Become a Model Whisperer : The "On-Policy" Secret to Better LLM results
Ever wonder why a perfectly crafted prompt or a carefully curated fine-tuning dataset falls flat? The problem isn't always your instructions - it's that you might be fighting against the model's fundamental nature. This video dives deep into a critical lesson from Large Language Model Reinforcement Learning (RL): the principle of 'On-Policy' interaction. We break down why forcing an LLM to follow a script it wasn't trained on ('Off-Policy') can lead to poor performance, brittleness, and even hallucinations. You'll learn a new mental model for working with LLMs, understanding them not as simple instruction-following machines, but as systems with their own deeply learned distribution of knowledge. We cover practical, on-policy techniques you can apply today: *Prompting:* How to coax the model into revealing its own internal data structures and preferred phrasing for more reliable results. *Fine-Tuning:* Safer ways to introduce new facts and behaviors without corrupting the model's core knowledge. Stop fighting the model : Learn to become a 'model whisperer' and build more robust, predictable AI applications by working with the LLM's nature, not against it. Papers & Resources Denny Zhou's Stanford Lecture mentioned: • Stanford CS25: V5 I Large Language Model R... [LLMs can do reasoning unprompted](https://arxiv.org/abs/2402.10200) - Google (2024) [Self-consistency improves chain of thought reasoning in language models](https://arxiv.org/abs/2203.11171) - Google (2022) [ReFT: Reasoning with REinforced Fine-Tuning](https://arxiv.org/abs/2401.08967) - ByteDance (2024) The DSPy Framework for automated 'on-policy' prompting: https://github.com/stanfordnlp/dspy Chapters 00:00 - Introduction: Lessons from Reinforcement Learning 01:06 - How LLMs are Trained (And Why It's a Problem) 04:05 - The Inference Paradox: Untrained for Their Own Output 06:14 - Reinforcement Learning: Teaching Models Consequences 08:10 - Three Key Lessons from AI Researchers 11:32 - The Critical Rule: On-Policy vs. Off-Policy 14:38 - Practical Prompting: Stop Forcing, Start Asking 15:50 - Example 1: Extracting Bounding Boxes 17:35 - Example 2: Building Marketing Personas 22:06 - Safer Fine-Tuning with On-Policy Methods 26:17 - Conclusion: Become a Model Whisperer ABOUT THE CHANNEL My channel is for "The AI Builder": the developer, tinkerer, and hands-on enthusiast. We go beyond the headlines to understand the mechanisms behind the latest research, empowering you to build the future. From the Lab to Your Laptop. SOCIALS GitHub: https://github.com/mdda LinkedIn: / martinandrews X / Twitter: https://x.com/mdda123 #AI #LLM #MachineLearning #PromptEngineering #FineTuning #ReinforcementLearning #OnPolicy

Is RAG Still Needed? Choosing the Best Approach for LLMs

Latent Space Reasoning : Looking at the research

Orchestrating Intelligence: Multi-Agentic Design Patterns for Production AI - Mary Grygleski

Yann LeCun's $1B Bet Against LLMs

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

KV Cache: The Invisible Trick Behind Every LLM

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Something is jamming GPS over Europe. Here's what we found

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

The Strange Math That Predicts (Almost) Anything

Running a 35B Model on an 8GB GPU (And What Actually Broke)

OWASP's Top 10 Ways to Attack LLMs: AI Vulnerabilities Exposed

But what is a neural network? | Deep learning chapter 1

RAG Crash Course for Beginners

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Hacking an LLM's Personality with Representation Engineering

If You Have A Bad Memory, I’ll Help You Fix It In 28 Minutes

The 7 Skills You Need to Build AI Agents

AlphaEvolve and Darwin Gödel Machines : LLMs for Code Evolution in 2025

