Perfect training data cripples reasoning — RLVR vs SFT has a provable exponential gap

Everyone assumes clean, flawless examples are the best reasoning data — and a new theory paper proves that intuition is backwards. By formalizing reasoning as path-finding through a maze, two researchers show imitation learning provably can't teach backtracking, while reinforcement learning learns it for free from the model's own failures. The result is a clean, exponential gap that reframes what 'high-quality reasoning data' even means. Full episode page: https://paperdive.ai/episodes/163-provable... Paper: Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently Authors: Wei, Kim Read the paper: https://arxiv.org/abs/2606.22938 What you'll take away: Why training on clean, backtracking-free solutions provably freezes a model's ability to retreat from dead ends — there's no gradient signal where there's no data How modeling reasoning as path-finding through a maze turns 'backtracking' into something you can prove theorems about The headline result: RL scales linearly with reasoning depth (W·K) while imitation blows up exponentially (W·L^K), from the identical starting model Why bolting a clever search wrapper onto a weak imitation model helps a lot but still can't fully close the gap The steelman critique: the central theorem is close to true by construction, and the exponential drama leans on a chosen graph topology and a deliberately pessimistic definition of SFT The practical payoff — why distilling from an RL-trained model works precisely because you inherit its messy recoveries, not just its answers Chapters: 0:00 Is clean data secretly the problem? 1:40 Two ways to train, one key difference 3:53 Turning reasoning into a maze 6:16 No examples, no nudge 9:15 Linear versus falling off a cliff 10:27 How RL escapes the trap 13:24 Does it survive a real algorithm? 15:31 How true by construction is this? 18:51 The dead ends are the curriculum This episode is AI-generated. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The on-screen illustrations were generated by OpenAI GPT Image.

Language World Models: predicting environment responses made this agent 9 pts better

Language World Models: predicting environment responses made this agent 9 pts better

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

พื้นฐาน VPS Security ความปลอดภัยบน Server ที่ Dev ควรรู้! | EP.6🔥

พื้นฐาน VPS Security ความปลอดภัยบน Server ที่ Dev ควรรู้! | EP.6🔥

How To Think SO CLEARLY People Assume You're A Genius

How To Think SO CLEARLY People Assume You're A Genius

If you need calm, you'll feel this on your skin (comfort for restless minds)

If you need calm, you'll feel this on your skin (comfort for restless minds)

AI alignment forensics: cover-up rate drops 6x when the culprit isn't itself

AI alignment forensics: cover-up rate drops 6x when the culprit isn't itself

books i want to read this summer | classics, fantasy, summerween!!!

books i want to read this summer | classics, fantasy, summerween!!!

The most beautiful formula not enough people understand

The most beautiful formula not enough people understand

Harvard Professor Explains The Rules of Writing — Steven Pinker

Harvard Professor Explains The Rules of Writing — Steven Pinker

Psychology of People With Extremely High IQ

Psychology of People With Extremely High IQ

Agentic RL rollouts: why more compute flatlines at 45% on hard problems

Agentic RL rollouts: why more compute flatlines at 45% on hard problems

Context Compaction Silently Deletes Agent Safety Rules — 0% to 59% Violations

Context Compaction Silently Deletes Agent Safety Rules — 0% to 59% Violations

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message

The Man Asked If I Was Still Looking for My Son—Then He Said, “I’m the Kid in..." - Calm Dad Stories

The Man Asked If I Was Still Looking for My Son—Then He Said, “I’m the Kid in..." - Calm Dad Stories

Train Your Brain to Never Forget (5 Feynman Habits)

Train Your Brain to Never Forget (5 Feynman Habits)

Code Memory Made This Agent Dumber — Here's Why (Metis Deep Dive)

Code Memory Made This Agent Dumber — Here's Why (Metis Deep Dive)

Cliff Tokens: Delete One Token, Rescue Every Math Solution

Cliff Tokens: Delete One Token, Rescue Every Math Solution

Thinking tokens & AI safety: the refusal is decided before word one

Thinking tokens & AI safety: the refusal is decided before word one

How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave

How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave

Bug localization in AI coding agents: why better reports can break fixes

Bug localization in AI coding agents: why better reports can break fixes