Perfect training data cripples reasoning — RLVR vs SFT has a provable exponential gap

Everyone assumes clean, flawless examples are the best reasoning data — and a new theory paper proves that intuition is backwards. By formalizing reasoning as path-finding through a maze, two researchers show imitation learning provably can't teach backtracking, while reinforcement learning learns it for free from the model's own failures. The result is a clean, exponential gap that reframes what 'high-quality reasoning data' even means. Full episode page: https://paperdive.ai/episodes/163-provable... Paper: Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently Authors: Wei, Kim Read the paper: https://arxiv.org/abs/2606.22938 What you'll take away: Why training on clean, backtracking-free solutions provably freezes a model's ability to retreat from dead ends — there's no gradient signal where there's no data How modeling reasoning as path-finding through a maze turns 'backtracking' into something you can prove theorems about The headline result: RL scales linearly with reasoning depth (W·K) while imitation blows up exponentially (W·L^K), from the identical starting model Why bolting a clever search wrapper onto a weak imitation model helps a lot but still can't fully close the gap The steelman critique: the central theorem is close to true by construction, and the exponential drama leans on a chosen graph topology and a deliberately pessimistic definition of SFT The practical payoff — why distilling from an RL-trained model works precisely because you inherit its messy recoveries, not just its answers Chapters: 0:00 Is clean data secretly the problem? 1:40 Two ways to train, one key difference 3:53 Turning reasoning into a maze 6:16 No examples, no nudge 9:15 Linear versus falling off a cliff 10:27 How RL escapes the trap 13:24 Does it survive a real algorithm? 15:31 How true by construction is this? 18:51 The dead ends are the curriculum This episode is AI-generated. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. The on-screen illustrations were generated by OpenAI GPT Image.

Language World Models: predicting environment responses made this agent 9 pts better
▶︎

Language World Models: predicting environment responses made this agent 9 pts better

Yann LeCun's $1B Bet Against LLMs [Part 1]
▶︎

Yann LeCun's $1B Bet Against LLMs [Part 1]

พื้นฐาน VPS Security ความปลอดภัยบน Server ที่ Dev ควรรู้! | EP.6🔥
▶︎

พื้นฐาน VPS Security ความปลอดภัยบน Server ที่ Dev ควรรู้! | EP.6🔥

How To Think SO CLEARLY People Assume You're A Genius
▶︎

How To Think SO CLEARLY People Assume You're A Genius

If you need calm, you'll feel this on your skin (comfort for restless minds)
▶︎

If you need calm, you'll feel this on your skin (comfort for restless minds)

AI alignment forensics: cover-up rate drops 6x when the culprit isn't itself
▶︎

AI alignment forensics: cover-up rate drops 6x when the culprit isn't itself

books i want to read this summer | classics, fantasy, summerween!!!
▶︎

books i want to read this summer | classics, fantasy, summerween!!!

The most beautiful formula not enough people understand
▶︎

The most beautiful formula not enough people understand

Harvard Professor Explains The Rules of Writing — Steven Pinker
▶︎

Harvard Professor Explains The Rules of Writing — Steven Pinker

Psychology of People With Extremely High IQ
▶︎

Psychology of People With Extremely High IQ

Agentic RL rollouts: why more compute flatlines at 45% on hard problems
▶︎

Agentic RL rollouts: why more compute flatlines at 45% on hard problems

Context Compaction Silently Deletes Agent Safety Rules — 0% to 59% Violations
▶︎

Context Compaction Silently Deletes Agent Safety Rules — 0% to 59% Violations

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message
▶︎

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message

The Man Asked If I Was Still Looking for My Son—Then He Said, “I’m the Kid in..." - Calm Dad Stories
▶︎

The Man Asked If I Was Still Looking for My Son—Then He Said, “I’m the Kid in..." - Calm Dad Stories

Train Your Brain to Never Forget (5 Feynman Habits)
▶︎

Train Your Brain to Never Forget (5 Feynman Habits)

Code Memory Made This Agent Dumber — Here's Why (Metis Deep Dive)
▶︎

Code Memory Made This Agent Dumber — Here's Why (Metis Deep Dive)

Cliff Tokens: Delete One Token, Rescue Every Math Solution
▶︎

Cliff Tokens: Delete One Token, Rescue Every Math Solution

Thinking tokens & AI safety: the refusal is decided before word one
▶︎

Thinking tokens & AI safety: the refusal is decided before word one

How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave
▶︎

How Floating-Point Rounding Lets a Model Tell Which Chip It's On — And Misbehave

Bug localization in AI coding agents: why better reports can break fixes
▶︎

Bug localization in AI coding agents: why better reports can break fixes