The Art of Scaling Reinforcement Learning Compute for LLMs | Bonnie Li
Bonnie Li is an AI Researcher at Google DeepMind, where she focuses on pushing the boundaries of frontier AI models and agentic post-training. Her work is at the frontier of developing foundation models, world models, and generalist agents, with core contributions to Gemini 2.5, Gemini 3, SIMA 2, and Genie 2. Previously, she worked on impactful research in reinforcement learning, presenting at top-tier machine learning conferences. In this Frontier Research Club talk, Bonnie presents The Art of Scaling Reinforcement Learning Compute for LLMs, exploring how RL performance scales with compute and what it takes to make reinforcement learning for large language models more predictable, stable, and efficient. Paper: https://arxiv.org/abs/2510.13786 The talk begins from a central shift in frontier model development: pretraining teaches a model about the world, but reinforcement learning unlocks many of the capabilities that matter most, including test-time thinking, agentic capabilities, and scientific discovery. Bonnie explains how RL scaling differs from pretraining. Instead of following the same smooth power-law behavior associated with pretraining loss, RL training follows sigmoid-like scaling curves defined by an asymptotic ceiling, compute efficiency, and an inflection point. This means early training curves can be misleading: a recipe that starts slowly may take off later, while another may plateau because its algorithmic choices cap what it can achieve. The talk then walks through the ScaleRL recipe, showing how different design choices affect different parts of the scaling curve. Loss functions and train-inference discrepancy can limit the final performance ceiling, while off-policyness, normalization, adaptive sampling, and length control often affect compute efficiency. Bonnie also explains why large-scale RL systems often move beyond synchronous on-policy training. On-policy RL can leave GPUs idle during generation, so systems use async or pipeline RL to overlap generation and training. But this introduces off-policy staleness: the data used for training may have been generated by older weights, which requires careful algorithmic correction through techniques like importance sampling, clipping, and sequence-level objectives. The presentation also covers one of the quieter but most important engineering issues in RL for LLMs: train-inference discrepancy. Because training and inference stacks may use different kernels, precisions, or numerical behavior, the same weights can produce different probabilities across systems. Bonnie shows how these silent mismatches can corrupt RL training, and how fixes such as FP32 logits at the LM head can improve stability and raise the effective performance ceiling. The talk closes with adaptive sampling strategies such as zero-variance filtering and no-positive resampling, which focus compute on prompts that are still informative rather than wasting training on examples that are already too easy or too hard. The broader takeaway is that scaling RL for LLMs is not just about spending more GPU-hours — it is about understanding which design choices change the ceiling, which improve efficiency, and how to predict large-scale behavior from smaller runs. Topics include: • Reinforcement learning for LLMs • RL scaling laws • The Art of Scaling Reinforcement Learning Compute for LLMs • ScaleRL • Sigmoid RL scaling curves • Asymptotic ceiling • Compute efficiency • Inflection points • Test-time thinking • Agentic capabilities • Scientific discovery • RL recipes • Loss functions • CISPO • GRPO • DAPO • GSPO • Importance sampling • Off-policyness • Async RL • PipelineRL • In-flight weight updates • Train-inference discrepancy • FP32 logits • Adaptive sampling • Zero-variance filtering • No-positive resampling • Batch-level normalization • Length control • Predicting large-scale performance from small runs Presented at Frontier Research Club by Bonnie Li. Recorded on [add date] at [add venue]. Frontier Research Club is a curated forum for rigorous discussion on how AI is reshaping the scientific research process. We convene researchers, computational scientists, and research engineers to examine concrete work across literature synthesis, hypothesis generation, experimental design, simulation, analysis, safety, and reproducibility. Upcoming events: https://luma.com/frontiersyndicate Subscribe for more research talks, technical discussions, and frontier AI presentations.

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Demis Hassabis: Agents, AGI & The Next Big Scientific Breakthrough

Moondream Segmentation: From Words to Masks | Ethan Reid

Yann LeCun Says LLMs Have 2 Years Left…

Dario Amodei WARNS: People Have No Idea What's Coming In 2027

Small Batch Size Training for Language Models | Sanae Lotfi

Predictive vs Generative AI: How They Work and When to Use Each

Skill Issue: Andrej Karpathy on Code Agents, AutoResearch, and the Loopy Era of AI

Yann LeCun's $1B Bet Against LLMs

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Demis Hassabis: Why AGI is Bigger than the Industrial Revolution & Where Are The Bottlenecks in AI

Demis Hassabis On What AI Will Do Next

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI

Demis Hassabis: We're Three Quarters of the Way to AGI

🚗 BYD : The biggest SCAM of the car industry ?

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

The future of intelligence | Demis Hassabis (Co-founder and CEO of DeepMind)

