Lessons From RL Systems That Looked Fine Until They Didn't

[2026 - Day 3 - Model Systems] Reinforcement learning systems often fail not because rewards are wrong, but because optimization pressure is unbounded. Policies exploit edge cases, drift over time, and converge to brittle strategies that look fine in training but break in deployment, especially under bounded actions, safety requirements, resource budgets, and long-term user impact. This talk focuses on controlling optimization directly: practical techniques for training RL agents that remain stable and predictable under hard constraints. Rather than modifying rewards, we explore structural and system-level approaches that shape behavior by construction. Topics include: Why reward penalties alone fail to enforce hard constraints under scale and distribution shift Structural constraint mechanisms such as action masking, feasibility filters, and sandboxed execution How training inside hard boundaries changes policy behavior and improves long-horizon stability, including across retraining cycles Detecting constraint violations and failure modes that do not appear in aggregate return metrics Lessons from applying constrained RL in production-like systems, including failures only discovered after deployment and what ultimately stopped them The goal is to share concrete algorithmic and system design strategies for deploying reinforcement learning in settings where violations are suboptimal. SPEAKER: Ezi Ozoani - Co-founder & CTO, Aethon 👉 Sign up for our "No BS" Newsletter to get the latest technical data & AI content: https://aicouncil.com/newsletter ABOUT AI COUNCIL: AI Council brings together the brightest minds in data to share industry knowledge, technical architectures and best practices in building cutting edge data & AI systems and tools. FIND US: Website: https://aicouncil.com/ LinkedIn: / aicouncilconf X: https://x.com/aicouncilconf

Trinity: Training a 400B MoE from Scratch Without Losing Your Mind

Trinity: Training a 400B MoE from Scratch Without Losing Your Mind

Stop Prompting Claude. Use Karpathy's Method Instead.

Stop Prompting Claude. Use Karpathy's Method Instead.

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Ex-Google Recruiter Explains Why "Lying" Gets You Hired

Ex-Google Recruiter Explains Why "Lying" Gets You Hired

Why The Russian Accent Terrifies Everyone

Why The Russian Accent Terrifies Everyone

Why you should hope that discrete log is hard (ft. Amit Sahai)

Why you should hope that discrete log is hard (ft. Amit Sahai)

If Prime Numbers Become Increasingly Rare, Then Why Do They Keep Showing Up In Pairs?

If Prime Numbers Become Increasingly Rare, Then Why Do They Keep Showing Up In Pairs?

I Think They Are Lying To You

I Think They Are Lying To You

Why AI Has Failed to Take Your Job Since 1976

Why AI Has Failed to Take Your Job Since 1976

How Open Frontier Labs Actually Train Their Models

How Open Frontier Labs Actually Train Their Models

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

How To Think SO CLEARLY People Assume You're A Genius

How To Think SO CLEARLY People Assume You're A Genius

Judge LOSES IT After Discovering What She Did

Judge LOSES IT After Discovering What She Did

Powering Agents with Context Graphs & Ontologies

Powering Agents with Context Graphs & Ontologies

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

But what is quantum computing? (Grover's Algorithm)

But what is quantum computing? (Grover's Algorithm)

Is RAG Still Needed? Choosing the Best Approach for LLMs

Is RAG Still Needed? Choosing the Best Approach for LLMs

LIVE: Conan O’Brien speaks at Harvard graduation ceremony (full)

LIVE: Conan O’Brien speaks at Harvard graduation ceremony (full)

RLVR in Practice: From Synthetic Data to GRPO

RLVR in Practice: From Synthetic Data to GRPO

How to Speak

How to Speak