Trinity: Training a 400B MoE from Scratch Without Losing Your Mind
[2026 - Day 3 - Model Systems] Training sparse Mixture-of-Experts models at scale is notoriously unstable. Experts collapse, routers drift, and loss spikes appear out of nowhere. This talk covers how we built Trinity Large, a 400B parameter MoE (13B active), trained on 17 trillion tokens with zero loss spikes. We'll walk through the decisions that actually mattered: why we replaced standard aux-loss-free balancing with a momentum-based approach (SMEBU), how interleaved local/global attention made context extension surprisingly smooth, and what broke when we first tried running Muon at scale. I'll also cover the less glamorous stuff: our Random Sequential Document Buffer to reduce batch heterogeneity, recovering from B300 GPU faults on brand-new hardware, and the six changes we shipped at once when routing started collapsing mid-run. Practical lessons for teams training their own MoEs or scaling up sparse architectures SPEAKER: Lucas Atkins - CTO, Arcee AI 👉 Sign up for our "No BS" Newsletter to get the latest technical data & AI content: https://aicouncil.com/newsletter ABOUT AI COUNCIL: AI Council brings together the brightest minds in data to share industry knowledge, technical architectures and best practices in building cutting edge data & AI systems and tools. FIND US: Website: https://aicouncil.com/ LinkedIn: / aicouncilconf X: https://x.com/aicouncilconf

Lessons From RL Systems That Looked Fine Until They Didn't

Chip design from the bottom up – Reiner Pope

Yann LeCun: World Models: Enabling the next AI revolution

Co-Creator of Haskell: Useless vs Useful Languages, Rust vs C, Functional Programming | Simon Jones

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

HW News - DRAM Companies Hit Trillions of Dollars, Bambu Open Source, NVIDIA Spark Concerns

Powering Agents with Context Graphs & Ontologies

Microsoft Just Released Their Own Linux Distro: Should You Be Worried?

If Prime Numbers Become Increasingly Rare, Then Why Do They Keep Showing Up In Pairs?

RLVR in Practice: From Synthetic Data to GRPO

Warum die Sperre von Claude Fable vorhersehbar war

The Best Local Agentic Coding Workflow (Complete Guide)

How This Non-Technical Founder Mastered Agentic Engineering in 50 Minutes | Matt Van Horn

How GPT, Claude, and Gemini are actually trained and served – Reiner Pope

The Open Source community is collapsing

How To Think SO CLEARLY People Assume You're A Genius

AI Lies Are Finally Getting Punished

China's 1.4nm Breakthrough Terrifies America and Taiwan

