Challenges in implementing AI/ML training job recovery from GPU/Accelerator data poisoning events
Anil Agrawal (Meta Platform Corp - Hardware Systems Engineer) David Xiao (Meta - Engineering Manager) AI/ML training job interruptions due to hardware faults such as GPU/Accelerator memory uncorrected errors is a growing issue- especially as we build very large clusters (100K+). In this presentation- we would like to share the challenges we have experienced as a result of uncorrected errors during the development of Meta's large training clusters built using the Grand Teton Training Platform- how such errors affected the training job 'interruption rate'- and how RAS technology is used in contain the impact. At the end- we would like to share a call to action for the OCP community on how to reduce the interruptions due to such hardware uncorrected errors by using various 'recovery' techniques.

PCI Express HW Fault Management (RAS) Solution Implementation considerations

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Software-in-the-Loop Testing: Cost-Effective Simulation and Validation for Control Units

Designing Data-Intensive Applications: Chapters 1 and 2

Extreme Savings through Modern Storage Architecture

Time Appliances Project Call #161 (June 17, 2026)

MIT Just Revealed the AI Bubble's Fatal Flaw

NVIDIA-Certified Associate AI Infrastructure and Operations (NCA AIIO) Free Study Course

The Insane Complexity of the Semiconductor Global Supply Chain

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

NVIDIA didn't want me to do this

Why AI Agents are either the best or worst thing we’ve ever built

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

New Chip Factory That Terrifies TSMC
![Yann LeCun's $1B Bet Against LLMs [Part 1]](https://i.ytimg.com/vi/kYkIdXwW2AE/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDbV4izF3i-wxevCVIn7FJjoy1vlA)
Yann LeCun's $1B Bet Against LLMs [Part 1]

Something is jamming GPS over Europe. Here's what we found

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

JANITOR vs THE BIGGEST GUYS IN THE GYM. They Didn’t Expect THAT

Why I Left Quantum Computing Research

