A proposed solution to improve reliability by containing the impact of PCIe Uncorrected Errors
Presented by Anil Agrawal (Meta) | Gada Badeer (Meta) Meta's next generation of AI/ML platform called "Grand Teton Training" uses a complex hierarcy of PCIe devices including GPUs, Switches, NICs, and NVMe drivers and require various RAS features to improve the system reliability. In this presentation, we would share the key learnings as we developed this platform and propose an optimized solution to contain any risk of data corruption due to PCIe uncorrected errors.

▶︎
Standardizing RAS Requirements for GPU & Accelerators in Hyperscale Computing

▶︎
PCIe Hot-Plug and Error Handling for NVMe

▶︎
Episode2 PVST+ Theory & LAB

▶︎
PCI Express HW Fault Management (RAS) Solution Implementation considerations

▶︎
Casey Muratori – The Big OOPs: Anatomy of a Thirty-five-year Mistake – BSC 2025

▶︎
Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

▶︎
LIVE: Sheldon Whitehouse EXPOSE the Shocking Trump-Epstein-Russia Connection | US News | N18G

▶︎
The World's Most Important Machine

▶︎
NestJS Full Course for Beginners in 2026 | Build a Production-Ready API

▶︎
PCIe Express corrected errors handling (RAS) solution implementation considerations

▶︎
ASMR Addictive Fast Tapping Collection For Deep Sleep & Anxiety Relief (No Talking) — 2.5 Hours

▶︎
Evolution of PCI Express as the Ubiquitous I/O Interconnect Technology

▶︎
تلاوة القرآن للدراسة والتركيز 📚🕛 | راحة وطمأنينة | Peaceful Focus Quran | محمد هشام

▶︎
Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

▶︎
Rapid leakage protection and cooling management in rack and tank level liquid cooling systems

▶︎
PCI Express Physical Layer

▶︎
Something is jamming GPS over Europe. Here's what we found

▶︎
248 DIOS TE DICE HOY: NADA ES IMPOSIBLE PARA MÍ | CONFÍA EN DIOS

▶︎
PCIe Device Attacks: Beyond DMA. Exploiting PCIe Switches, Messages and Errors

▶︎
