Lessons From Scaling BPF To Detect RDMA Device Drivers Bugs In Real Time
Abstract Training large models requires significant resources, and failure of any GPU or host can significantly prolong training times. At Meta, we observed that 17% of our jobs fail due to RDMA-related syscall errors, which arise due to bugs in the RDMA driver code. Unlike other parts of the kernel, RDMA-related syscalls are opaque, and the errors create a mismatched application/kernel view of hardware resources. As a result of this opacity and mismatch, existing observability tools provided limited visibility and DevOps found it challenging to triage – we required a new scalable framework to analyze kernel state and identify the cause of this mismatch. Direct approaches like tracing the kernel calls and capturing metadata involved in the systems turned out to be prohibitively expensive. In this talk, we will describe the set of optimizations used to scale tracking kernel state and the map-based systems designed to efficiently export relevant state without impacting production workloads. Prankur Gupta: Prankur Gupta is a Staff Software Engineer at Meta with over 12 years of experience driving reliability and observability at scale. His expertise spans the entire technology stack, from developing AI transport protocols in NIC firmware and kernel drivers to pioneering network optimizations in user space using eBPF. At Meta, Prankur has played a key role in building advanced ecosystems for transport tuning and congestion control, delivering major performance gains. He currently leads the productionization of Meta’s in-house hardware initiatives—including NICs and MTIA—while enabling the next generation of AI transport protocols for Meta’s infrastructure. https://nanog.org/events/nanog-97/con...

The Hidden Costs of SSH at Scale

Channelmania! – future proof your DWDM network topology while keeping it flexible for 1.6T

When Photonics Becomes a Network Architecture Constraint in AI Data Centers

Beyond Geofeeds: Towards a New Standard in Data Sharing

Gemma 4 12B Quant Comparison - q8 vs q4 - 16GB VRAM Local LLM setup

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Google & AWS Veteran: What Top Tier Software Architects Actually Do

Modern Architecture 101 for New Engineers & Forgetful Experts - Jerry Nixon - NDC Copenhagen 2025

I Hacked This Temu Router. What I Found Should Be Illegal.

How Huawei Just Built an Impossible Chip

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Something is jamming GPS over Europe. Here's what we found

NetOps Stack for Core/Aggregation Operators of Any Size

Kubernetes and retiring at the top with Kelsey Hightower

Unfortunately, I Was Right

Passkeys Explained: Are They Actually Better Than Passwords?

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Chip design from the bottom up – Reiner Pope

Exposing The Solid State Donut Battery. It's Over.

