llm d NYC 2026 Meetup

Welcome to the recording of the first-ever llm-d Meetup, hosted on March 11, 2026, in New York City! This inaugural event brought together engineering leaders from IBM Research, AMD, and Red Hat to dive deep into the challenges of scaling LLM inference and the future of the open-source distributed stack. In this session, we explore how llm-d (an open-source, full-stack solution) is establishing distributed inference as a first-class cloud-native workload. From managing the "prefill crunch" to state-aware scheduling on Kubernetes, our speakers break down the technical paths to production-ready AI. 📍 AGENDA & TIMESTAMPS 00:00 Welcome - Pete Cheslock (Red Hat) 01:49 Intro to llm-d for Open Source Distributed Inference - Carlos Costa (IBM) 35:40 Distributed LLM Serving on AMD with llm-d - Kenny Roche (AMD) 1:05:55 Scaling Wide-EP and Mixture-of-Experts (MoE) Models - Tyler Smith (Red Hat AI) 1:20:59 KV-Cache Wins: Prefix-Cache Scheduling & Offloading - Maroon Ayoub (IBM) 1:41:54 Closing & How to Get Involved with llm-d - Pete Cheslock Carlos Costa (IBM Research) kicks off with an overview of the core challenges: hardware heterogeneity, varying request sizes, and the shift from monolithic to orchestrated inference. Kenny Roche (AMD) discuss aligning llm-d with the ROCm stack and the performance potential of the ADER version of kernels. Tyler Smith (Red Hat AI) dive into Expert Parallelism (EP) and lessons learned scaling sparse models like DeepSeek-style architectures. 1:05:10 KV-Cache Wins: Prefix-Cache Scheduling & Offloading Maroon Ayoub (IBM Research) explains why KV cache hit rates are the most important metric for production and introduces North-South/East-West management paths. 💡 KEY TECHNICAL HIGHLIGHTS State-Aware Scheduling: Learn how llm-d achieves significantly faster performance by optimizing KV cache reuse across clusters. Prefill-Decode (PND) Disaggregation: A deep dive into separating compute-bound prefill from memory-bound decode for better latency. Offloading Strategies: How to overcome GPU memory limits using CPU and file system-based storage offloading for terabytes of KV cache. Future Frontiers: A sneak peek at the llm-d roadmap, featuring reinforcement learning (RL) support and expansion to the SGLang inference engine. 🔗 JOIN THE COMMUNITY Join the llm-d community: 🌎 https://llm-d.ai 💬 https://llm-d.ai/slack 💻 https://github.com/llm-d

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Keynote: Linus Torvalds, Creator of Linux & Git with Dirk Hohndel, Founder, DH Consulting

Keynote: Linus Torvalds, Creator of Linux & Git with Dirk Hohndel, Founder, DH Consulting

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

Announcing NVIDIA RTX Spark | GTC Taipei 2026 Keynote by CEO Jensen Huang

Announcing NVIDIA RTX Spark | GTC Taipei 2026 Keynote by CEO Jensen Huang

Gemma 4 12B QAT vs non-QAT - 16GB VRAM Local LLM setup

Gemma 4 12B QAT vs non-QAT - 16GB VRAM Local LLM setup

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Ocean Waves for Deep Sleep LIVE 🌊 Rolling Waves & Dark Screen Reduce Anxiety, Stress & Sleep Aid

Ocean Waves for Deep Sleep LIVE 🌊 Rolling Waves & Dark Screen Reduce Anxiety, Stress & Sleep Aid

Big Techday 26: Einfluss von KI auf Schach - Matthias Blübaum, Schach-Großmeister

Big Techday 26: Einfluss von KI auf Schach - Matthias Blübaum, Schach-Großmeister

LIVE: Conan O’Brien speaks at Harvard graduation ceremony (full)

LIVE: Conan O’Brien speaks at Harvard graduation ceremony (full)

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Exclusive Interview With Nvidia CEO Jensen Huang (Full Special)

Exclusive Interview With Nvidia CEO Jensen Huang (Full Special)

Casey Muratori – The Big OOPs: Anatomy of a Thirty-five-year Mistake – BSC 2025

Casey Muratori – The Big OOPs: Anatomy of a Thirty-five-year Mistake – BSC 2025

Solving Impossible Problems for Fun and Profit | Dan Gelbart

Solving Impossible Problems for Fun and Profit | Dan Gelbart

Free Live Event: Chat With Your Data Using Fabric Data Agents

Free Live Event: Chat With Your Data Using Fabric Data Agents

Trump Preps for 80th Birthday, Threatens to Hit Iran, Knicks Historic Win & Elon Musk Trillionaire!?

Trump Preps for 80th Birthday, Threatens to Hit Iran, Knicks Historic Win & Elon Musk Trillionaire!?

Andrej Karpathy: Software Is Changing (Again)

Andrej Karpathy: Software Is Changing (Again)

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

BREAKING: U.S. Resumes Strikes on Iran. A Clean Exit Is Unlikely. Tucker and John Mearsheimer React.

BREAKING: U.S. Resumes Strikes on Iran. A Clean Exit Is Unlikely. Tucker and John Mearsheimer React.

Is the AI Boom About to COLLAPSE?

Is the AI Boom About to COLLAPSE?