#136: vLLM, LMD, and the Quest to Build the Linux of AI Inference

In this episode, hosts Ronald and Jan are joined at KubeCon by two guests from Red Hat: Brian Stevens, AI CTO and one of the original architects behind the creation of Kubernetes and the CNCF, and Rob Shaw, co-lead of the vLLM project and maintainer of LMD. Brian shares the remarkable backstory of how Kubernetes came to be open source, including how Red Hat negotiated a single committer seat before agreeing to be a launch partner, and how he later pushed Google to contribute Kubernetes to the newly formed CNCF rather than keeping it proprietary like TensorFlow. Rob explains what an inference runtime actually is: the critical piece of software that takes an abstract AI model and runs it as efficiently as possible on a GPU or other accelerator — handling everything from CUDA-level kernel optimization to memory management and concurrent request scheduling. vLLM serves as a "Rosetta Stone" between the ever-growing zoo of models (Llama, DeepSeek, Mistral, Qwen, Nvidia Nemotron) and accelerators (Nvidia, AMD, Intel, Google TPUs). The conversation covers model compression and quantization how techniques like 4-bit precision can deliver 2x hardware efficiency gains while preserving 99%+ model accuracy. Brian and Rob also address the "big model vs. many small models" debate, recommending to always start with the largest capable model to validate a use case before optimizing down. Looking ahead, both guests see inference as potentially the single largest workload ever run on Kubernetes, and position LMD (now contributed to the CNCF) as the distributed inference layer that will make this possible across heterogeneous accelerator environments preventing enterprises from ending up with 42 incompatible AI stacks. The episode closes with a discussion on AI slop, human-in-the-loop thinking, and the future of Kubernetes as the universal platform for running AI agents at scale. Powered by @acc-ict ( / @acc-ict ) Stuur ons een bericht. (https://www.buzzsprout.com/2098061/fa...) ACC ICT Specialist in IT-CONTINUÏTEIT (https://www.acc-ict.com/) Bedrijfskritische applicaties én data veilig beschikbaar, onafhankelijk van derden, altijd en overal Support the show (https://www.buzzsprout.com/2098061/su...) Like and subscribe! It helps out a lot. You can also find us on: De Nederlandse Kubernetes Podcast - YouTube ( / @denederlandsekubernetespodcast ) Nederlandse Kubernetes Podcast (@k8spodcast.nl) | TikTok ( / k8spodcast.nl ) De Nederlandse Kubernetes Podcast (https://www.k8spodcast.nl/) Where can you meet us: Events (https://www.k8spodcast.nl/events) This Podcast is powered by: ACC ICT - IT-Continuïteit voor Bedrijfskritische Applicaties | ACC ICT (https://acc-ict.com/)

#133 Kubernetes everywhere: how far can it really go?

#133 Kubernetes everywhere: how far can it really go?

Linus Torvalds: AI Is Changing Linux Fast

Linus Torvalds: AI Is Changing Linux Fast

Data Doet Dat: CHILI Publish gebruikt data voor proactieve strategievorming

Data Doet Dat: CHILI Publish gebruikt data voor proactieve strategievorming

#132 From CPU to GPU: The New Reality of Kubernetes 1.36

#132 From CPU to GPU: The New Reality of Kubernetes 1.36

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Return of the Edge: Did We Forget About the Perimeter? - #206

Return of the Edge: Did We Forget About the Perimeter? - #206

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]

#135 The Return of OpenStack: Kubernetes & Sovereign Infrastructure

#135 The Return of OpenStack: Kubernetes & Sovereign Infrastructure

Conan O’Brien Mocks Trump At Harvard Commencement | Crowd Erupts During Viral Speech

Conan O’Brien Mocks Trump At Harvard Commencement | Crowd Erupts During Viral Speech

Co-Creator of Haskell: Functional Programming, Thinking in Types, Useless Languages | Simon Jones

Co-Creator of Haskell: Functional Programming, Thinking in Types, Useless Languages | Simon Jones

Aflevering132: From CPU to GPU: The New Reality of Kubernetes 1.36

Aflevering132: From CPU to GPU: The New Reality of Kubernetes 1.36

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

Building the PERFECT Linux PC with Linus Torvalds

Building the PERFECT Linux PC with Linus Torvalds

The GPU Myth: State of AI Compute 2026 | Stephen Balaban

The GPU Myth: State of AI Compute 2026 | Stephen Balaban

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Why AI Agents are either the best or worst thing we’ve ever built

Why AI Agents are either the best or worst thing we’ve ever built

Aflevering 137: The hidden performance tax you're paying on every cloud deployment

Aflevering 137: The hidden performance tax you're paying on every cloud deployment

#134 Kubernetes at the Edge: Hype, Reality and Trade-offs

#134 Kubernetes at the Edge: Hype, Reality and Trade-offs