llm-d: Distributed LLM Inference on Kubernetes

Blog post: https://cefboud.com/posts/llm-d/ llm-d: https://llm-d.ai/docs/getting-started 00:00 Introduction to LLMD 00:32 Why LLM inference needs smarter load balancing 01:31 Prefill vs Decode explained 03:15 KV cache awareness and session routing 04:10 How LLMD scores model servers 06:36 LLMD Router architecture 07:48 Client request flow 08:34 Envoy External Processing (ExtProc) 10:04 End-to-end request routing 12:49 Gateway API Inference Extension 15:18 Prefill/Decode disaggregation 17:13 KV cache transfer with NCCL & RDMA 18:13 Plugin architecture and extensibility 19:59 Flow control, priorities & autoscaling 20:47 Final thoughts

How Reasoning LLMs Work (RL, Thinking Tags & Budgets Explained)

How Reasoning LLMs Work (RL, Thinking Tags & Budgets Explained)

Is RAG Still Needed? Choosing the Best Approach for LLMs

Is RAG Still Needed? Choosing the Best Approach for LLMs

LLM Inference Engines: vLLM, KV Cache, Paged attention and Continuous Batching.

LLM Inference Engines: vLLM, KV Cache, Paged attention and Continuous Batching.

MIT Just Revealed the AI Bubble's Fatal Flaw

MIT Just Revealed the AI Bubble's Fatal Flaw

Harness Engineering Masterclass: Technical Deep Dive on how to build Agentic Systems

Harness Engineering Masterclass: Technical Deep Dive on how to build Agentic Systems

Should You Still Become a Software Engineer in 2026? GitHub VP

Should You Still Become a Software Engineer in 2026? GitHub VP

Kubernetes Zero to Hero: The Complete Beginner’s Guide (2025 Edition)

Kubernetes Zero to Hero: The Complete Beginner’s Guide (2025 Edition)

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Five things every developer should know about building mission-critical systems - Loek Duys

Five things every developer should know about building mission-critical systems - Loek Duys

NestJS Full Course for Beginners in 2026 | Build a Production-Ready API

NestJS Full Course for Beginners in 2026 | Build a Production-Ready API

🚗 BYD : The biggest SCAM of the car industry ?

🚗 BYD : The biggest SCAM of the car industry ?

LLM Quantization: Smaller, Faster, Cheaper AI Models

LLM Quantization: Smaller, Faster, Cheaper AI Models

You Can Learn AI Agent Harness & Loop Engineering In 19 Min | LLM Ops, Eval, Tracing, RAG

You Can Learn AI Agent Harness & Loop Engineering In 19 Min | LLM Ops, Eval, Tracing, RAG

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Kubernetes Crash Course for Absolute Beginners [NEW]

Kubernetes Crash Course for Absolute Beginners [NEW]

System Design Concepts Course and Interview Prep

System Design Concepts Course and Interview Prep

Too Many Parameters? Use This Pattern

Too Many Parameters? Use This Pattern

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live