llm-d: Distributed LLM Inference on Kubernetes

Blog post: https://cefboud.com/posts/llm-d/ llm-d: https://llm-d.ai/docs/getting-started 00:00 Introduction to LLMD 00:32 Why LLM inference needs smarter load balancing 01:31 Prefill vs Decode explained 03:15 KV cache awareness and session routing 04:10 How LLMD scores model servers 06:36 LLMD Router architecture 07:48 Client request flow 08:34 Envoy External Processing (ExtProc) 10:04 End-to-end request routing 12:49 Gateway API Inference Extension 15:18 Prefill/Decode disaggregation 17:13 KV cache transfer with NCCL & RDMA 18:13 Plugin architecture and extensibility 19:59 Flow control, priorities & autoscaling 20:47 Final thoughts