Scaling KV Caches for LLMs: How LMCache + NIXL Handle Network and Storage...- J. Jiang & M. Khazraee
Scaling KV Caches for LLMs: How LMCache + NIXL Handle Network and Storage Heterogeneity - Junchen Jiang, University of Chicago & Moein Khazraee, NVIDIA Efficient KV cache management is critical for scalable, low-latency LLM inference. LMCache, a widely adopted open-source KV caching layer used in vLLM deployments, addresses two fundamental challenges: (1) transferring KV caches across LLM instances, and (2) storing KV caches into diverse backend systems. However, in real-world deployments, both operations must navigate hardware heterogeneity—from network fabrics like NVLink, RDMA, and TCP/IP, to storage layers like Infinistore, Redis, and Mooncake. That’s where NVIDIA’s NIXL library comes in. NIXL abstracts and optimizes data movement across heterogeneous infrastructures, making it easier for systems like LMCache to deliver high throughput and low latency. In this talk, we’ll dive into how LMCache integrates with NIXL to accelerate KV cache transfers and storage. Expect real deployment demos, performance benchmarks, and practical guidance for running next-gen LLM inference on Kubernetes with minimal GPU waste.

Accelerating vLLM with LMCache | Ray Summit 2025

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA

NVIDIA didn't want me to do this

PagedAttention: Behind vLLM's Insane Speed

Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!

SNIA SDC 2025 - KV-Cache Storage Offloading for Efficient Inference in LLMs

COLLAPSE of Personal Computing | Investigation Into the Destruction of Ownership

Maximizing Luck in Reinforcement Learning - Daniel Han, Unsloth

How Nvidia GPUs Compare To Google’s And Amazon’s AI Chips

Highlights of Nvidia’s Computex Keynote 2026 in Under 12 Minutes

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Why WideEP Inference Needs Data-Parallel-Aware Scheduling - Maroon Ayoub & Tyler Michael Smith

The KV Cache: Memory Usage in Transformers

The Strange Math That Predicts (Almost) Anything

Kubernetes Zero to Hero: The Complete Beginner’s Guide (2025 Edition)

Understanding vLLM with a Hands On Demo

Optimizing PyTorch on CPU-GPU Coherent Platforms - Matthias Jouanneaux, Nvidia

The Insane Genius of a Formula 1 Gearbox

