Accelerating vLLM with LMCache | Ray Summit 2025

At Ray Summit 2025, Kuntai Du from TensorMesh shares how LMCache expands the resource palette for serving large language models—making LLM inference faster and more cost-efficient by moving beyond GPU-only execution. He begins by highlighting a key limitation in today’s serving stacks: KV-cache memory demands often exceed what GPUs alone can provide efficiently. LMCache addresses this by enabling KV-cache offloading to a wide range of datacenter resources—including CPU memory, local disk, and remote storage—and dynamically loading caches back to GPUs on demand. This unlocks new flexibility and dramatically reduces GPU memory pressure. But LMCache goes far beyond simple prefix caching. Kuntai introduces KV-cache–related machine learning techniques that allow the inference engine to: Reuse KV caches for non-prefix text Share and reuse caches across different LLMs Improve inference efficiency even for complex, non-sequential workloads These innovations enable faster inference, lower cost, and improved hardware utilization without modifying model architectures. Attendees will learn how LMCache opens new frontiers in LLM serving by leveraging broader datacenter resources and smart KV-cache reuse strategies—delivering scalable performance improvements even for the largest models. Subscribe to our YouTube channel to stay up-to-date on the future of AI! / anyscale 🔗 Connect with us: LinkedIn: / joinanyscale X: https://x.com/anyscalecompute Website: https://www.anyscale.com/

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

Scaling KV Caches for LLMs: How LMCache + NIXL Handle Network and Storage...- J. Jiang & M. Khazraee

Scaling KV Caches for LLMs: How LMCache + NIXL Handle Network and Storage...- J. Jiang & M. Khazraee

SNIA SDCStorageAI 2026-Scaling Inference w/ KV Cache Storage Offload & RDMA Accelerated Architecture

SNIA SDCStorageAI 2026-Scaling Inference w/ KV Cache Storage Offload & RDMA Accelerated Architecture

I Thought DGX Spark Was Slower… Until I Changed ONE Thing

I Thought DGX Spark Was Slower… Until I Changed ONE Thing

PagedAttention: Behind vLLM's Insane Speed

PagedAttention: Behind vLLM's Insane Speed

Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!

Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!

LMCache: Lower LLM Performance Costs in the Enterprise - Martin Hickey & Junchen Jiang

LMCache: Lower LLM Performance Costs in the Enterprise - Martin Hickey & Junchen Jiang

Understanding vLLM with a Hands On Demo

Understanding vLLM with a Hands On Demo

The KV Cache: Memory Usage in Transformers

The KV Cache: Memory Usage in Transformers

How xAI Scales Image & Video Processing with Ray | Ray Summit 2025

How xAI Scales Image & Video Processing with Ray | Ray Summit 2025

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

What is vLLM? Efficient AI Inference for Large Language Models

What is vLLM? Efficient AI Inference for Large Language Models

Why Ray Became a Distributed Computing Engine for Modern AI

Why Ray Became a Distributed Computing Engine for Modern AI

COLLAPSE of Personal Computing | Investigation Into the Destruction of Ownership

COLLAPSE of Personal Computing | Investigation Into the Destruction of Ownership

LMCache Office Hour 2025-11-13

LMCache Office Hour 2025-11-13

NVIDIA's Hostile Takeover

NVIDIA's Hostile Takeover

Secure & Scalable AI on Ray + Kubernetes: Google’s Decoupled Agent Pattern | Ray Summit 2025

Secure & Scalable AI on Ray + Kubernetes: Google’s Decoupled Agent Pattern | Ray Summit 2025

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

NVIDIA didn't want me to do this

NVIDIA didn't want me to do this

The World's Most Important Machine

The World's Most Important Machine