Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

About the seminar: https://faster-llms.vercel.app Speaker: Ion Stoica (Berkeley & Anyscale & Databricks) Title: Accelerating LLM Inference with vLLM (and SGLang) Abstract: Inference efficiency remains a critical challenge for deploying large language models (LLMs) at scale. In this talk, I will present our work on LLM inference we have conducted at Berkeley over the past two years in the context of vLLM and SGLang, which are today the most popular open-source inference engines. In particular, I will describe some of the key techniques they introduced, PagedAttention and RadixAttention, which are now widely used by the majority of LLM inference engines. Finally, I will discuss the new architecture of vLLM. Recorded on Mar 4, 2025.

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

The State of vLLM | Ray Summit 2024

The State of vLLM | Ray Summit 2024

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

LLMs Don't Need More Parameters. They Need Loops.

LLMs Don't Need More Parameters. They Need Loops.

vLLM Production Stack Community Meeting on Jan 20 2026

vLLM Production Stack Community Meeting on Jan 20 2026

Next-Gen Long-Context LLM Inference with LMCache - Junchen Jiang (UChicago & LMCache)

Next-Gen Long-Context LLM Inference with LMCache - Junchen Jiang (UChicago & LMCache)

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

How to Turn Research Into Real Companies | Ion Stoica, Co-founder and Executive Chairman, Databricks

How to Turn Research Into Real Companies | Ion Stoica, Co-founder and Executive Chairman, Databricks

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

How to pick a GPU and Inference Engine?

How to pick a GPU and Inference Engine?

Accelerating LLM Inference with vLLM

Accelerating LLM Inference with vLLM

CS 263 Group 12 Presentation on ToolLLM

CS 263 Group 12 Presentation on ToolLLM

But what is quantum computing? (Grover's Algorithm)

But what is quantum computing? (Grover's Algorithm)

Scalable and Efficient Systems for Large Language Models—Lianmin Zheng (Berkeley)

Scalable and Efficient Systems for Large Language Models—Lianmin Zheng (Berkeley)

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Efficient LLM Inference with SGLang, Lianmin Zheng, xAI

Efficient LLM Inference with SGLang, Lianmin Zheng, xAI

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

19 Tips to Better AI Fine Tuning

19 Tips to Better AI Fine Tuning