Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica
About the seminar: https://faster-llms.vercel.app Speaker: Ion Stoica (Berkeley & Anyscale & Databricks) Title: Accelerating LLM Inference with vLLM (and SGLang) Abstract: Inference efficiency remains a critical challenge for deploying large language models (LLMs) at scale. In this talk, I will present our work on LLM inference we have conducted at Berkeley over the past two years in the context of vLLM and SGLang, which are today the most popular open-source inference engines. In particular, I will describe some of the key techniques they introduced, PagedAttention and RadixAttention, which are now widely used by the majority of LLM inference engines. Finally, I will discuss the new architecture of vLLM. Recorded on Mar 4, 2025.

▶︎
vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Woosuk Kwon & Xiaoxuan Liu, UC Berkeley

▶︎
The State of vLLM | Ray Summit 2024

▶︎
The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

▶︎
LLMs Don't Need More Parameters. They Need Loops.

▶︎
vLLM Production Stack Community Meeting on Jan 20 2026

▶︎
Next-Gen Long-Context LLM Inference with LMCache - Junchen Jiang (UChicago & LMCache)

▶︎
Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

▶︎
How to Turn Research Into Real Companies | Ion Stoica, Co-founder and Executive Chairman, Databricks

▶︎
Faster LLMs: Accelerate Inference with Speculative Decoding

▶︎
How to pick a GPU and Inference Engine?

▶︎
Accelerating LLM Inference with vLLM

▶︎
CS 263 Group 12 Presentation on ToolLLM

▶︎
But what is quantum computing? (Grover's Algorithm)

▶︎
Scalable and Efficient Systems for Large Language Models—Lianmin Zheng (Berkeley)

▶︎
Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

▶︎
Efficient LLM Inference with SGLang, Lianmin Zheng, xAI

▶︎
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

▶︎
Fast LLM Serving with vLLM and PagedAttention

▶︎
LLM inference optimization: Architecture, KV cache and Flash attention

▶︎
