Fast LLM Serving with vLLM and PagedAttention

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past three months. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth, and present our future plan. About Anyscale --- Anyscale is the AI Application Platform for developing, running, and scaling AI. https://www.anyscale.com/ If you're interested in a managed Ray service, check out: https://www.anyscale.com/signup/ About Ray --- Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads. https://docs.ray.io/en/latest/ #llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

Intellectual Property with GenAI: What LLM Developers Need to Know

Intellectual Property with GenAI: What LLM Developers Need to Know

How vLLM Became the Standard for Fast AI Inference | Simon Mo, Inferact

How vLLM Became the Standard for Fast AI Inference | Simon Mo, Inferact

Enabling Cost-Efficient LLM Serving with Ray Serve

Enabling Cost-Efficient LLM Serving with Ray Serve

IWOCL 2026 - Advancing OpenCL-Based LLM Inference in Llama.cpp on Qualcomm Adreno GPUs

IWOCL 2026 - Advancing OpenCL-Based LLM Inference in Llama.cpp on Qualcomm Adreno GPUs

What I Learned From Implementing LLM Architectures From Scratch (And How to Get Started)

What I Learned From Implementing LLM Architectures From Scratch (And How to Get Started)

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

How vLLM Works + Journey of Prompts to vLLM + Paged Attention

How vLLM Works + Journey of Prompts to vLLM + Paged Attention

Yann LeCun's $1B Bet Against LLMs

Yann LeCun's $1B Bet Against LLMs

FlashAttention - Tri Dao | Stanford MLSys #67

FlashAttention - Tri Dao | Stanford MLSys #67

Deploying Many Models Efficiently with Ray Serve

Deploying Many Models Efficiently with Ray Serve

Understanding vLLM with a Hands On Demo

Understanding vLLM with a Hands On Demo

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Simon Mo, vLLM

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone - Simon Mo, vLLM

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Accelerating LLM Inference with vLLM

Accelerating LLM Inference with vLLM

Most devs don't understand how LLM tokens work

Most devs don't understand how LLM tokens work

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

How is hardware reshaping LLM design?

How is hardware reshaping LLM design?

The State of vLLM | Ray Summit 2024

The State of vLLM | Ray Summit 2024

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24