Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Do you want to learn how to serve models like DeepSeek and Qwen with SOTA speeds on launch day? SGLang is an open-source fast serving framework for LLMs and VLMs that generates trillions of tokens per day at companies like xAI, AMD, and Meituan. This workshop guides AI engineers who are familiar with serving models using frameworks like vLLM, Ollama, and TensorRT-LLM through deploying and optimizing their first model with SGLang, as well as providing guidance on when SGLang is the appropriate tool for LLM workloads. About Philip Kiely Philip Kiely leads Developer Relations at Baseten. Prior to joining Baseten in 2022, he worked across software engineering and technical writing for a variety of startups. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted bay area sports teams. About Yineng Zhang Yineng Zhang is a Software Engineer at Baseten Model Performance team. He is also a core developer of the SGLang project. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter 00:00 Introduction to LLM serving with SGLang 02:14 What is SGLang? 03:36 History of SGLang 06:49 Deploying Your First Model 13:01 Optimizing Performance with CUDA Graph Max Batch Size 24:19 Optimizing Performance with Eagle 3 Speculative Decoding 30:02 SGLang Community and Contributions 35:24 Invitations and Job Opportunities 36:52 Q&A