Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
Do you want to learn how to serve models like DeepSeek and Qwen with SOTA speeds on launch day? SGLang is an open-source fast serving framework for LLMs and VLMs that generates trillions of tokens per day at companies like xAI, AMD, and Meituan. This workshop guides AI engineers who are familiar with serving models using frameworks like vLLM, Ollama, and TensorRT-LLM through deploying and optimizing their first model with SGLang, as well as providing guidance on when SGLang is the appropriate tool for LLM workloads. About Philip Kiely Philip Kiely leads Developer Relations at Baseten. Prior to joining Baseten in 2022, he worked across software engineering and technical writing for a variety of startups. Outside of work, you'll find Philip practicing martial arts, reading a new book, or cheering for his adopted bay area sports teams. About Yineng Zhang Yineng Zhang is a Software Engineer at Baseten Model Performance team. He is also a core developer of the SGLang project. Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter 00:00 Introduction to LLM serving with SGLang 02:14 What is SGLang? 03:36 History of SGLang 06:49 Deploying Your First Model 13:01 Optimizing Performance with CUDA Graph Max Batch Size 24:19 Optimizing Performance with Eagle 3 Speculative Decoding 30:02 SGLang Community and Contributions 35:24 Invitations and Job Opportunities 36:52 Q&A

Accelerating LLM Inference with vLLM (and SGLang) - Ion Stoica

Building the platform for agent coordination — Tom Moor, Linear

Spec-Driven Development with AI: From Use Case to Running Java Code - Simon Martinelli

SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving | Ray Summit 2025

Ship Production Software in Minutes, Not Months — Eno Reyes, Factory

AI Agent Inference Performance Optimizations + vLLM vs. SGLang vs. TensorRT w/ Charles Frye (Modal)

SGLang Cookbook Full Tutorial: Your One-Click Model Deployment Guide

I Benchmarked vLLM vs SGLang So You Don't Have To Shocking Results!

vLLM on Kubernetes in Production

MCP Tutorial: Build Your First MCP Server and Client from Scratch (Free Labs)

How to pick a GPU and Inference Engine?

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

Efficient LLM Deployment: A Unified Approach with Ray, VLLM, and Kubernetes - Lily (Xiaoxuan) Liu

RAG vs. Fine Tuning

Building a Smarter AI Agent with Neural RAG - Will Bryk, Exa.ai

Open Source Model Performance Optimization With SGLang - Yineng Zhang, Together AI

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

What is vLLM? Efficient AI Inference for Large Language Models

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

