How to Engineer AI Inference Systems [Philip Kiely] - 766

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most. 🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/766. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confi... 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: / twimlai Follow us on LinkedIn: / twimlai Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 03:40 - Why inference is the most important AI workload? 06:21 - Inference vs model serving 07:18 - Inference challenges 09:57 - Pace of inference research to production timeline 13:41 - Reasons to care about inference engineering 15:49 - Considerations in build vs buy decisions 22:08 - Product maturity cycle 27:14 - GPU lifecycles in inference maturity 32:14 - LLM-assisted inference 36:46 - Agents and multimodal models in specialized inference optimization 47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM 49:50 - Specialized AI hardware 51:24 - Future trends and predictions 52:36 - Where to find the inference engineering book 🔗 LINKS & RESOURCES =============================== Inference Engineering Book - https://www.baseten.co/inference-engi... Baseten - https://www.baseten.co/ 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5

Scaling Agentic Inference Across Heterogeneous Compute [Zain Asgar] - 757

Scaling Agentic Inference Across Heterogeneous Compute [Zain Asgar] - 757

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

This NVIDIA Partnership Could Accelerate Robotics by Years

This NVIDIA Partnership Could Accelerate Robotics by Years

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Signals & Insights Episode - Josh LaMar | Amplinate

Signals & Insights Episode - Josh LaMar | Amplinate

Why Hardware-Software Co-Design Is AI's Real 100x: Dylan Patel of SemiAnalysis

Why Hardware-Software Co-Design Is AI's Real 100x: Dylan Patel of SemiAnalysis

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Why Your LLM Evals Are Missing Critical Failures

Why Your LLM Evals Are Missing Critical Failures

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Something is jamming GPS over Europe. Here's what we found

Something is jamming GPS over Europe. Here's what we found

Why Vision Language Models Ignore What They See [Munawar Hayat] - 758

Why Vision Language Models Ignore What They See [Munawar Hayat] - 758

Abstract Black and White wave pattern| Height Map Footage| 3 hours Topographic 4k Background

Abstract Black and White wave pattern| Height Map Footage| 3 hours Topographic 4k Background

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

The Race to Production-Grade Diffusion LLMs [Stefano Ermon] - 764

The Race to Production-Grade Diffusion LLMs [Stefano Ermon] - 764

Silicon Photonics and the Future of AI Scaling | John Bowers

Silicon Photonics and the Future of AI Scaling | John Bowers

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762

Above the Cloud: Building Data Centers in Space - Richard Campbell - NDC Copenhagen 2026

Above the Cloud: Building Data Centers in Space - Richard Campbell - NDC Copenhagen 2026

The insane engineering of Deepseek V4

The insane engineering of Deepseek V4