How to Engineer AI Inference Systems [Philip Kiely] - 766

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most. 🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/766. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confi... 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter:   / twimlai   Follow us on LinkedIn:   / twimlai   Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 03:40 - Why inference is the most important AI workload? 06:21 - Inference vs model serving 07:18 - Inference challenges 09:57 - Pace of inference research to production timeline 13:41 - Reasons to care about inference engineering 15:49 - Considerations in build vs buy decisions 22:08 - Product maturity cycle 27:14 - GPU lifecycles in inference maturity 32:14 - LLM-assisted inference 36:46 - Agents and multimodal models in specialized inference optimization 47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM 49:50 - Specialized AI hardware 51:24 - Future trends and predictions 52:36 - Where to find the inference engineering book 🔗 LINKS & RESOURCES =============================== Inference Engineering Book - https://www.baseten.co/inference-engi... Baseten - https://www.baseten.co/ 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5

Scaling Agentic Inference Across Heterogeneous Compute [Zain Asgar] - 757
▶︎

Scaling Agentic Inference Across Heterogeneous Compute [Zain Asgar] - 757

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
▶︎

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

This NVIDIA Partnership Could Accelerate Robotics by Years
▶︎

This NVIDIA Partnership Could Accelerate Robotics by Years

Yann LeCun's $1B Bet Against LLMs [Part 1]
▶︎

Yann LeCun's $1B Bet Against LLMs [Part 1]

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit
▶︎

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Signals & Insights Episode - Josh LaMar | Amplinate
▶︎

Signals & Insights Episode - Josh LaMar | Amplinate

Why Hardware-Software Co-Design Is AI's Real 100x: Dylan Patel of SemiAnalysis
▶︎

Why Hardware-Software Co-Design Is AI's Real 100x: Dylan Patel of SemiAnalysis

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026
▶︎

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Why Your LLM Evals Are Missing Critical Failures
▶︎

Why Your LLM Evals Are Missing Critical Failures

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan
▶︎

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!
▶︎

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Something is jamming GPS over Europe. Here's what we found
▶︎

Something is jamming GPS over Europe. Here's what we found

Why Vision Language Models Ignore What They See [Munawar Hayat] - 758
▶︎

Why Vision Language Models Ignore What They See [Munawar Hayat] - 758

Abstract Black and White wave pattern| Height Map Footage| 3 hours Topographic 4k  Background
▶︎

Abstract Black and White wave pattern| Height Map Footage| 3 hours Topographic 4k Background

Yann LeCun: World Models: Enabling the next AI revolution
▶︎

Yann LeCun: World Models: Enabling the next AI revolution

The Race to Production-Grade Diffusion LLMs [Stefano Ermon] - 764
▶︎

The Race to Production-Grade Diffusion LLMs [Stefano Ermon] - 764

Silicon Photonics and the Future of AI Scaling | John Bowers
▶︎

Silicon Photonics and the Future of AI Scaling | John Bowers

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762
▶︎

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More [Sebastian Raschka] - 762

Above the Cloud: Building Data Centers in Space - Richard Campbell - NDC Copenhagen 2026
▶︎

Above the Cloud: Building Data Centers in Space - Richard Campbell - NDC Copenhagen 2026

The insane engineering of Deepseek V4
▶︎

The insane engineering of Deepseek V4