Inside NVIDIA Dynamo: Faster, Scalable AI Deployment | Ray Summit 2025
At Ray Summit 2025, Harry Kim from NVIDIA shares how NVIDIA Dynamo is redefining large-scale LLM inference through system-level optimizations that seamlessly integrate with high-performance engines such as vLLM, SGLang, and TensorRT-LLM (TRT-LLM). He begins by outlining the core challenge: as LLMs grow in size, context length, and real-world usage, inference systems must deliver massive efficiency gains—not just from kernels or hardware, but across the entire distributed serving stack. NVIDIA Dynamo addresses this by introducing a new layer of intelligent orchestration and memory management designed specifically for LLM workloads. Harry walks through Dynamo’s key innovations, including: Smart Scheduling – Routes requests based on KV-cache hit rates and system load, intelligently autoscaling and disaggregating the prefill and decode phases for maximum throughput and efficiency. Hierarchical Memory Management – Transparently leverages HBM, CPU memory, local NVMe, and remote storage to minimize latency and maximize effective model capacity. Low-Latency KV-Cache Transfer – Quickly moves KV-cache across nodes and memory tiers, enabling fast context reuse and efficient distributed inference. The session also introduces Dynamo’s production-grade LLM serving capabilities, including: Tools to identify optimal disaggregated serving configurations offline Automated tuning based on real-time traffic Topology-aware gang scheduling to dynamically scale prefill and decode workers LLM-specific fault-tolerance mechanisms for reliable serving at scale Harry demonstrates how Dynamo enables organizations to achieve higher throughput, lower latency, and better cost efficiency across distributed LLM deployments—while still leveraging their preferred inference engine. Attendees will leave with a clear understanding of how NVIDIA Dynamo transforms end-to-end LLM serving, making large-scale inference more efficient, robust, and operationally simple. Liked this video? Check out other Ray Summit breakout session recordings • Ray Summit 2025 - Breakout Sessions Subscribe to our YouTube channel to stay up-to-date on the future of AI! / anyscale 🔗 Connect with us: LinkedIn: / joinanyscale X: https://x.com/anyscalecompute Website: https://www.anyscale.com/

SIMD-Accelerated Data Processing

A Quick Overview of the Ray Libraries Built on Ray Core | Ray Summit Expo

NVIDIA DYNAMO: Serving LLMs at AI-Factory Scale

Introducing NVIDIA Dynamo: Low-Latency Distributed Inference for Scaling Reasoning LLMs

Leading in the Age of AI: A Conversation with NVIDIA CEO Jensen Huang | Global Conference 2026

NVIDIA @ Replay 2026 | Scaling AV Simulation with Temporal

The NVIDIA x Microsoft Sloperating System

LMCache + vLLM: How to Serve 1M Context for Free

Prompt Learning: A Reinforcement Learning-Inspired Approach to AI Optimization | Ray Summit 2025

How xAI Scales Image & Video Processing with Ray | Ray Summit 2025

NVIDIA Dynamo Platform: Scale & Serve Generative AI Fast | Chris Alexiuk, NVIDIA

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

【直播|AI即時中字翻譯】輝達 NVIDIA GTC Taipei 2026 黃仁勳主題演講

RTX Spark Filled 128GB With Windows

The Insane Genius of a Formula 1 Gearbox

NVIDIA CEO Jensen Huang's Vision for the Future

Model Context Protocol (MCP), clearly explained (why it matters)

Andrej Karpathy: Software Is Changing (Again)

Inside the Modern Data Center! SuperClusters at Applied Digital

