Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines... M. Kaushik, S.K. Merla
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon Europe in London from April 1 - 4, 2025. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Best Practices for Deploying LLM Inference, RAG and Fine Tuning Pipelines on K8s - Meenakshi Kaushik & Shiva Krishna Merla, NVIDIA In this session, we'll cover best practices for deploying, scaling, and managing LLM inference pipelines on Kubernetes (K8s). We'll explore common patterns like inference, retrieval-augmented generation (RAG), and fine-tuning. Key challenges addressed include: [1]. Minimizing initial inference latency with model caching [2] Optimizing GPU usage with efficient scheduling, multi-GPU/node handling, and auto-quantization [3] Enhancing security and management with RBAC, monitoring, auto-scaling, and support for air-gapped clusters We'll also demonstrate building customizable pipelines for inference, RAG, and fine-tuning, and managing them post-deployment. Solutions include [1] a lightweight standalone tool built using operator pattern and [2] KServe, a robust open-source AI inference platform. This session will equip you to effectively manage LLM inference pipelines on K8s, improving performance, efficiency, and security

Kubernetes Zero to Hero: The Complete Beginner’s Guide (2025 Edition)

Share the Ride: Robust Multi-Tenancy in Kubernetes at Uber - Sashank Appireddy & Apoorva Jindal

Model Context Protocol (MCP), clearly explained (why it matters)

Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar

OaaS-IoT Tutorial at IPDPS 2026 Conference

RAG vs. CAG: Solving Knowledge Gaps in AI Models

Andrej Karpathy: Software Is Changing (Again)

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Introduction to Distributed ML Workloads with Ray on Kubernetes - Mofi Rahman & Abdel Sghiouar

Optimizing Load Balancing and Autoscaling for Large Language Model (LLM) Inference on Kub... D. Gray

What’s Going on in the Containerd Neighborhood? - P. Estes, S. Karp, A. Suda, M. Brown, K. Ashok

AI in Kubernetes: How to Get Started?

Model Context Protocol (MCP) Explained for Beginners: AI Flight Booking Demo!

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Cloud Computing Explained: The Most Important Concepts To Know

OWASP's Top 10 Ways to Attack LLMs: AI Vulnerabilities Exposed

If You Have A Bad Memory, I’ll Help You Fix It In 28 Minutes

AI Agents for Beginners – Part 1 (Free Labs)

Accelerating LLM Inference with vLLM

