Optimizing Training Workloads on GPU Clusters

The talk covers best practices, technical guidance and a live demonstration on a 2-node instant Kubernetes cluster. It will walk through key considerations from initial setup through to training execution and system monitoring. Topics Covered: Pre-Cluster Planning: Choosing between Kubernetes and Slurm, sizing GPU resources, and understanding model and data requirements Pre-Flight Validation: Verifying hardware (GPUs, CPUs, memory), software stack (e.g., Docker), and network configuration for RDMA or Ethernet-based setups CPU and GPU Optimization: Understanding workload characteristics, NUMA node configuration, and avoiding common bottlenecks (e.g., CPU-heavy preprocessing) Storage and Data Handling: Comparing parallel file systems vs. local NVMe, managing data ingestion/output, and minimizing transfer overhead Failure Recovery and Observability: Addressing issues like GPU errors, node lockups, and network flaps, and implementing robust observability with tools like nvidia-smi and GPU utilization monitors Live Demo: Running a real training job with basic observability in place, and demonstrating progress checks and troubleshooting workflows

Everything You Wanted to Know About RDMA But Were Too Proud to Ask

Everything You Wanted to Know About RDMA But Were Too Proud to Ask

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Scaling AI Workloads with Kubernetes: Sharing GPU Resources Across Multiple Containers - Jack Ong

Optimizing Training Workloads on GPU Clusters

Optimizing Training Workloads on GPU Clusters

Explain How Kubernetes Works With GPU Like I’m 5 - Carlos Santana, AWS

Explain How Kubernetes Works With GPU Like I’m 5 - Carlos Santana, AWS

Spec-Driven Development with GitHub Spec-Kit with Barret Blake

Spec-Driven Development with GitHub Spec-Kit with Barret Blake

GPUs in Kubernetes for AI Workloads

GPUs in Kubernetes for AI Workloads

Boosting RAG and Search with Mxbai

Boosting RAG and Search with Mxbai

Choosing the right platform: Slurm vs Kubernetes

Choosing the right platform: Slurm vs Kubernetes

Real-Time WebSockets Course | Build a Live Sports Dashboard with Node.js & PostgreSQL

Real-Time WebSockets Course | Build a Live Sports Dashboard with Node.js & PostgreSQL

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message

God Says:"TAKE THIS MESSAGE SERIOUSLY, BECAUSE ONLY YOU ARE SEEING IT"/God Message Now/God Message

How To Think SO CLEARLY People Assume You're A Genius

How To Think SO CLEARLY People Assume You're A Genius

No Celebrity Has ZERO Filter Like Harrison Ford _ and It’s HILARIOUS!

No Celebrity Has ZERO Filter Like Harrison Ford _ and It’s HILARIOUS!

A Deep Dive into NVIDIA Blackwell with SemiAnalysis' Dylan Patel

A Deep Dive into NVIDIA Blackwell with SemiAnalysis' Dylan Patel

Instant Focus Mode – 40Hz Gamma Brainwave Music for Deep Focus & Productivity

Instant Focus Mode – 40Hz Gamma Brainwave Music for Deep Focus & Productivity

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Slurm Introduction (Jobs, Partitions, Nodes, and Concepts)

Slurm Introduction (Jobs, Partitions, Nodes, and Concepts)

Managing GPUs by SLURM

Managing GPUs by SLURM

LLM-as-a-Judge Evals: Comparing Kimi, Qwen, and GLM

LLM-as-a-Judge Evals: Comparing Kimi, Qwen, and GLM

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Slurm vs Kubernetes : What to choose to run my AI workloads?

Slurm vs Kubernetes : What to choose to run my AI workloads?