The Engineering Behind LLM Inference: Inside the GPU

When a language model generates a token, the GPU doing the work spends more than 99% of its time waiting on memory, and almost none of it doing math. This video opens up an NVIDIA H100 to explain why, and what the hardware does to fight back. A GPU makes the opposite bet from a CPU: instead of a few fast threads, an H100 runs more than 270,000 slow ones across 132 Streaming Multiprocessors, and hides memory latency by keeping enough warps in flight that a scheduler always has one ready to run. We go inside the SM (CUDA cores, tensor cores, warps and SIMT, warp divergence), walk the four-level memory hierarchy from registers down to HBM, and explain the memory wall: the widening gap between compute and bandwidth that leaves most inference memory-bound. The roofline model and arithmetic intensity tell you which side of that wall any operation falls on. Tensor cores and lower precision (FP16, FP8, FP4) are how each generation pushes the compute ceiling higher, while NVLink, NVSwitch, and the 72-GPU GB200 NVL72 are how you scale past one chip. The final section is fair to the alternatives: Google's TPU and its systolic array, AMD's MI300X with 192 GB of HBM, and why CUDA's two-decade software lead, not the spec sheet, is still the binding constraint. This is the second video in a series on LLM inference; the next one is about the kernels, the code that makes this hardware fast. Chapters: --------------- 00:00 Opening Up the GPU 01:16 Throughput vs. Latency: 270,000 Threads in Flight 02:57 Inside the Streaming Multiprocessor: CUDA and Tensor Cores 05:04 Warps and SIMT: 32 Threads in Lockstep 06:23 Warp Divergence: When Branches Halve Throughput 07:36 The Four-Level Memory Hierarchy: Registers to HBM 09:12 Hiding HBM Latency by Oversubscribing Warps 10:21 Coalesced vs. Uncoalesced: One Transaction or 32 11:32 The Memory Wall: Compute Outpaces HBM Bandwidth 13:19 The Roofline Model: 295 Operations per Byte 14:01 Tensor Cores: Matrix-Multiply-Accumulate in One Instruction 15:35 Lower Precision: FP16, FP8, and FP4 17:10 Scale-Up vs. Scale-Out: NVLink and InfiniBand 18:56 TPU, MI300X, and CUDA's Software Lead References: ------------------ NVIDIA, "CUDA C++ Programming Guide": https://docs.nvidia.com/cuda/cuda-c-p... NVIDIA, "Hopper Architecture In-Depth" (2022): https://developer.nvidia.com/blog/nvi... NVIDIA, "Tesla V100 GPU Architecture" whitepaper (2017): https://images.nvidia.com/content/vol... NVIDIA, "Blackwell Architecture": https://www.nvidia.com/en-us/data-cen... NVIDIA, "GB200 NVL72": https://www.nvidia.com/en-us/data-cen... Williams, Waterman & Patterson (2009), "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM: https://doi.org/10.1145/1498765.1498785 Google Cloud, "TPU architecture": https://docs.cloud.google.com/tpu/doc... AMD, "Instinct MI300X Accelerators": https://www.amd.com/en/products/accel... SemiAnalysis (2024), "MI300X vs H100 vs H200 Benchmark Part 1: Training (CUDA Moat Still Alive)": https://newsletter.semianalysis.com/p... #GPU #CUDA #LLMInference #NVIDIA #TensorCores #H100 #blackwell #GPUComputing #AIHardware #DeepLearning #MachineLearning #ai #llm #openai #anthropic