The Engineering Behind LLM Inference: Inside the GPU

When a language model generates a token, the GPU doing the work spends more than 99% of its time waiting on memory, and almost none of it doing math. This video opens up an NVIDIA H100 to explain why, and what the hardware does to fight back. A GPU makes the opposite bet from a CPU: instead of a few fast threads, an H100 runs more than 270,000 slow ones across 132 Streaming Multiprocessors, and hides memory latency by keeping enough warps in flight that a scheduler always has one ready to run. We go inside the SM (CUDA cores, tensor cores, warps and SIMT, warp divergence), walk the four-level memory hierarchy from registers down to HBM, and explain the memory wall: the widening gap between compute and bandwidth that leaves most inference memory-bound. The roofline model and arithmetic intensity tell you which side of that wall any operation falls on. Tensor cores and lower precision (FP16, FP8, FP4) are how each generation pushes the compute ceiling higher, while NVLink, NVSwitch, and the 72-GPU GB200 NVL72 are how you scale past one chip. The final section is fair to the alternatives: Google's TPU and its systolic array, AMD's MI300X with 192 GB of HBM, and why CUDA's two-decade software lead, not the spec sheet, is still the binding constraint. This is the second video in a series on LLM inference; the next one is about the kernels, the code that makes this hardware fast. Chapters: --------------- 00:00 Opening Up the GPU 01:16 Throughput vs. Latency: 270,000 Threads in Flight 02:57 Inside the Streaming Multiprocessor: CUDA and Tensor Cores 05:04 Warps and SIMT: 32 Threads in Lockstep 06:23 Warp Divergence: When Branches Halve Throughput 07:36 The Four-Level Memory Hierarchy: Registers to HBM 09:12 Hiding HBM Latency by Oversubscribing Warps 10:21 Coalesced vs. Uncoalesced: One Transaction or 32 11:32 The Memory Wall: Compute Outpaces HBM Bandwidth 13:19 The Roofline Model: 295 Operations per Byte 14:01 Tensor Cores: Matrix-Multiply-Accumulate in One Instruction 15:35 Lower Precision: FP16, FP8, and FP4 17:10 Scale-Up vs. Scale-Out: NVLink and InfiniBand 18:56 TPU, MI300X, and CUDA's Software Lead References: ------------------ NVIDIA, "CUDA C++ Programming Guide": https://docs.nvidia.com/cuda/cuda-c-p... NVIDIA, "Hopper Architecture In-Depth" (2022): https://developer.nvidia.com/blog/nvi... NVIDIA, "Tesla V100 GPU Architecture" whitepaper (2017): https://images.nvidia.com/content/vol... NVIDIA, "Blackwell Architecture": https://www.nvidia.com/en-us/data-cen... NVIDIA, "GB200 NVL72": https://www.nvidia.com/en-us/data-cen... Williams, Waterman & Patterson (2009), "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM: https://doi.org/10.1145/1498765.1498785 Google Cloud, "TPU architecture": https://docs.cloud.google.com/tpu/doc... AMD, "Instinct MI300X Accelerators": https://www.amd.com/en/products/accel... SemiAnalysis (2024), "MI300X vs H100 vs H200 Benchmark Part 1: Training (CUDA Moat Still Alive)": https://newsletter.semianalysis.com/p... #GPU #CUDA #LLMInference #NVIDIA #TensorCores #H100 #blackwell #GPUComputing #AIHardware #DeepLearning #MachineLearning #ai #llm #openai #anthropic

The Engineering Behind LLM Inference: The Memory Wall

The Engineering Behind LLM Inference: The Memory Wall

Cerebras Co-Founder Deconstructs Blackwell GPU Delay

Cerebras Co-Founder Deconstructs Blackwell GPU Delay

Writing a Linux Device Tree Overlay (.dts) | LDD with Raspberry Pi #30

Writing a Linux Device Tree Overlay (.dts) | LDD with Raspberry Pi #30

I Tested the Cheapest Path to 96GB of VRAM

I Tested the Cheapest Path to 96GB of VRAM

Inside the YASA YM360: Axial Flux Motor Engineering Explained

Inside the YASA YM360: Axial Flux Motor Engineering Explained

The Genius of Computing with Light

The Genius of Computing with Light

The Engineering Behind LLM Inference: Kernels and Memory

The Engineering Behind LLM Inference: Kernels and Memory

Transformer Architecture Explained (What Changed Since 2017)

Transformer Architecture Explained (What Changed Since 2017)

Why Inference is hard..

Why Inference is hard..

FPGAs Aren’t Processors (Unless You Want Them to Be) || FPGA Deep Dive and Use

FPGAs Aren’t Processors (Unless You Want Them to Be) || FPGA Deep Dive and Use

Reinventing Entropy | Compression is Intelligence Part 1

Reinventing Entropy | Compression is Intelligence Part 1

The Insane Complexity of the Semiconductor Global Supply Chain

The Insane Complexity of the Semiconductor Global Supply Chain

21 Yr Old Disproves 4 Decades Old Belief in Computing

21 Yr Old Disproves 4 Decades Old Belief in Computing

Linux Kernel 7.1 | The FUTURE of X86!

Linux Kernel 7.1 | The FUTURE of X86!

The RAM Crisis just got so much worse for them... they lied

The RAM Crisis just got so much worse for them... they lied

The World's Most Important Machine

The World's Most Important Machine

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

How do Graphics Cards Work? Exploring GPU Architecture

How do Graphics Cards Work? Exploring GPU Architecture

An introduction to Beamforming

An introduction to Beamforming

Gate-All-Around — The Future of Transistors

Gate-All-Around — The Future of Transistors