The Engineering Behind LLM Inference: Kernels and Memory

Two GPU kernels can compute the exact same attention, on the same chip, with identical inputs and identical outputs, and one still finishes up to 7.6x faster than the other. The entire gap comes from data movement: which bytes get loaded, from which level of memory, in what order, and where each result is written. This is the third episode of The Engineering Behind LLM Inference, and it follows a single kernel, FlashAttention, through four generations, from the original 2022 paper on the A100 to FlashAttention-4 on NVIDIA's Blackwell B200. The attention function never changes across any of them. What changes is how much of the GPU's peak arithmetic the kernel actually delivers, climbing from tensor cores that mostly sit idle to roughly three quarters of peak. Getting there means counting bytes instead of operations. The roofline and arithmetic intensity decide whether attention is memory-bound; tiling keeps the N-by-N score matrix in SRAM instead of streaming it to HBM; online softmax and recomputation make that legal; warp specialization and FP8 arrive with Hopper; and on Blackwell, asymmetric hardware scaling moves the bottleneck off the tensor cores and onto the exponential unit. The second half is about the KV cache that every active request keeps in HBM. PagedAttention and vLLM borrow operating-system paging to stop wasting more than half of it. Multi-query, grouped-query, and DeepSeek-V2's multi-head latent attention shrink the cache itself, by up to 57x. Flash-Decoding splits it across the whole chip, and CUDA Graphs amortize the launch cost of the hundreds of tiny kernels each decoded token would otherwise pay for. The video closes on quantization, the next lever and the next episode. Every technique here ends up doing the same thing: making the bytes cheaper to move, or rarer to need. Chapters: --------------- 00:00 GPU Kernels and the 7.6x Data-Movement Gap 02:47 HBM vs SRAM: The 10x Bandwidth Gap 03:56 The Roofline: Arithmetic Intensity Decides the Bottleneck 05:23 Naive Attention: the N×N Matrix and Three HBM Passes 09:29 FlashAttention: IO-Aware Tiling Keeps S in SRAM 11:48 Online Softmax: Tiling Without the Full Row 15:09 Recomputation: Linear Memory and the 7.6x Speedup 16:42 FlashAttention-2: Better Parallelism and Work Partitioning 20:28 FlashAttention-3: Warp Specialization and FP8 on Hopper 24:57 FlashAttention-4: Asymmetric Scaling on Blackwell B200 30:51 PagedAttention: Paging the KV Cache in vLLM 34:32 Multi-Query and Grouped-Query Attention: Sharing KV Heads 36:59 Multi-Head Latent Attention: DeepSeek-V2's 57x Smaller Cache 41:05 Flash-Decoding: Splitting the KV Cache Across the Chip 44:18 CUDA Graphs: Amortizing Per-Kernel Launch Overhead 46:33 Quantization: FP8, FP4, and the Accuracy Floor 47:43 From Idle Tensor Cores to 75% of Peak References: ------------------- Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Milakov & Gimelshein (2018). Online normalizer calculation for softmax. https://arxiv.org/abs/1805.02867 Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. https://arxiv.org/abs/1911.02150 Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. https://arxiv.org/abs/2205.14135 Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245 Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. https://arxiv.org/abs/2307.08691 Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. https://arxiv.org/abs/2309.06180 Dao et al. (2023). Flash-Decoding for long-context inference. https://pytorch.org/blog/flash-decoding/ DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. https://arxiv.org/abs/2405.04434 Shah et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. https://arxiv.org/abs/2407.08608 Zadouri et al. (2026). FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. https://arxiv.org/abs/2603.05451 #flashattention #llminference #cuda #kvcache #vllm #attentionmechanism #transformers #nvidia #blackwell #quantization #deeplearning #openai #anthropic #deepmind #ai