The Engineering Behind LLM Inference: Kernels and Memory
Two GPU kernels can compute the exact same attention, on the same chip, with identical inputs and identical outputs, and one still finishes up to 7.6x faster than the other. The entire gap comes from data movement: which bytes get loaded, from which level of memory, in what order, and where each result is written. This is the third episode of The Engineering Behind LLM Inference, and it follows a single kernel, FlashAttention, through four generations, from the original 2022 paper on the A100 to FlashAttention-4 on NVIDIA's Blackwell B200. The attention function never changes across any of them. What changes is how much of the GPU's peak arithmetic the kernel actually delivers, climbing from tensor cores that mostly sit idle to roughly three quarters of peak. Getting there means counting bytes instead of operations. The roofline and arithmetic intensity decide whether attention is memory-bound; tiling keeps the N-by-N score matrix in SRAM instead of streaming it to HBM; online softmax and recomputation make that legal; warp specialization and FP8 arrive with Hopper; and on Blackwell, asymmetric hardware scaling moves the bottleneck off the tensor cores and onto the exponential unit. The second half is about the KV cache that every active request keeps in HBM. PagedAttention and vLLM borrow operating-system paging to stop wasting more than half of it. Multi-query, grouped-query, and DeepSeek-V2's multi-head latent attention shrink the cache itself, by up to 57x. Flash-Decoding splits it across the whole chip, and CUDA Graphs amortize the launch cost of the hundreds of tiny kernels each decoded token would otherwise pay for. The video closes on quantization, the next lever and the next episode. Every technique here ends up doing the same thing: making the bytes cheaper to move, or rarer to need. Chapters: --------------- 00:00 GPU Kernels and the 7.6x Data-Movement Gap 02:47 HBM vs SRAM: The 10x Bandwidth Gap 03:56 The Roofline: Arithmetic Intensity Decides the Bottleneck 05:23 Naive Attention: the N×N Matrix and Three HBM Passes 09:29 FlashAttention: IO-Aware Tiling Keeps S in SRAM 11:48 Online Softmax: Tiling Without the Full Row 15:09 Recomputation: Linear Memory and the 7.6x Speedup 16:42 FlashAttention-2: Better Parallelism and Work Partitioning 20:28 FlashAttention-3: Warp Specialization and FP8 on Hopper 24:57 FlashAttention-4: Asymmetric Scaling on Blackwell B200 30:51 PagedAttention: Paging the KV Cache in vLLM 34:32 Multi-Query and Grouped-Query Attention: Sharing KV Heads 36:59 Multi-Head Latent Attention: DeepSeek-V2's 57x Smaller Cache 41:05 Flash-Decoding: Splitting the KV Cache Across the Chip 44:18 CUDA Graphs: Amortizing Per-Kernel Launch Overhead 46:33 Quantization: FP8, FP4, and the Accuracy Floor 47:43 From Idle Tensor Cores to 75% of Peak References: ------------------- Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Milakov & Gimelshein (2018). Online normalizer calculation for softmax. https://arxiv.org/abs/1805.02867 Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. https://arxiv.org/abs/1911.02150 Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. https://arxiv.org/abs/2205.14135 Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245 Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. https://arxiv.org/abs/2307.08691 Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. https://arxiv.org/abs/2309.06180 Dao et al. (2023). Flash-Decoding for long-context inference. https://pytorch.org/blog/flash-decoding/ DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. https://arxiv.org/abs/2405.04434 Shah et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. https://arxiv.org/abs/2407.08608 Zadouri et al. (2026). FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. https://arxiv.org/abs/2603.05451 #flashattention #llminference #cuda #kvcache #vllm #attentionmechanism #transformers #nvidia #blackwell #quantization #deeplearning #openai #anthropic #deepmind #ai

The Engineering Behind LLM Inference: The Memory Wall

The Engineering Behind LLM Inference: Inside the GPU
![[MLArchSys 2026] A Hardware Native Bit Serial Learner with Exact Statistical Structure](https://i.ytimg.com/vi/yFPPjggQNYI/hqdefault.jpg?sqp=-oaymwE9CNACELwBSFryq4qpAy8IARUAAAAAGAElAADIQj0AgKJDeAHwAQH4Af4JgALQBYoCDAgAEAEYOyBIKHIwDw==&rs=AOn4CLDlYKFoFsbTKV83v7YsNR6xfff1Fw)
[MLArchSys 2026] A Hardware Native Bit Serial Learner with Exact Statistical Structure

How Can One Object Affect Another Through Empty Space? | Maxwell’s Equations: Part 1

FPGAs Aren’t Processors (Unless You Want Them to Be) || FPGA Deep Dive and Use

Transformer Architecture Explained (What Changed Since 2017)

Inside Claude Code: The Architecture of AI Agents

The World's Most Important Machine

Elite: "The game that couldn't be written"

Yann LeCun: World Models: Enabling the next AI revolution

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Transformers, the tech behind LLMs | Deep Learning Chapter 5

How Prompt Caching Made Long-Context LLM Agents Viable
![Yann LeCun's $1B Bet Against LLMs [Part 1]](https://i.ytimg.com/vi/kYkIdXwW2AE/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDbV4izF3i-wxevCVIn7FJjoy1vlA)
Yann LeCun's $1B Bet Against LLMs [Part 1]

Is RAG Still Needed? Choosing the Best Approach for LLMs

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Don't learn AI Agents without Learning these Fundamentals

How do Graphics Cards Work? Exploring GPU Architecture

JANITOR vs THE BIGGEST GUYS IN THE GYM. They Didn’t Expect THAT

