The Engineering Behind LLM Inference: The Memory Wall

When an LLM generates a token, the GPU spends almost all of its time moving data and barely any of it doing arithmetic. On an H100, the math for a single token takes under a tenth of a millisecond, but a token only comes out every 30 milliseconds or so. The gap is memory bandwidth. This is episode 1 of a series on how LLM inference actually works in production. It covers the memory wall, where GPU compute grew about 80x from 2012 to 2022 while memory bandwidth grew only 17x, and traces a transformer forward pass that pulls all 140 GB of a 70B model's weights out of HBM on every step. A request runs in two phases that sit at opposite ends of the roofline. The H100's ridge is near 295 FLOPs per byte: prefill lands well to the right and is compute-bound, decode sits about 300x to the left and is memory-bandwidth-bound, capped near 24 tokens per second. The KV cache makes decode possible and then competes with the weights for that same bandwidth. The numbers that decide whether a serving system actually works are TTFT, TPOT, and goodput. These are the constraints behind serving any large model, from Llama, Kimi and DeepSeek to the systems running ChatGPT, Gemini and Claude. Later episodes will get into the ways around the memory wall: quantization, multi-GPU parallelism, mixture-of-experts, prefill and decode disaggregation, and speculative decoding. If this was useful, like and subscribe for the rest of the series. Chapters: --------------- 00:00 LLM Inference: One Token Every 30 Milliseconds 03:47 The Memory Wall: 80x Compute vs 17x Bandwidth 06:24 Transformer Inference: 140 GB of Weights in HBM 10:28 Prefill vs Decode: The Two Phases of Inference 14:00 The Roofline Model: Decode 300x Below the Ridge 20:37 The KV Cache: 320 KB Per Token 26:06 TTFT, TPOT, and Goodput: LLM Serving Metrics 29:31 LLM Inference Is a Memory Bandwidth Problem References: Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Wulf & McKee (1995). Hitting the Memory Wall: Implications of the Obvious. https://doi.org/10.1145/216585.216588 Gholami et al. (2024). AI and Memory Wall. https://arxiv.org/abs/2403.14123 Williams, Waterman & Patterson (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. https://doi.org/10.1145/1498765.1498785 Ma & Patterson (2026). Challenges and Research Directions for Large Language Model Inference Hardware. https://arxiv.org/abs/2601.05047 Grattafiori et al. (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783 NVIDIA. H100 Tensor Core GPU datasheet. https://www.nvidia.com/en-us/data-cen... Deloitte (2026). Technology, Media & Telecommunications Predictions: More compute for AI, not less. https://www.deloitte.com/us/en/insigh... #llminference #gpu #nvidia #deeplearning #machinelearning #ai #llm #openai #anthropic #deepmind #deepseek #transformers #kvcache #mlops #inference