The Engineering Behind LLM Inference: The Memory Wall
When an LLM generates a token, the GPU spends almost all of its time moving data and barely any of it doing arithmetic. On an H100, the math for a single token takes under a tenth of a millisecond, but a token only comes out every 30 milliseconds or so. The gap is memory bandwidth. This is episode 1 of a series on how LLM inference actually works in production. It covers the memory wall, where GPU compute grew about 80x from 2012 to 2022 while memory bandwidth grew only 17x, and traces a transformer forward pass that pulls all 140 GB of a 70B model's weights out of HBM on every step. A request runs in two phases that sit at opposite ends of the roofline. The H100's ridge is near 295 FLOPs per byte: prefill lands well to the right and is compute-bound, decode sits about 300x to the left and is memory-bandwidth-bound, capped near 24 tokens per second. The KV cache makes decode possible and then competes with the weights for that same bandwidth. The numbers that decide whether a serving system actually works are TTFT, TPOT, and goodput. These are the constraints behind serving any large model, from Llama, Kimi and DeepSeek to the systems running ChatGPT, Gemini and Claude. Later episodes will get into the ways around the memory wall: quantization, multi-GPU parallelism, mixture-of-experts, prefill and decode disaggregation, and speculative decoding. If this was useful, like and subscribe for the rest of the series. Chapters: --------------- 00:00 LLM Inference: One Token Every 30 Milliseconds 03:47 The Memory Wall: 80x Compute vs 17x Bandwidth 06:24 Transformer Inference: 140 GB of Weights in HBM 10:28 Prefill vs Decode: The Two Phases of Inference 14:00 The Roofline Model: Decode 300x Below the Ridge 20:37 The KV Cache: 320 KB Per Token 26:06 TTFT, TPOT, and Goodput: LLM Serving Metrics 29:31 LLM Inference Is a Memory Bandwidth Problem References: Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Wulf & McKee (1995). Hitting the Memory Wall: Implications of the Obvious. https://doi.org/10.1145/216585.216588 Gholami et al. (2024). AI and Memory Wall. https://arxiv.org/abs/2403.14123 Williams, Waterman & Patterson (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. https://doi.org/10.1145/1498765.1498785 Ma & Patterson (2026). Challenges and Research Directions for Large Language Model Inference Hardware. https://arxiv.org/abs/2601.05047 Grattafiori et al. (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783 NVIDIA. H100 Tensor Core GPU datasheet. https://www.nvidia.com/en-us/data-cen... Deloitte (2026). Technology, Media & Telecommunications Predictions: More compute for AI, not less. https://www.deloitte.com/us/en/insigh... #llminference #gpu #nvidia #deeplearning #machinelearning #ai #llm #openai #anthropic #deepmind #deepseek #transformers #kvcache #mlops #inference

Why Inference is hard..

The Engineering Behind Training a 2 Trillion Parameter LLM

S19 | ELF: Embedded Language Flows

The ASML Replacement Nobody Saw Coming

This Local LLM Looked Smart Until I Saw What It Made Up

How does Google Maps find the fastest path out of 10^200 possible paths?

This Battery Doesn't Need Lithium and It Just Hit Mass Production

How Prompt Caching Made Long-Context LLM Agents Viable

The Greatest Unsolved Problem In Mathematics

Google's New TPU Quietly Ends the GPU Era?

The Founder Using Claude to Build Real Hardware

But what is quantum computing? (Grover's Algorithm)

I Thought DGX Spark Was Slower… Until I Changed ONE Thing

The Sum-Product conjecture was just disproven!!

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

They Lied to You About AI (This Study Proves It)

Once You Understand it, You Will Think Everything Else is Silly - Toyota E-CVT

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

