The Memory Wall: The Invisible Cap on Every LLM

Same prompt, same model, same GPU. One returns in half a second. The other takes twelve. The reason isn't more compute. The model doesn't "think harder" — its weights are frozen at inference time. The reason is bandwidth. Every token in your prompt has a Key and a Value vector cached in GPU memory. About 16 KB per token, per layer. Multiply by ~96 layers and you're at 1.5 MB per token. A 100K-token prompt: 150 GB sitting in memory. An H100 has 80 GB. Your cache is twice the slot size. But even when it fits, attention has to read every prior K/V from memory on every new token. That's bandwidth-bound, not compute-bound. The H100 does 989 TFLOPS but only 3.35 TB/s of memory bandwidth. Compute keeps doubling. Bandwidth doesn't. Wulf and McKee called this "the memory wall" in 1995 — three decades before transformers ran straight into it. The field has been fighting back: Flash Attention keeps K/V in SRAM. Sliding window goes O(N) instead of O(N²). KV quantization stores at 4 bits instead of 16. Multi-query attention shares K/V across heads — ~96× less cache. Every long-context model uses all four. Simultaneously. The wall is bandwidth, not compute.