KV Cache: The Invisible Trick Behind Every LLM

Same prompt. Same model. The first call costs $1.00. The second costs $0.05. Same words — 20× cheaper. The reason isn't a discount. It's the most important trick in modern AI inference: the KV cache. And the underlying idea is older than the transformer itself — Donald Michie called it "memoization" in 1968. In this video I walk through the full mechanism, end to end: — what a token, embedding, and dot product actually are — why attention computes a "key" and a "value" for every token — why one token costs ~2 billion math operations — why the naive transformer redoes those ops every single token — how the KV cache fixes it (~1000× faster) — why long context is slow (memory wall, Wulf & McKee 1995) — and the prompt-ordering trick that makes Anthropic's prompt caching 20× cheaper If you've ever wondered why agent loops get expensive fast, this is the answer. Chapters: 0:00 Same prompt, different bill 0:14 2017: parallel training, sequential generation 0:45 Token = chunk of text 0:57 Embedding = list of numbers 1:23 Dot product = arrow alignment 1:43 Meaning is direction 2:06 Key = advertise. Value = give. 2:27 Embedding × matrix → key (or value) 2:53 Walking attention on "the cat sat" 3:19 ~2 billion ops per token 3:48 Redone every token (the waste) 4:17 KV cache, named 4:50 The memory wall 5:22 Prompt caching (Anthropic, 2024) 5:46 Order matters (stable bottom, variable top) 6:22 Long context isn't slow because the model thinks harder