KV Cache: The Invisible Trick Behind Every LLM
Same prompt. Same model. The first call costs $1.00. The second costs $0.05. Same words — 20× cheaper. The reason isn't a discount. It's the most important trick in modern AI inference: the KV cache. And the underlying idea is older than the transformer itself — Donald Michie called it "memoization" in 1968. In this video I walk through the full mechanism, end to end: — what a token, embedding, and dot product actually are — why attention computes a "key" and a "value" for every token — why one token costs ~2 billion math operations — why the naive transformer redoes those ops every single token — how the KV cache fixes it (~1000× faster) — why long context is slow (memory wall, Wulf & McKee 1995) — and the prompt-ordering trick that makes Anthropic's prompt caching 20× cheaper If you've ever wondered why agent loops get expensive fast, this is the answer. Chapters: 0:00 Same prompt, different bill 0:14 2017: parallel training, sequential generation 0:45 Token = chunk of text 0:57 Embedding = list of numbers 1:23 Dot product = arrow alignment 1:43 Meaning is direction 2:06 Key = advertise. Value = give. 2:27 Embedding × matrix → key (or value) 2:53 Walking attention on "the cat sat" 3:19 ~2 billion ops per token 3:48 Redone every token (the waste) 4:17 KV cache, named 4:50 The memory wall 5:22 Prompt caching (Anthropic, 2024) 5:46 Order matters (stable bottom, variable top) 6:22 Long context isn't slow because the model thinks harder

Why Inference is hard..

How DeepSeek V4 fits on a laptop and what does it mean to us?

The KV Cache: Memory Usage in Transformers

How Anthropic Engineers ACTUALLY Prompt Claude Code

Yann LeCun Says LLMs Have 2 Years Left…

The Engineering Behind Training a 2 Trillion Parameter LLM

Prompt Caching Explained: Why Prefixes Matter

I Reviewed 28,655 Flashcards Every Day for 17 Years. I Barely Had to Study.

The Greatest Unsolved Problem In Mathematics

Yann LeCun's $1B Bet Against LLMs

I Hacked This Temu Router. What I Found Should Be Illegal.

The Memory Wall: The Invisible Cap on Every LLM

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Self-Attention Explained: How Transformers Actually Work (Full Visual Breakdown)

AI just disproved the biggest math conjecture so far

Is RAG Still Needed? Choosing the Best Approach for LLMs

I Re-Created A Quant Trading Strategy With Claude Code (Insanely Cool)

They Lied to You About AI (This Study Proves It)

I Visualised Attention in Transformers

