KV Cache: The Invisible Trick Behind Every LLM

Same prompt. Same model. The first call costs $1.00. The second costs $0.05. Same words — 20× cheaper. The reason isn't a discount. It's the most important trick in modern AI inference: the KV cache. And the underlying idea is older than the transformer itself — Donald Michie called it "memoization" in 1968. In this video I walk through the full mechanism, end to end: — what a token, embedding, and dot product actually are — why attention computes a "key" and a "value" for every token — why one token costs ~2 billion math operations — why the naive transformer redoes those ops every single token — how the KV cache fixes it (~1000× faster) — why long context is slow (memory wall, Wulf & McKee 1995) — and the prompt-ordering trick that makes Anthropic's prompt caching 20× cheaper If you've ever wondered why agent loops get expensive fast, this is the answer. Chapters: 0:00 Same prompt, different bill 0:14 2017: parallel training, sequential generation 0:45 Token = chunk of text 0:57 Embedding = list of numbers 1:23 Dot product = arrow alignment 1:43 Meaning is direction 2:06 Key = advertise. Value = give. 2:27 Embedding × matrix → key (or value) 2:53 Walking attention on "the cat sat" 3:19 ~2 billion ops per token 3:48 Redone every token (the waste) 4:17 KV cache, named 4:50 The memory wall 5:22 Prompt caching (Anthropic, 2024) 5:46 Order matters (stable bottom, variable top) 6:22 Long context isn't slow because the model thinks harder

Why Inference is hard..

Why Inference is hard..

How DeepSeek V4 fits on a laptop and what does it mean to us?

How DeepSeek V4 fits on a laptop and what does it mean to us?

The KV Cache: Memory Usage in Transformers

The KV Cache: Memory Usage in Transformers

How Anthropic Engineers ACTUALLY Prompt Claude Code

How Anthropic Engineers ACTUALLY Prompt Claude Code

Yann LeCun Says LLMs Have 2 Years Left…

Yann LeCun Says LLMs Have 2 Years Left…

The Engineering Behind Training a 2 Trillion Parameter LLM

The Engineering Behind Training a 2 Trillion Parameter LLM

Prompt Caching Explained: Why Prefixes Matter

Prompt Caching Explained: Why Prefixes Matter

I Reviewed 28,655 Flashcards Every Day for 17 Years. I Barely Had to Study.

I Reviewed 28,655 Flashcards Every Day for 17 Years. I Barely Had to Study.

The Greatest Unsolved Problem In Mathematics

The Greatest Unsolved Problem In Mathematics

Yann LeCun's $1B Bet Against LLMs

Yann LeCun's $1B Bet Against LLMs

I Hacked This Temu Router. What I Found Should Be Illegal.

I Hacked This Temu Router. What I Found Should Be Illegal.

The Memory Wall: The Invisible Cap on Every LLM

The Memory Wall: The Invisible Cap on Every LLM

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Self-Attention Explained: How Transformers Actually Work (Full Visual Breakdown)

Self-Attention Explained: How Transformers Actually Work (Full Visual Breakdown)

AI just disproved the biggest math conjecture so far

AI just disproved the biggest math conjecture so far

Is RAG Still Needed? Choosing the Best Approach for LLMs

Is RAG Still Needed? Choosing the Best Approach for LLMs

I Re-Created A Quant Trading Strategy With Claude Code (Insanely Cool)

I Re-Created A Quant Trading Strategy With Claude Code (Insanely Cool)

They Lied to You About AI (This Study Proves It)

They Lied to You About AI (This Study Proves It)

I Visualised Attention in Transformers

I Visualised Attention in Transformers

The AI layoffs end in 12 months and I know why

The AI layoffs end in 12 months and I know why