DeepSeek Gave LLMs a Real Memory (It's Not RAG)
DeepSeek's engram introduces a new way to retrieve knowledge through scalable lookups. This boosts LLMs across all tasks (including reasoning tasks) by freeing up attention and MoE layers from the need to reconstruct facts in static patterns. In this video, let's explore how and why Engram works. 00:00 Attention 01:56 How facts are stored in LLM (FFN/MoE) 06:27 Retrieving knowledge via lookup 07:32 Hashing 10:47 Multi-head hashing 11:56 Context-aware gating 16:06 Multi-branch architecture (mHC) 16:53 Integrating Engram into a Transformer 18:01 Sparsity allocation (Engram vs MoE) 20:16 Performance on benchmark tasks 22:08 Why does Engram improve LLM reasoning? 23:45 Where should we place the Engram? 25:41 Does the Engram model really make the model deeper? 27:05 Embedding scaling and the future of LLMs References: [Engram] https://arxiv.org/abs/2601.07372 [Layer Embeddings] https://developers.googleblog.com/en/... [DeepEmbed] https://www.rwkv.com/ [SuperBPE] https://arxiv.org/abs/2503.13423 [SCONE] https://arxiv.org/abs/2502.01637 [OverEncoding] https://arxiv.org/abs/2501.16975 [Byte Latent Transformer] https://arxiv.org/abs/2412.09871 [LongCat-Flash-Lite] https://arxiv.org/abs/2601.21204 [Large Lookup Layers] https://arxiv.org/abs/2601.21461 Video made with manim: https://www.manim.community/ Note: I caught a cold while making this video 🤒, so the part of the voiceover is generated by my cloned voice. Sorry if the voiceover felt a bit unnatural.

How mHC Reinvents Residual Connections

LLMs Don't Need More Parameters. They Need Loops.

Is RAG Still Needed? Choosing the Best Approach for LLMs

We’ve Been Doing Attention Wrong (2-Line Fix)

Yann LeCun's $1B Bet Against LLMs

The 60-Year Hunt for AI's Most Important Function
![How Attention Got So Efficient [GQA/MLA/DSA]](https://i.ytimg.com/vi/Y-o545eYjXM/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLBuOQf8Rw0rEDbSy5MucgJ2Vh6xGw)
How Attention Got So Efficient [GQA/MLA/DSA]

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Why Chinese AI Is Suddenly So Good (ft. DeepSeek, SeeDance 2.0) | AB Explained

The Most Counterintuitive Way to Build a Brain

The Residual Connection Is Broken. Here's the Fix.

DeepSeek Just Started a Global AI War And Exposed GPT-5.6

But What Are Transformers?

They solved AI’s memory problem!
![How DeepSeek Rewrote the Transformer [MLA]](https://i.ytimg.com/vi/0VLAoVGf_74/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCSwSaI6q3w2_zizcjVK5wONqMqIQ)
How DeepSeek Rewrote the Transformer [MLA]

LLMs Are Databases - So Query Them

DeepSeek V4's Secret: 98% Less Memory

But what is quantum computing? (Grover's Algorithm)

