DeepSeek Gave LLMs a Real Memory (It's Not RAG)

DeepSeek's engram introduces a new way to retrieve knowledge through scalable lookups. This boosts LLMs across all tasks (including reasoning tasks) by freeing up attention and MoE layers from the need to reconstruct facts in static patterns. In this video, let's explore how and why Engram works. 00:00 Attention 01:56 How facts are stored in LLM (FFN/MoE) 06:27 Retrieving knowledge via lookup 07:32 Hashing 10:47 Multi-head hashing 11:56 Context-aware gating 16:06 Multi-branch architecture (mHC) 16:53 Integrating Engram into a Transformer 18:01 Sparsity allocation (Engram vs MoE) 20:16 Performance on benchmark tasks 22:08 Why does Engram improve LLM reasoning? 23:45 Where should we place the Engram? 25:41 Does the Engram model really make the model deeper? 27:05 Embedding scaling and the future of LLMs References: [Engram] https://arxiv.org/abs/2601.07372 [Layer Embeddings] https://developers.googleblog.com/en/... [DeepEmbed] https://www.rwkv.com/ [SuperBPE] https://arxiv.org/abs/2503.13423 [SCONE] https://arxiv.org/abs/2502.01637 [OverEncoding] https://arxiv.org/abs/2501.16975 [Byte Latent Transformer] https://arxiv.org/abs/2412.09871 [LongCat-Flash-Lite] https://arxiv.org/abs/2601.21204 [Large Lookup Layers] https://arxiv.org/abs/2601.21461 Video made with manim: https://www.manim.community/ Note: I caught a cold while making this video 🤒, so the part of the voiceover is generated by my cloned voice. Sorry if the voiceover felt a bit unnatural.