How Attention Got So Efficient [GQA/MLA/DSA]
Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper? In this video, we break down several core ideas that make attention efficient and scalable. 00:00 Introduction 00:35 Tokenization 01:21 Attention (vector form) 04:26 Attention (matrix form) 07:07 Key-Value caching 09:42 Multi-Query Attention (MQA) 11:03 Grouped Query Attention (GQA) 13:32 Multi-head Latent Attention (MLA) 15:37 MLA at inference time 18:15 Applying RoPE to MLA (decoupled RoPE) 22:18 DeepSeek Sparse Attention (DSA) 23:57 Quantization and rotation in DSA 27:44 DSA training References: Multi-Head Attention (MHA): https://arxiv.org/abs/1706.03762 Multi-Query Attention (MQA): https://arxiv.org/abs/1911.02150 Grouped Query Attention (GQA): https://arxiv.org/abs/2305.13245 Multi-head Latent Attention (MLA): https://arxiv.org/abs/2405.04434 DeepSeek Sparse Attention (DSA): https://api-docs.deepseek.com/news/ne... Rotary Position Embedding (RoPE): https://arxiv.org/abs/2104.09864 Video made with Manim: https://www.manim.community/

DeepSeek V4's Secret: 98% Less Memory

Attention in transformers, step-by-step | Deep Learning Chapter 6
![This Simple Optimizer Is Revolutionizing How We Train AI [Muon]](https://i.ytimg.com/vi/bO5nvE289ec/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLAzFxNYWuTGV6zIBHgFHXfRMkBUNg)
This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

The Residual Connection Is Broken. Here's the Fix.
![How DeepSeek Rewrote the Transformer [MLA]](https://i.ytimg.com/vi/0VLAoVGf_74/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCSwSaI6q3w2_zizcjVK5wONqMqIQ)
How DeepSeek Rewrote the Transformer [MLA]

We’ve Been Doing Attention Wrong (2-Line Fix)

The Engineering Behind Training a 2 Trillion Parameter LLM

LLMs Don't Need More Parameters. They Need Loops.

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Keys, Queries, and Values: The celestial mechanics of attention

Beyond Softmax: The Future of Attention Mechanisms

The Brain’s Learning Algorithm Isn’t Backpropagation

Why Inference is hard..

How might LLMs store facts | Deep Learning Chapter 7

The math behind Attention: Keys, Queries, and Values matrices

How mHC Reinvents Residual Connections

DeepSeek Gave LLMs a Real Memory (It's Not RAG)

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

But what is quantum computing? (Grover's Algorithm)

