How Attention Got So Efficient [GQA/MLA/DSA]

Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper? In this video, we break down several core ideas that make attention efficient and scalable. 00:00 Introduction 00:35 Tokenization 01:21 Attention (vector form) 04:26 Attention (matrix form) 07:07 Key-Value caching 09:42 Multi-Query Attention (MQA) 11:03 Grouped Query Attention (GQA) 13:32 Multi-head Latent Attention (MLA) 15:37 MLA at inference time 18:15 Applying RoPE to MLA (decoupled RoPE) 22:18 DeepSeek Sparse Attention (DSA) 23:57 Quantization and rotation in DSA 27:44 DSA training References: Multi-Head Attention (MHA): https://arxiv.org/abs/1706.03762 Multi-Query Attention (MQA): https://arxiv.org/abs/1911.02150 Grouped Query Attention (GQA): https://arxiv.org/abs/2305.13245 Multi-head Latent Attention (MLA): https://arxiv.org/abs/2405.04434 DeepSeek Sparse Attention (DSA): https://api-docs.deepseek.com/news/ne... Rotary Position Embedding (RoPE): https://arxiv.org/abs/2104.09864 Video made with Manim: https://www.manim.community/