How Attention Got So Efficient [GQA/MLA/DSA]

Attention mechanisms have been the key behind the recent AI boom. What happened after the multi-head attention in the seminal 2017 Transformer paper? In this video, we break down several core ideas that make attention efficient and scalable. 00:00 Introduction 00:35 Tokenization 01:21 Attention (vector form) 04:26 Attention (matrix form) 07:07 Key-Value caching 09:42 Multi-Query Attention (MQA) 11:03 Grouped Query Attention (GQA) 13:32 Multi-head Latent Attention (MLA) 15:37 MLA at inference time 18:15 Applying RoPE to MLA (decoupled RoPE) 22:18 DeepSeek Sparse Attention (DSA) 23:57 Quantization and rotation in DSA 27:44 DSA training References: Multi-Head Attention (MHA): https://arxiv.org/abs/1706.03762 Multi-Query Attention (MQA): https://arxiv.org/abs/1911.02150 Grouped Query Attention (GQA): https://arxiv.org/abs/2305.13245 Multi-head Latent Attention (MLA): https://arxiv.org/abs/2405.04434 DeepSeek Sparse Attention (DSA): https://api-docs.deepseek.com/news/ne... Rotary Position Embedding (RoPE): https://arxiv.org/abs/2104.09864 Video made with Manim: https://www.manim.community/

DeepSeek V4's Secret: 98% Less Memory

DeepSeek V4's Secret: 98% Less Memory

How DeepSeek Rewrote the Transformer [MLA]

How DeepSeek Rewrote the Transformer [MLA]

Attention, KV Cache, MQA & GQA — A Visual Guide

Attention, KV Cache, MQA & GQA — A Visual Guide

Why is This the Scariest Chart in Electrical Engineering?

Why is This the Scariest Chart in Electrical Engineering?

Triton Kernels Actually Work - Here's Proof

Triton Kernels Actually Work - Here's Proof

Is RAG Still Needed? Choosing the Best Approach for LLMs

Is RAG Still Needed? Choosing the Best Approach for LLMs

How Attention Mechanism Works in Transformer Architecture

How Attention Mechanism Works in Transformer Architecture

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

Only Video That Will Make You BETTER at MATH - 100%

Only Video That Will Make You BETTER at MATH - 100%

Beyond Softmax: The Future of Attention Mechanisms

Beyond Softmax: The Future of Attention Mechanisms

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Mixture of Experts (MoE), Visually Explained

Mixture of Experts (MoE), Visually Explained

This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

Self-Attention Explained: How Transformers Actually Work (Full Visual Breakdown)

Self-Attention Explained: How Transformers Actually Work (Full Visual Breakdown)

Why Inference is hard..

Why Inference is hard..

DeepSeek Gave LLMs a Real Memory (It's Not RAG)

DeepSeek Gave LLMs a Real Memory (It's Not RAG)

China quietly saved the world last month

China quietly saved the world last month

Keys, Queries, and Values: The celestial mechanics of attention

Keys, Queries, and Values: The celestial mechanics of attention

How Rotary Position Embedding Supercharges Modern LLMs [RoPE]

How Rotary Position Embedding Supercharges Modern LLMs [RoPE]

Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) Explained

Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) Explained