Keys, Queries, and Values: The celestial mechanics of attention

The attention mechanism is what makes Large Language Models like ChatGPT or DeepSeek talk well. But how does it work? One can see it as a mechanism that uses similarity to figure out what parts of the text to pay more or less attention to. For this, we use word embeddings. I like to see word embeddings as words flying around in the universe, like planets and stars. In this case, the attention mechanism (the Keys, Queries, and Values matrices) define the fabric of this universe, and the laws of gravity, that resemble (yet in some ways are very different) to the laws of gravity that rule our universe. Come join me in this celestial adventure in the universe of language! See other videos in this LLM series The attention mechanism in LLMs: • The Attention Mechanism in Large Language ... The math behind attention mechanisms: • The math behind Attention: Keys, Queries, ... Transformer models: • What are Transformer Models and how do the... Get the Grokking Machine Learning book! https://manning.com/books/grokking-ma... Discount code (40%): serranoyt (Use the discount code on checkout) 01:55 Similarity 02:12 Embeddings 04:56 Attention 07:14 Dot product 09:29 Cosine similarity 11:10 The Keys and Queries matrices 14:19 Compressing and stretching dimensions 18:50 Combining dimensions 23:14 Asymmetric pull 40:57 Multi-head attention 45:14 The Value matrix 49:24 Summary

The math behind Attention: Keys, Queries, and Values matrices

The math behind Attention: Keys, Queries, and Values matrices

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

Rotary Position Embeddings (RoPE) Explained — The Rotation Trick Behind Long-Context LLMs

Rotary Position Embeddings (RoPE) Explained — The Rotation Trick Behind Long-Context LLMs

There’s a Problem with Quantum Mechanics – with Jim Al-Khalili

There’s a Problem with Quantum Mechanics – with Jim Al-Khalili

Query, Key and Value Matrix for Attention Mechanisms in Large Language Models

Query, Key and Value Matrix for Attention Mechanisms in Large Language Models

Transformers: Attention Is Just Weighted Dot Products | The Math Behind AI

Transformers: Attention Is Just Weighted Dot Products | The Math Behind AI

What are Transformer Models and how do they work?

What are Transformer Models and how do they work?

Don't learn AI Agents without Learning these Fundamentals

Don't learn AI Agents without Learning these Fundamentals

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

I Built an LLM From Scratch

I Built an LLM From Scratch

The Strange Math That Predicts (Almost) Anything

The Strange Math That Predicts (Almost) Anything

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

But how do AI images and videos actually work? | Guest video by Welch Labs

But how do AI images and videos actually work? | Guest video by Welch Labs

How Attention Got So Efficient [GQA/MLA/DSA]

How Attention Got So Efficient [GQA/MLA/DSA]

GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models

GRPO - Group Relative Policy Optimization - How DeepSeek trains reasoning models

Strengths and Weaknesses of Large Language Models

Strengths and Weaknesses of Large Language Models

The Attention Mechanism in Large Language Models

The Attention Mechanism in Large Language Models

Only Video That Will Make You BETTER at MATH - 100%

Only Video That Will Make You BETTER at MATH - 100%

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Transformers Explained | Simple Explanation of Transformers

Transformers Explained | Simple Explanation of Transformers

How do Transformer Models keep track of the order of words? Positional Encoding

How do Transformer Models keep track of the order of words? Positional Encoding

The Riskiest Moment of the AI Bubble

The Riskiest Moment of the AI Bubble

MIT 6.S191 (2025): Recurrent Neural Networks, Transformers, and Attention

MIT 6.S191 (2025): Recurrent Neural Networks, Transformers, and Attention