How Rotary Position Embedding Supercharges Modern LLMs [RoPE]

Positional information is critical in transformers' understanding of sequences and their ability to generalize beyond training context length. In this video, we discuss 1) Why attention mechanism in transformers is not sufficient 2) Earlier attempt for injecting positional information (e.g., sinusoidal positional encoding) 3) Rotary position embedding, and 4) Techniques for long-context generalization and extension. Background on Transformer: • But What Are Transformers? References: [Transformer] Attention Is All You Need https://arxiv.org/abs/1706.03762 [RoPE] RoFormer: Enhanced Transformer with Rotary Position Embedding https://arxiv.org/abs/2104.09864 [How is RoPE useful?] Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 [Controlled study] A Controlled Study on Long Context Extension and Generalization in LLMs https://arxiv.org/abs/2409.12181 Raw PowerPoint slides: https://www.dropbox.com/scl/fi/y43aw2...

Mixture of Experts (MoE), Visually Explained

Mixture of Experts (MoE), Visually Explained

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

Kimi K3 Why Is It So Powerful — The Architecture Deep Dive! Delta Attention, Attention Residuals

Kimi K3 Why Is It So Powerful — The Architecture Deep Dive! Delta Attention, Attention Residuals

LLMs Are Databases - So Query Them

LLMs Are Databases - So Query Them

Give me 30 min, I will make RoPE click forever

Give me 30 min, I will make RoPE click forever

Transformers: Attention Is Just Weighted Dot Products | The Math Behind AI

Transformers: Attention Is Just Weighted Dot Products | The Math Behind AI

How Attention Got So Efficient [GQA/MLA/DSA]

How Attention Got So Efficient [GQA/MLA/DSA]

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Query, Key and Value Matrix for Attention Mechanisms in Large Language Models

Query, Key and Value Matrix for Attention Mechanisms in Large Language Models

Rotary Positional Encodings | Explained Visually

Rotary Positional Encodings | Explained Visually

CHOSEN ONE!! YOUR IDENTITY REVEAL JUST SHOOK THE INTERNET... AND THEIR MINDS

CHOSEN ONE!! YOUR IDENTITY REVEAL JUST SHOOK THE INTERNET... AND THEIR MINDS

Rotary Positional Embeddings: Combining Absolute and Relative

Rotary Positional Embeddings: Combining Absolute and Relative

Is Fine-Tuning Still Needed? LLMs, RAG, & LoRA

Is Fine-Tuning Still Needed? LLMs, RAG, & LoRA

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

Why Rotating Vectors Solves Positional Encoding in Transformers | Rotary Positional Embeddings(ROPE)

Why Rotating Vectors Solves Positional Encoding in Transformers | Rotary Positional Embeddings(ROPE)

How FlashAttention Accelerates Generative AI Revolution

How FlashAttention Accelerates Generative AI Revolution

But What Are Transformers?

But What Are Transformers?

RoPE: Understanding Rotary Positional Embeddings in transformers

RoPE: Understanding Rotary Positional Embeddings in transformers

How do Transformer Models keep track of the order of words? Positional Encoding

How do Transformer Models keep track of the order of words? Positional Encoding

DeepSeek Gave LLMs a Real Memory (It's Not RAG)

DeepSeek Gave LLMs a Real Memory (It's Not RAG)