Flash Attention derived and coded from first principles with Triton (Python)

In this video, I'll be deriving and coding Flash Attention from scratch. I'll be deriving every operation we do in Flash Attention using only pen and "paper". Moreover, I'll explain CUDA and Triton from zero, so no prior knowledge of CUDA is required. To code the backwards pass, I'll first explain how the autograd system works in PyTorch and then derive the Jacobian of the matrix multiplication and the Softmax operation and use it to code the backwards pass. All the code will be written in Python with Triton, but no prior knowledge of Triton is required. I'll also explain the CUDA programming model from zero. Chapters 00:00:00 - Introduction 00:03:10 - Multi-Head Attention 00:09:06 - Why Flash Attention 00:12:50 - Safe Softmax 00:27:03 - Online Softmax 00:39:44 - Online Softmax (Proof) 00:47:26 - Block Matrix Multiplication 01:28:38 - Flash Attention forward (by hand) 01:44:01 - Flash Attention forward (paper) 01:50:53 - Intro to CUDA with examples 02:26:28 - Tensor Layouts 02:40:48 - Intro to Triton with examples 02:54:26 - Flash Attention forward (coding) 04:22:11 - LogSumExp trick in Flash Attention 2 04:32:53 - Derivatives, gradients, Jacobians 04:45:54 - Autograd 05:00:00 - Jacobian of the MatMul operation 05:16:14 - Jacobian through the Softmax 05:47:33 - Flash Attention backwards (paper) 06:13:11 - Flash Attention backwards (coding) 07:21:10 - Triton Autotuning 07:23:29 - Triton tricks: software pipelining 07:33:38 - Running the code This video won't only teach you one of the most influential algorithms in deep learning history; it'll also give you the knowledge you need to solve any new problem that involves writing CUDA or Triton kernels. Moreover, it'll give you the mathematical foundations to derive backwards passes! As usual, the code is available on GitHub: https://github.com/hkproj/triton-flas... 🚀Join Writer 🚀 If you're a ML researcher who wants to do research at the hottest AI startup in Silicon Valley, consider applying to Writer and help us make GPUs go brrrrrrrrr. Join Writer: https://writer.com/company/careers/

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

How FlashAttention Accelerates Generative AI Revolution

How FlashAttention Accelerates Generative AI Revolution

AI + Automation Study Hall Live, n8n Workflows & Business AI

AI + Automation Study Hall Live, n8n Workflows & Business AI

How Attention Got So Efficient [GQA/MLA/DSA]

How Attention Got So Efficient [GQA/MLA/DSA]

6. Monte Carlo Simulation

6. Monte Carlo Simulation

What Nobody Tells You About Being a Quant

What Nobody Tells You About Being a Quant

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - CppCon 2024

Türkei – USA Highlights | Gruppe D, FIFA WM 2026 | sportstudio

Türkei – USA Highlights | Gruppe D, FIFA WM 2026 | sportstudio

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

Chip design from the bottom up – Reiner Pope

Chip design from the bottom up – Reiner Pope

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Attacking AI - Jason Haddix - NDC Security 2026

Attacking AI - Jason Haddix - NDC Security 2026

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Creator of C++: Bell Labs, Negative Overhead Abstraction, Mistakes | Bjarne Stroustrup

Flash Attention: The Fastest Attention Mechanism?

Flash Attention: The Fastest Attention Mechanism?

FlashAttention - Tri Dao | Stanford MLSys #67

FlashAttention - Tri Dao | Stanford MLSys #67

THE TRITON LANGUAGE | PHILIPPE TILLET

THE TRITON LANGUAGE | PHILIPPE TILLET

ResNet - Explained!

ResNet - Explained!

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

Mamba and S4 Explained: Architecture, Parallel Scan, Kernel Fusion, Recurrent, Convolution, Math

How AI Cracked the Protein Folding Code and Won a Nobel Prize

How AI Cracked the Protein Folding Code and Won a Nobel Prize

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution