The Hidden Memory Crisis Within LLMs (and its solution + code explanation)

In this video, we trace the hidden memory crisis within Large Language Models from first principles and look at how FlashAttention-2 fundamentally rewrites how GPUs handle data movement. We’ll dive into the mathematical mechanics of Online Softmax, briefly talk about GPU architecture, and walk line-by-line through a complete PyTorch reference implementation of the tiling loop. Mathematical Variables Cheat Sheet: N, d - Sequence Length and Head Dimension. Q_block, K_block, V_block - Sub-matrices sliced to fit into fast SRAM caches. m - Running Row Maximum (tracks the highest attention score found so far to prevent exponential overflow). l - Running Softmax Denominator (accumulates the sum of scaled exponentials, sum e^{x - m}). alpha - Rescaling Correction Factor (e^{m_old - m_new}). Dynamically down-scales historical accumulations when a new maximum is discovered. acc - Running Output Accumulator (stores the weighted product of probabilities and Values). If you are modifying or building your own custom high-performance computing kernels, always ensure your block sizes match your target hardware's thread warp schedules to optimize memory coalescing. #DeepLearning #MachineLearning #FlashAttention #CUDA #PyTorch #LLMs #GenerativeAI #Transformers