Build Self-Attention from Scratch in Python (Transformer Core, No PyTorch)

Self-attention is the heart of every transformer and every large language model — GPT, Claude, Llama, all of them. But the core mechanism is shockingly small: a couple of matrix multiplies, a scale, and a softmax. In this hands-on tutorial we build it from scratch in pure numpy, no PyTorch or TensorFlow, so you can see exactly what's happening. What we build, step by step: • A numerically stable softmax (the only nonlinearity in attention) • Scaled dot-product attention — queries, keys, values and the QKᵀ/√d_k score matrix • Causal masking so a token can't peek at the future (autoregressive attention) • Multi-head self-attention that splits the feature dimension across parallel heads • An interpretable demo on a toy sentence with an ASCII attention heatmap By the end you'll understand what Q, K and V actually are, why we divide by √d_k, how the causal mask makes GPT-style models autoregressive, and why multiple heads help. Everything runs in under a second. Stack: Python 3, numpy. No GPU, no frameworks, ~70 lines of code total. Chapters: 00:00 Why attention is just matmuls + softmax 00:30 Stable softmax 01:30 Scaled dot-product attention 02:45 Causal masking 04:00 Multi-head attention 05:15 Interpretable demo + invariants #machinelearning #transformers #python #deeplearning #llm Chapters: 01. Stable Softmax 02. Scaled Dot-Product Attention 03. Causal Masking 04. Multi-Head Attention 05. See It Work #self-attention #transformers #attention mechanism #python #numpy #deep learning #machine learning #llm