Transformer Architecture Explained (What Changed Since 2017)

Part 1 of the Modern LLM Architectures series. We go inside the modern decoder-only block (Transformer Architecture): RoPE, RMSNorm + QK-Norm, SwiGLU, GQA, MLA, sliding window, NoPE, Flash Attention, the Chinchilla wall, and the KV cache tax that decides whether your model is shippable. 🧪 BUILD WITH THIS — PREPORATO LABS real GPUs · all in the browser Fine-tune Llama with LoRA: https://preporato.com/labs/fine-tune-... Profile attention with PyTorch Profiler: https://preporato.com/labs/pytorch-pr... Serve a model with vLLM: https://preporato.com/labs/vllm-serving Quantization (FP8 / INT4 / AWQ): https://preporato.com/labs/quantization Continued pretraining: https://preporato.com/labs/continued-... All AI/ML labs: https://preporato.com/labs TIMESTAMPS: 0:00 Intro 0:57 The 2017 block 2:33 Decoder-only wins 3:50 RoPE 6:20 Normalization 9:19 SwiGLU 11:13 KV cache problem 13:48 Attention zoo 17:10 Flash Attention 19:19 Beyond Chinchilla 22:09 Bandwidth tax 23:52 The 2026 block 27:04 Part 2 → SOURCES: • Sebastian Raschka — The Big LLM Architecture Comparison https://magazine.sebastianraschka.com... • DeepSeek-V3 Technical Report https://arxiv.org/abs/2412.19437 • Gemma 3 Technical Report https://arxiv.org/abs/2503.19786 • Qwen 3 Technical Report https://arxiv.org/abs/2505.09388 • Beyond Chinchilla-Optimal: Accounting for Inference https://arxiv.org/abs/2401.00448 • RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) https://arxiv.org/abs/2104.09864 • FlashAttention-2: Faster Attention with Better Parallelism https://arxiv.org/abs/2307.08691 #transformer #ai #llm