Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Grouped Query Attention (GQA), the SwiGLU Activation function and more! I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P I also explain the math behind the Rotary Positional Embedding, with step by step proofs. Repository with PDF slides: https://github.com/hkproj/pytorch-llama Download the weights from: https://github.com/facebookresearch/l... Prerequisites: 1) Transformer explained: • Attention is all you need (Transformer) - ... 2) LLaMA explained: • LLaMA explained: KV-Cache, Rotary Position... Chapters 00:00:00 - Introduction 00:01:20 - LLaMA Architecture 00:03:14 - Embeddings 00:05:22 - Coding the Transformer 00:19:55 - Rotary Positional Embedding 01:03:50 - RMS Normalization 01:11:13 - Encoder Layer 01:16:50 - Self Attention with KV Cache 01:29:12 - Grouped Query Attention 01:34:14 - Coding the Self Attention 02:01:40 - Feed Forward Layer with SwiGLU 02:08:50 - Model weights loading 02:21:26 - Inference strategies 02:25:15 - Greedy Strategy 02:27:28 - Beam Search 02:31:13 - Temperature 02:32:52 - Random Sampling 02:34:27 - Top K 02:37:03 - Top P 02:38:59 - Coding the Inference

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Llama 4 From Scratch in PyTorch - Vision Language Models + MoE

Llama 4 From Scratch in PyTorch - Vision Language Models + MoE

pruning

pruning

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Attention is all you need (Transformer) - Model explanation (including math), Inference and Training

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

Visualizing transformers and attention | Talk for TNG Big Tech Day '24

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

BERT explained: Training, Inference, BERT vs GPT/LLamA, Fine tuning, [CLS] token

What Nobody Tells You About Being a Quant

What Nobody Tells You About Being a Quant

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

Distributed Training with PyTorch: complete tutorial with cloud infrastructure and code

What I Learned From Implementing LLM Architectures From Scratch (And How to Get Started)

What I Learned From Implementing LLM Architectures From Scratch (And How to Get Started)

Attention in transformers, step-by-step | Deep Learning Chapter 6

Attention in transformers, step-by-step | Deep Learning Chapter 6

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Building LLMs from the Ground Up: A 3-hour Coding Workshop

Building LLMs from the Ground Up: A 3-hour Coding Workshop

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

What are Transformer Models and how do they work?

What are Transformer Models and how do they work?

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

Deep dive - Better Attention layers for Transformer models

Deep dive - Better Attention layers for Transformer models

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution