Training models with only 4 bits | Fully-Quantized Training

Can you really train a large language model in just 4 bits? In this video, we explore the cutting edge of model compression: fully quantized training in FP4 (4-bit floating point). While quantization has traditionally focused on inference, new research pushes the limits of training efficiency — reducing memory, compute, and cost. 🧠 We cover: ✅ NVIDIA TensorCores for mixed precision training ✅ Micro-scaling (MX) data formats ✅ Modeling tricks for 4-bit gradients (e.g. Stochastic Rounding) 📎 Resources: 🔵 Main paper: https://arxiv.org/abs/2505.19115 🔵 US congressional report on DeepSeek: https://selectcommitteeontheccp.house... 🔵 Slide deck and full reading list: / juliaturc Watch the entire quantization series here: • Model Quantization 00:00 Intro 01:00 Motivation (training is expensive) 03:06 Mixed precision 05:40 Hardware support: FP4 in NVIDIA Blackwell 13:51 Microscaling formats (MXFP4 & NVFP4) 17:45 Why not INT4? 19:51 Modeling tricks: Stochastic Rounding 22:26 Outro

Reverse-engineering GGUF | Post-Training Quantization

Reverse-engineering GGUF | Post-Training Quantization

Albert Tseng - Training LLMs with MXFP4

Albert Tseng - Training LLMs with MXFP4

The myth of 1-bit LLMs | Quantization-Aware Training

The myth of 1-bit LLMs | Quantization-Aware Training

Ex-Google Insider: You're Not Ready For The Next Phase of AI

Ex-Google Insider: You're Not Ready For The Next Phase of AI

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

Gentle Introduction to NVFP4!

Gentle Introduction to NVFP4!

How LLMs survive in low precision | Quantization Fundamentals

How LLMs survive in low precision | Quantization Fundamentals

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

How is hardware reshaping LLM design?

How is hardware reshaping LLM design?

How To Think SO CLEARLY People Assume You're A Genius

How To Think SO CLEARLY People Assume You're A Genius

Knowledge Distillation: How LLMs train each other

Knowledge Distillation: How LLMs train each other

Robotics' End Game: Nvidia's Jim Fan

Robotics' End Game: Nvidia's Jim Fan

Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

The Tiny Idea That Lets Anyone Fine-Tune AI

The Tiny Idea That Lets Anyone Fine-Tune AI

Hierarchical Reasoning Model: Substance or Hype?

Hierarchical Reasoning Model: Substance or Hype?

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

Why are diffusion LLMs so fast?

Why are diffusion LLMs so fast?

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

What it takes to build *realtime* voice models | Voxtral

What it takes to build realtime voice models | Voxtral