Reverse-engineering GGUF | Post-Training Quantization

The first comprehensive explainer for the GGUF quantization ecosystem. GGUF quantization is currently the most popular tool for Post-Training Quantization. GGUF is actually a binary file format for quantized models, sitting on top of GGML (a lean PyTorch alternative) and llama.cpp (an LLM inference engine). Due to its ad-hoc open-source nature, GGUF is poorly documented and misunderstood. Currently, information is scattered across Reddit threads and GitHub pull requests. 📌 Main topics covered in this video: The ecosystem: GGML, llama.cpp, GGUF Legacy quants vs K-quants vs I-quants The importance matrix Mixed precision (_S, _M, _L, _XL variants) If you enjoyed this video, watch my entire series on model quantization: • Model Quantization 📬 Have feedback or spotted an error? Contribute to the GitHub repo or leave a comment! https://github.com/iuliaturc/gguf-docs 00:00 Intro 01:36 The stack: GGML, llama.cpp, GGUF 04:05 End-to-end workflow 05:29 Overview: Legacy, K-quants, I-quants 06:03 Legacy quants (Type 0, Type1) 10:57 K-quants 13:43 I-quants 17:42 Importance Matrix 22:51 Recap 23:35 Mixed precision (_S, _M, _L, _XL)

What it takes to build *realtime* voice models | Voxtral

What it takes to build realtime voice models | Voxtral

How LLMs survive in low precision | Quantization Fundamentals

How LLMs survive in low precision | Quantization Fundamentals

HMFoDG presentation 02

HMFoDG presentation 02

Which .GGUF Should You Download? (Hugging Face Quantization Guide)

Which .GGUF Should You Download? (Hugging Face Quantization Guide)

The myth of 1-bit LLMs | Quantization-Aware Training

The myth of 1-bit LLMs | Quantization-Aware Training

How Do We Get MASSIVE Model To Run On Device? Quantization Explained.

How Do We Get MASSIVE Model To Run On Device? Quantization Explained.

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Why Inference is hard..

Why Inference is hard..

Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

Quantizing LLMs - How & Why (8-Bit, 4-Bit, GGUF & More)

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

Co-Creator of Haskell: Functional Programming, Thinking in Types, Useless Languages | Simon Jones

Co-Creator of Haskell: Functional Programming, Thinking in Types, Useless Languages | Simon Jones

Knowledge Distillation: How LLMs train each other

Knowledge Distillation: How LLMs train each other

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

EASIEST Way to Train LLM Train w/ unsloth (2x faster with 70% less GPU memory required)

Model Collapse Ends AI Hype

Model Collapse Ends AI Hype

OpenAI’s $1 Trillion Bullsh*t Is Falling Apart

OpenAI’s $1 Trillion Bullsh*t Is Falling Apart

Give me 30 min, I will make Quantization click forever

Give me 30 min, I will make Quantization click forever

Training models with only 4 bits | Fully-Quantized Training

Training models with only 4 bits | Fully-Quantized Training