Compress LLMs Like a Pro: FP8, GPTQ & SmoothQuant Explained

šŸš€ Large Language Models are powerful, but they can be expensive to run. In this tutorial, you'll learn how to use llmcompressor to apply post-training quantization (PTQ) techniques that reduce model size, improve inference speed, and lower deployment costs without retraining your model. Using the Qwen2.5 model as a practical example, we'll explore multiple quantization strategies, benchmark their performance, and evaluate the trade-offs between efficiency and model quality. šŸ“Œ What You'll Learn āœ… What is Post-Training Quantization (PTQ)? āœ… Introduction to the llmcompressor library āœ… FP8 Dynamic Quantization explained āœ… GPTQ Quantization workflow āœ… SmoothQuant implementation and use cases āœ… Preparing and using calibration datasets āœ… Measuring model perplexity and output quality āœ… Benchmarking inference throughput and latency āœ… Comparing compression techniques side-by-side āœ… Optimizing Qwen2.5 for production deployment āœ… Best practices for efficient LLM serving

Yann LeCun's $1B Bet Against LLMs [Part 1]
ā–¶ļøŽ

Yann LeCun's $1B Bet Against LLMs [Part 1]

Leave Windows 11 Idle for 24 Hours and Watch What Happens
ā–¶ļøŽ

Leave Windows 11 Idle for 24 Hours and Watch What Happens

How a Small Team Beat the AI Giants: Building Big LLMs Without a Huge GPU Cluster
ā–¶ļøŽ

How a Small Team Beat the AI Giants: Building Big LLMs Without a Huge GPU Cluster

I Tested the Cheapest Path to 96GB of VRAM
ā–¶ļøŽ

I Tested the Cheapest Path to 96GB of VRAM

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial
ā–¶ļøŽ

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

System Design Course – APIs, Databases, Caching, CDNs, Load Balancing & Production Infra
ā–¶ļøŽ

System Design Course – APIs, Databases, Caching, CDNs, Load Balancing & Production Infra

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source
ā–¶ļøŽ

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

Build Knowledge Graphs from Unstructured Text Using AI
ā–¶ļøŽ

Build Knowledge Graphs from Unstructured Text Using AI

Stop Prompting Claude. Use Karpathy's Method Instead.
ā–¶ļøŽ

Stop Prompting Claude. Use Karpathy's Method Instead.

The Best Local Agentic Coding Workflow (Complete Guide)
ā–¶ļøŽ

The Best Local Agentic Coding Workflow (Complete Guide)

Deep Dive into LLMs like ChatGPT
ā–¶ļøŽ

Deep Dive into LLMs like ChatGPT

MIT Just Revealed the AI Bubble's Fatal Flaw
ā–¶ļøŽ

MIT Just Revealed the AI Bubble's Fatal Flaw

NVIDIA cuTile Tutorial: Custom GPU Programming in Python
ā–¶ļøŽ

NVIDIA cuTile Tutorial: Custom GPU Programming in Python

Real-Time WebSockets Course | Build a Live Sports Dashboard with Node.js & PostgreSQL
ā–¶ļøŽ

Real-Time WebSockets Course | Build a Live Sports Dashboard with Node.js & PostgreSQL

Learn 97% of Claude in Under 16 Minutes
ā–¶ļøŽ

Learn 97% of Claude in Under 16 Minutes

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)
ā–¶ļøŽ

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

Ollama + Claude Code = 99% CHEAPER
ā–¶ļøŽ

Ollama + Claude Code = 99% CHEAPER

Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer
ā–¶ļøŽ

Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer

How to Actually Build Mobile Apps with AI in 2026 | A Complete Beginner's Tutorial
ā–¶ļøŽ

How to Actually Build Mobile Apps with AI in 2026 | A Complete Beginner's Tutorial

Paid LLM vs. Local Model on a Home Laptop – Is There a Big Difference?
ā–¶ļøŽ

Paid LLM vs. Local Model on a Home Laptop – Is There a Big Difference?