DeepSpeed Makes Training a 13B LLM Possible On YOUR Hardware

AI news for builders: DeepSpeed's ZeRO lets you train a 13B large language model on two used RTX 3090s for under $1,000 AI news for anyone doing local AI or llm fine tuning on consumer hardware: a 13 billion parameter model needs 208 GB of training-state memory with mixed-precision Adam, more than 8x what a single RTX 3090 holds. Microsoft's DeepSpeed library solves this not through machine learning tricks like shrinking the model, but by partitioning that memory across gpu for ai setups so no single card carries the full load. This breaks down exactly how DeepSpeed's Zero Redundancy Optimizer works, stage by stage. Stage one shards optimizer state for a 4x memory cut with zero communication cost. Stage two adds gradient partitioning for 8x. Stage three partitions the actual model weights, scaling reduction linearly with GPU count at a 50% communication tax. Then ZeRO-Offload pushes the remaining optimizer state onto cheap system RAM instead of VRAM, and ZeRO-Infinity extends that further to NVMe drives, the same mechanism behind training BLOOM's 176 billion parameters. You'll see the real numbers: 16 bytes per parameter, the exact math behind that 208 GB figure, why 64-128 GB of system RAM matters more than people admit, and why DeepSpeed only requires a JSON config change, not a rewritten training loop, thanks to HuggingFace's own T5-3B demonstration. It also covers where this approach loses to renting a cloud A100, and when parameter-efficient methods like LoRA make more sense than a full fine-tune. This is for builders weighing a local training rig against generative ai cloud rentals, and anyone trying to understand how large language models actually get trained outside a data center. Chapters: 0:00 Intro 0:16 The 208GB Wall That Kills Every Run 1:19 Why Every GPU Was Hoarding Data 2:19 Stage One: The Free 4x Memory Win 3:09 Stage Two: Doubling Down For Free 4:05 Stage Three: Splitting The Model Itself 5:17 The Trick That Makes 13B Actually Fit 6:35 Turning This On Takes One Config File Tools & resources mentioned: DeepSpeed (Microsoft): https://github.com/deepspeedai/DeepSpeed ZeRO paper (Rajbhandari et al., SC20): https://arxiv.org/pdf/1910.02054 ZeRO-Infinity paper: https://arxiv.org/pdf/2104.07857 HuggingFace: Fit More and Train Faster With ZeRO: https://huggingface.co/blog/zero-deep... Microsoft Research ZeRO blog: https://www.microsoft.com/en-us/resea... About The Stack The Stack helps you build with AI. Each video takes one tool, model, or workflow and shows how it works in a few focused minutes, with the real benchmarks and real costs. We go deep on Claude Code and Cursor for AI coding, AI agents and MCP servers, the open-source AI tools and GitHub repos most people miss, RAG and vector search, fine-tuning, and running local LLMs on your own machine with Ollama and LM Studio. We compare models like ChatGPT and Claude, test AI automation with Zapier, Make, and n8n, and flag the tools that actually ship. Subscribe for new breakdowns: https://www.youtube.com/@the-stack-ai... #DeepSpeed #LLMFineTuning #AINews #MachineLearning #LocalAI

I Tested the Cheapest Path to 96GB of VRAM

I Tested the Cheapest Path to 96GB of VRAM

Ornith 35B Benchmarked vs Qwen 35B - 16GB Local LLM setup

Ornith 35B Benchmarked vs Qwen 35B - 16GB Local LLM setup

China Drops LongCat, Anthropic Fires Sonnet 5, Google Omni Flash & Robots Learn to Vibe Code AI News

China Drops LongCat, Anthropic Fires Sonnet 5, Google Omni Flash & Robots Learn to Vibe Code AI News

10 Open-Source AI Tools That Feel ILLEGAL To Know About

10 Open-Source AI Tools That Feel ILLEGAL To Know About

Home made GPU escalated quickly

Home made GPU escalated quickly

Why AI Tokens are so Expensive - Computerphile

Why AI Tokens are so Expensive - Computerphile

China Just Built What TSMC Said Was Impossible

China Just Built What TSMC Said Was Impossible

Yann LeCun's $1B Bet Against LLMs [Part 1]

Yann LeCun's $1B Bet Against LLMs [Part 1]

The Local AI Hardware Mistake Everyone Makes

The Local AI Hardware Mistake Everyone Makes

Ornith 1.0 35B Coding vs Qwen - 16GB Local LLM setup

Ornith 1.0 35B Coding vs Qwen - 16GB Local LLM setup

Android 17 sucks. So I put Linux on a phone.

Android 17 sucks. So I put Linux on a phone.

Deepseek drops another HUGE breakthrough

Deepseek drops another HUGE breakthrough

This 284B Model Shouldn't Fit On Your Laptop. It Does

This 284B Model Shouldn't Fit On Your Laptop. It Does

Local AI Coding is Finally Good Enough

Local AI Coding is Finally Good Enough

LLM that loops instead of Doing Chain-of-Thought

LLM that loops instead of Doing Chain-of-Thought

This Could END The RAMpocalypse!

This Could END The RAMpocalypse!

Deepseek Drop Another Huge Breakthrough..

Deepseek Drop Another Huge Breakthrough..

Want to Run AI Agents Locally? Here is The Bare Minimum Setup/Build

Want to Run AI Agents Locally? Here is The Bare Minimum Setup/Build

He’s always wrong

He’s always wrong

This Is What Happens When You Shrink an AI Workstation

This Is What Happens When You Shrink an AI Workstation