Fast Finetuning of Gemma-3, Qwen-3 and GPT-OSS on Strix Halo using Unsloth and Multi-Node Setups
In this video, I introduce an updated Strix Halo fine-tuning toolbox to include two major improvements: Unsloth integration and multi-node distributed training. The setup builds on the previous fine-tuning tutorial, but now leverages Unsloth's highly optimized Triton kernels to reduce VRAM usage and speed up training times for models like Gemma 3. I cover the software details that make this possible: how Unsloth dynamically patches the Hugging Face Transformers library, and why standard PyTorch Autograd is less efficient for these specific architectures. I also show a side-by-side comparison of full fine-tuning and LoRA, demonstrating the massive memory and speed advantages Unsloth provides, along with the specific commits and patches required to get it running on ROCm. For the distributed training side, I walk through running DDP (Distributed Data Parallel) and FSDP (Fully-Sharded Data Parallel) across a 2-node Strix Halo cluster. Just like with vLLM, the main blocker here was missing RCCL support for gfx1151 in upstream ROCm. I explain how I incorporated my patched RCCL library into the toolbox, allowing us to split training workloads across multiple machines using either RDMA or standard Ethernet, and how to reproduce the setup using my cluster management scripts. Timestamps 00:00 – Introduction 03:26 – Starting the Toolbox 07:00 – How Unsloth Works (vs PyTorch Autograd) 10:15 – Unsloth Training Demo & Benchmarks 16:50 – Unsloth Patching & Implementation Details 20:28 – Multi-Node Cluster Setup 24:08 – DDP vs FSDP Training Strategies 27:00 – Multi-Node Training Demo 29:53 – Conclusion Links & Resources Strix Halo Toolboxes & Guides: https://strix-halo-toolboxes.com Strix Halo Fine-Tuning Toolbox: https://github.com/kyuz0/amd-strix-ha... LLM Chronicles (Gradient Descent Deep Dive): https://llm-chronicles.com DDP vs. FSDP in PyTorch: https://www.jellyfishtechnologies.com...

DeepSeek V4 Flash Inference on Strix Halo: ds4, Quantizations, Distributed Inference and Benchmarks

AMD MI50 32GB for Local AI: Qwen 3.6 & Gemma 4 on llama.cpp / vLLM (vs R9700)

Local Coding Agents on Strix Halo and R9700: Pi, Opencode, and SWE-bench Mini Benchmarks

Finetuning LLMs on Strix Halo – Full, LoRA, and QLoRA on Gemma-3, Qwen-3, and GPT-OSS-20B

Red-Teaming the AI Red Team – Dario Pasquini

NVIDIA Monopoly is DEAD | OPEN-SOURCE Chips Are HERE!

EEVblog 1752 - Texas Instruments SCREWED UP the NE5532!

Dual AMD Radeon 9700 AI PRO: Building a 64GB LLM/AI Server with Llama.cpp

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

Yann LeCun: World Models: Enabling the next AI revolution

ComfyUI Strix Halo Toolbox for Image and Video Generation (LTX2, Qwen Image, WAN 2.2, Hunyuan 1.5)

VibeVoice (Speech Generation/Voice Cloning) on Framework Desktop with Strix Halo (AMD AI Ryzen MAX+)

Gemma 4 12B: The First "Encoder-Free" AI, Explained

The insane engineering of Deepseek V4

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

But what is quantum computing? (Grover's Algorithm)

AI Agents for Beginners – Part 1 (Free Labs)

27B Beats 397B?! The New Qwen 3.6 Is All About Efficiency

Unsloth Studio is insane… fine-tune any AI model locally

