Fine-tuning LLMs on Human Feedback (RLHF + DPO)

🤝 Want your team maximizing Claude? I run 1:1 and team AI workshops for companies doing $1M+ per year: https://aibuilder.academy/yt/bbVoDXoPrPM Here, I discuss how to use reinforcement learning to fine-tune LLMs on human feedback (i.e. RLHF) and a more efficient reformulation of it (i.e. DPO) 📰 Read more: https://medium.com/@shawhin/fine-tuni... Example code: https://github.com/ShawhinT/YouTube-B... 🤗 Dataset: https://huggingface.co/datasets/shawh... 🤗 Fine-tuned Model: https://huggingface.co/shawhin/Qwen2.... References [1] arXiv:2407.21783 [cs.AI] [2] arXiv:2203.02155 [cs.CL] [3] arXiv:1707.06347 [cs.LG] [4] • Deep Dive into LLMs like ChatGPT [5] arXiv:2305.18290 [cs.LG] Intro - 0:00 Base Models - 0:25 InstructGPT - 2:20 RL from Human Feedback (RLHF) - 5:18 Proximal Policy Optimization (PPO) - 9:20 Limitations of RLHF - 10:30 Direct Policy Optimization (DPO) - 11:50 Example: Fine-tuning Qwen on Title Preferences - 14:29 Step 1: Curate preference data - 17:49 Step 2: Fine-tuning with DPO - 20:53 Step 3: Evaluate fine-tuning model - 25:27

How to Train LLMs to "Think" (o1 & DeepSeek-R1)

How to Train LLMs to "Think" (o1 & DeepSeek-R1)

Compressing Large Language Models (LLMs) | w/ Python Code

Compressing Large Language Models (LLMs) | w/ Python Code

2.1 | The LLM Assisted Workflow

2.1 | The LLM Assisted Workflow

Reinforcement Learning: A (practical) introduction

Reinforcement Learning: A (practical) introduction

How to Improve LLMs with RAG (Overview + Python Code)

How to Improve LLMs with RAG (Overview + Python Code)

How to Evaluate (and Improve) Your LLM Apps

How to Evaluate (and Improve) Your LLM Apps

How to Build an LLM from Scratch | An Overview

How to Build an LLM from Scratch | An Overview

Fine-Tuning Text Embeddings For Domain-specific Search (w/ Python)

Fine-Tuning Text Embeddings For Domain-specific Search (w/ Python)

Reinforcement Learning with LLMs: a new era of AI agents

Reinforcement Learning with LLMs: a new era of AI agents

QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code)

QLoRA—How to Fine-tune an LLM on a Single GPU (w/ Python Code)

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

Reinforcement Learning from Human Feedback (RLHF) Explained

Reinforcement Learning from Human Feedback (RLHF) Explained

Agent Skills vs MCP: What’s the difference?

Agent Skills vs MCP: What’s the difference?

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

Fine-tuning LLMs for Tool Use (w/ Example Code)

Fine-tuning LLMs for Tool Use (w/ Example Code)

Claude Code Explained in 47 Minutes [Complete Course]

Claude Code Explained in 47 Minutes [Complete Course]

LoRA & QLoRA Fine-tuning Explained In-Depth

LoRA & QLoRA Fine-tuning Explained In-Depth

Speech LLMs: Models that listen and talk back

Speech LLMs: Models that listen and talk back

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

How LLMs survive in low precision | Quantization Fundamentals

How LLMs survive in low precision | Quantization Fundamentals