Can Interpretability Control Model Training?

A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 0:00:00 Introduction: Three Ways to Steer AI 0:01:45 Ablating Concepts With CAFT 0:07:45 Preventative Steering 0:13:10 Filtering Data with Attribution 0:17:30 Applying to RL?

What Happened With Sparse Autoencoders?

What Happened With Sparse Autoencoders?

How Reasoning Models Break Mechanistic Interpretability Techniques

How Reasoning Models Break Mechanistic Interpretability Techniques

Creating Models Worth Interpreting

Creating Models Worth Interpreting

How AI Learned to Teach Itself [JEPA]

How AI Learned to Teach Itself [JEPA]

Neel Nanda: Mechanistic Intepretability (HAAISS 2024)

Neel Nanda: Mechanistic Intepretability (HAAISS 2024)

The Strange Math That Predicts (Almost) Anything

The Strange Math That Predicts (Almost) Anything

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Training Sand to Think: Artificial General Intelligence & Future of Physics

Training Sand to Think: Artificial General Intelligence & Future of Physics

Science of Misalignment

Science of Misalignment

How to Introduce Yourself — and Get Hired | Rebecca Okamoto | TED

How to Introduce Yourself — and Get Hired | Rebecca Okamoto | TED

Flow Matching for Generative Modeling (Paper Explained)

Flow Matching for Generative Modeling (Paper Explained)

China's rogue AI does what experts warned.

China's rogue AI does what experts warned.

How Will Mech Interp Help Make AGI Safe?

How Will Mech Interp Help Make AGI Safe?

What do models learn during finetuning? A model diffing paper walkthrough w/ Clement & Julian

What do models learn during finetuning? A model diffing paper walkthrough w/ Clement & Julian

4 Philosophies of Interpretability

4 Philosophies of Interpretability

What Matters Right Now In Mechanistic Interpretability?

What Matters Right Now In Mechanistic Interpretability?

The Most Important Algorithm in Machine Learning

The Most Important Algorithm in Machine Learning

How To Interpret Chain Of Thought: A Walkthrough

How To Interpret Chain Of Thought: A Walkthrough

Interpretability: Understanding how AI models think

Interpretability: Understanding how AI models think

Train Your Brain to Never Forget (5 Feynman Habits)

Train Your Brain to Never Forget (5 Feynman Habits)