Can Interpretability Control Model Training?
A talk I gave to my MATS 9.0 Training Program on using interpretability to steer finetuning If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 0:00:00 Introduction: Three Ways to Steer AI 0:01:45 Ablating Concepts With CAFT 0:07:45 Preventative Steering 0:13:10 Filtering Data with Attribution 0:17:30 Applying to RL?

▶︎
What Happened With Sparse Autoencoders?

▶︎
How Reasoning Models Break Mechanistic Interpretability Techniques

▶︎
Creating Models Worth Interpreting
![How AI Learned to Teach Itself [JEPA]](https://i.ytimg.com/vi/gVEr2cnDE_8/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCVqBgym7bIq_bSKgJFb16dvaV-Cg)
▶︎
How AI Learned to Teach Itself [JEPA]

▶︎
Neel Nanda: Mechanistic Intepretability (HAAISS 2024)

▶︎
The Strange Math That Predicts (Almost) Anything

▶︎
Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

▶︎
Training Sand to Think: Artificial General Intelligence & Future of Physics

▶︎
Science of Misalignment

▶︎
How to Introduce Yourself — and Get Hired | Rebecca Okamoto | TED

▶︎
Flow Matching for Generative Modeling (Paper Explained)

▶︎
China's rogue AI does what experts warned.

▶︎
How Will Mech Interp Help Make AGI Safe?

▶︎
What do models learn during finetuning? A model diffing paper walkthrough w/ Clement & Julian

▶︎
4 Philosophies of Interpretability

▶︎
What Matters Right Now In Mechanistic Interpretability?

▶︎
The Most Important Algorithm in Machine Learning

▶︎
How To Interpret Chain Of Thought: A Walkthrough

▶︎
Interpretability: Understanding how AI models think

▶︎
