Science of Misalignment
If a future model were to be dangerously misaligned, could we tell? If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app 00:00:00 The Problem with Viral Demos 00:06:49 Hunting for "Eval Awareness" 00:17:00 Debunking the Shutdown Demo 00:24:00 Why Do Models Blackmail 00:31:33 A New Tool: The Resilience Score 00:32:30 The Science of Misalignment 00:35:45 How to Convince Skeptics? 00:47:00 The Future of AI Psychology

▶︎
How Reasoning Models Break Mechanistic Interpretability Techniques

▶︎
How Will Mech Interp Help Make AGI Safe?

▶︎
Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

▶︎
Yann LeCun: World Models: Enabling the next AI revolution

▶︎
Introduction to Mechanistic Interpretability with David Bau

▶︎
Can Interpretability Control Model Training?

▶︎
What Happened With Sparse Autoencoders?

▶︎
The Strange Math That Predicts (Almost) Anything

▶︎
Building the PERFECT Linux PC with Linus Torvalds

▶︎
How To Interpret Chain Of Thought: A Walkthrough

▶︎
How AI Cracked the Protein Folding Code and Won a Nobel Prize

▶︎
What Matters Right Now In Mechanistic Interpretability?

▶︎
Something is jamming GPS over Europe. Here's what we found

▶︎
Train Your Brain to Never Forget (5 Feynman Habits)

▶︎
Training Sand to Think: Artificial General Intelligence & Future of Physics

▶︎
The Story of Mech Interp

▶︎
How To Think About Thinking Models

▶︎
How AI agents & Claude skills work (Clearly Explained)

▶︎
Creating Models Worth Interpreting

▶︎
