NEURAL NETWORKS ARE WEIRD! - Neel Nanda (DeepMind)

SPONSOR MESSAGES: *** CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. https://centml.ai/pricing/ Neel Nanda leads the mechanistic interpretability team at Google DeepMind. At 26, he's become one of the most prominent researchers working on the question of what's actually going on inside neural networks -- systems that can win IMO medals and write complex software, but which nobody actually designed or understands. This nearly four-hour conversation is a deep technical dive into the field. Nanda explains why machine learning is fundamentally weird: we produce artifacts that do impressive things, but unlike conventional software, no one wrote the code or planned the architecture. His team's goal is reverse-engineering these systems by finding the internal structures and algorithms that emerge during training. The discussion covers the mechanics of sparse autoencoders at length -- how they decompose model activations into interpretable feature vectors, the mathematical foundations (ReLU vs TopK activation functions), scaling laws for feature learning, and the engineering challenges of running them at the scale of frontier models. Nanda walks through the Golden Gate Claude experiment (amplifying a single feature to make Claude obsessed with the Golden Gate Bridge), induction heads (the circuits responsible for in-context learning), and activation patching as a causal intervention technique. On AI safety, Nanda is pragmatic. He argues that mechanistic interpretability gives us genuine empirical evidence about questions that are otherwise stuck in philosophical debate -- do models have goals? Do they deceive? He also discusses the limitations: sparse autoencoders haven't yet demonstrated capabilities beyond what fine-tuning already achieves, and at sufficient model complexity, models could potentially facade interpretability measurements. The conversation covers his path from pure maths at Cambridge through Anthropic to DeepMind, and why he thinks hands-on coding matters more than reading papers for new researchers entering the field. --- REFERENCES: person: [00:00:00] Neel Nanda - Personal Website https://www.neelnanda.io/ tool: [00:35:00] TransformerLens https://github.com/TransformerLensOrg... paper: [01:00:31] A Mathematical Framework for Transformer Circuits https://transformer-circuits.pub/2021... [01:01:40] In-context Learning and Induction Heads https://transformer-circuits.pub/2022... [01:21:06] Scaling Monosemanticity https://transformer-circuits.pub/2024... [01:33:27] Refusal in Language Models Is Mediated by a Single Direction https://arxiv.org/abs/2406.11717 --- LINKS: Full Transcript: https://app.rescript.info/share/acb41... Download PDF transcript: https://app.rescript.info/api/public/... NEEL NANDA: https://www.neelnanda.io/ https://scholar.google.com/citations?... https://x.com/NeelNanda5