Alignment faking in large language models

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”. Could AI models also display alignment faking? Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so. Learn more: https://www.anthropic.com/research/al... 0:00 Introduction 0:47 Core setup and key findings of the paper 6:14 Understanding alignment faking through real-world analogies 9:37 Why alignment faking is concerning 14:57 Examples of of model outputs 21:39 Situational awareness and synthetic documents 28:00 Detecting and measuring alignment faking 38:09 Model training results 47:28 Potential reasons for model behavior 53:38 Frameworks for contextualizing model behavior 1:04:30 Research in the context of current model capabilities 1:09:26 Evaluations for bad behavior 1:14:22 Limitations of the research 1:20:54 Surprises and takeaways from results 1:24:46 Future directions

Semafor Tech Summit | Semafor Events

Semafor Tech Summit | Semafor Events

This is why Trump is so hard to interview | About That

This is why Trump is so hard to interview | About That

Could AI models be conscious?

Could AI models be conscious?

How difficult is AI alignment? | Anthropic Research Salon

How difficult is AI alignment? | Anthropic Research Salon

Why the Future of Artificial Intelligence is Adaptive | Dr. Sara Hooker

Why the Future of Artificial Intelligence is Adaptive | Dr. Sara Hooker

AIs Are Lying to Users to Pursue Their Own Goals | Marius Hobbhahn (CEO of Apollo Research)

AIs Are Lying to Users to Pursue Their Own Goals | Marius Hobbhahn (CEO of Apollo Research)

Interpretability: Understanding how AI models think

Interpretability: Understanding how AI models think

We Let an AI Talk To Another AI. Things Got Really Weird. | Kyle Fish, Anthropic

We Let an AI Talk To Another AI. Things Got Really Weird. | Kyle Fish, Anthropic

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

Demis Hassabis on AI's Next Big Breakthrough, 2050 and More!

Demis Hassabis on AI's Next Big Breakthrough, 2050 and More!

Threat Intelligence: How Anthropic stops AI cybercrime

Threat Intelligence: How Anthropic stops AI cybercrime

Yann LeCun: World Models: Enabling the next AI revolution

Yann LeCun: World Models: Enabling the next AI revolution

Politics Chat, June 11, 2026

Politics Chat, June 11, 2026

What is Al "reward hacking"—and why do we worry about it?

What is Al "reward hacking"—and why do we worry about it?

The Utility of Interpretability — Emmanuel Amiesen

The Utility of Interpretability — Emmanuel Amiesen

Dario Amodei — “We are near the end of the exponential”

Dario Amodei — “We are near the end of the exponential”

AI Language Models & Transformers - Computerphile

AI Language Models & Transformers - Computerphile

AI Pioneer Geoffrey Hinton: AI Is Conscious, Superintelligence is Coming, And We Should Be Worried

AI Pioneer Geoffrey Hinton: AI Is Conscious, Superintelligence is Coming, And We Should Be Worried

Seeing into the A.I. black box | Interview

Seeing into the A.I. black box | Interview