Hacking an LLM's Personality with Representation Engineering

Papers & Resources [Persona Vectors: Monitoring and Controlling Character Traits in Language Models](https://arxiv.org/abs/2507.21509) = Interpretability [Blog post](https://www.anthropic.com/research/pe...) [Code Repo](https://github.com/safety-research/pe...) [Anthropic Thread](https://x.com/AnthropicAI/status/1951...) [Anthropic Hiring](https://x.com/Jack_W_Lindsey/status/1...) [A Simple but Tough-to-Beat Baseline for Sentence Embeddings](https://openreview.net/pdf?id=SyK00v5xx) [Improving Reasoning Performance in Large Language Models via Representation Engineering](https://arxiv.org/abs/2504.19483) [Poster: #246 ICLR](https://iclr.cc/virtual/2025/poster/3...) Control vectors derived from positive and negative reasoning outcomes [Learning without training: The implicit dynamics of in-context learning](https://arxiv.org/abs/2507.16003) [Author Tweet](https://x.com/mikemunnster/status/194...) Earlier work : [Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers](https://arxiv.org/abs/2212.10559) [It's Owl in the Numbers: Token Entanglement in Subliminal Learning](https://owls.baulab.info/) [Colab (as in the video)](https://colab.research.google.com/dri...) Blurb Is it possible to perform surgery on an LLM's brain? In this video, we dive deep into one of the most exciting new frontiers in AI research: Representation Engineering, and specifically Anthropic's work on "Persona Vectors." We've all struggled with prompting LLMs to get them to behave exactly as we want, fighting against baked-in behaviors like sycophancy, evasiveness, or even hallucination. But what if we could stop treating the model like a black box and instead edit its internal states directly? We'll break down the papers to understand how researchers are identifying and manipulating the very vectors that control these complex personality traits. By the end of this video, you'll understand: The core mechanism behind Anthropic's "Persona Vectors" How this technique relates (sometimes tangentially) to other research works The potential this unlocks for creating safer, more reliable, and precisely controlled AI systems. ABOUT THE CHANNEL My channel is for "The AI Builder": the developer, tinkerer, and hands-on enthusiast. We go beyond the headlines to understand the mechanisms behind the latest research, empowering you to build the future. From the Lab to Your Laptop. SOCIALS https://github.com/mdda / martinandrews https://x.com/mdda123 #AI #LLM #MachineLearning #Research #LatentSpace #AIExplained #PersonaVector #Anthropic #Colab

Latent Space Reasoning : Looking at the research

Latent Space Reasoning : Looking at the research

Steering LLM Behavior Without Fine-Tuning

Steering LLM Behavior Without Fine-Tuning

Automating the "Black Art" of GPU Programming with AI

Automating the "Black Art" of GPU Programming with AI

Diffusion Models Explained : From DDPM to Stable Diffusion

Diffusion Models Explained : From DDPM to Stable Diffusion

The Surprising Performance Drivers of HRM

The Surprising Performance Drivers of HRM

Tanya presents: Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Tanya presents: Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Will AI outsmart human intelligence? - with 'Godfather of AI' Geoffrey Hinton

Will AI outsmart human intelligence? - with 'Godfather of AI' Geoffrey Hinton

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Transformers, the tech behind LLMs | Deep Learning Chapter 5

Princeton Cognitive Scientist Says AI Researchers Are Wrong

Princeton Cognitive Scientist Says AI Researchers Are Wrong

Do LLMs Understand? AI Pioneer Yann LeCun Spars with DeepMind’s Adam Brown.

Do LLMs Understand? AI Pioneer Yann LeCun Spars with DeepMind’s Adam Brown.

AI That Evolves Its Own Prompts

AI That Evolves Its Own Prompts

FULL DISCUSSION: Google's Demis Hassabis, Anthropic's Dario Amodei Debate the World After AGI | AI1G

FULL DISCUSSION: Google's Demis Hassabis, Anthropic's Dario Amodei Debate the World After AGI | AI1G

Model Collapse Ends AI Hype

Model Collapse Ends AI Hype

How AI agents & Claude skills work (Clearly Explained)

How AI agents & Claude skills work (Clearly Explained)

The "Final Boss" of Deep Learning

The "Final Boss" of Deep Learning

Why can’t LLMs just LEARN the context window?

Why can’t LLMs just LEARN the context window?

World Models & Neural Assets: The Mechanics of AI Simulation

World Models & Neural Assets: The Mechanics of AI Simulation

How Mixture of Experts (MoE) Actually Works

How Mixture of Experts (MoE) Actually Works

This is not the AI we were promised | The Royal Society

This is not the AI we were promised | The Royal Society

How an AI Agent Won Gold at the Physics Olympiad (Paper Explained)

How an AI Agent Won Gold at the Physics Olympiad (Paper Explained)