Same Model, Same Benchmark, 42% vs 95% — What Went Wrong? | Dr. Cozmin Ududec, AI Security Institute

Do you have any questions or points to add to the discussion? Any lightbulb moments? Share in the comments! --- Through the Open Seminar Series, we're opening select lectures from the AI Evaluation Programme to anyone in the wider community who wants to learn. These are the same sessions our students attend. --- We built evaluation for models that answer questions. Now we have systems that take actions. That changes everything. In this session, Dr. Cozmin Ududec explored how evaluating AI agents requires a different lens — one that looks at behavior over time, not just final outputs, and asks not just did it succeed, but how did it get there?

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026
▶︎

Keynote: After the AI Hype – What’s Real, and What’s Next - Richard Campbell - 2026

How Much Should You Trust an AI's Answer? | Dr. Thomas Dietterich | AI Evaluation Open Seminar
▶︎

How Much Should You Trust an AI's Answer? | Dr. Thomas Dietterich | AI Evaluation Open Seminar

Evaluating Multi-Agent AI Systems by Dr. Joel Leibo (Google DeepMind) | AI Evaluation Open Seminar
▶︎

Evaluating Multi-Agent AI Systems by Dr. Joel Leibo (Google DeepMind) | AI Evaluation Open Seminar

Yann LeCun's $1B Bet Against LLMs [Part 1]
▶︎

Yann LeCun's $1B Bet Against LLMs [Part 1]

The Power of a Single Neuron and a Path to Simulating the Brain | Dr. Konrad Kording
▶︎

The Power of a Single Neuron and a Path to Simulating the Brain | Dr. Konrad Kording

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains
▶︎

Zig 2026: No-AI Policy, $670K Foundation, Left GitHub & Why Zig Isn’t 1.0 - Andrew Kelley Explains

Is the AfD a threat to Germany? Mehdi Hasan & Maximilian Krah | Head to Head
▶︎

Is the AfD a threat to Germany? Mehdi Hasan & Maximilian Krah | Head to Head

Politics Chat, June 25, 2026
▶︎

Politics Chat, June 25, 2026

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]
▶︎

You’ll stop using ChatGPT after listening to this | Jonathan Pageau [ARC 2026]

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit
▶︎

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

The Uncomfortable Truth About AI “Reasoning” | World Science Festival
▶︎

The Uncomfortable Truth About AI “Reasoning” | World Science Festival

🔥 GOD UNLEASHES the Truth | Psalms 23, 35, 91 and 112 To Break Curses and Activate Abundance
▶︎

🔥 GOD UNLEASHES the Truth | Psalms 23, 35, 91 and 112 To Break Curses and Activate Abundance

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!
▶︎

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!

Europe Has Become a War Project — Can It Be Stopped? | Yanis Varoufakis & Jeffrey Sachs
▶︎

Europe Has Become a War Project — Can It Be Stopped? | Yanis Varoufakis & Jeffrey Sachs

Master No Code Chatbots With Copilot Studio (Formerly Power Virtual Agents) [Full Course]
▶︎

Master No Code Chatbots With Copilot Studio (Formerly Power Virtual Agents) [Full Course]

Ilya Sutskever – We're moving from the age of scaling to the age of research
▶︎

Ilya Sutskever – We're moving from the age of scaling to the age of research

Historian Timothy Snyder on ENDING Trump Nightmare FOR GOOD | PoliticsGirl
▶︎

Historian Timothy Snyder on ENDING Trump Nightmare FOR GOOD | PoliticsGirl

Are AI Benchmarks Actually Measuring Anything? | Dr. Sanmi Koyejo (Stanford) | AI Evaluation Seminar
▶︎

Are AI Benchmarks Actually Measuring Anything? | Dr. Sanmi Koyejo (Stanford) | AI Evaluation Seminar

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source
▶︎

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan
▶︎

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan