What a 100-year-old horse teaches us about AI
How do we rigorously measure AI's intelligence? We don't really know. What we know is that measuring intelligence is tricky, and if we're not careful, our tests might not measure what we intend. We explore this topic by starting with the story of Clever Hans, a horse who seemingly could do arithmetic. Later, we explain the potential limitations of today's AI benchmarks and how we could do better by looking at the established discipline of cognitive science. ▀▀▀▀▀▀▀▀▀SOURCES & READINGS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ The Project Gutenberg EBook of Clever Hans, by Oskar Pfungst: https://www.gutenberg.org/files/33936... The Wiring of Intelligence: https://journals.sagepub.com/doi/10.1... New and emerging models of human intelligence: https://wires.onlinelibrary.wiley.com... NTIRE2025 Challenge on Cross-Domain Few-Shot Object Detection: Methods and Results: https://arxiv.org/pdf/2504.10685v1 HellaSwag: https://rowanzellers.com/hellaswag/ Are We Done with MMLU? https://arxiv.org/abs/2406.04127 Artificial cognition: How experimental psychology can help generate explainable artificial intelligence: https://link.springer.com/article/10.... o3-mini System Card: https://cdn.openai.com/o3-mini-system... Measuring Massive Multitask Language Understanding: https://arxiv.org/pdf/2009.03300 Requiem for nutrition as the cause of IQ gains: Raven's gains in Britain 1938–2008: https://www.sciencedirect.com/science... Observational Scaling Laws and the Predictability of Language Model Performance: https://doi.org/10.48550/arXiv.2405.1... Introducing Claude 4 (agentic benchmarks): https://www.anthropic.com/news/claude-4 Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses: https://turntrout.com/original-truthf... Phonological memory and vocabulary development during the early school years: A longitudinal study: https://psycnet.apa.org/doi/10.1037/0... MMLU-CF:AContamination-free Multi-task Language Understanding Benchmark: https://arxiv.org/pdf/2412.15194 Smelling themselves: Dogs investigate their own odours longer when modified in an “olfactory mirror” test: https://doi.org/10.1016/j.beproc.2017... Elephants' jumbo mirror ability: http://news.bbc.co.uk/2/hi/science/na... ARC Prize 2024: Technical Report: https://arxiv.org/pdf/2412.04604 Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others: https://arxiv.org/pdf/2102.11938v1 CogBench: a large language model walks into a psychology lab: https://arxiv.org/pdf/2402.18225 The Animal-AI Environment: A virtual laboratory for comparative cognition and artificial intelligence research: https://doi.org/10.3758/s13428-025-02... A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment: https://openreview.net/forum?id=eUkbT... More about cross-benchmark metrics and tests inspired by cognitive science: General Scales Unlock AI Evaluation with Explanatory and Predictive Power: https://arxiv.org/abs/2503.06378 I SPY WITH MY MODEL’S EYE: VISUAL SEARCH AS A BEHAVIOURAL TEST FOR MLLMS: https://arxiv.org/pdf/2510.19678 INTUIT: Investigating intuitive reasoning in humans and language models: https://escholarship.org/content/qt33... ▀▀▀▀▀▀▀▀▀PATREON, MEMBERSHIP, MERCH▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ 🟠 Patreon: / rationalanimations 🔵 Channel membership: / @rationalanimations 🟢 Merch: https://rational-animations-shop.four... 🟤 Ko-fi, for one-time and recurring donations: https://ko-fi.com/rationalanimations ▀▀▀▀▀▀▀▀▀SOCIAL & DISCORD▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Rational Animations Discord: / discord Reddit: / rationalanimations X/Twitter: / rationalanimat1 Instagram: / rationalanimations ▀▀▀▀▀▀▀▀▀PATRONS & MEMBERS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Thanks to our patrons and channel members from the Simple Adder tier and above: https://docs.google.com/document/d/1p... ▀▀▀▀▀▀▀CREDITS▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ Credits here: https://docs.google.com/document/d/1d...

The Unbreakable Kryptos Code

We've never seen an atom. But we know what they look like.

AI Sleeper Agents: How Anthropic Trains and Catches Them

AI could be a tool for global control (plus other major AI risks)

The Tiny Donut That Proved We Still Don't Understand Magnetism

The Power of Vulnerability | Hidden Psychology

What Do Neural Networks Really Learn? Exploring the Brain of an AI Model

What do other animals think of human music?

How to Align AI: Put It in a Sandwich

Can humans make AI any better?

The Uncomfortable Truth About Ozempic (Updated Version)

Is Bunny the “talking" dog legit? Here’s what science says

What Happened to Horses Is Happening to Us

Simulating the Evolution of Aging

If dropping 0.1 grams of antimatter destroys a city, then why are we making it?

You NEED to STOP Using Google Right Now

Why is Everyone So Wrong About AI Water Use??

But how do AI images and videos actually work? | Guest video by Welch Labs

Why Tech CEOs Are Quietly Cancelling Their AI Plans

