Everything you need to know about LLM benchmarks. (and why they're flawed), OpenAI's Healthbench
Whenever there was AI, there were benchmarks- from the turing test, to society-changing benchmarks like MNIST and ImageNet to modern problems like the ARC prize, benchmarks served a vital purpose to measure the performance of AI models. But something has shifted in modern times, in the LLM era have benchmarks lost their utility, becoming mere advertisement for big tech? Even seemingly more sophisticated benchmarks like LM Arena can be gamed by tech giants. We also deep dive into healthcare benchmarks like OpenAI's Healthbench (deeply problematic) and microsofts AI DXO orchestrator agent for diagnosis. Where is this all going? How do we make the perfect benchmark? Or is the real work to be done afterwards in the real world? 👋 Hey! If you are enjoying our conversations, reach out, share your thoughts and journey with us. Don't forget to subscribe whilst you're here :) Timestamps 00:00 intro - The OG benchmarks - Turing test, MNIST , ImageNET 06:40 are large language models benchmarks similar to humans taking tests? 10:05 Are we testing model capability vs production ready? 12:00 LLM era - data contamination 15:30 LM arena - The leaderboard illusion paper- how big tech games benchmarks 28:35 Goodhart's law- When a measure becomes a target, it ceases to be a good measure 32:05 some good benchmarks - games- Pokemon , ARC prize, minecraft 34:35 Medical benchmarks - OpenAI's healthbench has some big problems 46:50 microsoft AI-DXO orchestrator for case reports 👨🏻‍⚕️Doc - Dr. Joshua Au Yeung -   / dr-joshua-auyeung  🤖Dev - Zeljko Kraljevic   / zeljkokr  References Rethinking benchmarks , data contamination paper - https://arxiv.org/pdf/2311.04850 leaderboard illusion - https://arxiv.org/pdf/2504.20879 openAI's healthbench subanalysis   / a-closer-look-at-openais-new-healthbench-e...  microsoft - towards sequential ddx https://arxiv.org/pdf/2506.22405 YT -    / @devanddoc  Spotify - https://podcasters.spotify.com/pod/sh... Apple- https://podcasts.apple.com/gb/podcast... Substack- https://aiforhealthcare.substack.com/ For enquiries - 📧[email protected] 🎞️ Editor- Dragan Kraljević   / dragan_kraljevic  🎨Brand design and art direction - Ana Grigorovici https://www.behance.net/anagrigorovic...
![Yann LeCun's $1B Bet Against LLMs [Part 1]](https://i.ytimg.com/vi/kYkIdXwW2AE/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDbV4izF3i-wxevCVIn7FJjoy1vlA)
Yann LeCun's $1B Bet Against LLMs [Part 1]

Is the AI Boom About to COLLAPSE?

If Prime Numbers Become Increasingly Rare, Then Why Do They Keep Showing Up In Pairs?

Trump’s Big Violent 80th Birthday Party at the White House, "Great Deal" with Iran & NY Knicks Win

Linus Torvalds Was Right About Microsoft… And Nobody Noticed

FULL DISCUSSION: Google's Demis Hassabis, Anthropic's Dario Amodei Debate the World After AGI | AI1G

The most rational take on AI you’ll hear this year

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Is RAG Still Needed? Choosing the Best Approach for LLMs

Inside Anthropic, the $965 Billion AI Juggernaut | The Circuit

The future of intelligence | Demis Hassabis (Co-founder and CEO of DeepMind)

Introduction to Generative AI

The Uncomfortable Truth About AI “Reasoning” | World Science Festival

Significantly advancing LLMs with RAG (Google's Gemini 2.0, Deep Research, notebookLM)

AI agents explained (for healthcare)- Manus AI, computer control, Agentic workflows, clinical agents

Leading in the Age of AI: A Conversation with NVIDIA CEO Jensen Huang | Global Conference 2026

Yann LeCun Says LLMs Have 2 Years Left…

Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next

Text Diffusion — Brendan O’Donoghue, Google DeepMind

