Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran
With nearly two-thirds of enterprise developers planning production deployments of large language models this year, LLM evaluation has never been more important. LLM evaluation is also an area where confusion reigns, starting with ambiguity around what “LLM evals” even means. Often, LLM model evaluation – quantifying general fitness (i.e. on the Hugging Face leaderboard) – is conflated with task-specific LLM system evaluation. And while many foundation model providers offer their own evals, AI engineers building LLM systems designed to plug into many models or tools need a way to objectively evaluate both different foundation models and their own systems with rigorous techniques. In this session, Arize AI founder Aparna Dhinakaran will release research onstage and walk attendees through real life examples of building an LLM Eval from scratch. This session will build on multiple research pieces that have garnered millions of views across social platforms, diving into techniques to build out robust LLM evals and ultimately gain a better understanding of the limits of LLM capabilities. Want to build your own LLM task evals for a specific use case leveraging open source tools? Want to see the latest research on which foundation models your company should be using for specific use cases? You won’t want to miss this session! Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/20... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025 About Aparna Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a pioneer and early leader in AI observability and LLM evaluation. A frequent speaker at top conferences and thought leader in the space, Dhinakaran is a Forbes 30 Under 30 honoree. Before Arize, Dhinakaran was an ML engineer and leader at Uber, Apple, and TubeMogul (acquired by Adobe). During her time at Uber, she built several core ML Infrastructure platforms, including Michelangelo. She has a bachelor’s from Berkeley's Electrical Engineering and Computer Science program, where she published research with Berkeley's AI Research group. She is on a leave of absence from the Computer Vision Ph.D. program at Cornell University.

Agentic Evaluations Workshop - Deep Dive on the Future on Evals for Agents.

How to Systematically Setup LLM Evals (Metrics, Unit Tests, LLM-as-a-Judge)

How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh

Is RAG Still Needed? Choosing the Best Approach for LLMs

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Inspect - A LLM Eval Framework Used by Anthropic, DeepMind, Grok and More.

Anthropic's NEW Claude Architect Guide In 39 Minutes

Make your LLM app a Domain Expert: How to Build an Expert System — Christopher Lovejoy, Anterior

Harnesses in AI: A Deep Dive — Tejas Kumar, IBM

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Andrej Karpathy: Software Is Changing (Again)

OWASP's Top 10 Ways to Attack LLMs: AI Vulnerabilities Exposed

Evals in Action: From Frontier Research to Production Applications

How we solved Context Management in Agents — Sally-Ann Delucia

Model Context Protocol (MCP) Explained for Beginners: AI Flight Booking Demo!

Everything We Got Wrong About Research-Plan-Implement - Dexter Horthy

Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next

Observability and Evals for AI Agents: A Simple Breakdown

Key Metrics and Evaluation Methods for RAG

