Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis agent: tracing with Phoenix, reading traces before writing a single eval, categorizing failures by root cause, then building code evals, built-in LLM-as-a-judge evals, and a custom rubric with labeled examples. The sharpest lesson: choosing the right eval matters more than tuning it. A correctness eval scored 0 out of 13 on the same agent that a faithfulness eval scored 13 out of 13, because the model doesn't know what year it is and can't verify forward-looking financial data. The workshop closes on the thing most eval content skips — experiments that let you prove a prompt change actually worked, rather than eyeballing it and calling it a win. Speaker info: https://x.com/seldo / seldo https://github.com/seldo Timestamps: 0:00:00 Introduction 0:00:14 Workshop Overview 0:04:31 Troubleshooting Phoenix Setup 0:05:17 Fundamentals of Evals and Tracing 0:18:44 Anatomy of an Eval Result 0:21:19 The Iteration Loop 0:26:58 Building the Financial Analysis Agent 0:33:28 Using Phoenix for Observability 0:35:38 Running Multiple Test Queries 0:38:12 Reading and Categorizing Traces 0:49:52 Implementing Code Evals 0:57:51 Built-in LLM-as-a-Judge Evals 1:03:04 Faithfulness Evaluation 1:04:35 Designing a Custom Eval Rubric 1:11:47 Running the Actionability Judge 1:19:14 Using Data Sets and Experiments 1:50:19 Final Tips and Best Practices 1:51:48 Differences Between Phoenix and Arize AX

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

Building an AI Dark Factory: A Codebase That Writes Its Own Code, Live

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

Inside the Agentic Enterprise

Inside the Agentic Enterprise

Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic

Agent Optimization with Pydantic AI: GEPA, Evals, Feedback Loops — Samuel Colvin, Pydantic

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

Becomming AI Engineer/Scientist LIVE #3

Becomming AI Engineer/Scientist LIVE #3

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Build a Complete Medical Chatbot with LLMs, LangChain, Pinecone, Flask & AWS 🔥

Build a Complete Medical Chatbot with LLMs, LangChain, Pinecone, Flask & AWS 🔥

Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick

Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick

Claude Architect: Multi-Agent Orchestration

Claude Architect: Multi-Agent Orchestration

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

Building your own software factory — Eric Zakariasson, Cursor

Building your own software factory — Eric Zakariasson, Cursor

Evals 101 — Doug Guthrie, Braintrust

Evals 101 — Doug Guthrie, Braintrust

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

Databricks Live Bootcamp | Day1: Introduction & Data Analytics

Databricks Live Bootcamp | Day1: Introduction & Data Analytics

The best AI agents are simpler than you think

The best AI agents are simpler than you think

Shipping complex AI applications — Braintrust & Trainline

Shipping complex AI applications — Braintrust & Trainline

How to Systematically Setup LLM Evals (Metrics, Unit Tests, LLM-as-a-Judge)

How to Systematically Setup LLM Evals (Metrics, Unit Tests, LLM-as-a-Judge)