Practical AI Coding Agent Evaluation with SWE-bench, TeamCity, and Juni | Ernst Haagsman

In this talk, Ernst Haagsman, Product Leader at JetBrains, shares his expertise on scaling developer tools from his early days on the PyCharm team to his current role leading TeamCity and AI integration. We explore the practical challenges of evaluating AI coding agents using SWE-bench and how to build a robust CI/CD pipeline for non-deterministic AI outputs. You’ll learn about: The architecture of SWE-bench and how it uses real-world GitHub issues as benchmarks. How to apply the "Arrange, Act, Assert" framework to AI agent evaluation. Technical strategies for caching dependencies and using Docker to reduce evaluation costs. Scaling parallel AI workloads using TeamCity, Kotlin DSL, and AWS infrastructure. Techniques for managing LLM API rate limits and handling stochastic model behavior. Building custom data sets for specialized AI agents like customer support bots or transcribers. The future of "Agentic Development" with a first look at JetBrains Air. Links: Repository: https://github.com/jetbrains/teamcity... Dataset: https://huggingface.co/datasets/SWE-b... TIMECODES: 00:00:00 Intro: workshop, speakers, and agenda 00:01:46 Demo project: a small Go service and manual testing 00:05:37 AI agents, Juni, and why unit tests don't fit 00:08:18 What SWE-bench is: real GitHub issues as tasks 00:14:18 Evaluation workflow and the SWE‑bench harness 00:19:20 Scaling gotchas: cost, retries, caching and prebuilt images 00:23:25 Designing evaluation runs: slicing, CI reuse and TeamCity benefits 00:29:22 Live demo: preparing task images and kicking off evaluations 00:34:02 TeamCity config as code: Kotlin DSL and repo layout 00:43:56 How images and task environments are built and cached 00:49:51 Running the agent (Juni), formatting outputs and grading 00:55:42 Tagging builds, interpreting results and concurrency controls 01:01:14 Parallel vs sequential runs, timing, and reuse trade-offs 01:05:48 Dataset coverage, language scope and model leakage concerns 01:08:50 Aggregating results and visualizing success rates in TeamCity 01:13:06 Interpreting evaluation outcomes and model selection 01:16:49 Applying SWE‑bench ideas to your own agent or skill 01:21:06 Getting started: TeamCity, Juni, Air, and next steps This workshop is designed for Machine Learning Engineers, Data Scientists, and DevOps professionals who are building or evaluating AI agents and need to move from manual testing to automated, scalable benchmarks. It is particularly valuable for those looking to integrate LLM evaluation into their existing CI/CD workflows. Connect with Ernst Linkedin - / ernsthaagsman Connect with DataTalks.Club: Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events GitHub: https://github.com/DataTalksClub LinkedIn - / datatalks-club Twitter - / datatalksclub Website - https://datatalks.club/ Connect with Alexey Twitter - / al_grigor Linkedin - / agrigorev Check our free online courses: ML Engineering course - http://mlzoomcamp.com Data Engineering course - https://github.com/DataTalksClub/data... MLOps course - https://github.com/DataTalksClub/mlop... LLM course - https://github.com/DataTalksClub/llm-... Open-source LLM course: https://github.com/DataTalksClub/open... AI Dev Tools course: https://github.com/DataTalksClub/ai-d... 👉🏼 Read about all our courses in one place - https://datatalks.club/blog/guide-to-... 👋🏼 Support/inquiries If you want to support our community, use this link - https://github.com/sponsors/alexeygri... If you’re a company, reach us at [email protected] #AI #MachineLearning #AIAgents #SWEbench #JetBrains #TeamCity #SoftwareEngineering #LLM #DevOps #CICD #DataScience #Python #Automation #CodingAgents #KotlinDSL #AWS #Docker #TechWorkshop #AIResearch #datatalksclub

Microsoft Fabric and Power BI - Developer of the Future⚡ [Full Course]

Microsoft Fabric and Power BI - Developer of the Future⚡ [Full Course]

Inside the AI Engineer Role: Tools, Skills, and Career Path - Ruslan Shchuchkin

Inside the AI Engineer Role: Tools, Skills, and Career Path - Ruslan Shchuchkin

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

Full Archon Guide - Build AI Coding Harnesses That Actually Ship (LIVE)

Object Oriented Programming | OOPS in Python | OOPS Tutorial | Intellipaat

Object Oriented Programming | OOPS in Python | OOPS Tutorial | Intellipaat

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

Data Engineer Career in 2026: Roles, Specializations, and What Companies Look for - Slawomir Tulski

Data Engineer Career in 2026: Roles, Specializations, and What Companies Look for - Slawomir Tulski

Understanding the AI Engineer Role - Nasser Qadri

Understanding the AI Engineer Role - Nasser Qadri

Starting a Data Conference: The Data Makers Fest Story - Leonid Kholkine

Starting a Data Conference: The Data Makers Fest Story - Leonid Kholkine

Context Engineering for Agentic Hybrid Applications - Ivan Potapov, Tobias Lindenbauer

Context Engineering for Agentic Hybrid Applications - Ivan Potapov, Tobias Lindenbauer

What is SonarQube | Introduction SonarQube | SonarQube Tutorial | SonarQube Basics | Intellipaat

What is SonarQube | Introduction SonarQube | SonarQube Tutorial | SonarQube Basics | Intellipaat

Nvidia CEO Live on Bloomberg Technology (full show) #tech

Nvidia CEO Live on Bloomberg Technology (full show) #tech

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

How to Evaluate MCP-powered AI Agents Beyond Accuracy using Agent GPA - Josh Reini

How to Evaluate MCP-powered AI Agents Beyond Accuracy using Agent GPA - Josh Reini

How to increase your vocabulary: Live English Class

How to increase your vocabulary: Live English Class

Python Interview Questions and Answers | Top Python Interview Questions | Intellipaat

Python Interview Questions and Answers | Top Python Interview Questions | Intellipaat

Pushing My AI Dark Factory to Its Limits with Opus + Kimi Combined

Pushing My AI Dark Factory to Its Limits with Opus + Kimi Combined

LLM Zoomcamp 2026 Pre-Course Live Q&A - Alexey Grigorev

LLM Zoomcamp 2026 Pre-Course Live Q&A - Alexey Grigorev

Full Walkthrough: Workflow for AI Coding — Matt Pocock

Full Walkthrough: Workflow for AI Coding — Matt Pocock

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson

Anthropic Workshop: Build Agents That Run for Hours — Ash Prabaker & Andrew Wilson