The Dark Arts of ML Benchmarking - Yonatan Alexander

Summary Most ML benchmarks are quietly broken. Leaderboard gaming, data leaks, logging errors, and poorly designed execution functions mean teams spend more time debugging than learning. The best case is rare: implement once, run once, analyze without pain. This talk shares hard-won wisdom practitioners rarely document: designing experiments backwards from the questions you want to answer, building robust single-execution functions that are cacheable, stateless, and failure-aware, saving raw responses, and versioning everything at the right level of rigor. We will also demo xetrack, an open-source Python experiment tracking library built for practitioners who want lightweight logging without vendor lock-in. xetrack ships with a Claude skill that acts as a built-in methodology guide, helping AI agents design experiments correctly, avoid common pitfalls, and work methodically from the start. If you have ever stared at a 3 AM benchmark failure, wondering what went wrong, this talk is for you. About Yonatan Alexander Yonatan Alexander builds AI systems that ship fast, work at scale, and actually solve problems. As Head of AI at Lasso Security, he leads teams building production ML systems for enterprise security. He invented a patent-pending LLM inference architecture achieving 570X cost reduction and pioneered serverless machine learning before it became an industry standard. Yonatan is the creator of xetrack, an open-source experiment tracking library, and a technical advisor to Vaex (8.5K+ GitHub stars), helping shape how Python handles billion-row datasets on standard hardware. His "Beyond Pandas" article has been read 18.8K times by practitioners navigating real-world data challenges. He has delivered technical talks at PyData and AIGrunn on the messy realities of production AI, from LLM hallucinations to the gap between demos and deployed systems. His "Branches Are All You Need" framework influences how teams approach ML versioning.