BiomniBench: Evaluating AI Agents in Biology | Yunhao Qu

Join the reading group: https://hannes-stark.com/starkly-spea... Paper: BiomniBench: Evaluating AI Agents in Biology https://phylo.bio/blog/evaluating-ai-... Abstract: As AI agents become central to biological research, evaluation must keep pace. We examine why existing benchmarks fall short for biology, share lessons from our experience with BixBench including a verified subset, and introduce BiomniBench, a trace-based evaluation framework that scores agents on their analytical process, not just the final answer. Biomni Lab achieves state-of-the-art performance across both general-purpose and domain-specific agents on both benchmarks.