XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks(2605.30788)【論文解説シリーズ】

[AI Era Compass] Paper Commentary Series XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju https://arxiv.org/abs/2605.30788 ⭐️ Authors' Organizations and Abbreviations Google DeepMind Indian Institute of Technology Bombay International Centre for Theoretical Sciences, Tata Institute of Fundamental Research ⭐️ Problem Solved Existing multilingual benchmarks (such as MMLU and MGSM) have been found to include translation errors and annotation errors, posing a fundamental problem: the inability to distinguish whether the observed difference between English and other languages ​​represents a true difference in the model's capabilities or a translation quality issue. In addition, benchmarks saturate as models improve, making it impossible to measure differences. To address this, this research, XLGoBench, proposes a measurement system that simultaneously achieves the following: Designed to require translating only an average of 5 templates per task, making manual translation verification realistically feasible. Continuously scalable complexity prevents saturation. Automatic scoring in JSON format eliminates LLM judgment, ensuring objectivity. A large-scale experiment was conducted with approximately 140,000 prompts across 7 languages, 5 tasks, and 4 models, detecting cross-lingual gaps in all frontier models. ⭐️Key Points Explanation 1. Major Findings: Cross-lingual gaps were confirmed in all four frontier models. The difference was greatest in the intermediate complexity range, with all languages ​​showing near-perfect scores at low complexity. In the GLM-5.1 graph shortest path task, a reversal phenomenon was observed where Chinese surpassed English. Multilingual proficiency differences varied by task, with the tournament ranking being the task where differences were most consistently apparent across all models. The 140,000 prompt experiment quantitatively demonstrated the non-uniformity of multilingual LLM for the first time. 2. Methodology: The XLGoBench LLM evaluation benchmark consists of five types of algorithmic puzzles generated from an average of five templates per task. Difficulty is controlled using complexity scaling, the accuracy curve is estimated using a gamma distribution (R² 0.97), and the difference is quantified at the point of maximum deviation. Translation quality verification is reinforced with manual checks and back-translation, achieving objective scoring without the need for LLM determination. Improvements should include reducing evaluation costs and adding correlation measurements with real-world tasks. 3. Research Limitations: There are two main limitations. Firstly, it is unclear to what extent the cross-lingual gap detected by the algorithmic puzzles represents real-world tasks, including creative and cultural backgrounds. Secondly, evaluation costs are high, requiring several thousand dollars for large-scale models. Effective countermeasures include conducting score correlation studies with real-world tasks and narrowing down to complexity ranges where differences occur. Adding experiments controlling for tokenization and publishing detailed translation quality verification results would also improve reliability. 4. Related Research: Existing multilingual benchmarks such as MMLU and MGSM have been criticized for including translation and annotation errors, and this research overcomes those limitations. Compared to previous multilingual reasoning gyms by Dobler et al. (2026) and synthetic task studies by Kim et al. (2025), the algorithmic puzzle in this study requires higher reasoning ability. It can detect cross-lingual gaps even with cutting-edge frontier models and is positioned as a pioneering study for specifically measuring multilingual proficiency differences without using translation-dependent benchmarks. 5. Future Impact: This research is expected to be widely used as a foundation for multilingual fairness assessment. It is also expected to play a role as a diagnostic tool for early detection of proficiency declines in specific languages ​​in multilingual quality audits of frontier models. Furthermore, it will prompt additional experiments to identify whether the cause of the cross-lingual gap is related to the distribution of learning data, English-centric thinking, or tokenization. The complexity scaling design method can be applied to the evaluation of other multilingual tasks and provides guidance for research aimed at reducing interlingual performance gaps.