AI
February 10, 2026
|
3 min read
Beyond Translation: What High-Quality Multilingual Agent Benchmarks Actually Require
The paradigm of Large Language Models (LLMs) has shifted; models like ChatGPT and Gemini are no longer just chatbot wrappers, but agents capable of solving complex, multi-step problems like deep market research or e-commerce assistance. Models must strategically call external tools—like web search or payment gateways—at exactly the right moment. However, there is a significant blind spot in how we measure their success in non-English contexts.
LILT Team

The paradigm of Large Language Models (LLMs) has shifted; models like ChatGPT and Gemini are no longer just chatbot wrappers, but agents capable of solving complex, multi-step problems like deep market research or e-commerce assistance. Models must strategically call external tools—like web search or payment gateways—at exactly the right moment. However, there is a significant blind spot in how we measure their success in non-English contexts.
The Functional and Cultural Alignment Gap
The core technical challenge is that most agentic benchmarks remain strictly English-centric, resulting in an unreliable measure of user satisfaction globally; we are essentially building global tools while measuring them with a "local yardstick." Current multilingual agent benchmarks suffer from technical and cultural artifacts that render tasks unsolvable. Consequently, these benchmarks demand more than mere translation from the English version—they require intricate functional and cultural alignment.
Our Analysis
This blog details our deep audit of the Arabic, German, and Korean localized versions of the MAPS-GAIA[1,2] dataset. To quantify the performance gap, we evaluated GPT-5.2[3] using the Open Deep Research[4] framework on the validation set. We identified critical pitfalls in the conventional data curation pipeline—machine translation (MT) with untargeted human review—and introduced a framework to shift curation from literal translation to functional alignment. Navigating this complexity and bridging this gap is what LILT excels in.
Challenges with Data Localization using Machine Translation
MT introduces probabilistic anomalies that compromise task integrity:
- Functional Misalignment: Minor translation errors often alter the semantics of a task, rendering the task effectively unsolvable.
- Over-translation: Some tasks require the gold answer to remain untranslated, e.g. if the ask is to extract a specific string verbatim from an external source.
- Language Contamination: The model may switch to an unrelated language mid-sentence or generate output in the wrong script.
- Instructional Leakage: MT sometimes inadvertently includes the "gold answer" within the prompt, artificially inflating model scores[5].
- Translationese: Literal substitutions often result in syntactic interference (applying English word order to Korean) and a lack of idiomatic equivalents[6,7].
- Register Mismatch: Failing to adapt grammatical honorifics—such as using the formal Sie in German for human-AI interaction—makes the text feel socially misplaced.
These issues degrade model performance with structural failures corrupting the technical logic of the task, while linguistic artifacts obstruct the model's reasoning efficiency.
Figure 1 Example of hallucinations in a translated query
Challenges with Data Localization using the Human Review Process
Standard linguistic validation is insufficient for agentic tasks; reviewers must actively validate technical logic and functional viability of every entry. But the human review process is prone to following failure points:
- Answer Translation Misalignment: Under-translation where you fail to localize the gold answer can lead to false negatives, such as rejecting the valid German chess notation Td5 because the key expected English Rd5. Conversely, over-translation can sever links to source evidence, such as translating the IOC country code "CUB" into the Arabic word for "lion cub" (shibl), violating task constraints of keeping it in English.
- Cultural Misalignment: Localization fails when tasks remain anchored to the source region, such as using U.S.-specific transport ordinances, American holiday cycles in the MENA region, or imperial units (miles/ounces) in a Korean context. See Figure 2 that shows regional and cultural misalignment in a translated query.
- Domain Misalignment: Inexpert reviewers often misapply technical terminology, either by incorrectly transliterating terms that have established native equivalents or by translating “industry-standard” English terms that professionals would naturally retain.
- Reviewer and Selection Bias: In selective audits, LLM filters frequently favor linguistic patterns from their own training, overlooking English-centric reasoning or translationese artifacts. At the same time, human reviewers often mistake grammatical fluency for technical accuracy, approving "fluent hallucinations" that are syntactically perfect but functionally broken.
These discrepancies force the agent to navigate foreign logic or incorrect terminology rather than solve the core task.
Figure 2: Example of regional and cultural misalignment in a translated query with applied corrections
Solution: LILT’s Standardized Audit Framework
Maintaining benchmark integrity requires a review process that allows for flagging or redesigning culturally and technically incompatible entries.. To address these pitfalls, we implemented targeted interventions during both the post-translation and review stages.
First, we established a suite of automated quality checks to provide immediate technical validation:
- Mechanical Validation: Automatic scripts performed language identification to detect contamination and term-transfer matching to ensure technical identifiers and numerical data remain intact.
- Granular LLM Judges: Specialized, narrow-scope LLM evaluators were used to detect specific artifacts. Limiting each judge to a single issue type minimized self-preference and improved detection accuracy[8,9].
For the human review process, we shifted from general linguistic editing to a specialized verification framework:
- Three-Fold Quality Control: For every potential issue, we integrated a comprehensive review guideline, a mandatory validation checkbox, and the results of the automated checks into a single interface for the reviewer.
- Specialized Reviewer Training: We employed reviewers with a background in LLM properties and agentic operations, ensuring they can identify logic-breaking errors that general linguists might miss.
Results and Impact on Benchmark Reliability
As shown in the performance delta below, model accuracy fluctuates significantly when structural and cultural nuances are strictly controlled. We found that manually fixing these errors increased success rates significantly, proving that existing benchmarks often overstate the gap between English and other languages.
Figure 3: Multilingual agent performance before and after correction
Conclusion
Our audit reveals a significant cross-lingual performance gap in frontier AI agentic reasoning. Valid benchmarking requires addressing the linguistic and procedural pitfalls inherent in localization; without functional alignment, scores reflect translation artifacts rather than actual reasoning capacity.
To ensure global parity and model reliability, the industry must move beyond English-centric biases. We propose a shift toward functional curation: moving past raw MT to prioritize technical logic, cultural viability, and the active redesign of incompatible tasks.
To explore our full methodology and detailed findings, please download the complete research article.
About LILT
LILT is a multilingual applied research lab, partnering with researchers to design custom evaluations, benchmarks, and RL environments that measure real model behavior in business workflows. We integrate expert human judgment, research-grade delivery, and forward-deployed engineering to define, operationalize, and defend what “good” means—across domains, and 100+ languages. Contact us today.
References:
[1] Hofman et al., MAPS: A Multilingual Benchmark for Global Agent Performance and Security, arXiv:2505.15935
[2] Mialon et al., GAIA: A Benchmark for General AI Assistants, ICLR 2024
[3] OpenAI, Update to GPT-5 System Card: GPT-5.2
[4] Roucher et al., Open-source DeepResearch – Freeing our search agents
[5] Huang et al., A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, TOIS Vol. 43 No. 2
[6] Koppel and Ordan, Translationese and Its Dialects, ACL 2011
[7] Li et al., Lost in Literalism: How Supervised Training Shapes Translationese in LLMs, ACL 2025
[8] Saha et al., Branch-Solve-Merge Improves Large Language Model Evaluation and Generation, NAACL 2024
[9] Feng et al., M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation, ACL 2025
Share this post
To explore our full methodology and detailed findings, please download the complete research article here.
Download the ReportShare this post