AI Data Services

April 29, 2026

5 min read

Beyond Translation: Announcing GAIA-v2-LILT the Multilingual Agent Benchmark That Actually Measures What It Claims To

GAIA-v2-LILT is a re-audited multilingual extension of the GAIA benchmark covering Arabic, German, Hindi, Korean, and Portuguese. It reveals that roughly 20 percentage points of measured multilingual performance gaps stem from benchmark translation artifacts rather than genuine AI model capability limits.

LILT Team

Beyond Translation: Announcing GAIA-v2-LILT the Multilingual Agent Benchmark That Actually Measures What It Claims To

TL;DR

We're releasing GAIA-v2-LILT, a re-audited multilingual agent benchmark across Arabic, German, Hindi, Korean, and Portuguese (BR), with 165 tasks per language. We went beyond machine translation to add functional alignment, cultural alignment, and difficulty calibration audits. The result: measured agent performance jumped an average of +20.7 percentage points (up to +28.3 for Korean), showing that translation artifacts, not model capability, account for a large share of apparent multilingual performance gaps.

Intro

Machine translation has reached a point where multilingual benchmark construction appears trivial: translate, lightly post-edit, and evaluate. But this assumption breaks in a specific and measurable way when the benchmark requires agents to reason, use tools, and produce exact-match outputs grounded in real-world context.

Why Machine Translation Alone Fails Agentic Benchmarks

Most agentic benchmarks remain English-centric. When teams do build multilingual versions, they typically machine-translate an established English dataset and apply human post-editing, often to improve fluency and adequacy. For tasks such as document translation or reading comprehension, that process can often be sufficient when quality control is comprehensive and the evaluation target is primarily semantic accuracy. For agentic tasks, where a model must plan, call external tools, and return an answer that passes an exact-match evaluator, it is frequently not enough. These settings often require additional alignment at the functional level, including tool behavior, locale conventions, formatting expectations, and evaluator compatibility.

Translation quality and task integrity are distinct. A translation can be fluent, grammatically correct, and semantically faithful, yet functionally flawed if the expected answer changes, cultural context is lost, or difficulty shifts because target-language information is unavailable.

We call the gap between “this translation reads well” and “this task is valid and solvable” the functional and cultural alignment gap. Closing it was the design goal behind GAIA-v2-LILT.

What Is GAIA-v2-LILT?

GAIA-v2-LILT is a re-audited multilingual extension of GAIA: a benchmark for general AI assistants requiring multi-step tool use. It covers five non-English languages: Arabic, German, Hindi, Korean, and Portuguese (Brazil). It contains 165 query-answer pairs per language (validation set), built on top of the machine-translated MAPS-GAIA baseline.

The dataset is available here: https://huggingface.co/datasets/Fujitsu-FRE/MAPS/viewer/GAIA-v2-LILT

How We Built GAIA-v2-LILT: The Three-Stage Review Workflow

Standard MT post-editing focuses on fluency and adequacy. Our workflow adds three further layers of validation, functional alignment, cultural alignment, and difficulty calibration, and structures the review process to actively resist the two failure modes that plague quality control at scale:

LLM self-preference: model-based judges that favor outputs resembling their own training distribution, overlooking translationese and English-centric reasoning.
Human fluency bias: reviewers who approve entries because they read naturally, without checking whether the task is technically executable.

Stage 1: Deterministic Filtering

Before any model or human judgment is applied, fast rule-based scripts catch high-impact, objective defects:

Language identification: flags entries where the translation is not in the target language.
Answer leakage detection: string-matching to catch cases where the gold answer appears in the translated query.
Placeholder recall: verifies that fixed-term categories (numbers, URLs, proper nouns, country codes) are preserved correctly.

This stage is entirely deterministic. It has no self-preference problem and no fluency bias. Its job is to eliminate the structural defects that are straightforward to define.

Stage 2: Granular LLM Judges

Rather than asking a single model to score each entry holistically, we used separate judges evaluating one axis at a time: fluency, adequacy, cultural appropriateness, and query-answer compatibility. Each judge returns a binary label.

Limiting each judge to a single, narrow criterion significantly reduces self-preference effects. A holistic “is this a good translation?” prompt invites the model to reward its own stylistic patterns. A binary “does the expected answer match what a correct agent would return given this query?” prompt forces it to evaluate a concrete, checkable fact.

Stage 3: Specialized Human Audit

The final stage is bilingual human review, but structured quite differently from a standard post-editing pass.

Reviewers were trained specifically on agentic task mechanics, not just linguistic correctness, with 1-on-1 briefings covering each issue category. During review, the output of Stages 1 and 2 was surfaced inline alongside the task text, directing attention to the highest-risk entries first. Mandatory checkboxes for each issue category prevented reviewers from approving entries without explicitly considering functional and cultural dimensions.

Each query-answer pair was reviewed by one human expert and meta-reviewed by another human expert and a machine learning researcher.

What Errors Did Reviewers Actually Catch?

Functional alignment: under-translation and over-translation

The trickiest class of errors. Some answers must be localized for the evaluation to be valid: a German chess task that leaves the answer key in English notation (Rd5) will reject the correct German answer (Td5). Other answers must stay in English: a task asking an agent to extract a specific string from an English document cannot have that string translated, because the agent’s output won’t match.

Distinguishing these cases requires reviewers to understand both the cultural convention and the technical logic of the question type simultaneously.

Cultural alignment

Tasks that remain anchored to the source region create a different kind of failure: the task is solvable in principle, but the agent must reason through a foreign cultural context instead of focusing on the core problem. U.S.-specific transport laws, American holiday calendars, imperial units, and US/UK geographic assumptions all fall into this category. Reviewers replaced these with locally equivalent context — routes, units, institutional frameworks — rather than just translating the surface text.

Translationese and hallucination

Literal MT produces unnatural target-language text: incorrect honorifics, wrong register for human-AI interaction, word-for-word idiom transfers, English syntax imposed on languages with different word order. LLM-based translators add a further failure mode: generating output in the wrong language entirely, or including the gold answer in the prompt as part of a “helpful” translation note.

Difficulty calibration

A well-translated task can still be harder or easier than the English original. If the relevant information simply doesn't exist in the target-language web, or if a localized currency introduces arithmetic complexity not present in the English version, the difficulty has shifted without any surface indication. Reviewers manually verified the solving path for each entry, running local web searches themselves, to confirm the task was executable at the intended difficulty level.

How Much Did the Translations Need to Change?

The edit rates below show that this was not light-touch cleanup: Hindi required revision to every single task, and word-level edit rates across languages ranged from 25% to 55%, indicating changes well beyond surface fluency fixes.

The table below shows character-level and word-level edit rates for GAIA-v2-LILT against the MAPS baseline.

Which Issue Types Actually Flip Agent Scores?

The clearest signal of which issue categories matter most comes from filtering to the examples whose evaluation outcome changed after correction — i.e., the tasks where the audit had a direct effect on whether the agent was scored as correct or incorrect.

Among these result-flipping examples, the overwhelming majority had at least one critical flag in the functional alignment or cultural alignment categories. Translationese flags, while common overall, appeared much less frequently in result-flipping cases, consistent with the intuition that linguistic artifacts slow the agent down but rarely cause outright failure, whereas functional and cultural defects cause structural task breakage.

How Benchmark Quality Changes Measured Agent Performance

We evaluated GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6 using the Open Deep Research framework on the validation set, comparing scores on the MAPS baseline (MT with standard review) against GAIA-v2-LILT (our audited version).

The average improvement across languages is approximately +20.7 percentage points, with Korean showing the largest gain, consistent with a higher rate of structure-sensitive and culturally anchored defects identified during audit.

These numbers have a specific interpretation: they represent the share of the measured multilingual performance gap that is benchmark-induced measurement error, not a genuine model capability gap. An uncorrected translation pipeline will systematically underestimate how well a model performs in non-English contexts, because the benchmark itself is penalizing the agent for translation artifacts rather than reasoning errors.

Corrections of this magnitude cannot be explained by random run variance. They are a signal that localization quality and evaluation validity are the same problem.

What This Means for Multilingual Benchmark Development

The practical takeaway is that machine translation is a starting point, not a complete pipeline. The workflow above is designed to be reproducible across language pairs and base benchmarks, not specific to GAIA or the five languages we covered here.

We’re releasing GAIA-v2-LILT — including audited tasks, review criteria, and implementation notes — as a public resource for teams working on multilingual evaluation. The dataset is available here, and here is the full technical report.

Frequently Asked Questions

What is GAIA-v2-LILT?

GAIA-v2-LILT is a re-audited multilingual extension of the GAIA benchmark, designed to evaluate AI agents on multi-step tool-use tasks across non-English languages. It contains 165 query-answer pairs per language and was built on top of the machine-translated MAPS-GAIA baseline with additional layers of functional, cultural, and difficulty calibration.

What languages does GAIA-v2-LILT cover?

GAIA-v2-LILT covers five non-English languages: Arabic, German, Hindi, Korean, and Portuguese (Brazil).

How is GAIA-v2-LILT different from MAPS-GAIA?

While MAPS-GAIA relies on machine translation with standard post-editing focused on fluency and adequacy, GAIA-v2-LILT adds three additional validation layers: functional alignment, cultural alignment, and difficulty calibration. This multi-stage audit combines deterministic filtering, granular LLM judges, and specialized bilingual human review trained on agentic task mechanics.

Why does translation quality matter for AI agent benchmarks?

For agentic tasks where an agent must plan, call external tools, and return an exact-match answer, a fluent translation is not enough. Translation artifacts can break task validity by shifting the expected answer, losing cultural context, or changing task difficulty. These issues cause benchmarks to penalize agents for translation flaws rather than reasoning errors.

How much does benchmark quality affect measured AI performance?

Across five languages, GAIA-v2-LILT corrections revealed an average improvement of +20.7 percentage points in measured agent performance compared to the MT baseline. Korean showed the largest gain at +28.3 points. These gains represent benchmark-induced measurement error, not genuine model capability gaps.

Where can I access the GAIA-v2-LILT dataset?

The GAIA-v2-LILT dataset, including audited tasks, review criteria, and implementation notes, is available as a public resource. The full technical report is also available for teams working on multilingual evaluation.

Ready to build a benchmark that measures what your multilingual AI actually does?

Book a Meeting

Share this post