AI Data Services

March 13, 2026

5 min read

The Origin of Multilingual Performance Gap: A Deep Dive into Multi-Turn Conversational Agents

The research identifies why LLMs struggle in non-English languages using the MultiChallenge benchmark. While data artifacts and language nuances play a role, model limitations like tokenizer inefficiencies and English-centric reasoning account for over 70-80% of failures. The post advocates for targeted post-training as a cost-effective solution to bridge the gap.

LILT Team

The Origin of Multilingual Performance Gap: A Deep Dive into Multi-Turn Conversational Agents

Large language model (LLM) performance consistently degrades when transitioning from English to other languages [1]. While often attributed to a simple scarcity of training data, this blog post moves beyond general speculation to analyze the systemic root causes of this gap.

Analyzing Multi-Turn Conversational Agents

We use multi-turn conversations between users and LLM agents for daily tasks—such as search or content creation—as our research lens. Unlike heavy, specialized tasks like deep research or closed-loop coding, these natural dialogues directly reveal how models handle user dynamics and linguistic constraints.

The MultiChallenge Dataset: Benchmarking LLM Reasoning

The MultiChallenge dataset [2] collects such conversations across various topics to evaluate how well models perform in long-term reasoning. Each dialogue ends with a rubric question to verify if the agent successfully managed the complex context through to its final response. These tests are organized into the four challenge axes shown in Table 1.

To optimize for both SEO (Search Engine Optimization) and AEO (Answer Engine Optimization), your alt text should be descriptive, keyword-rich, and structured to provide a direct answer to what the table contains.

Recommended Alt Text
Table 1: MultiChallenge benchmark axes for evaluating LLM agent reasoning in multi-turn conversations. The table defines four key challenges: Instruction Retention (tracking long-range constraints), Inference Memory (context relevance and recall), Reliable Version Editing (version disambiguation), and Self-Coherence (self-consistency and sycophancy resistance).

Table 1 Challenges for LLM agents in multi-turn conversations (view examples [2]).

Measuring the Multilingual Performance Drop

We translated the original English data into Arabic, German, and Korean, and tested against three frontier models (Figure 1). While overall accuracy varies across models, the performance drop in target languages compared to English is clear.

Figure 1: Comparison of multilingual performance accuracy across frontier LLMs—GPT-5.2, Claude Opus 4.6, and Gemini 3.1 Flash-Lite. The bar chart visualizes the accuracy drop when transitioning from English (50.69%–53.74%) to Arabic, German, and Korean, highlighting the consistent performance gap in non-English reasoning tasks.

Figure 1 Performance difference across languages.

To pinpoint why this drop happens, we group potential causes of the failures into three categories: data, language, and model.

Primary Factor 1: Data Artifacts and Translation Issues

First, the evaluation data itself can be flawed: translation occasionally makes logical tasks ill-defined. This is a surface-level issue that can be resolved through human review and refinement.

Incompatible Constraints in Non-English Tasks

In the Instruction Retention tasks, many failures stem from English-centric constraints that do not map 1:1 across languages. Rather than a reasoning failure, the model struggles with linguistic incompatibility. For example:

“limit to 15 words” is calibrated for English. Enforcing this in Spanish makes the task artificially harder, since the language requires more words to convey the same meaning.
“only ‘Yes’ or ‘No’” forces unnatural phrasing in Chinese, which lacks standalone equivalents and instead affirms by echoing the verb (e.g., answering “Do you have it?” with 有 / “have”). The model must either break natural syntax or fail the exact-match test.

Fixing this requires localizing the constraint's intent and replacing it with a natural equivalent of similar difficulty in the target language.

Inconsistent Entity References and Coreference Chains

Inference Memory requires tracking characters across long contexts. Translation often breaks these coreference chains, scrambling entity labels and causing failures [3, 4]. This occurs in two ways:

Entity Splitting: A single entity might be translated differently across a text depending on the short-term context, e.g. “Mother” becoming “어머니” (formal) or “엄마” (casual) in Korean.
Entity Merging: Distinct entities collapse into a single word, e.g., “nephew” and “niece” both becoming “조카” (gender-neutral) in Korean.

Anchor-Pointer Mismatch and Cultural Irrelevance

In Reliable Version Editing, a model must link a Version Anchor (introducing a version) to a Version Pointer (referring back to it). This typically relies on stem matching, e.g., linking “Update 1” to the anchor “the updated version.” If translation maps these to different terms, the model may treat them as separate concepts rather than states of the same object, leading to hallucinated versions or the retrieval of outdated content.

As shown in Figure 2, the first revision is labeled “수정본” (modified version, Anchor 1) and the second “업데이트된” (updated, Anchor 2). When the user later requests “첫 번째 업데이트” (the first update), lexical overlap wrongly ties this pointer to the second anchor instead of the first.

Figure 2: An example of Anchor-Pointer Mismatch in a Korean multi-turn conversation compared to English. The visualization demonstrates how inconsistent translations of "modified version" (수정본) and "updated" (업데이트된) confuse the LLM, causing a reference pointer to fail when the user requests the first update.

Figure 2 Example of Anchor-Pointer Mismatch. Anchors and pointers are marked with superscripts.

Invalid Statement

Models trained on US-centric data embed regional assumptions that often clash with target-culture realities after translation. This may trigger hallucinations or contradictions as follows:

“December brings heavy snow” is geographically false in Brazil, where December is mid-summer.
“School starts in August” is false in Japan, where the academic year begins in April.

Fixing this involves aligning the regional claims with local realities.

Culturally Irrelevant Prompt

If a task is tied to the specific culture of English-speaking countries, the model is likely to default to its US-centric pretraining, effectively adopting a ‘foreign persona’ to answer. This fails to measure true native proficiency; instead, it merely tests the model’s ability to map American concepts into the target language, undermining the benchmark's reliability.

“Plan a Thanksgiving dinner” is North American-centric and lacks relevance in most target regions.
“Ensure PG-13 rating compliance” relies on US motion picture standards, which are irrelevant in countries with different classification systems.

To fix this, we need to replace these with local cultural equivalents. Unlike correcting a single claim, this often requires rewriting the entire dialogue history to maintain logical and cultural coherence.

Primary Factor 2: Inherent Language Nuances

Beyond data artifacts, inherent linguistic properties can shift task difficulty or reasoning paths. These structural nuances are uneditable; forcing literal translation only results in unnatural, robotic dialogue.

Pronoun Dropping in Pro-drop Languages

Pro-drop languages (e.g., Asian, Slavic, Uralic) often omit established subjects or objects rather than replacing them with pronouns. Models lacking target-language robustness frequently forget these implicit referents over multiple turns, even when they are obvious to native speakers [5].

As shown in Figure 3, the Japanese constraint “パスポートを持っていない” (does not have a passport) lacks an explicit subject, requiring the model to infer the traveler from the preceding context. Similarly, the Vietnamese “vẫn không nhận” (still does not recognize) omits both subject and object, forcing the agent to determine whether the “cổng USB” (port) or “ổ cứng” (hard drive) is the source of the error.

Figure 3: Examples of pronoun dropping in Japanese and Vietnamese multi-turn conversations. The diagram illustrates how omitted subjects and objects in non-English languages force LLMs to infer referents from context, leading to potential reasoning failures when explicit pronouns are absent.

Figure 3 Examples of pronoun dropping. Omitted referents are indicated with [].

Gender Neutrality and Coreference Tracking

In languages like Korean, Turkish, or Persian, gender-neutral pronouns (e.g., 걔 in Korean) lack the clarity of he and she in English. With multiple actors, this causes agents to confuse individuals by assigning one person’s traits to another. Without gendered cues, the model incurs extra overhead to track subjects accurately across turns, especially in the Inference Memory challenge.

Conversely, translating English neutral terms (e.g., “the doctor”) into heavily gendered languages like German forces an immediate gender assignment (der Arzt or die Ärztin). If the context is limited, this might lead to an incorrect guess which shatters the coreference chain [4, 6]. When later turns contradict this guess, the model hallucinates a new entity rather than updating its assumption, dropping critical constraints tied to the original actor.

Tone Softening and Discourse Norms

In languages like Thai or Japanese, discourse norms require softening definitive statements. For example, “It is the fastest” naturally translates into Thai as “น่าจะเร็วที่สุด” (It might be the fastest). This introduces explicit uncertainty into the dialogue history, even when the underlying fact is true.

This cultural adaptation may disrupt the Self-Coherence challenge. When an agent uses softer words to be polite, it makes itself less sure of the facts. In later turns, the model treats its past words as guesses rather than facts, making it likely to contradict itself and fail the task.

Primary Factor 3: Fundamental Model Limitations

To systematically identify remaining causes, we asked bilingual experts to correct data artifacts and flag language nuances affecting the evaluation outcomes. As shown in Figure 4, after removing data issues, language nuances account for only a small fraction of remaining errors. Then what drives most of the performance drop?

Figure 4: Breakdown of performance degradation causes in non-English LLMs across Arabic, German, and Korean. The chart reveals that fundamental model limitations drive 72.1% to 87.3% of errors, while data artifacts account for 10.6% to 25.6%, and inherent language nuances represent approximately 2% of the gap.

Figure 4 Proportions of causes for performance degradation in non-English languages relative to English.

We attribute this residual gap to fundamental model limitations: models are simply poorer at reasoning in less-resourced languages. Several technical factors underlie this:

Tokenizer Inefficiency: Target languages often require significantly more tokens than English (e.g., 3x for Arabic; Figure 5). This higher density fragments the input, making it harder for attention mechanisms to track constraints or user details across a dialogue [7].
Latent Space Misalignment: A model's latent space is heavily biased toward English. Rather than mapping to a shared, language-agnostic conceptual space, less-resourced languages form shallow, isolated clusters [8, 9]. This isolates them from the core reasoning engine, severely hindering the cross-lingual transfer of capabilities.
English-Centric Reasoning: Because complex reasoning is primarily trained on English data, models often default to internal English reasoning for non-English tasks. This forces a continuous, resource-draining internal translation cycle—translating the query, reasoning in English, and converting the output back—sacrificing overall task performance [10, 11].

Figure 5: Tokenizer efficiency and average tokens per dialogue across four languages. The table shows that Arabic requires 4,558 tokens, nearly 3x the 1,572 tokens needed for English, while Korean (3,588 tokens) and German (2,371 tokens) also demonstrate higher token density, contributing to increased computational overhead and reasoning errors in non-English LLMs.

Figure 5 Average number of tokens per dialogue across languages.

Modified architectures, new pre-training recipes, or scaling up model parameters could theoretically resolve these issues. However, these foundational changes are computationally expensive and research-intensive.

A more practical solution lies in post-training. By collecting high-quality multilingual data and carefully designing Reinforcement Learning (RL) environments and rewards, we can directly improve native-language reasoning without costly retraining [12].

Empirical Analysis: Pinpointing Where Reasoning Breaks Down

Having identified fundamental model limitations as the primary factor, improving performance requires a targeted data curation strategy. By analyzing the performance drop across specific dimensions, we can prioritize exactly what data to collect next.

The Impact of Conversation Length on Accuracy

Figure 6 shows that, because of the tokenizer inefficiencies mentioned earlier, accuracy in non-English languages drops significantly in longer conversations (6-10 turns) while English remains stable.

Figure 6: LLM performance accuracy by conversation length (2–10 turns) for GPT-5.2 across English, Arabic, German, and Korean. The bar chart illustrates that while English performance remains stable at long lengths (54.84% at 6–10 turns), non-English languages suffer significant drops, with Arabic and Korean both falling to 40.86% as dialogue turns increase.

Figure 6 Performance by conversation length (GPT-5.2).

Cross-Lingual Asymmetries Across Challenge Axis

Figure 7 reveals that the degradation is uneven across task types. Instruction Retention and Inference Memory drop a steady 3-7% in all non-English languages, but Reliable Version Editing is sharply language-dependent: Arabic and Korean collapse to 32-37% while German surpasses English. Self-Coherence flips the pattern again, with Arabic outperforming English and German falling behind.

Figure 7: LLM performance accuracy across four reasoning axes: Instruction Retention, Inference Memory, Reliable Version Editing, and Self-Coherence. The chart shows how Arabic (31.71%) and Korean (36.59%) collapse in Reliable Version Editing, while German (53.66%) surpasses English (46.34%) in that same category. Arabic also outperforms English in Self-Coherence (52% vs 50%).

Figure 7 Performance by challenge axis.

These asymmetries prove the multilingual gap requires a targeted approach to data curation. Rather than simply adding more data, post-training must prioritize long-turn dialogues tailored to each language's specific failure modes, such as focusing on Reliable Version Editing for Arabic.

Conclusion: Building Truly Global AI

The multilingual performance gap isn’t just about data volume; closing it requires a rigorous process of elimination. First, we must clear the noise: remove data artifacts like cultural mismatches and poor translations. Next, we must navigate the linguistic terrain: preserve inherent language nuances for natural dialogue. Only then is the final bottleneck exposed: the model itself.

Since rebuilding foundational models is too expensive, targeted post-training is the most practical solution. However, simply adding data isn't enough. By identifying exactly where a model struggles—across specific topics and reasoning types—we can curate the precise datasets needed to fix those weaknesses.

Building truly global AI requires the right data. We combine deep linguistic expertise with research-level benchmarking to identify exactly where your model breaks down. By resolving artifacts and capturing subtle nuances, we provide the targeted, high-quality datasets needed to fix the capability gap at the source. Contact us today to transform your multilingual performance.

About LILT

LILT is a multilingual applied research lab, partnering with researchers to design custom evaluations, benchmarks, and RL environments that measure real model behavior in business workflows. We integrate expert human judgment, research-grade delivery, and forward-deployed engineering to define, operationalize, and defend what “good” means—across domains, and 100+ languages. Contact us today.

-----------------------------------------------------------------------------

[1] Ahuja et al., MEGA: Multilingual Evaluation of Generative AI, EMNLP 2023

[2] Sirdeshmukh et al., MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs, ACL 2025 Findings

[3] Voita et al., When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion, ACL 2019

[4] Bawden et al., Evaluating Discourse Phenomena in Neural Machine Translation, NAACL-HLT 2018

[5] Novák et al., Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLMs Dethrone Traditional Approaches?, CODI-CRAC 2025

[6] Saunders et al., Neural Machine Translation Doesn’t Translate Gender Coreference Right Unless You Make It, GeBNLP 2020

[7] Ahia et al., Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models, EMNLP 2023

[8] Wendler et al., Do Llamas Work in English? On the Latent Language of Multilingual Transformers, ACL 2024

[9] Lim et al., Language-Specific Latent Process Hinders Cross-Lingual Performance, NAACL 2025

[10] Bafna et al., The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure, IJCNLP 2025

[11] Kang et al., Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?, arXiv:2510.27269