Automatic Translation Quality

Automatic evaluation methods try to answer the question: how close is the system's translation to one or more human reference translations deemed to be of high quality? There might be many different correct translations, and comparing against of them may not seem fair. Still, if the sample of translations is large enough, then automatic evaluation is a fast, reliable, and cheap way to discriminate among systems.

An important prerequisite is that the systems have not seen the data before. When using evaluation data from a pool of data that has been already used to train an MT system, the evaluation scores will be unnaturally high: statistical MT systems do a good job of memorizing what they have seen before. A result of 80% overlap with a reference translation might look good at first, but if the system fails to translate anything except similar segments that it has already seen, the results will be misleading.


Suppose that we are evaluating five MT systems: A, B, C, D, E. Assume that some of these vendors offer custom MT, which can be adapted to your content.

To run the evaluation, you need the following:

  1. Source test set — Source text (segments) for which you will evaluate translation quality. Each test set should have at least 2,000 segments. Use at least two different test sets, preferably from different domains. For this example, let's use these two test sets:
    • email — source text from an email marketing campaign.
    • listing — product listings from an e-commerce catalog.
  2. Target test reference data — Human translations that correspond to the source test data. One set of references is sufficient, but multiple sets of references will make the evaluation more robust.
  3. Target system translations — MT output for the source test data, one set of outputs per system.

Automatic Evaluation Measures

Automatic evaluation measures compute the similarity of the target system translations to the target test reference data. Automatic measures only consider the surface form of the translation, i.e., the words as they are written; they typically do not incorporate syntactic or semantic information. Consequently, they cannot distinguish between semantically equivalent strings. Consider the following example:

Reference Peter gave the ball to Mary
System A Peter gave the ball to Mary
System B Peter gave Mary the ball

The most common automatic measures would score System A's output as a perfect translation since it exactly matches the reference. However, System B's translation, which is semantically equivalent to both the reference and System A, would be significantly penalized due to the string difference from the reference.

The most widely used automatic evaluation measures are:

  • BLEU — an average of the sub-string overlap between the system translation and the reference. Pro: correlates well with human judgment, available for all languages. Con: undefined at the segment-level, so it must be computed over a set of translations.
  • TER — the number of string edits (insertion, deletion, and substitution) required to transform the system translation into the reference. Pro: intuitive relation to the translation process, defined at the segment-level, available for all languages. Con: not as effective for MT system training as BLEU.
  • METEOR — Exact, stem, synonym, and paraphrase matches between words and phrases in the system translation and the reference. Pro: incorporates linguistic knowledge. Con: the linguistic resource requirement means that it isn't available for all languages.

MultEval, which is free, can be used to compute these three scores.

Running the Evaluation

Execution of the evaluation is straightforward.

  1. Collect target system translations for systems A-E.
  2. Run MultEval (or your favorite evaluation suite) to compute all / any of BLEU, TER, METEOR for each test set.
  3. Create a results table for each of the metrics. Here's an example for BLEU:
System email listing
A 45.0 29.5
B 44.1 33.0
C 46.2 32.1
D 41.0 29.0
E 39.7 29.8

Analyzing the Results

Consider the example BLEU table above. A margin of one BLEU point is usually significant, but not noticeable to humans. A 2-3 BLEU difference is usually noticeable to humans. Using these guidelines, we can infer the following:

  • For the email data, we see that C is best, but probably not noticeably better than A and B. However, systems A-C are probably noticeably better than systems D and E.
  • For the listing data, system B is best, but probably not noticeably better than C. Systems D and E are noticeably worse, as is system A.

We know that BLEU is correlated with human judgment, so we can, with some confidence, eliminate A, D, and E from further consideration. Human judgment will be needed to distinguish between systems B and C.

This evaluation has helped us cheaply and quickly narrow the list of candidate systems from five to two. Because there will be fewer systems to evaluate, the human evaluation will also be faster and cheaper, or we could evaluate more translation data for the same budget.

Good Hygiene for Automatic Evaluation

  • Get good reference translations. Data preparation for an MT system evaluation can be a time-consuming step, but is very important to carry out carefully: one has to decide which data reflects best the use case that the system will ultimately be applied to, including the preparation of reference translations.
  • Use recent, "fresh" evaluation data, i.e., that has most likely not been seen before by the MT systems that are evaluated. This means that one should not use parallel evaluation data that can be found on the internet, as most of the MT providers already use these sets to train and tune their systems. Also, one should not use data that was discoverable on the internet for a long time (e.g. pages that were online for more than half a year), as MT providers also crawl the web and add everything they can get to their models.
  • Protect your reference translations. Only a blind evaluation gives you objective results. If they ask for data that they can use to tune their systems a bit, give them a different set. Often, an evaluation set is split into two, and one portion is shared for tuning, whereas the other is then used for a final blind evaluation where only the source side is shared with the MT provider and the evaluation is done in-house.
Still need help? Get in touch!
Last updated on 12th May 2019