This article describes a human translation productivity evaluation plan. The methodology is based on human-subjects experiments in the human-computer interaction (HCI) literature. For example experiments, see our case studies.
We assume that you will evaluate Lilt vs. at least one other translation tool that includes machine translation. But the evaluation plan can be used to compare other translation tools.
The goal of a productivity evaluation is to measure two variables:
- Throughput — source words translated per hour
- Quality — of the final translations
In a machine-assisted setting, these two variables are related. To maximize throughput, a translator could simply confirm the MT output for every source sentence. To correct for this bias, we typically multiply the raw throughput by the quality score for each translator. This yields quality-adjusted words per hour. For example, suppose that translator A translates at 800 words per hour with a quality score (from the human translation quality evaluation) of 4.2. We compute:
800 * (4.2 / 5.0) = 672 quality-adjusted words per hour
Let's continue the example from the human translation quality evaluation. We have two test sets (email and listing), each with source sentences and target references. We also have the output of the two systems B and C. Concretely, let's now assume that system B is Lilt and system C is a a public MT system integrated into another computer-aided translation tool. Now we need:
- At least four translators — An experiment with two translators is possible, but it will be harder to separate differences due to tooling vs. human performance.
- A method for collecting timing data — Lilt collects segment-level timing. Some other tools offer plugins to collect this data. In a proctored environment, you may time the translators.
Divide the evaluation into two timed sessions separated by an untimed break. Here we call those sessions "morning" and "afternoon". Randomize the pairings of systems/tools and data sets to mitigate the effects of fatigue, source text difficulty, and human proficiency with each tool. Here is an example design:
|Translator 1||Lilt / email||Tool / listing|
|Translator 2||Tool / listing||Lilt / email|
|Translator 3||Lilt / listing||Tool / email|
|Translator 4||Tool / email||Lilt / listing|
Analyzing the Results
Absent segment-level timings, simply aggregate the number of words translated and the total translation time for each system and test set. Create a table like this:
|words / hour||listing|
To compute quality-adjusted words per hour, repeat the human translation quality evaluation for the final translations produced by the human translators. Then scale the raw throughputs in the table above by the quality scores.
Advanced topic — The segment-level timings can be used for more sophisticated statistical analysis. To learn more, refer to section 4.3.1 of our EMNLP 2014 paper.
Good Hygiene for Human Productivity Evaluation
- Each subject should translate between 2,500 and 3,000 words per day. The industry average is 2,684 words per day. Translating more could increase the effect of fatigue.
- Measure time accurately, either via software or by proctoring. Don't rely on the translators to time themselves.
- Measure quality so that translators are incentivized to translate quickly and accurately.
- Evaluate quality-adjusted throughput: throughput * quality.
- Incentivize the participants by giving a prize to the top performer on the evaluation metric. 1. The price could be as small as a free coffee. But there must be some incentive.
- Limit time for self-review, which is a source of considerable variance among translators.