Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation

Sudip Kumar Naskar,; Toral, Antonio; Gaspari, F; Groves, Declan

Diagnostic evaluation of machine translation (MT) is an approach to evaluation that provides finer-grained information compared to state-of-the-art automatic metrics. This paper evaluates DELiC4MT, a diagnostic metric that assesses the performance of MT systems on user-defined linguistic phenomena. We present the results obtained using this diagnostic metric when evaluating three MT systems that translate from English to French, with a comparison against both human judgements and a set of representative automatic evaluation metrics. In addition, as the diagnostic metric relies on word alignments, the paper compares the margin of error in diagnostic evaluation when using automatic word alignments as opposed to gold standard manual alignments. We observed that this diagnostic metric is capable of accurately reflecting translation quality, can be used reliably with automatic word alignments and, in general, correlates well with automatic metrics and, more importantly, with human judgements.

Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation / Sudip Kumar, N., Antonio, T., Gaspari, F., Declan, G.. - (2013), pp. 135-142. (XIV Machine Translation Summit Nice, France 2-6 September 2013).