Improving the robustness of neural machine translation to social media text - Naver Labs Europe

WMT 2019 Robustness Competition: NAVER LABS Europe’s models rank first in 3 out of 4 subtasks.

Neural Machine Translation (NMT), based on Deep Neural Networks, has been deployed in many commercial MT engines such as Google Translate, Bing Translator, DeepL and Naver’s own Papago.

It’s behind a lot of progress in the quality of machine translation which, in some language pairs, is now almost as good as human translations.

Yet, NMT isn’t perfect and there are a couple of inherent reasons for this which include, but that are not limited, to the following:

  • NMT usually requires extremely large amounts of expensive, translated data. That’s why the quality is so good in language pairs where such data exists (e.g. French-English) and why it’s so bad in many others.
  • It requires a lot of computing power both to train and to run. This makes research costly and deployment difficult (especially on embedded devices).
  • It’s brittle to the extent that it breaks down as a result of even the smallest of changes in the user input (see Belinkov et al., 2016). Even just adding or removing a comma can lead to dramatic changes in the model output.
  • Its decisions are hard to interpret. NMT models are known to “hallucinate”, i.e. they sometimes output fluent text that’s completely unrelated to the input.

In this post, we discuss the problem of brittleness which is what we addressed in the WMT challenge. You can read the full paper here.

Robustness Task

The competition takes part in the context of the yearly WMT conference. Participants are given a number of resources (translated texts and monolingual data) to train their models.

The goal was to make French-English and Japanese-English Machine Translation systems that are good at translating user generated content (UGC) from social media (specifically Reddit).

We could use large public translated corpora (e.g., Europarl, CommonCrawl, etc.), as well as a smaller domain-specific corpus called “MTNT” ( that contained sentences from Reddit that had been translated by humans). The evaluation data was a subset of the MTNT corpus.


The kind of user generated content that Reddit and similar websites aggregate often contains orthographic variations that are particularly tough for regular MT systems. Here are some (cherry-picked) examples from MTNT that illustrate this:

blankExamples 1, 4 and 5 have typos, examples 1 and 3 contain emoticons. Examples 1 and 5 use capitalization for emphasis. Examples 2, 3 and 4 use Internet slang. Example 6 does code switching between Japanese and Latin scripts. Examples 1, 3 and 4 have irregular or missing punctuation. Examples 2 , 3 and 5 have irregular capitalization.

Any if the above will typically cause problems for any standard MT system. For instance, while capitalization may seem trivial, NMT considers the ‘same’ words capitalized differently as 2 different words. If “SERIOUSLY” is not in the training data, a regular NMT system will not be able to translate it (even though “seriously” is a very frequent word). Typos and Internet slang are also a large source of confusion because they’re typically missing from public corpora. For instance, the word “bcp”, which is a very common abbreviation of “beaucoup” (“very” in French) is often mistranslated by state-of-the-art models and commercial MT engines.


Our team proposes several solutions to mitigate these problems, which we can roughly categorize into three types of techniques: corpus filtering, domain adaptation and robustness tricks.

We observed that MT models trained on given data were particularly brittle and often generated hallucinations (output completely unrelated to input) or copies (exact copy of the source). We traced these problems to the bad quality of some of the training corpora (in particular CommonCrawl), and were able to completely offset them thanks to three filtering steps:

  1. Language identification: sentences that were not in the right language according to an automatic tool were removed.
  2. Length filtering: sentences that were too long or sentence pairs with a large source/target ratio were removed.
  3. Attention-based filtering, which used the attention matrix of an NMT model to identify misalignments (i.e., translated pairs that don’t match).

After the filtering we applied several robustness tricks:

  • Inline casing: to deal with capitalized words, we split words and their case as follows: “They were SO TASTY!!” → “they <T> were so <U> tasty <U> !!”

<T> tokens mean that the previous word is in title case, while <U> means upper case. The model treats these special tokens like any other word. We also use the standard Byte Pair Encoding (BPE) algorithm to split rare words into more frequent “subwords” (i.e., “tasty <U>” → “ta <U> sty <U>”).

  • Natural noise generation: we analyzed the monolingual MTNT data with a transducer (implemented in our open source Tamgu language), to extract common orthographic variations of each word (e.g. gorgeous/georgous). We then randomly replaced source-side words in the clean parallel data with their alternative spellings.
  • Placeholders: NMT has trouble translating (or rather copying) emojis, because there are so many of them (about 3000 according to the latest Unicode standard) and because most of them are absent from the training data. We solve this by replacing all the emojis in the training and test data with a special <emoji> token. At test time, we replace the tokens in the model’s output with the source-side emojis in the same order.

Finally, because the evaluation data is in a specific domain (Reddit), and we have a small training corpus of the same domain (MTNT), it makes sense to apply the usual domain adaptation techniques.

  • Fine-tuning or corpus tags: this consists of training a general domain model on all parallel data available then finish training it on the in-domain data only. Instead, we use the “domain tag” technique of Kobus et al., 2016, with one distinct tag per corpus in our training data (<corpus:MTNT>, <corpus:Europarl>, etc.) These tags are appended to each source-side sentence, to identify the “domain” (or in our case, corpus) of this sentence pair. At test time, you can translate sentences in any domain by using the most appropriate tag. An advantage of this over fine-tuning is that it doesn’t degrade the performance on other domains. It’s also less tricky to configure and can be combined with other types of tags.
  • Back-translation ‘BT’: we use MT models in reverse direction to translate target-language monolingual data (the small MTNT corpus and the huge news-discuss corpus) to the source language. This creates a large, synthetic, parallel corpus (part in-domain, part close-to-domain), whose source side is noisy. We identify such sentences in the training data with a <type:BT> tag. Natural noise examples are identified with a <type:noise> tag, and real parallel data with <type:real>. At test time we use both the <corpus:MTNT> and <type:real> tags.

With these pre-processing steps, we train Transformer Big models using the Fairseq framework, and similar hyper-parameters similar to the winner of the WMT18 EN-DE News translation task [paper].


The organizers performed automatic and manual evaluation. The automatic evaluation was done with the standard BLEU metric which compares a model’s outputs with a human-translated reference. The human evaluation was done by asking human judges to score model outputs on a scale of 1 to 100, while being shown the source sentence. The main results are shown below and you can check out the full task overview paper if you want more detail.

Automatic evaluation


The best improvements were obtained with the domain adaptation techniques. Back-translation and ensembling improved the scores a little more. Our ensemble models (which combine all the techniques mentioned) ranked first in all translation directions with a large margin on FR-EN and a tight margin on JA-EN.

Human evaluation

We see that our ensemble model ranked first in 3 out of 4 sub-tasks. Our English to Japanese model, which has the highest BLEU score, gets second position in the human evaluation.


The human rating scale goes as follows:

51-70: Understandable, with few translation errors/grammar mistakes

71-90: Very good translation, only a couple of minor mistakes

91-100: Accurate translation, no mistakes


Good baseline systems, with corpus filtering steps, domain adaptation and back-translation were really the key elements in what we did. To solve a couple of the robustness issues we proposed some tricks (e.g. handling of emojis and capital letters), but this had little impact on BLEU scores partly due to the rarity of these issues in naturally occurring text. For instance, the particularly “noisy” MTNT test set only contains 5 emojis out of 20k words.

In the future, we’ll be focusing more on robustness and less on domain adaptation. We’ll explore more aggressive alternatives to our natural noise generation (e.g. adversarial noise generation) and look for more specialized evaluation metrics than BLEU for robustness.

About the author: Alexandre Berard is a research scientist in the Natural Language Processing group. He worked with Ioan Calapodescu and Claude Roux on the WMT19 Robustness Task.
The results of the WMT challenge are presented at the 4th Conference on Machine Translation, 1–2 August 2019 which is being held in conjunction with ACL19 in Florence, Italy.