A machine translation model for Covid-19 research - Naver Labs Europe

We are releasing a state-of-the-art multilingual and multi-domain neural machine translation model specialised for biomedical data and that enables translation into English from five languages (French, German, Italian, Spanish and Korean).


You’ve heard it before: Covid-19 is disrupting every aspect of our lives. Different forms of lockdown are in operation across the world, and we’re all coming terms with the idea that social distancing may be a long-term requirement. It’s likely you’ve also heard that this pandemic is far from over, and that its consequences will take years to fully crystallise.

In the midst of all this, chances are high that when you open a news portal or social media channel, you’re presented with an article or post that has some reference to the disease. By its very nature, a pandemic affects people of all languages, and we’re currently living through a rare period in which people around the world are all talking about a single topic. Understanding how reactions vary across different cultures, as well as pulling them together to find commonalities, will provide important insights for the future.

We believe that the vast sum of written digital communication about Covid-19 that’s currently being amassed will be the basis of hundreds of research programmes in the future. The data will be used to analyse our response during this period and—hopefully—to orient and advise future policies in economy, sociology, crisis management and, of course, public health.

To facilitate the large-scale analysis of this digital evidence at such a unique time in human history, we are releasing a multilingual translation model that can translate biomedical text. Anybody can download our model and use it locally to translate texts from five different languages (French, German, Italian, Spanish and Korean) into English.

Why is our translation model useful?

While automated translation portals are mainstream and used by millions daily, they’re not specialised in biomedical data. Such data often contains specific terminology which isn’t recognised—or is poorly translated—by most platforms. In addition, making our work available means that researchers can host their own models, enabling them to translate at will without having to monitor the budget spent on those portals. Although a few pretrained models exist (Opus-MT-train is one example), most are only bilingual, limiting their use. Additionally, as they aren’t trained using biomedical data, the models are not suitable for such specialised translation.

What exactly are we releasing?

Neural machine translation

Neural machine translation models work by encoding input sequences into mathematical structures, or intermediate representations, that consist of points in high-dimensional real space. These intermediate representations are obtained by setting a large number of parameters, which we achieved by exposing our model to training data (consisting of translated sentences from publicly available resources). A decoder then uses the intermediate representation to generate an English translation, producing the translated sentence word by word.

A variety of neural architectures exist. We based our own work on state-of-the-art models, which employ the so-called Transformer (1) architecture. Using high-capacity variants of this architecture, we’re able to translate different languages with a single model.

Our multilingual, multi-domain translation model

Typically, a number of models (n2, where n is the number of languages) must be managed separately to enable translation across multiple languages. Because ours is multilingual, users are able to translate from five different languages to English using just one model, simplifying storage and maintenance. More importantly, research has shown (2) that multilingual models can greatly benefit so-called under-resourced languages, i.e. languages for which less parallel data exists (as can be seen by the data we used, the number of training sentences varies widely across languages). In the benchmarks that we used to measure performance, we found that our model achieves results similar to the best-performing bilingual models.

Words are often ambiguous when taken out of context and can mean very different things in different settings (or ‘domains’). For example, when translated into German, high temperature could be hohe Temperatur or Fieber, depending on whether the domain is meteorology or medicine, respectively. Likewise, a French carte might be a map or a menu, depending on whether you’re on a treasure hunt or in a restaurant. For this reason, creating a multi-domain model—i.e. one that is capable of translating specialised information—is particularly challenging. To achieve multi-domain functionality and enable the accurate translation of information relating to Covid-19, we used a variety of parallel biomedical data (in particular from TAUS) when training our model.

Achieving biomedical specialisation with domain tags

One approach to achieving domain adaptation in a translation model is to fine-tune it to the specific domain of interest. We didn’t want to overspecialise, however, as this could lead to losing the advantage provided by large corpora from other domains. To maximise the usability of our model, we decided instead to use domain tags (a strategy that has been successful (3) in the past). These domain tags are used as control tokens. During training, sentences that come from one domain are assigned the same tag. The tag can then be used at inference time (as opposed to training time) to nudge the model towards one domain or the other. In the model we’re releasing, the user can employ the default settings, for standard translation, or select the biomedical tag. The same sentence translated with or without this token produces different output.

We achieved multilingualism in the same way. For example, a French sentence is tagged with a different label than a Korean sentence: <fr> and <ko>, respectively. Although controlling the language tags at inference time makes less sense (as it forces the model to consider, for instance, a German sentence when translating from Italian to English), it allows for very flexible and generic training procedures.

Indeed, by enabling two ways of varying the input (domain and language), our model achieves better translation, as determined by a standard measure (called BLEU, for bilingual evaluation understudy) that counts overlapping sequences of words with respect to reference translation. Interestingly, although the model wasn’t presented with any biomedical data for Korean–English at the learning stage, we found that using the biomedical tag for translation resulted in a different output. In an internal test, we translated a set of biomedical texts and observed an increase of 0.44 BLEU points when the tag was selected. Although the changes are small, they’re often important (see, for example, the changes highlighted with boldface in Table 1). The first two examples in Table 1 show cases for which the translation was more accurate, while the last example actually shows degraded performance.

table 1 output translation model

Table 1: The output of our translation model, with and without biomedical specialisation (selected with the tag), compared with the reference (human) translation for a Korean medical text. The boldface text highlights two examples of important differences in translation that result from this specialisation.

We’re currently further exploring the transfer of such information in multilingual and multi-domain models. We’ve already proven that the use of flexible control tokens makes summarisation more faithful (4) to the original documents, and we are interested in how the interplay between different types of control tokens affects translation. In particular, the transfer of knowledge across languages and domains could open up a wide range of exciting uses for natural language generation models, even for languages or domains where no training data is easily available.

Benchmarks: How does our model measure up?

We tested our model against competing machine translation models using the BLEU measure. BLEU is a quality metric score for machine translation systems that evaluates the quality of a piece of machine-translated text by comparing it to corresponding human-translated text. Table 2 reports the BLEU values obtained for standard benchmarks in the field of machine translation, which are provided in regular competitions, compared with our own. Whenever available, we also compare against the best-performing model (as reported in the corresponding competition).


Table 2: Results of benchmarking tests carried out on our machine translation model using the BLEU (bilingual evaluation understudy) algorithm. For these benchmarks, we used biomedical (MEDLINE-test2019) and non-biomedical (WMT-News and IWSLT-test) test sets. The numbers refer to BLEU scores obtained with our model, and in parentheses we report the difference with respect to the best result we know of on that test set.*

[a] newstest2019.de-ennewstest2014.fr-ennewstest2013.es-en.
[b] Test sets (to English) from the WMT18 and WMT19 biomedical translation tasks. Results obtained using the <medical> tag.
[c] IWSLT17-test (for all but Spanish) and IWSLT16-test (for Spanish).


  • For all MEDLINE columns, we used the best result reported in the corresponding competition. However, our results were computed against untokenised references, with SacreBLEU’s default settings. The biomedical task settings might differ. The test sets are also very small (~300 lines), which can cause a large variance in results.
  • For German–English, for the News and TED Talks domains, we used the FAIR model for WMT’19.
  • For French–English, we used our own model (5).
  • We have not reported comparisons for Spanish–English, as the data comes from a 2013 competition and those models are outperformed by modern architectures (the best entry at that competition achieved 30.4). Spanish, Italian and Korean were added to the IWSLT challenge later, and were not part of the competition at that time.

Note that slightly better results can be obtained by using ensemble models, but for simplicity of use we are releasing our single best model.

How can I use this?

Detailed instructions are here. You will need a local copy of the fairseq toolkit. The version of our translation model that we’re releasing also requires a minimal amount of additional code (which we’re also releasing) that you’ll need to add in order to start translating. This additional code takes care of preprocessing the sentences you want to translate: while we rely on standard tools for tokenisation (SentencePiece), we are adding a necessary script through which the input data must be passed.

Other than that, just follow the instructions and… happy translating!


  1. Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin. arXiv:1706.03762 [cs.CL].
  2. Massively Multilingual Neural Machine Translation. Roee Aharoni, Melvin Johnson and Orhan Firat. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, June 2019. DOI: 10.18653/v1/N19-1388.
  3. NAVER LABS Europe’s Systems for the WMT19 Machine Translation Robustness Task. Alexandre Berard, Ioan Calapodescu and Claude Roux. Fourth Conference on Machine Translation (WMT19), Florence, Italy, 1–2 August 2019.
  4. Self-Supervised and Controlled Multi-Document Opinion Summarization. Hady Elsahar, Maximin Coavoux, Matthias Gallé and Jos Rozen. arXiv:2004.14754 [cs.CL], 30 April 2020.
  5. Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness. Alexandre Berard, Ioan Calapodescu, Marc Dymetman, Claude Roux, Jean-Luc Meunier and Vassilina Nikoulina. Workshop on Neural Generation and Translation (WNGT) at the Empirical Methods in Natural Language Processing (EMNLP) Conference 2019, Hong Kong, China, 4 November 2019.

    * Some numbers were updated to match the evaluation to the exact same conditions used in the referent competitions. Latest, updated figures on GitHub.