You’ve heard it before: Covid-19 is disrupting every aspect of our lives. Different forms of lockdown are in operation across the world, and we’re all coming terms with the idea that social distancing may be a long-term requirement. It’s likely you’ve also heard that this pandemic is far from over, and that its consequences will take years to fully crystallise.
In the midst of all this, chances are high that when you open a news portal or social media channel, you’re presented with an article or post that has some reference to the disease. By its very nature, a pandemic affects people of all languages, and we’re currently living through a rare period in which people around the world are all talking about a single topic. Understanding how reactions vary across different cultures, as well as pulling them together to find commonalities, will provide important insights for the future.
We believe that the vast sum of written digital communication about Covid-19 that’s currently being amassed will be the basis of hundreds of research programmes in the future. The data will be used to analyse our response during this period and—hopefully—to orient and advise future policies in economy, sociology, crisis management and, of course, public health.
To facilitate the large-scale analysis of this digital evidence at such a unique time in human history, we are releasing a multilingual translation model that can translate biomedical text. Anybody can download our model and use it locally to translate texts from five different languages (French, German, Italian, Spanish and Korean) into English.
While automated translation portals are mainstream and used by millions daily, they’re not specialised in biomedical data. Such data often contains specific terminology which isn’t recognised—or is poorly translated—by most platforms. In addition, making our work available means that researchers can host their own models, enabling them to translate at will without having to monitor the budget spent on those portals. Although a few pretrained models exist (Opus-MT-train is one example), most are only bilingual, limiting their use. Additionally, as they aren’t trained using biomedical data, the models are not suitable for such specialised translation.
Neural machine translation models work by encoding input sequences into mathematical structures, or intermediate representations, that consist of points in high-dimensional real space. These intermediate representations are obtained by setting a large number of parameters, which we achieved by exposing our model to training data (consisting of translated sentences from publicly available resources). A decoder then uses the intermediate representation to generate an English translation, producing the translated sentence word by word.
A variety of neural architectures exist. We based our own work on state-of-the-art models, which employ the so-called Transformer (1) architecture. Using high-capacity variants of this architecture, we’re able to translate different languages with a single model.
Typically, a number of models (n2, where n is the number of languages) must be managed separately to enable translation across multiple languages. Because ours is multilingual, users are able to translate from five different languages to English using just one model, simplifying storage and maintenance. More importantly, research has shown (2) that multilingual models can greatly benefit so-called under-resourced languages, i.e. languages for which less parallel data exists (as can be seen by the data we used, the number of training sentences varies widely across languages). In the benchmarks that we used to measure performance, we found that our model achieves results similar to the best-performing bilingual models.
Words are often ambiguous when taken out of context and can mean very different things in different settings (or ‘domains’). For example, when translated into German, high temperature could be hohe Temperatur or Fieber, depending on whether the domain is meteorology or medicine, respectively. Likewise, a French carte might be a map or a menu, depending on whether you’re on a treasure hunt or in a restaurant. For this reason, creating a multi-domain model—i.e. one that is capable of translating specialised information—is particularly challenging. To achieve multi-domain functionality and enable the accurate translation of information relating to Covid-19, we used a variety of parallel biomedical data (in particular from TAUS) when training our model.
One approach to achieving domain adaptation in a translation model is to fine-tune it to the specific domain of interest. We didn’t want to overspecialise, however, as this could lead to losing the advantage provided by large corpora from other domains. To maximise the usability of our model, we decided instead to use domain tags (a strategy that has been successful (3) in the past). These domain tags are used as control tokens. During training, sentences that come from one domain are assigned the same tag. The tag can then be used at inference time (as opposed to training time) to nudge the model towards one domain or the other. In the model we’re releasing, the user can employ the default settings, for standard translation, or select the biomedical tag. The same sentence translated with or without this token produces different output.
We achieved multilingualism in the same way. For example, a French sentence is tagged with a different label than a Korean sentence: <fr> and <ko>, respectively. Although controlling the language tags at inference time makes less sense (as it forces the model to consider, for instance, a German sentence when translating from Italian to English), it allows for very flexible and generic training procedures.
Indeed, by enabling two ways of varying the input (domain and language), our model achieves better translation, as determined by a standard measure (called BLEU, for bilingual evaluation understudy) that counts overlapping sequences of words with respect to reference translation. Interestingly, although the model wasn’t presented with any biomedical data for Korean–English at the learning stage, we found that using the biomedical tag for translation resulted in a different output. In an internal test, we translated a set of biomedical texts and observed an increase of 0.44 BLEU points when the tag was selected. Although the changes are small, they’re often important (see, for example, the changes highlighted with boldface in Table 1). The first two examples in Table 1 show cases for which the translation was more accurate, while the last example actually shows degraded performance.
We’re currently further exploring the transfer of such information in multilingual and multi-domain models. We’ve already proven that the use of flexible control tokens makes summarisation more faithful (4) to the original documents, and we are interested in how the interplay between different types of control tokens affects translation. In particular, the transfer of knowledge across languages and domains could open up a wide range of exciting uses for natural language generation models, even for languages or domains where no training data is easily available.
We tested our model against competing machine translation models using the BLEU measure. BLEU is a quality metric score for machine translation systems that evaluates the quality of a piece of machine-translated text by comparing it to corresponding human-translated text. Table 2 reports the BLEU values obtained for standard benchmarks in the field of machine translation, which are provided in regular competitions, compared with our own. Whenever available, we also compare against the best-performing model (as reported in the corresponding competition).
[a] newstest2019.de-en, newstest2014.fr-en, newstest2013.es-en.
[b] Test sets (to English) from the WMT18 and WMT19 biomedical translation tasks. Results obtained using the <medical> tag.
[c] IWSLT17-test (for all but Spanish) and IWSLT16-test (for Spanish).
Note that slightly better results can be obtained by using ensemble models, but for simplicity of use we are releasing our single best model.
Detailed instructions are here. You will need a local copy of the fairseq toolkit. The version of our translation model that we’re releasing also requires a minimal amount of additional code (which we’re also releasing) that you’ll need to add in order to start translating. This additional code takes care of preprocessing the sentences you want to translate: while we rely on standard tools for tokenisation (SentencePiece), we are adding a necessary script through which the input data must be passed.
Other than that, just follow the instructions and… happy translating!
* Some numbers were updated to match the evaluation to the exact same conditions used in the referent competitions. Latest, updated figures on GitHub.