EMNLP 2019 was in Hong Kong last week. It received a record-breaking number of submissions, but here is a very biased selection of what we liked from what we saw. NAVER LABS Europe had 8 presentations at the conference. If you’re interested in what we did, take a look at our publications.
Workshops
FEVER
The second edition of the Fact Extraction and Verification workshop came once again with a challenge, which was slightly modified this year. The data-set was the same, but submitters had to follow the build, break, fix paradigm. In addition to building their own models, participants had to break other people’s systems and see how they could address (fix) the attacks.
Sameer Sigh gave an invited talk on adversarial attack on a Q&A system, both on knowledge graph and unstructured text. With only one well chosen additional link, the current method saw a drop of 10 points. Emine Yilmaz in her talk mentioned current challenges in fact checking:
- Include explanations of decisions
- Incorporate source bias information into fact checking, and in general investigate the relationship between bias and truthfulness
- Multimodal fact-checking, including images & videos
- Fact checking using multilingual resources
- Early stage detection and mitigation of misinformation
DISCO-MT
The Fourth Workshop on Discourse in Machine Translation started with a Keynote from Prof. Qun Liu (Noah’s Ark Lab, Huawei – Hong Kong, China) on the challenges of document level MT: inconsistent proper name translation, tense errors, pronoun translation errors, inconsistent verb phrases, etc. Approaches to take into account these discourse or document level phenomena vary from preprocessing, post-processing (see also the Docrepair paper at the main EMNLP conference which is a very simple but efficient context-aware monolingual repair for Neural Machine Translation) and neural architectures for doc2doc translation and evaluation/datasets…
One poster (When and Why is Document-level Context Useful in Neural Machine Translation) pointed out the need for caution in reporting docNMT improvements which are often not interpretable as utilising the context. Authors hypothesise that a dominant cause of the improvements by document-level NMT is in fact the regularisation of the model and they advocate for a fair evaluation of document-level NMT methods where one should have a sentence-level NMT baseline as strong as possible first (using more data or applying proper regularisation).
The invited talk highlighted the following challenges of document level MT: very long documents (novels, theses, paper articles), linguistically informed approaches (entities/relations, anaphora/ellipse/coreference), evaluation metrics and data efficient doc2doc architectures.
DeepLo workshop
The second DeepLo workshop attracted a lot of attention this year and ended up being one of the biggest EMNLP workshops (90 submissions, 40% acceptance rate, ~200 participants)
Most of the work was around multilingual transfer: how to transfer various NLP models from higher resource languages to low resource settings. It is usually done either with MUSE (multilingual word embeddings) or m-BERT (multilingual BERT) adding some additional tricks/layers around these pre-trained representations depending on the task in hand.
Several pieces of work were also presented on domain or task transfer. Part of B. Plank’s talk was devoted to this: she talked about data selection to avoid negative transfer from one domain to another. An interesting work called ‘Evaluating Lottery Tickets Under Distributional Shifts‘ – shows that lottery tickets obtained on one domain still perform reasonably well on other domains (experiments on text classification task).
Finally, there were plenty of tips/tricks (often task specific) on how to augment/generalise/reuse data across the tasks to improve the performance in low-resource settings. Check out the proceedings where there are many interesting papers.
Trends
NMT
On Search Errors versus Model Errors for NMT. Many people have already tried to understand whether NMT problems come rather from the training or from the beam search part. Two papers at EMNLP studied this.
The first one is “On NMT Search Errors and Model Errors: Cat Got Your Tongue?” which brings two contributions. On the one hand it introduces the notion of exact beam search;
On the other hand, it finds out that the optimal solution is actually an empty sequence in more than 50% of cases.
This explains the fact observed by several researchers that NMT performance often drops when beam size increases. More importantly, this also points out that our model is actually not that good at estimating p(x|y).
Another paper from Facebook is called Simple and Effective Noisy Channel Modeling for Neural Machine Translation. It returns back to the SMT objective, and uses p(x|y)*p(y) as a decoding objective function. This model showed good performance at WMT this year and, more importantly, it could potentially overcome the problem the first work raises. One indicator that it is indeed so, is the experiment showing that such a model obtains bigger gains with larger beams.
Non-monotonous decoding. There is more and more work on non-autoregressive models and non left-to-right decoding. In Insertion-based Decoding with automatically Inferred Generation Order. J.Gu et al predict the decoding order first, and the tokens afterwards. Search-adaptive order seems to give slightly better results compared to left-to-right decoding, but at the cost of scalability. His follow up works seems to scale better (check follow up work from the same author accepted at NeurIPS : Levenshtein transformer).
Facebook’s Mask-Predict: Parallel Decoding of Conditional Masked Language Models is a BERT-style model which first predicts the length of the sequence, and then attempts to unmask all the tokens of this sequence. The unmasking is done in the order of highest probability. Although single iteration doesn’t seem to work well, splitting it in fixed amount of iterations (by unmasking simplest tokens first, and moving gradually to more challenging tokens prediction) seems to improve a lot. Each next unmasking iteration is conditioned on previously unmasked tokens. Combined with distillation it almost reaches non-autoregressive model performance. The future challenge (stated by the author) is to get rid of the distillation step which is still a key ingredient in NAT models. A similar (although much simpler) idea applied to dialog is Attending to Future Tokens for Bidirectional Sequence Generation.
In case you didn’t have enough you may also want to check out Hint-Based Training for Non-Autoregressive Machine Translation as well as a couple of (rejected) works mentioned by K. Cho during his invited talk.
Sparser/smaller/faster models
Transformer models are highly over-parameterised, there are therefore several pieces of work that try to come up with sparser models. M. Auli from Facebook at his invited talk at the WNGT workshop presented several types of work going towards sparser models. One is on Lightweight and Dynamic convolutions which adapts the attention span dynamically, and makes the decoding faster. Another one is on adaptive depth transformer, claiming that some tokens are easier to generate than others so the model does not necessarily require a very deep decoder for some simple tokens. The model would adapt its depth per token by predicting the number of decoder layers based on the input.
A main conference paper (Adaptively sparse transformers) introduces alpha-entmax which generalizes over softmax, argmax and sparsemax. The learnable parameter (alpha) can be optimized and this leads to slight improvements in BLEU for NMT over a Transformer/softmax baseline. Qualitative analysis also shows that it reduces head redundancy and leads to clearer and more interpretable head roles (pip install entmax). Google propose residual adapters (Simple, Scalable Adaptation for Neural Machine Translation) which are competitive with fine tuning while the latter requires maintaining a model for each target task. Residual adapters scale well to domain adaptation and multilingual NMT. This graph shows how such residual adaptors allow to ‘almost’ reach the bilingual performance for high resource languages.
An interesting architecture is also proposed for decoding into two target languages at the same time (Synchronously Generating Two Languages with Interactive Decoding). At each decoding stage, the decoder has 2 states (h1,h2) for the two target languages (l1,l2) and next word is generating by taking advantage of these two different types of information.
Analysis, Methodology
Several papers addressed corpus bias issues as well as reproducibility issues in the fast evolving NLP domain. For instance, one paper (Show your work: improved reporting of experimental results) emphasized that NLP researchers do not give enough details on their hyperparameter optimization strategies. Authors introduce expected validation performance and show that best performing model actually depends on the hyperparameter tuning budget and propose a useful reproducibility checklist.
As far as corpus biases are concerned, several papers also highlight that datasets obtained via crowdsourcing can display important annotator biases, which questions if we really model the task or the annotator. Others try to find a way to deal with existing bias. Thus, ‘Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual‘ work presented at the DeepLo workshop proposes to decompose the model into ‘biased classifier’ and ‘unbiased classifier’ where the biased classifier would be learnt on ‘biased features’, and unbiased classifier aims at fitting the residual of biased model. Only the unbiased classifier is used for the prediction.
Universal Adversarial Triggers for Attacking and Analyzing NLP by Eric Wallace also addresses a problem of model robustness towards the bias in the dataset. In particular it points our that there exist a number of adversarial triggers which appended to any input always make the model predict the same output. This work tries to find such triggers and analyses them on various NLP tasks.
Methodology is also important in low-resource language processing (Towards Realistic Practices In Low-Resource Natural Language Processing: The Development Set) which points out that many papers on low resource languages use an evaluation set which is actually bigger than training set which might not be realistic in true low resource scenarios (where large dev set would be rather used for training). The authors rather propose to use settings that do not assume a development set, for instance using development languages where you fix hyper parameters on other target languages (metalearning). Their experiments on historical datasets show differences of up to 18% absolute accuracy which demonstrates that a realistic evaluation of models for low-resource NLP is crucial for estimating real-world performance.
Analysis of BERT/Transformer models Many works trying to analyse both monolingual or multilingual BERT (or more generally Transformer) representations in different ways.
Very interesting work presented by Elena Voita from Yandex was “The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives”. It analyzes different components of a transformer architecture comparing them in different tasks : Language Modelling, Masked Language modelling and MT encoder. A lot of interesting insights on what and how the information is encoded at different layers, for different tokens.
Many attempts to analyse multilingual BERT representations, to understand whether it partitions its parameters between different languages, or whether it learns truly multilingual representations. Various techniques are used for this (various probing tasks, zero shot transferring experiments, PWCCA). The conclusions vary from one work to another. At BERT Is Not an Interlingua and the Bias of Tokenization from Salesforce it is shown that m-BERT partitions its parameters space rather than learning a single interlingua representation. However, several other works suggest an opposite conclusion.
Several vizualisation toolkits were presented with the same goal of better analysis of the models. AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models is a toolkit from AllenAI which got the Best Demo paper award.
Other
Dominance of BERT – a trend which started last year continued this way. Pre-trained language models improved performance in almost all tasks that they were tried on. Even more, at least two competitions were won basically by just using pre-trained LM. The top-3 contenders of the multi-hop reasoning challenge of the TextGraph workshop used some variations on top of BERT or XL-NET.
A similar scenario played out in the commonsense challenge of the COIN workshop
Multilingual. This plot, from Barbara Plank during her talk at the DeepLo workshop (see below), shows clearly the trend, or at least of putting it prominently in the title:
One (out of many) interesting pieces of work in this direction, is MultiFit, which proposes to fine-tune cross-lingual language models on your task in your target language using very few labeled labels.
Generation continued to be one of the strongest growing areas, with at least two challenges: how to learn from less data (or even totally unsupervised), and how to control the text that is generated.
Capsule networks continue to penetrate NLP applications. Capsule Networks were introduced couple of years ago by G. Hinton for computer vision tasks with a goal to combine various low-level features at a higher level in the model (as opposed to CNN models that would collapse all the information in a scalar). One would therefore expect such models to be helpful in capturing the hierarchical nature of languages. The first attempts to apply capsule networks to NLP last year were mostly on Classification tasks. This year people show the capacity on more complex tasks such as Context-aware Machine translation (Capsule networks for context based translation), Aspect-based sentiment analysis (Capsule Network with Interactive Attention for Aspect-Level Sentiment Classification), Semantic Role labeling (Capturing Argument Interaction in Semantic Role Labeling with Capsule Networks.