First of all, it seemed to us that several authors took a “step aside” to analyze the state of the NLP field and the limitations of the many deep neural models proposed in recent years. Two award-winning papers illustrate this trend: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList is inspired by principles of testing in software engineering and proposes a series of tests to identify critical issues with current commercial and state-of-the-art models. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data questions the use of the term ‘language understanding’ in many recent papers and argues that a system trained only on form has a priori no way of learning meaning.
Another hot topic is analyzing NLP systems, what they learn and if we can make them more explainable. But it seems that now is the time to evaluate and criticise the analytical methods themselves and study the current state of interpretability research. The findings of these papers are that, first, we need to be careful about the metrics we use to evaluate explanation methods (see this and other papers); secondly, attention (often used as explanation) is easy to manipulate with negligible drop in accuracy and thirdly, that it’s good to confirm findings on interpretability with not only one but multiple analytical approaches.
While full sessions were dedicated to NLP and ethics (and many of them investigate « biases » in NLP systems), a survey paper reviewed 146 papers analyzing “bias” in NLP systems and highlighted the fact that most papers often lack clear and consistent conceptualizations of bias. Authors also find that proposed quantitative techniques for measuring or mitigating “bias” may not always be well-matched with their motivations nor do they sufficiently relate to relevant literature outside of NLP. A final recommendation is to examine language use in practice by reaching out to the communities potentially affected by NLP systems.
Evaluation was an important subfield, if rather a meta one, at ACL 2020. There were many papers pushing for reference-free model-based automated evaluations using language models. An example of this is BLEURT, a learned evaluation metric based on BERT. Designing Precise and Robust Dialogue Response Evaluators used a strong encoder (RoBERTa) to build a dialogue response evaluator, which seems to generalize better than earlier models. Specific difficulties of the evaluation of dialogue are also highlighted in this paper which proposes a reference free evaluation metric for dialogue generation. About evaluation of machine translation (MT), comparing human judgements with automatic MT evaluation metrics has been common practice for several years within the framework of the WMT metrics task. However, this methodology is now questioned: authors of Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics highlight potential problems in the current best practices for assessment of evaluation metrics. They notably show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficiency.
As in the last episodes of ACL, a bunch of work still focuses on the problem of making neural machine translation (NMT) context-aware, i.e. capable of including information that’s outside the boundaries of the current sentence or segment to be translated, to improve translation of inter-sentential discourse phenomena, which strongly affects fluency and adequacy of document translations. The most notable piece of work is likely Better Document-level Machine Translation with Bayes’ Rule, which rewrites the document-level MT objective by means of Bayes’ rule. The resulting formulation is a factorized probability distribution that can be modeled by combining a noisy channel model (standard sentence-level NMT) and a context-aware language model (such as Transformer-XL). In this way, one can leverage monolingual corpora and sentence-level parallel corpora, that are much larger than available document-level parallel corpora. The downside of this solution is the generation process, which becomes quite demanding. On a critical note, Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation highlights the fact that it is hard to disentangle the DNMT systems’ BLEU improvements due to both the modeling of context and to the improved regularization provided by the extra information coming from context, that, in this case, works as a noisy regularizer.
In multilingual NMT the main effort is focused on improving transfer learning ability of models. Some work address it via language-aware modules (see for instance this and other papers). Low-resource and zero shot translation can also be improved by leveraging monolingual data. Another important challenge for multilingual models is how to sample data efficiently. Balancing Training for Multilingual Neural Machine Translation proposes a sampling strategy which automatically learns to weight training data in the multilingual settings. In addition to studies in the supervised setting, unsupervised multilingual NMT has also been investigated and achieved promising results using self-knowledge distillation and language branch knowledge distillation.
Automatic summarization is receiving increased attention by the community: with respect to 2019’s ACL, there were 38.5% more submissions in this track (115 in total), to be compared with the overall 27% increase in total paper submissions. In particular, the keynote by Kathleen R. McKeown focused very much on this task although the overall topic of that original talk — which included a series of interviews — was language generation at large. Current datasets do not always reflect problems faced by the community, so a number of papers introduced a new dataset, including a large-scale multiple news summarization task obtained by leveraging Wikipedia events, dialogues about role-playing games, summaries of book chapters and screenplays.
Because of the high cost of obtaining labels in this task, unsupervised methods have recently received lots of attention. This paper used general summarization principles (brevity, coverage and fluency) in a reinforcement learning setting to produce summaries of news articles, similar to another one that uses similar generic measures to obtain summaries through discrete optimization. Two papers addressed the problem of unsupervised multi-document summarization of reviews, both leveraging the idea that one review might serve as summaries of the others. This can be done either with auto-encoders, with additional latent-variables (e.g. type of product, score); or noising existing reviews. An upcoming paper at ICML 2020 (Pegasus) uses a similar self-supervised approach, as does our recent work where in addition we use control tokens.
Another unsupervised opinion summarization work proposes to go through extracted aspects: the model is trained to reproduce reviews from those aspects, and at inference time the desired aspects (extracted from the original reviews) can be selected to produce a textual realization of those.
A major concern with text generation from language models is their lack of grounding on the end user context (interlocutor in a dialogue, player of an interactive fiction, etc.). Image-Chat: Engaging Grounded Conversations grounds conversation by using images as visual prompts. Authors crowdsourced a dataset of conversations between humans who were asked to use specific style traits (optimistic, skeptical, frivolous, etc.) and to be engaging. When trained on this dataset, a hybrid model combining images, styles and texts is able to perform well in chit-chats. Grounding Conversations with Improvised Dialogues
gets inspired by improvisational theatre which contains dialogue focused on building common ground. The corpus collected is used to fine tune DialogGPT, with notable improvements according to human evaluators. Finally, Automatic detection of generated text is easiest when humans are fooled shows that human raters and discriminating models differ in the way they assess a generated text, revealing the importance of both human and automatic detectors to assess quality of text generation systems.
A cross-cutting topic that we have seen come up quite often is curriculum learning which had previously been explored in NMT. This year several pieces of work at ACL extended it to multi-domain NMT, NLU or speech translation. Norm-Based Curriculum Learning for Neural Machine Translation proposes a curriculum based on sentence difficulty and shows that the model is able to complete training 2-3 times faster, and reach more than 1 BLEU point gain for WMT14 En-De and WMT17 Zh-En translation tasks. Learning a Multi-Domain Curriculum for Neural Machine Translation proposes a curriculum that weights various features of domain relevance thus constructing multi-domain batches. This work shows that these constructed curriculums give better domain robustness. Curriculum Learning for Natural Language Understanding defines the difficulty of the sample based on the underlying model and golden annotations of an NLU task (no predefined heuristics compared to most previous work). It shows stable improvements across all NLU tasks upon which it is evaluated. Curriculum pre-training for End-to-End Speech Translation uses curriculum for pre-training where they do speech transcription (elementary course) with two additional tasks (advanced courses): one is a so-called ‘frame-based masked language’ model (FBML) for language understanding word masking directly on the speech signal. The other one is frame-based bilingual lexicon translation for mapping words in two languages.
The 17th International Conference on Spoken Language Translation (IWSLT) featured several shared tasks on spoken language translation. Two main approaches dominate speech translation (ST) research: cascade systems operate in two steps with source language automatic speech recognition (ASR) and source-to-target text machine translation (MT) but recent work has attempted to build end-to-end ST without using source language transcription during decoding. The last results of the IWSLT 2020 shared task on offline speech translation demonstrate that end-to-end models are now on a par (if not better) than their cascade counterparts. Another highlight of this workshop was the first ever proposition of a simultaneous speech translation shared task which starts output generation before the entire input sequence (text or speech) is available. Both simultaneous speech and text translation sub-tracks were proposed and paved the way for reproducible research in such a rapidly evolving topic. Another workshop (AUTOSIMTRANS) dealt with a similar topic as well as regular ACL papers: simultaneous ST systems based on a segmenter/encoder/decoder models and dynamic read/write policies for simultaneous ST.
While becoming virtual was obviously the major change brought about the current pandemic, there was another significant change at this year’s conference. In a very short time frame a special workshop with an ongoing reviewing process was put in place to present and showcase NLP research that can help combat COVID-19. Two of the main resources that were presented and/or heavily used by papers in that workshop were CORD-19, a dataset of scientific papers related to the virus, and tweets. The workshop was a victim of its success: while 17 papers were presented, there were 75 submissions. Because the submission of most of those was too close to the workshop itself (including our own NMT model) it was impossible to review them in time. The organizers were however both reactivity and flexible and have already confirmed a second session at EMNLP in November.
A takeaway from this workshop is that Conversational AI is coming of age and is now growing up in the real world, where bias is not a dataset issue, where people speak more than they write, where an agent has to be eloquent and efficient, where spoken turns have to be grounded and coherent. Dilek Hakkani-Tur talked about the Alexa Prize, and more generally of Knowledge-Grounded Social Conversational AI. Robustness becomes a major issue in a noisy world of accents and less than perfect English (where, alas, the focus seems to remain) and efficiently modeling ASR ambiguity might help, as explained by Yun-Nung Chen during her lecture on Robust and Scalable Conversational AI. More than in other subfields, dialogue is where NLU meets NLG. It’s also where speech/text meets semantics, and it makes sense to train models in parallel. Finally, Jesse Thomason in his Language Grounding with Robots, highlighted the vastness of the world where our embodied chatbots live, which adds to the complexity of language.