This year’s conference was the largest ACL ever and, although “exponential growth” has become a common expression when describing AI conferences, an 88% increase in submissions was clearly another dimension.
This increase was the topic of many discussions in the hallways, but also of the presidential address, where Ming Zhou (Microsoft Research Asia) was concerned that some countries e.g. in Africa or Central and South America are trailing. A dedicated discussion was held in the ACL business meeting to address the problem of being able to guarantee good reviews while still handling the growing number of submissions. Three ideas were discussed:
1/ Papers are placed somewhere central, anonymously, during review. After review, they can be made public or taken down.
- variation 1: this is not optional
- variation 2: reviews are also made public and outsiders can comment
2/ The Program Committee of one ACL conference can recommend that a paper that was rejected can be revised and sent to the next conference along with the reviews. The paper might be accepted by the Area Chair upon receipt without further review or with reduced review.
3/ Papers are reviewed on a monthly basis, so that reviewing is continuous and authors can respond to reviews and revise accordingly.
The last year and a half have seen some impressive leaps in natural language processing, most notably in transfer learning through token-based language modelling and unsupervised machine translation. This year’s ACL had many papers that consolidated these results.
In this post, we’ll cover some of the major topics we saw and mention a few papers which caught our attention. The usual caveats apply: there were 6 parallel tracks (including – unfortunately – posters, so you had to decide between those or talks) and we didn’t see everything.
The main conference was recorded in its totality, and the videos are available on the revamped and easy to browse ACL anthology.
Generation + Summarization
These were tracks which had some of the largest number of submissions and the highest increase over previous years. Annotations here are particularly expensive to get hold of so several approaches tried unsupervised ways, by using auto-encoders (e.g. Unsupervised Neural Text Simplification), language models (Simple Unsupervised Summarization by Contextual Matching), graph-based models (Sentence Centrality Revisited for Unsupervised Summarization), or document structure (Inducing Document Structure for Aspect-based Summarization). Some baselines for e-mail subject lines were also presented (This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation).
Departing from the trend of new models or task, A Simple Theoretical Model of Importance for Summarization (which received an outstanding paper award) proposed properties that a good summarization should have from an information theoretical perspective, using entropy (to prevent redundancy in summary), cross entropy (for relevance with source) and Kullback-Leibler divergence (to unify both). This will most certainly guide the future in evaluating summarization.
Many models suffer problems of hallucination (generating text with non-existent content) or omitting important content so A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation proposed a way to improve training data by looking for bad pairing.
To go beyond descriptions, and generate a coherent sequence of actions involving consistent characters, Strategies for Structuring Story Generation presented a strategy that – starting from a prompt – generates an action plan, followed by the generation of a generic story with placeholders for named entities and the full story is written.
The community is grasping with the increasing impact it has on real-life, and the fact that the data the models are learned from often encode explicitly or implicitly some form of bias. There were dedicated tracks, and several related workshops. Word-order Biases in Deep-agent Emergent Communication study this problem on synthetic data, while Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them shows that existing debiasing methods are far from perfect.
Revisiting Low-Resource Neural Machine Translation: A Case Study shows how hyper parameter tuning plays a key role and results in Neural Machine Translation beating Phrase-Based Statistical Machine Translation even on a low resource case. This seems like an answer to an earlier paper by Koehn.
Sparse Sequence-to-Sequence Models proposes a generalization of softmax, parametrized by α. For α=1, you have softmax. For larger α, you get a peakier distribution with few non-zeros. They see it as a sparse softmax. This is valuable for reducing the search space of, for instance, a machine translation (MT) decoder (for online use, or so-called near-exact beam search). Sounds intriguing and attractive.
An Effective Approach to Unsupervised Machine Translation beats the best WMT supervised model from 2014.
The evaluation track was very well attended, showing that the community pays attention to this topic, despite it being a relatively unamusing one. Deep Dominance – How to Properly Compare Deep Neural Models proposed a test to compare the results of deep learning models that’s claimed to be more powerful than other tests, but is still a non-parametric one with no assumption. The need is there since many neural network models produce results with quite a bit of variability due to random initialization seed. This test could become very valuable for the ML community.
Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation proposed a metric which correlates better than BLEU with human judgements. It relies on BERT embeddings and is recall oriented. A supervised version of it appears to also do a great job although it might be worth searching for pathological, or adversarial cases that mislead it.
Other papers we found interesting include:
Augmenting Neural Networks with First-order Logic: The idea here is to augment the neural computation graph with first order logic using soft differentiable versions of conjunction and disjunction. The experimental results (on reading comprehension and natural language inference) show that augmented networks work better, especially when the training dataset is small. There is one exception which is, when lots of training data is available, constraints can hurt so, in that case, trust your data.
Modeling Semantic Compositionality with Sememe Knowledge: looks at how the much meaning of a multiword expression (MWE) can be composed from its constituents and investigate “sememe-based semantic compositionality degree computation formulae”.
Predicting Humorousness and Metaphor Novelty with Gaussian Process Preference Learning: The true meaning of creative language differs from a shallow interpretation. This paper introduces a Bayesian approach, Gaussian Process Preference Learning (GPPL) that can use sparse pairwise annotations to estimate humor or novelty scores given word embeddings and linguistic features.
Liang Huang, Baidu Research & Oregon State University presented Simultaneous Translation: Recent Advances and Remaining Challenges as even the best human interpreters struggle at sustaining high quality. The main difficulty is the word order difference between languages (e.g. subject-object-verb for German and subject-verb-object for English). As the experience of full sentence translation is annoying to users, he suggests waiting for k words where k controls the aggressiveness/conservatism of the interpretation. At ACL2019, he introduced a varying latency with an on-the-fly decision to READ (wait for words) or WRITE (commit the interpretation). Huge challenges still remain however e.g. speech (ASR) for a start!
Pascale Fung, Director of CAiRE, Hong Kong University of Science and Technology, has a long history with conversational agents, for which there’s a growing demand. These agents can differ quite a bit e.g. a chit-chatty conversational system (just wants to be nice, tries to answer naturally, thinks more turns is better…) does not behave like a task oriented personal assistant (wants to help users, needs to track states, generates actions, minimizes the number of turns…).
Multi-turn conversations (domain classification, intention detection, state tracking, natural response generation) and the incorporation of external knowledge (knowledge bases, common sense) remains challenging; end-to-end systems with a deep learning model black box at their core tend to replace classic modularized ones. Conversational systems need to learn to memorize (with better memory and capability to access external KB), to personalize (using personas that can now be learned from dialogs) and to empathize (as it helps with engagement).
Judging by recent controversy, ethics is a major concern and there is lots of concern (biases, humans being replaced by systems, social relations becoming virtual, invasion of privacy …). The community needs to identify norms and values, implement the norms and evaluate the systems.
WMT News Task
In this task the human evaluations were given the context of the sentences at the document-level (see image below). There were 263 individual research accounts and 766 turker accounts used in the evaluation. They found that doc-level rating had a low statistical power, so they used segment-level rating with document-level context.
The evaluation was done with direct assessment, in a bilingual setting when the source-language was English, and monolingual evaluation otherwise. An advantage of bilingual evaluation is that the translation reference can be evaluated along with the MT systems to test for human parity.
In En->De and En->Ru tasks, the human evaluation found no statistical difference between the best MT models and the reference (i.e., “human parity” was reached in this particular setting). There were a lot of discussions (also in the panel) on whether human parity had actually been reached in this setting, and whether it was healthy for the field to do such claims.
APE at scale and Its Implications on MT Evaluation Biases evaluation: the authors trained an automatic post-editing system on synthetic data only (round-trip translated monolingual text) and applied it to the output of the News Translation task. They found that, while BLEU scores decreased, human evaluation significantly improved quality. It seems that evaluating against human-translated references (i.e. the “standard” evaluation setting of using natural sources and translated references) may disadvantage systems that produce more “natural” text.
Marcin Junczys-Dowmunt from Microsoft, who participated in the News Translation task, had similar remarks. They concluded that systematically reporting scores on the two types of test sets (“translationese” and original language) would be good practice.
Like our own work on the Robustness Task, the “Tagged Back-Translation” paper found that adding a source-side “<BT>” tag to the back-translated sentences in the training data gave significant increases in BLEU although it’s still not clear why this is the case. Interestingly, the authors also found that “noised back-translation” by Edunov et al., 2018, had a similar effect: the model is able to identify back-translated sentences because they contain blanks. A direct consequence is that the methods are not complementary. All the Findings of this First Shared Task on Machine Translation Robustness are on arXiv.
There were 4 invited speakers at RepL4NLP. Marco Baroni (FAIR & Pompeu Fabra University, Barcelona) gave a fascinating talk on the emergence of languages between learning agents. The systems tend to minimize the amount of information being sent, close to the theoretical minimum, maybe as a result of having to communicate through a discrete channel. This also means there’s no room for the emergence of more expressive language than needed yet the encoding of is not efficient: the sending agent has no pressure to obey Zipf’s Law of Abbreviation and longer messages are easier to discriminate for the receiver. We might need to tell them about their carbon footprint!
Mohit Bansal from the University of North Carolina, Chapel Hill, delivered a solid presentation on adversarially-robust representation learning and Raquel Fernandez, University of Amsterdam, Institute for Logic, Language & Computation shared her research on representations shaped by dialog interaction.
For the last invited talk, Yulia Tsvetkov, CMU LTI, presented a piece of stimulating work about replacing the softmax layer by an embedding layer. Although the results are not (yet?) state-of-the-art, the models are much more efficient at learning (time and memory-wise).
The continuous output also opens very interesting opportunities for text generation e.g. think GAN for text.
This workshop was designed to cover two different research families and identify bridges that can enrich research in both e.g. deep learning which has already had impressive results and formal languages with its decades-long understanding of classes of languages, their properties and their learnability.
The workshop was centered on invited speakers: Kevin Knight gave an historical overview, Ariadna Quattoni talked about spectral learning, Remi Eyraud on distillation, John Kelleher on using neural networks to test the complexity of data, Robert Frank talked about the results of neural nets for learning known classes of languages and, finally, Noah Smith presented work trying to create models which combined the best of both worlds.
There were 14 papers on discourse analysis at ACL this year presenting novel methods, performances and applications e.g. chatbots. The challenges were also there with the need for common datasets and evaluation frameworks, the need to work on languages beyond English and being able to learn from limited annotated data.
Giuseppe Carenini (University of British Colombia) introduced the problem that discourse analysis is akin to sentence parsing but at the level of paragraphs or text (multiple sentences) and that here, we’re interested in the relations between sentences (explanation, elaboration, result…).
2 main representations are used: Penn Discourse Treebank (PDTB) & ReStructured Text (RST). RST breaks a discourse up to build a tree from Elementary Discourse Units (EDU). It assigns nuclearity, to identify the important units and label relations. The state-of-the-art is not neural as the datasets are small. PDTB has a flatter structure and is lexically based: discourse relations are often triggered by specific “connective” words although there are also implicit relations. The datasets are larger but still not big enough for good neural approaches.
Discourse analysis is still rather inaccurate, especially when compared to syntactic parsing, but finding structure is more accurate than labelling relations. The top systems are supervised but training data is very limited hence the neural network struggle. We need semi-supervised approaches as well as domain adaptation. Work exists in languages other than English but they are also in need of annotated corpora.
In the 2nd part, Shafiq Joty (Nanyang Technical University, Singapore) focused first on coherence then on synchronous and asynchronous conversational structures. Coherence wonders if sentences in a text are related and there are many approaches (entity-based, graph-based, using syntax, neural, …) and applications (readability assessment, essay scoring, …). Evaluation is done either by discrimination (trying to distinguish between the original text and incoherent ones, generated by permutations) or insertion (trying to locate the original position of a sentence removed from a document).
As for conversations, when they’re synchronous the goal is to disentangled multiple conversations happening simultaneously. When they’re asynchronous it’s to construct the thread structure. This seems like a very active subfield and Shafiq presented a lot of his own work.
Jen-Tzung Chien, National Chiao Tung University (Taiwan), introduced a packed hall to Deep Bayesian NLP. After describing the motivation behind his work and the many potential applications in the field, he proceeded to compare probabilistic models to Neural Networks: top-down/bottom-up structure, intuitive/distributed representation, easy/hard to interpret, easy/hard to work on semi or unsupervised data and easy/hard to incorporate domain knowledge. Deep models are very powerful for some problems (e.g. perception) but the question is can our learning models be Bayesian and deep?
He took us on a tour of Bayesian learning: inference and optimization, variational Bayesian inference, Monte Carlo Markov chain inference, bayesian nonparametrics, hierarchical theme and topic model and nested Indian buffet process and then did the same for deep sequential learning: deep unfolded topic model, gated recurrent neural network, bayesian recurrent neural network, memory-augmented neural network, sequence-to-sequence learning, convolutional neural network, dilated neural network, attention network using transformer.
By this time, we were ready to take Bayes to deep learning with the main idea of using ‘variational’ auto-encoders (VAE): variational recurrent auto-encoders, hierarchical variational auto-encoders, stochastic recurrent neural networks, regularized recurrent neural networks, skip recurrent neural networks, Markov recurrent neural networks and temporal difference variational auto-encoder. Chien concluded by suggesting further improvements in VAE, presenting a 4-page summary and 2 pages of future trends!
These were the highlights from four of us in the NLP team here in France but there were many more! We met lots of people on the NAVER/NAVER LABS Europe booth at the conference and we even met some new colleagues from the Korean NLP team. There were more than 30 of us in Florence which shows just how important the research in this field is to NAVER whether it’s in translation, dialog, discourse analysis… You can check out our ACL papers in the anthology. There is however one other high point worth a mention for the community. We took advantage of ACL and the booth to present an open source language we’d just released that helps create and annotate datasets for training. It’s called Tamgu and it created a LOT of interest so you might want to check it out on Github or read the intro blog below