NAVER at EMNLP 2020 - Naver Labs Europe
NAVER and NAVER LABS Europe

NAVER at EMNLP 2020

16th – 20th November 2020

NAVER and NAVER LABS EUROPE are EMNLP 2020 Silver sponsors

NAVER Corporation is Korea’s premier Internet company and a global leader in online services such as NAVER search (30M DAU), LINE messaging (164MAU), and WEBTOON (62M MAU). NAVER is continuously focussed on the future and in seamlessly connecting the physical and digital worlds through advanced technology. Its AI and Robotics research in Asia and Europe are fundamental to creating this future. NAVER invests over 25 percent of annual revenue in R&D yet innovation is but one core value. NAVER promotes diversity on the internet, respects and connects people helping them to share knowledge, create communities and preserve culture.

NAVER LABS Europe

NAVER, NLP and machine learning

NAVER LABS Korea

18th November 2020

Join us on Zoom

Contact details will be shared here in beforehand of the conference.

Machine Translation, Open Domain QA and Conversational AI research at NAVER, Korea’s leading internet portal

10am – 10:15am CET (UTC/GMT+1)

Machine Translation

Hyunchang Cho and Kweonwoo Jung (Papago, NAVER)
Papago is an online translation service provided by NAVER.
The MT team within Papago focuses on advancing machine translation quality, mostly for Eastern Asian languages such as Korean, Japanese, Chinese.
In this session, we will discuss some major research topics of the team and the methods we use to enhance the user experience (ex: honorific translation and quality estimation).

Papago

10:15am – 10:30am CET (UTC/GMT+1)

Document Information Extraction 

Minjoon Seo (Clova, NAVER)
Clova is the AI-first organization within NAVER & LINE conducting high-impact research in a wide range of domains to empower various AI-driven products in and out of the company. In this session, I will discuss our team’s recent work and ongoing research on end-to-end document information extraction for diverse semi-structured documents, including name cards, receipts, and invoices.

Clova AI

10:30am – 10:45am CET (UTC/GMT+1)

Conversational AI

Kyungduk Kim and Hyeon-gu Lee (NLP, NAVER)

NAVER focuses on both researching NLP technologies and on disseminating AI-powered products to customers. In this session, we will share our experience in deploying conversational AI technologies into commercialised AI-powered products such as smart speakers, set top boxes and vehicle infotainment systems. We’ll also briefly introduce our ongoing answer snippet extraction.

NAVER logo

 5pm – 5:30pm  CET (UTC/GMT+01:00)

NLP research and openings at NAVER LABS Europe

Laurent Besacier, LIG and NLP group lead at NAVER LABS Europe: NLP research and openings at NAVER LABS Europe

This presentation is intended for potential academic collaborators, PhD or internship candidates. A short presentation of NLP activities at NAVER LABS Europe and recent highlights will be given ending with positions currently open in the group in France.

NAVER LABS Europe logo

Enter the virtual venue space and meet us on the Gather.town platform (need to be an EMNLP registered attendee and login). Check out the NAVER booth on Rocket chat 

Tuesday, 17th November 2020

10am-11am UTC + 9:00 (2am-3am CET)

Kang Min Yoo,  Research Scientist


11am-12pm UTC + 9:00 (3am-4am CET)

Ji-Hoon Kim,  Research Scientist


8am-8.30am UTC + 1:00 (8am-8.30am CET)

Matthias Gallé, Lab Manager


9am-10am UTC + 1:00 (9am-10am CET)

Laurent Besacier, Research Scientist / NLP group leader


6pm-7pm UTC + 9:00 (10am-11am CET)

Minjoon Seo, Software Engineer


7pm-8pm UTC + 9:00 (11am-12pm CET)

Gyuwan Kim, Software Engineer


5pm-6pm UTC + 1:00 (5pm-6pm CET)

Hady Elsahar,  Research Scientist


Wednesday, 18th November 2020

10am-11am UTC + 9:00 (2am-3am CET)

Kweonwoo Jung, Software Engineer


11am-12pm UTC + 9:00 (3am-4am CET)

Hyunchang Cho,  Research Scientist


8am-9am UTC + 1:00 (8am-9am CET)

Matthias Gallé, Lab Manager


9am-10am UTC + 1:00 (9am-10am CET)

Alexandre Berard, Research Scientist


6pm-7pm UTC + 9:00 (10am-11am CET)

Kyoungduk Kim, Software Engineer


7pm-8pm UTC + 9:00 (11am-12pm CET)

Seonhoon Kim, Software Engineer


5pm-6pm UTC + 1:00 (5pm-6pm CET)

German Kruszewski,  Research Scientist


Thursday, 19th November 2020

9am-10am UTC + 1:00 (9am-10am CET)

Jos Rozen, Research Scientist


Publications at EMNLP 2020

Nov 17, 10:00 – 11:00  CET (UTC/GMT+1): 6C

Context-aware answer extraction in question answering
EMNLP 2020 |  Yeon Seonwoo, Ji-Hoon Kim, Jung-Woo Ha, Alice Oh

Extractive QA models have shown very promising performance in predicting the correct answer to a question for a given passage. However, they sometimes result in predicting the correct answer text but in a context irrelevant to the given question. This discrepancy becomes especially important as the number of occurrences of the answer text in a passage increases. To resolve this issue, we propose BLANC (BLock AttentioN for Context prediction) based on two main ideas: context prediction as an auxiliary task in multi-task learning manner, and a block attention method that learns the context prediction task. With experiments on reading comprehension, we show that BLANC outperforms the state-of-the-art QA models, and the performance gap increases as the number of answer text occurrences increases. We also conduct an experiment of training the models using SQuAD and predicting the supporting facts on HotpotQA and show that BLANC outperforms all baseline models in this zero-shot setting.

Nov 17, 11:00 – 13:00 CET (UTC/GMT+1): 2B 

Variational hierarchical dialog autoencoder for dialog state tracking data augmentation
EMNLP 2020 | Kang Min Yoo, Hanbit Lee, Franck Dernoncourt, Trung Bui, Walter Chang, Sang-goo Lee

Recent works have shown that generative data augmentation, where synthetic samples generated from deep generative models complement the training dataset, benefit NLP tasks. In this work, we extend this approach to the task of dialog state tracking for goal-oriented dialogs. Due to the inherent hierarchical structure of goal-oriented dialogs over utterances and related annotations, the deep generative model must be capable of capturing the coherence among different hierarchies and types of dialog features. We propose the Variational Hierarchical Dialog Autoencoder (VHDA) for modeling the complete aspects of goal-oriented dialogs, including linguistic features and underlying structured annotations, namely speaker information, dialog acts, and goals. The proposed architecture is designed to model each aspect of goal-oriented dialogs using inter-connected latent variables and learns to generate coherent goal-oriented dialogs from the latent spaces. To overcome training issues that arise from training complex variational models, we propose appropriate training strategies. Experiments on various dialog datasets show that our model improves the downstream dialog trackers’ robustness via generative data augmentation. We also discover additional benefits of our unified approach to modeling goal-oriented dialogs – dialog response generation and user simulation, where our model outperforms previous strong baselines.

Nov 17, 19:00 – 21:00 CET (UTC/GMT+1): 3A

Monolingual adapters for zero-shot neural machine translation
EMNLP 2020 | Jerin Philip, Alexandre Berard, Matthias Gallé, Laurent Besacier

We propose a novel adapter layer formalism for adapting multilingual models. They are more parameter efficient than existing adapter layers while obtaining as good or better performance. The layers are specific to one language (as opposed to bilingual adapters) allowing to compose them and generalize to unseen language-pairs. In this zero-shot setting, they obtain a median improvement of +2.77 BLEU points over a strong 20-language multilingual Transformer baseline trained on TED talks.

Findings of EMNLP

Participatory research for low-resourced machine translation: a case study in African languages
Findings of EMNLP | Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon KABONGO KABENAMUALU, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, ghollah kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, Abdallah Bashir

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. “Low-resourced”-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at [https://github.com/masakhane-io/masakhane-mt

Large product key memory for pre-trained language models
Findings of EMNLP | Gyuwan Kim, Tae-Hwan Jung

Product key memory (PKM) proposed by Lample et al. (2019) enables to improve prediction accuracy by increasing model capacity efficiently with insignificant computational overhead. However, their empirical application is only limited to causal language modeling. Motivated by the recent success of pre-trained language models (PLMs), we investigate how to incorporate large PKM into PLMs that can be finetuned for a wide variety of downstream NLP tasks. We define a new memory usage metric, and careful observation using this metric reveals that most memory slots remain outdated during the training of PKM-augmented models. To train better PLMs by tackling this issue, we propose simple but effective solutions: (1) initialization from the model weights pre-trained without memory and (2) augmenting PKM by addition rather than replacing a feed-forward network. We verify that both of them are crucial for the pre-training of PKM-augmented PLMs, enhancing memory utilization and downstream performance. Code and pre-trained weights are available.

NLP for COVID-19 (Part 2)

Nov 20: Live Session 2, 16:00 – 22:00  CET (UTC/GMT+1)

16:00-16:15 A multi-lingual neural machine translation model for biomedical data,
NLP COVID-19 workshop, EMNLP | Alexandre Berard, Zae Myung Kim, Vassilina Nikoulina, Eunjeong Lucy Park, Matthias Gallé

We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and biomedical test sets, and that it outperforms the existing publicly released models. We believe that this release will help the large-scale multilingual analysis of the digital content of the COVID-19 crisis and of its effects on society, economy, and healthcare policies. We also release a test set of biomedical text for Korean-English. It consists of 758 sentences from official guidelines and recent papers, all about COVID-19.

Fifth Conference on Machine Translation (WMT20)

Nov 19: Live Session 1, 10:45 – 21:00  CET (UTC/GMT+1)

12:00 – 13:30  CET (UTC/GMT+1) NAVER LABS Europe’s participation to the robustness, chat and biomedical tasks at WMT 2020
Alexandre Berard, Vassilina Nikoulina, Ioan Calapodescu

This paper describes Naver Labs Europe’s participation to the Robustness, Chat and Biomedical Translation tasks at WMT 2020. We propose a bidirectional German-English model that is multi-domain, robust to noise and which can translate entire documents (or bilingual dialogues) at once. We use the same ensemble of such models as our primary submission to all three tasks, and achieve competitive results. We also experiment with language model pre-training techniques and evaluate their impact on robustness to noise and out-of-domain translation. For German, Spanish, Italian and French to English translation in the Biomedical Task, we also submit our recently released multilingual Covid19NMT model.

Nov 20: Live Session 2, 10:00 – 21:15  CET (UTC/GMT+1)

12:00 – 13:30 CET (UTC/GMT+1) PATQUEST: Papago Translation Quality Estimation
WMT20 at EMNLP | Yujin Baek, Zae Myung Kim, Jihyung Moon, Hyunjoong Kim, Eunjeong Park

This paper is a system description paper for NAVER Papago’s submission to the WMT20 Quality Estimation Task.. It proposes two key strategies for quality estimation: (1) task-specific pre-training scheme, and (2) task-specific data augmentation. The former focuses on devising learning signals for pre-training that are closely related to the downstream task. We also present data augmentation techniques that simulate the varying levels of errors that the downstream dataset may contain. Thus, our PATQUEST models are exposed to erroneous translations in both stages of task-specific pre-training and fine tuning, effectively enhancing their generalization capability. Our submitted models achieve significant improvement over the baselines for Task 1 (Sentence-Level Direct Assessment; EN-DE only), and Task 3 (Document Level Score).

NAVER and NAVER LABS Europe publish annually at CVPR, ECCV/ICCV, NeurIPS, ICML, ACL, EMNLP,  SIGIR, RecSys and many other conferences.

NAVER Clova publications     |    NAVER LABS Europe publications

Would you like to join one of the world’s most innovative companies and
have impact on the lives of millions of people?

NAVER LABS Europe

Creating new connections by advancing technology
If you enjoy a challenge, are passionate, talented and embrace diversity, then right here may be the perfect place for you!

work at NAVER

A culture that recognizes ‘capability’ regardless of age or seniority
Self decision-making and system of choice

Diversity and inclusiveness

Diversity is the reason NAVER came into existence in 1999. The need to provide alternatives is a fundamental core value for a healthy society.

We value different ways of thinking about the world and different perceptions of the world.

We try to create an inclusive workplace where respect reigns. A place where everyone can be themselves.

NAVER headquater
NAVER HQ ‘Green Factory’, Seongnam, Korea
NAVER LABS Europe
NAVER LABS Europe, Grenoble, France

NAVER was recognised as a top employer and company university students would like to work for in South Korea for 3 consecutive years
(2016 – 2019)

team photo

Downloads

NAVER LABS Europe brochure
NAVER LABS Europe brochure: Creating new connections
NAVER brochure
NAVER brochure: Connecting people beyond borders
blank
Selected NAVER and NAVER LABS Europe publications

Recent News:

machine translation covid-19
Release of a multilingual, multi-domain NMT model for Covid-19 and biomedical research. (Blog)
Blog on DL for HTR an IE to explore historical handwritten records.
A new platform based on deep-learning approaches to handwritten-text recognition and information extraction enables data from century-old documents to be parsed and analysed. (Blog)
blank
A new greedy, brute-force solution improves fairness in ranking and encompasses realistic scenarios with multiple, unknown protected groups.