23rd January 2020; 1:00PM-4:00PM. Place: Aud. 01, August Krogh Building, Universitetsparken 13, 2100 Copenhagen, Denmark
Speaker: Matthias Gallé, group lead of the NAVER LABS Europe Natural Language Processing group.
Title: Text Representation Units for Neural Machine Translation
Abstract: What is the best atomic unit to represent text? This important decision lies at the heart of the intersection between the continuous representation of modern NLP and the discrete world. To understand the effectiveness of BPE, we test the hypothesis that it lies in the compression capacity of that algorithm. We test this by linking it to the broader family of dictionary-based compression algorithms. We then study character-based NMT with Transformer models, showing the consequences of using character as atomic symbols on overall translation quality, robustness as well as the need of deeper models. This is joint work with Rohit Gupta, Laurent Besacier and Marc Dymetman.
Organizer: Wouter Boomsma and Francois Lauze, Department of Computer Science, University of Copenhagen (DIKU)