Investigating the effectiveness of BPE: the power of shorter sequences

Published by Claudia Heyer at 16 September 2019

Matthias Gallé

Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP), Hong Kong, 3-7 November, 2019

Download

@inproceedings{galle2019investigating,
  title={Investigating the Effectiveness of BPE: The Power of Shorter Sequences},
  author={Gall{\'e}, Matthias},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={1375--1381},
  year={2019}
}

Careers home

Abstract

Byte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that – given a fixed vocabulary size budget – the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Investigating the effectiveness of BPE: the power of shorter sequences

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings