Pretrained Language Models (LM) such as ELMO and BERT (Peters et al., 2018; Devlin et al., 2018 ) have turned out to significantly improve the quality of several Natural Language Processing (NLP) tasks by transferring the prior knowledge learned from data-rich monolingual corpora to data-poor NLP tasks such as question answering or bio-medical information extraction. In Information Retrieval (IR), BERT-based models have recently overtaken traditional learning to rank models which have been leading the field for many years. This is now an exciting time to work in this subject as new possibilities and research questions emerge.
In Information Retrieval, the ranking pipeline is generally decomposed in two stages: the first stage is focusing on retrieving a candidate set from the whole collection, providing documents for the second stage, which focuses on re-ranking the candidates using more complex techniques. Because of the size of web scale corpora, the first step is heavily conditioned on efficiency, and thereby generally relies on inverted index structure and BM25 algorithm. It is also required that it optimizes for recall, hence providing as much as possible relevant documents for the re-ranking part. On the contrary, by considering a reduced candidate set to rank, the second step has heavily been relying on machine learning, ranging from learning to rank on handcrafted features, neural ranking architectures, to BERT-based rankers that achieved state-of-the-art results on several benchmarks (https://microsoft.github.io/msmarco/).
While many works have been improving the latter stage, the first one still relies on bag-of-words matching, and hence remains a bottleneck of the pipeline. SNRM  was the first model that proposed to directly tackle the first stage of a ranking pipeline using neural networks.
This year, several alternative first stage rankers have been proposed, based on BERT and quantization library such as FAISS . In this internship, we propose to explore and benchmarks several indexing methods for deep information retrieval.
We are looking for someone with good coding skills, a great scientific rigor and creativity.
You will join a team of people working on this topic, learn about deep information retrieval, have access to many GPUs and experiment with novel ideas.
-  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805
-  From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing, https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1302
-  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, https://arxiv.org/abs/2004.12832
-  Efficient Document Re-Ranking for Transformers by Precomputing Term Representations: https://arxiv.org/abs/2004.14255
-  FAISS: https://github.com/facebookresearch/faiss
NAVER LABS Europe has full-time positions, PhD and PostDoc opportunities throughout the year which are advertised here and on international conference sites that we sponsor such as CVPR, ICCV, ICML, NeurIPS, EMNLP etc.
NAVER LABS Europe is an equal opportunity employer.
NAVER LABS are in Grenoble in the French Alps. We have a multi and interdisciplinary approach to research with scientists in machine learning, computer vision, artificial intelligence, natural language processing, ethnography and UX working together to create next generation ambient intelligence technology and services that deeply understand users and their contexts.