CODE & DATA

Data, code and models released by NAVER LABS Europe

Speech-MASSIVE

A multilingual Spoken Language Understanding (SLU) dataset

Covers 12 languages from different families and inherits from the original MASSIVE dataset the annotations for the intent prediction and slot filling tasks. See also the Interspeech 2024 paper.

BERGEN: benchmarking RAG

A Benchmarking Library for Retrieval-Augmented Generation

Designed to ease the reproducibility and integration of new datasets and models and identify strong baselines.

Pasero

Lightweight Pytorch framework for training and running text generation models.

Can be used for machine translation, speech translation, language modeling and dialogue supporting a number of popular pre-trained models.

mHuBERT-147

The first general-purpose massively multilingual HuBERT speech representation model.

A promising compact model for speech processing pipelines, offering an unprecedented balance between high performance and parameter efficiency. Developed within the the EU UTTER project.

DistilWhisper

Efficient distillation of multi-task speech models via language-specific experts.

A multitask and multilingual speech model covering 99 languages.

Multilingual machine translation

Assessing the impact of compression methods on MNMT.

Code repository for paper: What do compressed multilingual machine translation models forget?

SMaLL-100

A shallow multilingual machine translation model for low-resource languages.

Covers more than 10K language pairs, achieves competitive results with M2M-100 while being much smaller and faster.

NMT & Efficient Multilingual NMT

Code, model checkpoints, test sets and outputs for 4 multilingual NMT papers (EMNLP2021).

Publications concern efficient inference, continual learning, unsupervised NMT and domain adaptation.

COVID-19 NMT

Multi-lingual & multi-domain translation model.

Model specialised for biomedical data.

To Annotate or Not?

Domain shift prediction

A method to predict the drop in accuracy of a trained model.

Aspect Based Sentiment Analysis (ABSA) dataset

Manually annotated ABSA dataset from Foursquare comments.

585 samples (1006 sentences) randomly selected and annotated with the SemEval2016 annotation guidelines for the restaurant domain.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.