MACHINE TRANSLATION OF RESTAURANT REVIEWS – PARALLEL CORPUS
Very detailed information about social venues such as restaurants is available from user-generated reviews in applications like Foursquare, Google Maps, or TripAdvisor. However, most of these reviews are written in the local language and are not directly exploitable by foreign visitors. While Machine Translation (MT) has been deployed in some of these applications, the quality is often mediocre, because generic MT models (as provided by Papago or Google Translate) are not adapted to this specific domain, and they often lack robustness to the types of noise found in user reviews (read or blog post and previous paper on this topic).
We share a French-English parallel corpus of Foursquare restaurant reviews, and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in this real-world scenario where better-quality MT would be greatly beneficial.
This corpus contains over 11k reviews (or 18k sentences), whose original language is French, and that were translated by professional translators to English. We provide official train, valid and test splits. The accompanying paper describes a number of baseline models that we trained with this data, and which can used as a starting point for future work on this resource.
The corpus contains additional metadata that we have not used yet in our work, but which could be very useful: review boundaries (for work on MT beyond sentence-level), id of the POI (Point of Interest) and its location, category and average rating.
We also provide, alongside the parallel corpus, the outputs of our baseline models, an analysis of the errors in French Foursquare reviews, and the outputs of our human evaluations.
Please contact Gianluca Monaci for any legal question regarding the data and Ioan Calapodescu or Alexandre Bérard for any technical question.
If you use this data please cite the following paper:
Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness, Alexandre Bérard, Ioan Calapodescu, Marc Dymetman, Claude Roux, Jean-Luc Meunier and Vassilina Nikoulina, 3rd Workshop on Neural Generation and Translation (WNGT 2019)