Multimodal NLP for Robotics
To create intelligent systems that, by understanding and generating natural language in the form of text or speech, can seamlessly communicate with humans, we are developing multi-modal Natural Language Processing (NLP) techniques for human-robot interaction. By leveraging multimodal data and models, we believe we can create more robust and user-friendly interfaces for robotic services. This work, centered on multi-modality, also encompasses multi-tasking and multilinguality.
Multi-tasking
Even in the very simplest of scenarios, a robotic agent has to perform more than one task i.e. it needs to understand a user command, check the user identity and utter the relevant answer. Although most AI systems have been historically trained to perform each one of these tasks individually, we believe that we will get better results by learning them together and are therefore exploring the benefits of multi-tasking in NLP for HRI.
One example is training a single deep neural network to perform multiple tasks simultaneously, so that the robotic agent can i] understand what is said using Automatic Speech Recognition (ASR)/Natural Language Understanding (NLU) or Spoken Language Understanding (SLU), ii] detect the emotion of the speaker using Emotion Recognition (ER) and iii] generate an answer to the user question taking into account the aforesaid parameters. Some of our recent experiments show that such an approach enables the robot to process and respond to user input more efficiently as multi-tasking models can outperform single-task models in various scenarios, such as Intent Classification, Slot Filing and Spoken Emotion Recognition (SER) in spoken language understanding [1], and the models are lighter and faster. The Speech-MASSIVE [2] corpus for multilingual SLU/ASR/MT (Machine Translation) covering 18 domains in 12 languages that we recently released, is something we hope will help the community further develop these technologies.
Multi-modality
Another axe of research that aims at improving accuracy and robustness, is multimodal NLP models that can process and integrate various types of data such as speech, text and, in the future, images. We are investigating merging the embedding spaces of various foundational models initially trained on different data modalities. This research has shown that these types of multimodal models can lead to better performance in tasks such as speech-to-text machine translation [3], even in low resource settings where one of the modalities is missing. Our system in the IWSLT23 benchmark (which ranked 1st), showed how we can train an end-to-end model to translate speech for low-resource African languages when no transcription is available and only a few hours of data.
We’ve open sourced the Speech-MASSIVE [2] dataset which contains parallel text and speech in several languages as well as annotations for many tasks.
Multilinguality
Supporting multiple languages will not only better serve users around the world, it will also enable robots to be more effective in diverse environments and settings. For example, multilingual models can be used for spoken language translation and language understanding in various applications, such as customer service chatbots. Beyond the machine translation [3] and multilingual SLU [2] work previously mentioned, we recently released mHubert-147 [4] which is a state-of-the art, very efficient, foundation model for speech representation that can be used to build these types of systems.
Applications in robotics
Multimodal NLP research is applicable to robotics in numerous ways and whenever human-robot interaction or collaboration is necessary. We’re currently testing our models and techniques to see how they can improve the robot delivery services deployed in our company premises at ‘1784’ where a fleet of 100 robots deliver coffee orders, parcels etc. autonomously navigating their way around the building where 5000 employees work. In these kinds of situations efficiency, robustness and safety are also a very important part of the job. The answers given by the robotic agents need to be fast and robust to environmental noise and unexpected changes i.e. doors opening and closing, people passing by or anything not encountered in the training data. In addition, we’re constantly working on how to reduce computational power, latency and cost. For example, we’ve explored how to incrementally make a generative model robust to new sources of noise to avoiding retraining [5] and we’ve developed new techniques to create small models by distillation or to specialize models using only a few seconds of in-domain training data [6,7].
Some of these technologies have been released to the community such as our lightweight generation framework Pasero and various models and datasets listed below.
Open source
– Pasero: a lightweight text generation framework
– mHuBERT-147: compact multilingual HuBERT models
– Multilingual DistilWhisper: efficient distillation of multi-task speech models and code
– Speech-MASSIVE: multilingual, multimodal and multitask (SLU, MT, …) dataset
Recent publications
1: An adapter-based unified model for multiple spoken language processing tasks, ICASSP 2024
2: Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond, Interspeech 2024
3: NAVER LABS Europe’s multilingual speech translation systems for the IWSLT 2023 low-resource track, International Conference on Spoken Language Translation (IWSLT) 2023
4: mHuBERT-147: a compact multilingual HuBERT model, Interspeech 2024
5: Multimodal robustness for neural machine translation, EMNLP 2022
6: DistilWhisper: efficient distillation of multi-task speech models via language-specific experts, ICASSP 2024
7: Unsupervised multi-domain data selection for ASR fine-tuning, ICASSP 2024