Multimodal NLP for Robotics

blank

To create intelligent systems that, by understanding and generating natural language in the form of text or speech, can seamlessly communicate with humans, we are developing multi-modal Natural Language Processing (NLP) techniques for human-robot interaction. By leveraging multimodal data and models, we believe we can create more robust and user-friendly interfaces for robotic services. This work, centered on multi-modality, also encompasses multi-tasking and multilinguality. 

Multi-tasking

Even in the very simplest of scenarios, a robotic agent has to perform more than one task  i.e. it needs to understand a user command, check the user identity and utter the relevant answer. Although most AI systems have been historically trained to perform each one of these tasks individually, we believe that we will get better results by learning them together and are therefore exploring the benefits of multi-tasking in NLP for HRI

blank

One example is training a single deep neural network to perform multiple tasks simultaneously, so that the robotic agent can i] understand what is said using Automatic Speech Recognition (ASR)/Natural Language Understanding (NLU) or Spoken Language Understanding (SLU), ii] detect the emotion of the speaker using Emotion Recognition (ER) and iii] generate an answer to the user question taking into account the aforesaid parameters. Some of our recent experiments show that such an approach enables the robot to process and respond to user input more efficiently as multi-tasking models can outperform single-task models in various scenarios, such as Intent Classification, Slot Filing and Spoken Emotion Recognition (SER) in spoken language understanding [1], and the models are lighter and faster.  The Speech-MASSIVE [2] corpus for multilingual SLU/ASR/MT (Machine Translation) covering 18 domains in 12 languages that we recently released, is something we hope will help the community further develop these technologies.

Multi-modality

Another axe of research that aims at improving accuracy and robustness, is multimodal NLP models that can process and integrate various types of data such as speech, text and, in the future, images. We are investigating merging the embedding spaces of various foundational models initially trained on different data modalities. This research has shown that these types of multimodal models can lead to better performance in tasks such as speech-to-text machine translation [3], even in low resource settings where one of the modalities is missing. Our system in the IWSLT23 benchmark (which ranked 1st), showed how we can train an end-to-end model to translate speech for low-resource African languages when no transcription is available and only a few hours of data.  

We’ve open sourced the Speech-MASSIVE [2] dataset which contains parallel text and speech in several languages as well as annotations for many tasks.

Multilinguality

Supporting multiple languages will not only better serve users around the world, it will also enable robots to be more effective in diverse environments and settings. For example, multilingual models can be used for spoken language translation and language understanding in various applications, such as customer service chatbots. Beyond the machine translation [3] and multilingual SLU [2] work previously mentioned, we recently released mHubert-147 [4] which is a state-of-the art, very efficient, foundation model for speech representation that can be used to build these types of systems.

Applications in robotics

Multimodal NLP research is applicable to robotics in numerous ways and whenever human-robot interaction or collaboration is necessary. We’re currently testing our models and techniques to see how they can improve the robot delivery services deployed in our company premises at ‘1784’ where a fleet of 100 robots deliver coffee orders, parcels etc. autonomously navigating their way around the building where 5000 employees work. In these kinds of situations efficiency, robustness and safety are also a very important part of the job. The answers given by the robotic agents need to be fast and robust to environmental noise and unexpected changes i.e. doors opening and closing, people passing by or anything not encountered in the training data. In addition, we’re constantly working on how to reduce computational power, latency and cost.  For example, we’ve explored how to incrementally make a generative model robust to new sources of noise to avoiding retraining [5] and we’ve developed new techniques to create small models by distillation or to specialize models using only a few seconds of in-domain training data [6,7].

Some of these technologies have been released to the community such as our lightweight generation framework Pasero and various models and datasets listed below.

Open source 

Pasero: a lightweight text generation framework
mHuBERT-147: compact multilingual HuBERT models
– Multilingual DistilWhisper: efficient distillation of multi-task speech models and code
Speech-MASSIVE: multilingual, multimodal and multitask (SLU, MT, …) dataset

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank