Multimodal NLP for Robotics

Speech is the most natural interface humans have. We use it to coordinate in a busy kitchen, ask for help in a store, or switch seamlessly between languages in conversation. As intelligent assistants move beyond text boxes and into real-world environments, speech becomes more than a feature: it becomes the primary bridge between humans and machines.

Building a speech-enabled assistant is not just a matter of converting audio into text. Human communication carries intent, hesitation, emphasis and ambiguity. A truly capable assistant must go beyond transcription: it needs to understand meaning, follow complex instructions, reason across languages and respond in real time.

We aim to embed these capabilities into NAVER robotics services, enabling natural voice control of robotic functionalities and more user-friendly monitoring and supervision of robotic operations. From multilingual speech representations to instruction-following speech large language models, our goal is to build assistants that engage naturally across languages, environments and contexts.

Building efficient and robust speech models

For assistants to operate at scale, speech models must be both multilingual and efficient. Strong reasoning capabilities depend on strong representations.

At NAVER LABS Europe, we developed mHuBERT-147, a compact multilingual speech model that represents 147 languages within a lightweight architecture, an important capability for robotic services increasingly deployed in public, multilingual environments (airports, hospitals, office buildings, etc.) where they may face users, operators and bystanders speaking different languages. Lightweight, low-latency encoders remain essential to run robust speech understanding on platforms with limited compute and power.

We also developed Multilingual DistilWhisper for efficiency and robustness and explored how to compress large multi-task speech systems into smaller, language-specialized experts, maintaining strong performance while significantly reducing computational cost. In addition, techniques such as our unsupervised multi-domain data selection for ASR fine-tuning help improve model robustness across varied domains, ensuring that our systems are reusable across domains and services.

These foundations ensure that future assistants are not only intelligent, but also scalable, robust, and easy to deploy.

From speech recognition to speech reasoning

Understanding speech is only the beginning. Assistants must also interpret intent, handle ambiguity and act on spoken instructions.

A key step toward this goal is teaching systems to understand user intent directly from speech. To support this, we introduced Speech-MASSIVE, a multilingual dataset for spoken language understanding (SLU) in 12 languages. By addressing the scarcity of SLU data, it provides models with the examples they need to move beyond transcription and toward genuine understanding.

This capability directly supports our goal of building scalable multilingual voice assistants for robotics, specifically voice-ordering bots that provide access to robotic delivery services: the models listen to a spoken request, infer intent, ask clarifying questions, translate when needed and output a structured action that triggers the delivery workflow. Our IWSLT 2023 low-resource win shows we can extend to new languages with parameter-efficient adaptation, while our IWSLT 2025 instruction-following speech LLM win demonstrates a single system that can transcribe, translate and answer questions across four languages – the core loop required for end-to-end voice ordering.

Making speech LLMs practical

Creating speech-native assistants requires balancing capability with efficiency. Large speech models are powerful, but training them can be slow and resource-intensive. We explore approaches that make it easier to train and deploy these models in research and real-world settings.

In our Tower work, we focused on extending LLMs to process speech without degrading their existing text-language capabilities. This approach enables research into instruction-following and multilingual speech systems while keeping the backbone model’s reasoning intact.

Separately, we developed SpeechMapper, a method for aligning speech representations with LLM embeddings efficiently. Unlike standard approaches, SpeechMapper does not require a full LLM forward pass during training, significantly reducing computation while still producing high-quality speech-to-embedding mappings. This makes it possible to train instruction-following speech models more quickly and experiment with new data and languages at scale.

Together, these research efforts provide complementary solutions for making speech-native assistants practical: Tower allows LLMs to incorporate speech without losing existing capabilities, while SpeechMapper enables more efficient training. Both approaches bring us closer to assistants that are intelligent, robust and scalable for real-world applications.

Toward speech-native assistants

Our ultimate goal is to create assistants that can listen, understand and respond naturally, moving beyond simple transcription or scripted responses. By combining efficient speech representations, instruction-following models and practical alignment techniques, we are building the foundations for systems that can engage in real-time interactions and handle complex spoken instructions.