Speech systems (spoken language understanding, spoken question answering, speech translation) can either (a) include an explicit automatic speech recognition (ASR) module (cascade approach) or (b) rely on end-to-end architecture where the systems take speech as input and directly produce a decision from it. While those two approaches (cascade versus end-to-end) have been often opposed and compared in the past, fewer works tried to take advantage of the two modalities represented by speech input and text input (ASR transcript).
This project aims to propose a model that jointly learns from streamed audio and its noisy transcription into text and apply it to challenging tasks such as spoken language understanding or spoken question answering. In particular we believe that this approach should (a) allow to jointly integrate acoustic and semantic information for further downstream tasks, (b) facilitate knowledge transfer between text and speech tasks by minimizing the representation difference between text and speech input and, (c) bring additional paralinguistic information (speaker gender, prosody, speaker emotion) to the overall model. A starting point could be two different encoders (speech and text) whose states synchronize at the utterance level. But we could imagine more advanced architectures with cross- modality attention (and at different layers). We would work on a recently introduced dataset called EMOTyDA (https://github.com/sahatulika15/EMOTyDA) collected from open-sourced dialogue datasets and which contains speech, transcripts, videos and semantic annotations.
NAVER LABS Europe has full-time positions, PhD and PostDoc opportunities throughout the year which are advertised here and on international conference sites that we sponsor such as CVPR, ICCV, ICML, NeurIPS, EMNLP etc.
NAVER LABS Europe is an equal opportunity employer.
NAVER LABS are in Grenoble in the French Alps. We have a multi and interdisciplinary approach to research with scientists in machine learning, computer vision, artificial intelligence, natural language processing, ethnography and UX working together to create next generation ambient intelligence technology and services that deeply understand users and their contexts.