Controlling prosody in end-to-end TTS: a case study on contrastive focus generation

Published by Inyoung Kim at 10 November 2021

Siddique Latif, Inyoung Kim, Ioan Calapodescu, Laurent Besacier

The SIGNLL Conference on Computational Natural Language Learning (CoNLL), co-located with EMNLP, 10-11 November, 2021

Abstract

While End-2-End Text-to-Speech (TTS) has made significant progress over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system where this fine-grained prosodic feature can be correctly conveyed using control tokens.

Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learned accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody.

Related Content

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2026

All

Publications

Blog

News

Code & Data

Careers

People

Controlling prosody in end-to-end TTS: a case study on contrastive focus generation

Related Content

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings