StacMR: Scene-text aware cross-modal retrieval

Published by Claudia Heyer at 5 January 2021

Andres Mafla, Rafael Sampaio De Rezende, Lluis Gomez, Diane Larlus, Dimosthenis Karatzas

Winter Conference on Applications of Computer Vision (WACV), virtual event, 5-9 January, 2021

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.

In this work, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modalretrieval approaches benefit from scene text and highlight interesting research questions worth exploring further.

COCO-Text Captioned dataset (CTC):

The dataset we use to evaluate the StacMR task consists of a subset of the MS COCO dataset that is equipped with both scene-text and caption annotations. Both annotations were collected independently, so not all semantic descriptions of an image are dependent on its text content, as it is commonly the case in the wild.

We provide experiments two test set splits to evaluate cross-modal modes: CTC-1k, for which scene text appears directly an image’s captions, and CTC-5k, for which it does not necessarily appear.

@InProceedings{mafla2021stacmr,
author = {Mafla, Andres and Rezende, Rafael S. and Gomez, Lluis and Larlus, Diane and Karatzas, Dimosthenis},
title = {StacMR: Scene-text Aware Cross-modal Retrieval},
booktitle = {WACV},
year = {2021}
}

COCO-Text Captioned dataset (CTC):

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2026

All

Publications

Blog

News

Code & Data

Careers

People

StacMR: Scene-text aware cross-modal retrieval

COCO-Text Captioned dataset (CTC):

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings