StacMR: Scene-Text Aware Cross-Modal Retrieval

 Winter Conference on Applications of Computer Vision (WACV ’21)

Andrés Mafla         Rafael S. Rezende          Lluís Gómez         Diane Larlus          Dimosthenis Karatzas


Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for  retrieval.

In this work, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modalretrieval approaches benefit from scene text and highlight interesting research questions worth exploring further.

StacMR image
StacMR_image 2

COCO-Text Captioned dataset (CTC):

The dataset we use to evaluate the StacMR task consists of a subset of the MS COCO dataset that is equipped with both scene-text and caption annotations. Both annotations were collected independently, so not all semantic descriptions of an image are dependent on its text content, as it is commonly the case in the wild.

We provide experiments two test set splits to evaluate cross-modal modes: CTC-1k, for which scene text appears directly an image’s captions, and CTC-5k, for which it does not necessarily appear.


Paper accepted at Winter Conference on Applications of Computer Vision (WACV) 2021.


author = {Mafla, Andres and Rezende, Rafael S. and Gomez, Lluis and Larlus, Diane and Karatzas, Dimosthenis},
title = {StacMR: Scene-text Aware Cross-modal Retrieval},
booktitle = {WACV},
year = {2021}

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.