Beyond Instance-Level Image Retrieval : Leveraging Human Captions to Learn Representations for Semantic Visual Search

Published by NAVER LABS Europe at 5 April 2017

CVPR 17, Honolulu, US, 22-25 July, 2017

Querying a database using an example image has been is a simple and intuitive interface to retrieve information in a database of images. Consequently instance-level image retrieval has been heavily studied in the computer vision community in the last decade. One problem that has been overlooked though is the retrieval of similar visual scenes, where the retrieved images do not exhibit the same object instance as the query image, but the images share the same semantic. In this paper, we define the task of semantic image retrieval, and show through a user study that, despite its subjective nature, it is consistently implemented across a pool of human annotators. Our study also shows that region-level captions constitute a good proxy to semantic similarity. Following this observation, we leverage human captions to learning a global image representation that is compact and still performs well at the semantic retrieval task. We also show that we can jointly train visual and textual embeddings that allow to query with both image and text (although the database has no textual annotation) and to perform arithmetic operations on this joint embedding. As a by-product of the learning, the network can be used to visualize which regions contributed the most to the similarity between two images, allowing to interpret our semantic retrieval results in an visual and intuitive way.

@inproceedings{gordo2017beyond,
title={Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval},
author={Gordo, Albert and Larlus, Diane},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={6589–6598},
year={2017}
}

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Beyond Instance-Level Image Retrieval : Leveraging Human Captions to Learn Representations for Semantic Visual Search

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings