ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Published by Diane Larlus at 25 April 2022

Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, Diane Larlus

Tenth International Conference on Learning Representations (ICLR), virtual event, 25-29 April, 2022

An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target images. Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of each query element with the targeted image and derive light-weight attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks, querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.

Overview

The problem of image search with free-form text modifiers, which consists of ranking a collection of images by relevance with respect to a bi-modal query is illustrated below.

Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target image [1,2,3].

Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of the query element with targeted images and derive lightweight text-guided attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks [4,5,6], querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.

Our method: 2 independent modules

Our Explicit Matching (EM) module measures the compatibility of potential target images with the textual requirements.

Our Implicit Similarity (IS) module considers the relevance of the target images with respect to the properties of the reference image implied by the textual modifier.

A_IS, A_EM: text-guided attention mechanisms (MLP) that select the visual cues which should be emphasized during matching.

Tr: linear transformation (fully-connected layer) to project in the visual space.

Both modules are trained jointly with the batch-based classification loss.

We provide some Grad-CAM [7] visualisations of image parts contributing the most to the EM and IS scores.

The EM module addresses the image parts that are the most related to the caption (the slits on the side of the shoe).

The IS module attends to visual cues that are shared between the reference and the target images (sole shape & heel of the shoe).

Ablative results

We evaluate ARTEMIS on 3 datasets.

Visual search and cross-modal search results reveal that these datasets can be either text-centric (as FashionIQ [4]) or image-centric (as Shoes [5,8] and CIRR [6]).
An ablation of our modules shows that adding our attention mechanisms helps our modules to improve upon their respective baselines, cross-modal & visual search
Combining our two modules brings the best performance with respect to using either one of them alone, for both image-centric and text-centric datasets. This shows our modules’ complementarity.

The resulting ARTEMIS model outperforms its attention-free version, considered here as “late fusion”, and which combines scores of visual and cross-modal search. This is especially the case for the CIRR dataset. This shows the importance of the attention mechanisms in making our two modules compatible.

Take-home messages

ARTEMIS combines cross-modal and visual search scoring strategies, making them compatible for Image search with text modifiers.
ARTEMIS models all pairwise interactions including with the target image, without large extra-cost.
ARTEMIS is versatile: it works with different visual and textual encoders, on different domains.

References

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. CVPR 2019.
Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. CVPR 2020.
Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feedback. CVPR 2021.
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. CVPR 2021.
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-based interactive image retrieval. NeurIPS 2018.
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. ICCV 2021.
Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. ICCV 2017.
Tamara L. Berg, Alexander C. Berg, Jonathan Shih. Automatic Attribute Discovery and Characterization from Noisy Web Data. ECCV 2010.

@inproceedings{delmas2022artemis,
title={ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity},
author={Delmas, Ginger and Rezende, Rafael S and Csurka, Gabriela and Larlus, Diane},
booktitle={International Conference on Learning Representations},
year={2022}
}

Overview

Our method: 2 independent modules

Ablative results

Take-home messages

References

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Overview

Our method: 2 independent modules

Ablative results

Take-home messages

References

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings