ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

Problem Statement

An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. This is the problem of image search with free-form text modifiers, which consists in ranking a collection of images by relevance with respect to a bi-modal query.

A task at the intersection of 2 frameworks: cross-modal and visual search.

Overview

Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target image [1,2,3].

Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of the query element with targeted images and derive lightweight text-guided attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks [4,5,6], querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.

Our method

ARTEMIS method

Two independent modules

  • Our Explicit Matching (EM) module measures the compatibility of potential target images with the textual requirements.
  • Our Implicit Similarity (IS) module considers the relevance of the target images with respect to the properties of the reference image implied by the textual modifier.
  • AIS, AEM: text-guided attention mechanisms (MLP) that select the visual cues which should be emphasized during matching.
  • Tr: linear transformation (fully-connected layer) to project in the visual space.
  • Both modules are trained jointly with the batch-based classification loss.
EM and IS heatmaps on Shoes
Qualitative results on Shoes dataset

Ablative results

fig 3 R@50

Take-home messages

  • ARTEMIS combines cross-modal and visual search scoring strategies, making them compatible for Image search with text modifiers.
  • ARTEMIS models all pairwise interactions including with the target image, without large extra-cost.
  • ARTEMIS is versatile: it works with different visual and textual encoders, on different domains.

References

  1. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. CVPR 2019.
  2. Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. CVPR 2020.
  3. Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feedback. CVPR 2021.
  4. Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. CVPR 2021.
  5. Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-based interactive image retrieval. NeurIPS 2018.
  6. Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. ICCV 2021.

News

  • Paper accepted at International Conference on Representation Learning (ICLR) 2022. 

Bibtex

@inproceedings{delmas2022artemis,
  title={ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity},
  author={Delmas, Ginger and Rezende, Rafael S and Csurka, Gabriela and Larlus, Diane},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank