ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity

ICLR 2022




title={ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity},
author={Delmas, Ginger and Rezende, Rafael S and Csurka, Gabriela and Larlus, Diane},
booktitle={International Conference on Learning Representations},

Problem Statement

An intuitive way to search for images is to use queries composed of an example image and a complementary text. While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image. This is the problem of image search with free-form text modifiers, which consists in ranking a collection of images by relevance with respect to a bi-modal query.

A task at the intersection of 2 frameworks: cross-modal and visual search.


Current approaches typically combine the features of each of the two elements of the query into a single representation, which can then be compared to the ones of the potential target image [1,2,3].

Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval. Taking inspiration from them, we exploit the specific relation of the query element with targeted images and derive lightweight text-guided attention mechanisms which enable to mediate between the two complementary modalities. We validate our approach on several retrieval benchmarks [4,5,6], querying with images and their associated free-form text modifiers. Our method obtains state-of-the-art results without resorting to side information, multi-level features, heavy pre-training nor large architectures as in previous works.

Our method: 2 independent modules

ARTEMIS method
  • Our Explicit Matching (EM) module measures the compatibility of potential target images with the textual requirements.
  • Our Implicit Similarity (IS) module considers the relevance of the target images with respect to the properties of the reference image implied by the textual modifier.
  • AIS, AEM: text-guided attention mechanisms (MLP) that select the visual cues which should be emphasized during matching.
  • Tr: linear transformation (fully-connected layer) to project in the visual space.
  • Both modules are trained jointly with the batch-based classification loss.
EM and IS heatmaps on Shoes

We provide some Grad-CAM [7] visualisations of image parts contributing the most to the EM and IS scores.

The EM module addresses the image parts that are the most related to the caption (the slits on the side).

The IS module attends to visual cues that are shared between the reference and the target images (sole shape & heel).

Ablative results

fig 3 R@50

We evaluate ARTEMIS on 3 datasets.

  • Visual search and cross-modal search results reveal that these datasets can be either text-centric (as FashionIQ [4]) or image-centric (as Shoes [5,8] and CIRR [6]).
  • An ablation of our modules shows that adding our attention mechanisms helps our modules to improve upon their respective baselines, cross-modal & visual search
  • Combining our two modules brings the best performance with respect to using either one of them alone, for both image-centric and text-centric datasets. This shows our modules’ complementarity.

The resulting ARTEMIS model outperforms its attention-free version, considered here as “late fusion”, and which combines scores of visual and cross-modal search. This is especially the case for the CIRR dataset. This shows the importance of the attention mechanisms in making our two modules compatible.

Take-home messages

  • ARTEMIS combines cross-modal and visual search scoring strategies, making them compatible for Image search with text modifiers.
  • ARTEMIS models all pairwise interactions including with the target image, without large extra-cost.
  • ARTEMIS is versatile: it works with different visual and textual encoders, on different domains.


  1. Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. CVPR 2019.
  2. Yanbei Chen, Shaogang Gong, and Loris Bazzani. Image search with text feedback by visiolinguistic attention learning. CVPR 2020.
  3. Seungmin Lee, Dongwan Kim, and Bohyung Han. Cosmo: Content-style modulation for image retrieval with text feedback. CVPR 2021.
  4. Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion IQ: A new dataset towards retrieving images by natural language feedback. CVPR 2021.
  5. Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-based interactive image retrieval. NeurIPS 2018.
  6. Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. ICCV 2021.
  7. Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. ICCV 2017.
  8. Tamara L. Berg, Alexander C. Berg, Jonathan Shih. Automatic Attribute Discovery and Characterization from Noisy Web Data. ECCV 2010.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.