Deep Image Retrieval - NAVER LABS Europe
preloder

DEEP IMAGE RETRIEVAL

In this project, we aim at learning end-to-end deep visual representations specifically for the task of instance-level image retrieval

Deep Image Retrieval illustrating image

Principle:

The several papers listed below tackle the problem of image retrieval and explore different ways to learn deep visual representations for this task. In all cases, a CNN is used to extract a feature map that is aggregated into a compact, fixed-length representation by a global-aggregation layer. Finally, this representation is first projected with a fully-connected layer, and then L2 normalized so images can be efficiently compared with the dot product.

image architecture figure

All components in this network, including the aggregation layer, are differentiable, which makes it end-to-end trainable. In [1,2], a Siamese architecture that combines three streams with a triplet loss was proposed to train this network.

In [3], this work was extended by replacing the triplet loss with a new loss that directly optimizes for Average Precision, that is illustrated below.

map loss figure

Relevant papers:

[1] Deep Image Retrieval: Learning global representations for image search. Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus:
ECCV 2016 [pdf]

@inproceedings{gordo2016deep,
author = {Albert Gordo and Jon Almaz{\'{a}}n and J{\'{e}}rome Revaud and Diane Larlus},
title = {Deep Image Retrieval: Learning global representations for image search},
booktitle={ECCV},
year={2016}}

[2] End-to-end Learning of Deep Visual Representations for Image Retrieval. Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus:
International Journal of Computer Vision. Volume 124, Issue 2, September 2017.
arXiv preprint [pdf] [Publication entry]

@article{gordo2017end,
author = {Gordo, Albert and Almaz\'{a}n, Jon and Revaud, Jerome and Larlus, Diane},
title = {End-to-End Learning of Deep Visual Representations for Image Retrieval},
journal = {International Journal of Computer Vision},
issue_date = {September 2017},
volume = {124},
number = {2},
month = sep,
year = {2017},
pages = {237--254},
}

[3] Learning with Average Precision: Training Image Retrieval with a Listwise Loss Jerome Revaud, Rafael S. Rezende, Cesar de Souza, Jon Almazan:
arXiv 2019 [PDF]

This last reference, [3] is the one that gives the best results and corresponds to the models shared below.

Library:

This github repository links to a library that implements in Python3 and Pytorch 1.0 the two following papers:

[2] End-to-end Learning of Deep Visual Representations for Image Retrieval Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus, IJCV 2017 [PDF]

[3] Learning with Average Precision: Training Image Retrieval with a Listwise Loss Jerome Revaud, Rafael S. Rezende, Cesar de Souza, Jon Almazan, arXiv 2019 [PDF]

Please note that, originally, [2] used R-MAC pooling [4] as the global-aggregation layer. However, due to its efficiency and better performance we have replaced the R-MAC pooling layer with the Generalized-mean pooling layer (GeM) proposed in [5].

If you’d like to compare to older versions of the work, the exact models used in [2] are still available in Caffe format (Download old model, evaluation script and dataset)

[4] Particular object retrieval with integral max-pooling of CNN activations. Tolias, G., Sicre, R., Jegou, H., ICLR 2016

[5] Fine-tuning CNN Image Retrieval with No Human Annotation. Radenovic, F., Tolias, G., Chum, O., TPAMI 2018