DEEP IMAGE RETRIEVAL
In this project, we aim at learning end-to-end deep visual representations specifically for the task of instance-level image retrieval
Principle:
The several papers listed below tackle the problem of image retrieval and explore different ways to learn deep visual representations for this task. In all cases, a CNN is used to extract a feature map that is aggregated into a compact, fixed-length representation by a global-aggregation layer. Finally, this representation is first projected with a fully-connected layer, and then L2 normalized so images can be efficiently compared with the dot product.
All components in this network, including the aggregation layer, are differentiable, which makes it end-to-end trainable. In the first papers [2,3], a Siamese architecture that combines three streams with a triplet loss was proposed to train this network.
In the more recent [1], this work was extended by replacing the triplet loss with a new loss that directly optimizes for Average Precision, that is illustrated below.
Relevant papers:
[1] Learning with Average Precision: Training Image Retrieval with a Listwise Loss Jerome Revaud, Rafael S. Rezende, Cesar de Souza, Jon Almazan:
ICCV 2019 [PDF]
This most recent reference corresponds to the code and models shared below.
@inproceedings{revaud2019learning, author = {Jerome Revaud and Jon Almazan and Rafael Sampaio de Rezende and Cesar Roberto de Souza}, title = {Learning with Average Precision: Training Image Retrieval with a Listwise Loss}, booktitle={ICCV}, year={2019}}
[2] End-to-end Learning of Deep Visual Representations for Image Retrieval. Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus:
International Journal of Computer Vision. Volume 124, Issue 2, September 2017.
arXiv preprint [pdf] [Publication entry]
@article{gordo2017end,
author = {Gordo, Albert and Almaz\'{a}n, Jon and Revaud, Jerome and Larlus, Diane},
title = {End-to-End Learning of Deep Visual Representations for Image Retrieval},
journal = {International Journal of Computer Vision},
issue_date = {September 2017},
volume = {124},
number = {2},
month = sep,
year = {2017},
pages = {237--254},
}
[3] Deep Image Retrieval: Learning global representations for image search. Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus:
ECCV 2016 [pdf]
@inproceedings{gordo2016deep,
author = {Albert Gordo and Jon Almaz{\'{a}}n and J{\'{e}}rome Revaud and Diane Larlus},
title = {Deep Image Retrieval: Learning global representations for image search},
booktitle={ECCV},
year={2016}}
Library:
This github repository links to a library that implements in Python3 and Pytorch 1.0 the two following papers:
[1] Learning with Average Precision: Training Image Retrieval with a Listwise Loss Jerome Revaud, Rafael S. Rezende, Cesar de Souza, Jon Almazan, ICCV 2019 [PDF]
[2] End-to-end Learning of Deep Visual Representations for Image Retrieval Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus, IJCV 2017 [PDF]
Please note that, originally, [2] used R-MAC pooling [4] as the global-aggregation layer. However, due to its efficiency and better performance we have replaced the R-MAC pooling layer with the Generalized-mean pooling layer (GeM) proposed in [5].
If you’d like to compare to older versions of the work, the exact models used in [2] are still available in Caffe format (Download old model, evaluation script and dataset)
[4] Particular object retrieval with integral max-pooling of CNN activations. Tolias, G., Sicre, R., Jegou, H., ICLR 2016
[5] Fine-tuning CNN Image Retrieval with No Human Annotation. Radenovic, F., Tolias, G., Chum, O., TPAMI 2018