The third article in our four-part series on CVPR. Fashion, landmarks and breaking the SIFT descriptor record!
Two papers in the field of image retrieval particularly attracted our attention this year. The first paper one by Kun He, Yan Lu, Stan Sclaroff on ‘Local Descriptors Optimized for Average Precision’ focuses on the task of matching small image patches. This is important for many applications such as visual localization and image retrieval in large databases. Incredibly, no deep approach had yet managed to beat the notorious hand-crafted SIFT descriptor method that’s used to do this – at least not in terms of mAP on the large and well-known RomePatches dataset. This method has been around for more than two decades. Well, that record came to an end this year with Kun He et al.
As often, the training loss plays a key role in the success of their approach. While triplet loss has been widely used to train deep networks for ranking [1,2], the key idea here is to directly optimize the loss for Average-Precision (AP). This is normally impossible since the AP metric is not differentiable. However, by extending a recent trick used for binary descriptors by Ustinova and Lempitsky in Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, the authors show that AP can be differentiated through several harmless approximations. And it works great. For the first time, this approach outperforms SIFT and all other hand-crafted descriptors, as well as all other deep approaches for the task of image patch matching. This is very inspiring because this new loss can readily be a replacement for the triplet loss in many existing works [1,2]. Maybe it can further push the retrieval accuracy there too
Another topic of interest for us is fashion retrieval. It’s a hard one because detecting and identifying clothing in unconstrained images can be a problem even for us humans! A recent CVPR paper by Wang et al. on ‘Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification’ proposes a new approach based on recurrent neural networks (RNN) that yields impressive results. The method is able to jointly output clothing landmarks, clothing classification and corresponding attributes. The key contribution there is a novel deep grammar network that explicitly encodes a set of prior knowledge over fashion clothes. For instance, it encodes the kinematics and symmetry relations between clothing landmarks (i.e. which landmarks are physically connected to each other, e.g. collar and sleeve are connected whilst sleeve and pants’ hems are not – there are also left and right symmetries for each landmark). The grammar is approximately modelled with the RNN associated with attention mechanisms. The quantitative results and visualizations are clearly convincing concerning the benefit of both the grammar and the attention mechanisms. The approach largely outperforms the state of the art on standard datasets, including FashionNet in Liu et al.’s Deepfashion: Powering robust clothes recognition and retrieval with rich annotations and more recent works such as Lu et al. on Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification or Corbiere et al. with Leveraging weakly annotated data for fashion image retrieval and label prediction. To us it’s a winner.
The workshop on “Large-Scale Landmark Recognition: A Challenge” was about assessing and advancing the state-of-the-art in recognizing and retrieving images of landmarks. A landmark is somewhat loosely defined as a natural or man-made artefact that’s easily recognizable to a large number of people i.e. the Eiffel tower in Paris.
Visual landmark recognition has many applications including navigating photo albums and vision-based localization. Landmark retrieval is particularly challenging due to the large number of potential landmarks to be matched as well as the fact that landmarks are very fine-grained and can be quite similar to one another. The challenge before the workshop was to rank more than 1 million images for a set of queries. We came 4th using a simple and efficient deep model based on ResNet-101 [2] followed by diffusion-based re-ranking by Iscen et al. Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations]. The winners described their methods which consisted mainly of combinations of supervised deep feature banks combined with query expansion techniques. The results of the challenge were impressive but showed that there’s still a lot of room for improvement in landmark retrieval approaches.
[1] Improving deep neural nets for person Re-ID, Diane Larlus & Jon Almazan.
[2] Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus: End-to-end Learning of Deep Visual Representations for Image Retrieval. International Journal of Computer Vision. Volume 124, Issue 2, September 2017.
About the authors:
Naila Murray is group lead of the Computer Vision research team.
Jérôme Revaud is a researcher in the Computer Vision team.
Part 1: CVPR, Pose Estimation
Part 2: CVPR, Embedded Vision and Visual Localization
Part 4: CVPR, 3D Scene Understanding