The person search task of recognizing individuals across different images, also known as person re-identification, has become increasingly popular because of so many uses it has. Surveillance is one but ambient intelligence robotic platforms is another and the purpose closest to our interests at LABS.
We’ve recently devised a simple and easy to understand approach which outperforms current best methods by a significant margin on four standard benchmarks. Below we explain why it works so well, and also how our method implicitly learns what previous approaches have explicitly engineered in their architectures (such as attention modules or part detectors).
Person re-identification or Re-ID, is the task of correctly identifying individuals across different images captured under varying conditions, such as different cameras within a surveillance network, or images acquired during different interactions with a robotic platform.
The overall goal is to build a suitable representation. This can be seen as an image fingerprint so that, if two representations are similar, the two corresponding images contain the same person, and if representations are different, it means that the two images show different people.
Like many other computer vision tasks, this means building representations that have a whole bunch of constraints i.e. they’re discriminant and able to capture the fine details but they’re also invariant to the transformations a person can undergo, as well as to different camera viewpoints, illumination changes or occlusions. On top of that, these representations need to be compact to be efficient. These three constraints are pretty much orthogonal to each other which makes the trade-off difficult. One way to balance the constraints is to use a learning mechanism to find the right representation for the task from an annotated training set.
In the case of Person Re-ID this is pretty tough as the same person can drastically change appearance between two image pairs, yet it may be just a few tiny details that differentiate two different people. You can see this in the pictures below.
As a consequence, most approaches focus on explicitly aligning the people in the images to fix the changes of scale, but also to deal with very strong differences in the pose such as matching a person sitting on a bike with the same person walking. For this, these methods explicitly integrate part-detectors, pose estimators or attention models in their solution.
Another challenge in Re-ID is that, although our representation is learned from a set of images during a training stage, we expect the system to later be able to deal with people it’s never seen before. This requires a training mechanism that generalizes well to previously unknown identities making it fundamentally different and more challenging than classification.
We decided to tackle the problem with an approach which involves learning what it means for two images to represent the same person without relying on a previously trained part detector or pose estimator. It results in a fixed length representation that’s optimal for the trade-off described earlier.
The solution we came up with is very close to the one we proposed for the visual search of objects, (see [A] and/or our project page on Deep Image Retrieval)
Yet, when we applied this approach to the Re-ID problem, it became obvious that the details matter to the extent that they have a huge impact on overall results. After running a careful analysis, we observed that there were several key aspects which needed to be properly combined. They boil down to these two key principles:
Once all the details are properly set for the architecture and the training mechanism itself, here’s what we get: a simple deep neural network that takes an image as input and outputs a one-dimensional vector. It’s composed of convolutional layers based on the recent and successful ResNet architecture, a global max-pooling, followed by an embedding and a normalization step. The weights of this network are successfully trained with a classification loss followed by a ranking loss. For the ranking loss, a Siamese architecture simultaneously considers three images (two images of the same person, and one image of a different person).
Here is what the final architecture looks like:
We compared our approach to the most recent ones on the benchmark datasets Market-1501 and DukeMTMC-ReID. Despite the simplicity of our method the results show that it clearly outperforms all previous methods. We made the same observation on the recent Person Search dataset.
To get a better understanding of why a global image representation with no explicit localization mechanism performs so well, we put in a visualization method. It’s inspired by the Grad-Cam [B] visualization which highlights the regions activated when they predict visual concepts.
In our case we wanted to understand which part of the image most contributes to the decision that two images are of the same person. The visualisation selects the 5 dimensions of our built representation that contribute most to the similarity between image representations. Using a technique close to Grad-Cam, it then highlights the corresponding regions which you can see in the figure below.
Here are two examples of matching images.
In most examples we could make the following observations. First, we see that the activated regions are fairly localized. Second, most of these localized regions focus on the people themselves, and more precisely on body regions that are indicative of their clothing, like the length of their pair of trousers or shirt sleeves. Finally, some of the paired responses go beyond similarity in appearance and respond to each other at a more abstract and semantic level. One striking example is the strong response to the bag in the first image that seems to pair with the response to the strap of the bag in the second image, the bag itself being occluded. The network is able to match these two regions despite a totally different appearance because it’s learned this is a good clue that it’s the same person.
This shows that, contrary to what’s been tried so far, carefully trained global representations can actually learn the right-level of generalization for a difficult recognition task like Re-ID without any explicit attention or part-detection mechanism.
Re-ID done right: towards good practices for person re-identification. Jon Almazan, Bojana Gajic, Naila Murray, Diane Larlus. Arxiv 18.
[A] Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus: End-to-end Learning of Deep Visual Representations for Image Retrieval. International Journal of Computer Vision. Volume 124, Issue 2, September 2017.
[B] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra : Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017