The person search task of recognizing individuals across different images, also known as person re-identification, has become increasingly popular because of so many uses it has. Surveillance is one but ambient intelligence robotic platforms is another and the purpose closest to our interests at LABS.
We’ve recently devised a simple and easy to understand approach which outperforms current best methods by a significant margin on four standard benchmarks. Below we explain why it works so well, and also how our method implicitly learns what previous approaches have explicitly engineered in their architectures (such as attention modules or part detectors).
Person re-identification or Re-ID, is the task of correctly identifying individuals across different images captured under varying conditions, such as different cameras within a surveillance network, or images acquired during different interactions with a robotic platform.
It’s a very popular field with no less than 12 papers on the topic at ICCV17 last October, and 13 papers just before that at CVPR17 in July.
The overall goal is to build a suitable representation. This can be seen as an image fingerprint so that, if two representations are similar, the two corresponding images contain the same person, and if representations are different, it means that the two images show different people.
Like many other computer vision tasks, this means building representations that have a whole bunch of constraints i.e. they’re discriminant and able to capture the fine details but they’re also invariant to the transformations a person can undergo, as well as to different camera viewpoints, illumination changes or occlusions. On top of that, these representations need to be compact to be efficient. These three constraints are pretty much orthogonal to each other which makes the trade-off difficult. One way to balance the constraints is to use a learning mechanism to find the right representation for the task from an annotated training set.
In the case of Person Re-ID this is pretty tough as the same person can drastically change appearance between two image pairs, yet it may be just a few tiny details that differentiate two different people. You can see this in the pictures below.
As a consequence, most approaches focus on explicitly aligning the people in the images to fix the changes of scale, but also to deal with very strong differences in the pose such as matching a person sitting on a bike with the same person walking. For this, these methods explicitly integrate part-detectors, pose estimators or attention models in their solution.
Another challenge in Re-ID is that, although our representation is learned from a set of images during a training stage, we expect the system to later be able to deal with people it’s never seen before. This requires a training mechanism that generalizes well to previously unknown identities making it fundamentally different and more challenging than classification.
We decided to tackle the problem with an approach which involves learning what it means for two images to represent the same person without relying on a previously trained part detector or pose estimator. It results in a fixed length representation that’s optimal for the trade-off described earlier.
The solution we came up with is very close to the one we proposed for the visual search of objects, (see [A] and/or our project page on Deep Image Retrieval)
Yet, when we applied this approach to the Re-ID problem, it became obvious that the details matter to the extent that they have a huge impact on overall results. After running a careful analysis, we observed that there were several key aspects which needed to be properly combined. They boil down to these two key principles:
Once all the details are properly set for the architecture and the training mechanism itself, here’s what we get: a simple deep neural network that takes an image as input and outputs a one-dimensional vector. It’s composed of convolutional layers based on the recent and successful ResNet architecture, a global max-pooling, followed by an embedding and a normalization step. The weights of this network are successfully trained with a classification loss followed by a ranking loss. For the ranking loss, a Siamese architecture simultaneously considers three images (two images of the same person, and one image of a different person).
Here is what the final architecture looks like:
We compared our approach to the most recent ones on the benchmark datasets Market-1501 and DukeMTMC-ReID. Despite the simplicity of our method the results show that it clearly outperforms all previous methods. We made the same observation on the recent Person Search dataset.
To get a better understanding of why a global image representation with no explicit localization mechanism performs so well, we put in a visualization method. It’s inspired by the Grad-Cam [B] visualization which highlights the regions activated when they predict visual concepts.
In our case we wanted to understand which part of the image most contributes to the decision that two images are of the same person. The visualisation selects the 5 dimensions of our built representation that contribute most to the similarity between image representations. Using a technique close to Grad-Cam, it then highlights the corresponding regions which you can see in the figure below.
Here are two examples of matching images.
In most examples we could make the following observations. First, we see that the activated regions are fairly localized. Second, most of these localized regions focus on the people themselves, and more precisely on body regions that are indicative of their clothing, like the length of their pair of trousers or shirt sleeves. Finally, some of the paired responses go beyond similarity in appearance and respond to each other at a more abstract and semantic level. One striking example is the strong response to the bag in the first image that seems to pair with the response to the strap of the bag in the second image, the bag itself being occluded. The network is able to match these two regions despite a totally different appearance because it’s learned this is a good clue that it’s the same person.
This shows that, contrary to what’s been tried so far, carefully trained global representations can actually learn the right-level of generalization for a difficult recognition task like Re-ID without any explicit attention or part-detection mechanism.
Re-ID done right: towards good practices for person re-identification. Jon Almazan, Bojana Gajic, Naila Murray, Diane Larlus. Arxiv 18.
[A] Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus: End-to-end Learning of Deep Visual Representations for Image Retrieval. International Journal of Computer Vision. Volume 124, Issue 2, September 2017.
[B] Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra : Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. ICCV 2017
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.