Humans are at the center of many visual scenes, making human understanding one of the most important challenges in computer vision. Human Mesh Recovery (HMR) is the task of estimating the 3D pose and shape representation of a person from images or videos. Rather than predicting only a handful of body joints, HMR estimates a full 3D human mesh that captures body pose, shape and position in 3D space. This rich representation enables machines to reason about how people move, interact and occupy their environment. As a result, HMR has become a key technology for applications ranging from robotics and autonomous navigation to virtual and augmented reality, motion capture, sports analytics, healthcare and human-computer interaction.
HMR has made remarkable progress the last few years and systems are now capable of recovering highly detailed 3D human meshes from a single image, including body pose, hand articulation, facial expressions and body shape (for example NLF (8)). These advances have been driven by larger datasets, more powerful neural network architectures and improved parametric human models. As a result, HMR has evolved from a relatively niche research topic into a practical technology that can be deployed in many real-world applications.
Despite all this progress, most HMR systems still follow a two-stage pipeline. In the first stage a person detector locates every individual in the image, after which a separate mesh recovery model is applied to each detected person independently. This approach benefits from strong object detection models and highly optimized single-person HMR networks, but it comes with several limitations. The computational cost grows with the number of people in the scene, interactions between individuals are only weakly modeled and estimating the relative 3D positions of multiple humans remains challenging.
To overcome these limitations, a new generation of multi-person HMR methods has emerged. Inspired by recent advances in object detection, these approaches aim to recover all human meshes directly from the full image in a single pass (for example Multi-HMR (1), SAT-HMR (5) and AiOS (4)). By jointly reasoning about all people in the scene, they can better handle crowded environments and complex human interactions while significantly improving efficiency. All this sounds good but one-stage approaches still have to handle a certain number of difficulties. Many rely on simplified camera assumptions which limits their ability to recover metric-scale 3D locations and most are built on body models designed only for adults, which can lead to inaccurate estimates when children are present. As HMR systems move beyond pose estimation toward complete scene understanding, accurately localizing diverse humans of all ages in 3D space is becoming just as important as recovering their body pose and shape.
Multi-HMR 2 builds upon the foundations of Multi-HMR (1) but it pushes recovery a step further. While our original Multi-HMR demonstrated it was possible to recover expressive meshes for multiple people simultaneously in a single pass, Multi-HMR 2 focuses on the broader challenge of understanding who’s present in a scene, where they’re located in 3D space and how they can be tracked over time. Table 1 below recaps the differences between the original Multi-HMR model released in 2024 and Multi-HMR 2.
Examples of Multi-HMR 2 output. Compared to the previous version, Multi-HMR 2 provides a more complete understanding of human-centered scenes with improved 3D localization, greater robustness to crowded environments and more accurate pose and shape estimation. It also introduces identity tracking and child-aware body modeling – all in a single pass.
| Feature | Multi-HMR | Multi-HMR 2 |
|---|---|---|
| Single-shot multi-person HMR | YES | YES |
| Whole-body reconstruction | YES | YES |
| Camera-space localization | Limited | Improved |
| Camera estimation | NO | YES |
| Robust to overlapping people | Limited | Improved |
| Child-aware body modeling | NO | YES (Anny) |
| Identity tracking | NO | YES |
Table 1: Comparison of Multi-HMR and Multi-HMR 2. Multi-HMR 2 extends the original framework with improved 3D localization, camera estimation, child-aware body modeling and identity tracking while maintaining efficient single-shot multi-person reconstruction.
At its core, Multi-HMR 2 adopts a DETR-inspired (6) architecture, a family of models that has transformed object detection by reasoning globally over the entire image. Instead of relying on local image regions to detect people, the model uses a set of learned queries that compete to explain the humans present in the scene. This makes the approach very robust in crowded environments where people overlap, interact closely or partially occlude one another.
A major evolution comes from the integration of the open source parametric body model Anny (2). Most existing HMR systems are based on body models that were designed primarily for adults. As a consequence, children are often reconstructed as scaled-down adults positioned incorrectly in the scene. Anny addresses this limitation by representing the full diversity of human bodies -from infants to elders – within a single unified model. This allows Multi-HMR 2 to recover more realistic body shapes and improves the accuracy of 3D localization in scenes containing people of different ages and body proportions.

Another important advancement in Multi-HMR 2 is the ability to estimate camera intrinsic parameters directly from the image. Previous multi-person HMR methods typically relied on fixed assumptions about the camera, which often led to inaccurate estimates of absolute distances and relative positions between people. By estimating camera parameters from the images when they are unknown, Multi-HMR 2 can recover human-centered scenes at a much more accurate metric scale, providing a more meaningful representation of how people are arranged in 3D space.
We also revisited the model training procedure. Training DETR-based models usually requires computationally expensive matching strategies that compare every prediction with every target. Multi-HMR 2 introduces a simpler matching approach based solely on the 2D locations of people in the image. This 2D approach significantly reduces training cost, enables the use of additional image datasets without 3D mesh annotations and allows the entire model to be trained in about a week on a single GPU without penalising reconstruction quality.
Finally, Multi-HMR 2 goes beyond static 3D reconstruction. By distilling visual features from the SAM 2 (7) memory encoder, the model learns appearance representations that can be used to associate detections across frames. This makes Multi-HMR 2 the first human mesh recovery approach capable of performing tracking without requiring any video-based supervision during training. The result is a unified system that can detect, reconstruct, localize and track multiple people directly from visual observations.
Multi-HMR 2 achieves strong results across a number of benchmarks and tasks (9). It improves pelvis-centered 3D reconstruction, camera-space localization (with and without ground-truth camera parameters), detection accuracy and recall in crowded or occluded scenes, 2D keypoint reprojection and multi-person tracking. This shows that Multi-HMR 2 is not only accurate at recovering human meshes but also effective at understanding where people are in the scene and maintaining their identities over time.
Multi-HMR 2 brings us closer to practical, human-centered scene understanding by combining multi-person detection and pose/shape estimation, accurate 3D localization, child-aware body modeling and identity tracking in a single framework. Looking ahead, we aim to further improve the reconstruction of expressive body parts such as hands and faces. We also plan to leverage temporal and multi-view information to make predictions even more accurate and robust.
1: Multi-HMR: Multi-person whole body human mesh recovery in a single shot, Fabien Baradel et al., ECCV 2024
2: Human mesh modeling for Anny body, Romain Bregier et al., arXiv:2511.03589
3: Anny-Fit: all-age human mesh recovery, Laura Bravo-Sanchez et al., CVPR 2026 Findings Track
4: AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation, Qingping Sun et al., CVPR 2024
5: SAT-HMR: Ream-time multi-person 3D mesh estimation via scale-adaptive tokens, Chi Su et al., CVPR 2025
6: DETR: End-to-end object DEtection with TRansformers, Nicolas Carion et al., ECCV 2020
7: SAM 2: segment anything in images and videos, Nikhila Ravi et al., ICLR 2025
8: Neural Localizer Fields for continuous 3D human pose and shape estimation (NLF), Istvan Sarandi and Gerard Pons-Moll, NeurIPS 2024
9: Multi-HMR 2: multi-person camera-centric human detection, mesh recovery and tracking, Guénolé Fiche et al., arXiv:2606.14841
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.

The NAVER France (all entities combined) gender equality index score: 67/100. This score is based on 2025 data.
– Difference in female/male salary: 17/40 points
– Difference in salary increases female/male: 35/35 points
– Salary increases upon return from maternity leave: 15/15 points
– Number of employees in under-represented gender in 10 highest salaries: 0/10 points
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2025 au titre des données 2025 : 67/100
Détail des indicateurs :
– Les écarts de salaire entre les femmes et les hommes : 17/40 points
– Les écarts des augmentations individuelles entre les femmes et les hommes : 35/35 points
– Toutes les salariées augmentées revenant de congé maternité : 15/15 points
– Le nombre de salariés du sexe sous-représenté parmi les 10 plus hautes rémunérations : 0/10 points
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.