Blog home

Topics

Multi-HMR 2: a unified framework for human-centered scene understanding

Published by Irene Maxwell at 16 June 2026

Fabien Baradel, Guénolé Fiche, Philippe Weinzaepfel

2026

Careers home

The first human mesh recovery system to detect, localize, recover and track multiple people in 3D from a single camera.

Humans are at the center of many visual scenes, making human understanding one of the most important challenges in computer vision. Human Mesh Recovery (HMR) is the task of estimating the 3D pose and shape representation of a person from images or videos. Rather than predicting only a handful of body joints, HMR estimates a full 3D human mesh that captures body pose, shape and position in 3D space. This rich representation enables machines to reason about how people move, interact and occupy their environment. As a result, HMR has become a key technology for applications ranging from robotics and autonomous navigation to virtual and augmented reality, motion capture, sports analytics, healthcare and human-computer interaction.

The status of HMR today

HMR has made remarkable progress the last few years and systems are now capable of recovering highly detailed 3D human meshes from a single image, including body pose, hand articulation, facial expressions and body shape (for example NLF (8)). These advances have been driven by larger datasets, more powerful neural network architectures and improved parametric human models. As a result, HMR has evolved from a relatively niche research topic into a practical technology that can be deployed in many real-world applications.

Despite all this progress, most HMR systems still follow a two-stage pipeline. In the first stage a person detector locates every individual in the image, after which a separate mesh recovery model is applied to each detected person independently. This approach benefits from strong object detection models and highly optimized single-person HMR networks, but it comes with several limitations. The computational cost grows with the number of people in the scene, interactions between individuals are only weakly modeled and estimating the relative 3D positions of multiple humans remains challenging.

To overcome these limitations, a new generation of multi-person HMR methods has emerged. Inspired by recent advances in object detection, these approaches aim to recover all human meshes directly from the full image in a single pass (for example Multi-HMR (1), SAT-HMR (5) and AiOS (4)). By jointly reasoning about all people in the scene, they can better handle crowded environments and complex human interactions while significantly improving efficiency. All this sounds good but one-stage approaches still have to handle a certain number of difficulties. Many rely on simplified camera assumptions which limits their ability to recover metric-scale 3D locations and most are built on body models designed only for adults, which can lead to inaccurate estimates when children are present. As HMR systems move beyond pose estimation toward complete scene understanding, accurately localizing diverse humans of all ages in 3D space is becoming just as important as recovering their body pose and shape.

Multi-HMR 2: the next generation of multi-person human mesh recovery

Multi-HMR 2 builds upon the foundations of Multi-HMR (1) but it pushes recovery a step further. While our original Multi-HMR demonstrated it was possible to recover expressive meshes for multiple people simultaneously in a single pass, Multi-HMR 2 focuses on the broader challenge of understanding who’s present in a scene, where they’re located in 3D space and how they can be tracked over time. Table 1 below recaps the differences between the original Multi-HMR model released in 2024 and Multi-HMR 2.

Examples of Multi-HMR 2 output. Compared to the previous version, Multi-HMR 2 provides a more complete understanding of human-centered scenes with improved 3D localization, greater robustness to crowded environments and more accurate pose and shape estimation. It also introduces identity tracking and child-aware body modeling – all in a single pass.

Feature	Multi-HMR	Multi-HMR 2
Single-shot multi-person HMR	YES	YES
Whole-body reconstruction	YES	YES
Camera-space localization	Limited	Improved
Camera estimation	NO	YES
Robust to overlapping people	Limited	Improved
Child-aware body modeling	NO	YES (Anny)
Identity tracking	NO	YES

Table 1: Comparison of Multi-HMR and Multi-HMR 2. Multi-HMR 2 extends the original framework with improved 3D localization, camera estimation, child-aware body modeling and identity tracking while maintaining efficient single-shot multi-person reconstruction.

At its core, Multi-HMR 2 adopts a DETR-inspired (6) architecture, a family of models that has transformed object detection by reasoning globally over the entire image. Instead of relying on local image regions to detect people, the model uses a set of learned queries that compete to explain the humans present in the scene. This makes the approach very robust in crowded environments where people overlap, interact closely or partially occlude one another.

A major evolution comes from the integration of the open source parametric body model Anny (2). Most existing HMR systems are based on body models that were designed primarily for adults. As a consequence, children are often reconstructed as scaled-down adults positioned incorrectly in the scene. Anny addresses this limitation by representing the full diversity of human bodies -from infants to elders – within a single unified model. This allows Multi-HMR 2 to recover more realistic body shapes and improves the accuracy of 3D localization in scenes containing people of different ages and body proportions.

Another important advancement in Multi-HMR 2 is the ability to estimate camera intrinsic parameters directly from the image. Previous multi-person HMR methods typically relied on fixed assumptions about the camera, which often led to inaccurate estimates of absolute distances and relative positions between people. By estimating camera parameters from the images when they are unknown, Multi-HMR 2 can recover human-centered scenes at a much more accurate metric scale, providing a more meaningful representation of how people are arranged in 3D space.

We also revisited the model training procedure. Training DETR-based models usually requires computationally expensive matching strategies that compare every prediction with every target. Multi-HMR 2 introduces a simpler matching approach based solely on the 2D locations of people in the image. This 2D approach significantly reduces training cost, enables the use of additional image datasets without 3D mesh annotations and allows the entire model to be trained in about a week on a single GPU without penalising reconstruction quality.

Finally, Multi-HMR 2 goes beyond static 3D reconstruction. By distilling visual features from the SAM 2 (7) memory encoder, the model learns appearance representations that can be used to associate detections across frames. This makes Multi-HMR 2 the first human mesh recovery approach capable of performing tracking without requiring any video-based supervision during training. The result is a unified system that can detect, reconstruct, localize and track multiple people directly from visual observations.

Multi-HMR 2 achieves strong results across a number of benchmarks and tasks (9). It improves pelvis-centered 3D reconstruction, camera-space localization (with and without ground-truth camera parameters), detection accuracy and recall in crowded or occluded scenes, 2D keypoint reprojection and multi-person tracking. This shows that Multi-HMR 2 is not only accurate at recovering human meshes but also effective at understanding where people are in the scene and maintaining their identities over time.

Summary and next steps

Multi-HMR 2 brings us closer to practical, human-centered scene understanding by combining multi-person detection and pose/shape estimation, accurate 3D localization, child-aware body modeling and identity tracking in a single framework. Looking ahead, we aim to further improve the reconstruction of expressive body parts such as hands and faces. We also plan to leverage temporal and multi-view information to make predictions even more accurate and robust.

References

1: Multi-HMR: Multi-person whole body human mesh recovery in a single shot, Fabien Baradel et al., ECCV 2024

2: Human mesh modeling for Anny body, Romain Bregier et al., arXiv:2511.03589

3: Anny-Fit: all-age human mesh recovery, Laura Bravo-Sanchez et al., CVPR 2026 Findings Track

4: AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation, Qingping Sun et al., CVPR 2024

5: SAT-HMR: Ream-time multi-person 3D mesh estimation via scale-adaptive tokens, Chi Su et al., CVPR 2025

6: DETR: End-to-end object DEtection with TRansformers, Nicolas Carion et al., ECCV 2020

7: SAM 2: segment anything in images and videos, Nikhila Ravi et al., ICLR 2025

8: Neural Localizer Fields for continuous 3D human pose and shape estimation (NLF), Istvan Sarandi and Gerard Pons-Moll, NeurIPS 2024

9: Multi-HMR 2: multi-person camera-centric human detection, mesh recovery and tracking, Guénolé Fiche et al., arXiv:2606.14841

Multi-HMR 2: a unified framework for human-centered scene understanding

The first human mesh recovery system to detect, localize, recover and track multiple people in 3D from a single camera.

The status of HMR today

Multi-HMR 2: the next generation of multi-person human mesh recovery

Summary and next steps

References

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2026

All

Publications

Blog

News

Code & Data

Careers

People

Topics

Multi-HMR 2: a unified framework for human-centered scene understanding

The first human mesh recovery system to detect, localize, recover and track multiple people in 3D from a single camera.

The status of HMR today

Multi-HMR 2: the next generation of multi-person human mesh recovery

Summary and next steps

References

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings