Topics

Whole-body human mesh recovery of multiple persons from a single image

Published by Fabien Baradel at 23 February 2024

Fabien Baradel, Gregory Rogez, Philippe Weinzaepfel

2024

A simple yet effective single-shot method to detect multiple people in an image and estimate their pose, body shape and expression.

Code: naver/multi-hmr

Human Mesh Recovery (HMR) is a task in the field of computer vision focussed on identifying humans, and accurately estimating their shapes and their 3D poses from images. This capability has immense potential in applications in augmented and virtual reality (AR/VR), where the precise capture of facial expressions and hand gestures is pivotal to effective human communication and the desire to adopt the technology. These systems also promise to enhance the quality of human-robot interaction (HRI) and refine robot navigation systems. By incorporating proximity and human motions and gestures into navigation algorithms, a robot can anticipate and respond to human behaviour making them safer and making the interaction with humans more natural.

How HMR works today

The focus of current HMR systems is mainly on estimating an individual human mesh from the part of the image where a person has been detected by a separate detection algorithm. This dedicated algorithm does a first run to detect the person(s) but then has to be rerun for each said person, making the whole detection process rather slow and inefficient. What we’ve managed to do is develop a method, called Multi-HMR, that can process the entire image in one go, identifying the humans that figure within and delivering estimations for all individuals simultaneously.

Video 1: 3D reconstruction of multiple humans in camera space with Multi-HMR. Viewpoint is moving from camera view to side view.

The challenge in devising such a single step method is that it needs to be proficient in detecting humans within an image and in extracting local visual cues such as how the fingers are oriented or the position of the eyes. These visual cues are essential to be able to estimate expressive human meshes for multiple humans. This dual capability of detection and extraction is crucial in optimizing HMR algorithms for speed, robustness and accuracy, all of which are necessary to make them widely applicable in practical scenarios.

Current approaches to HMR can be broadly categorized into two main groups: single-shot methods and multi-shot methods. Single-shot methods process the entire image at once, but they yield only rough estimates of the 3D pose (i.e. ROMP (1)). By design they estimate only the body pose and are unable to estimate facial expressions and hand poses. On the other hand, multi-shot methods rely on the pre-detection of humans using the off-the-shelf algorithm mentioned earlier. Moreover, after pre-detection, a number of inefficient cropping and estimation procedures are conducted to give good estimations but these come at the cost of a significant amount of computational overhead (for example PIXIE (2) crops around the body, face and hands separately, feeding each crop to its own model then combining the predictions from each body part).

Video 2: Demonstration of Multi-HMR outputs where one can see the recovery of human poses and expressions. Top left is the RGB image, top right the RGB image and the output overlay, on the bottom left the camera side-view and on the bottom right a bird’s-eye view.

Multi-HMR: a new method for multi-person whole-body human mesh recovery in a single-shot

Multi-HMR, is a real-time single-shot detector capable of simultaneously estimating pose and shape parameters for entire body models of individuals within a scene. It positions humans within the camera space and, when camera parameters are accessible, leverages them to enhance the final predictions. A standard Vision Transformer (ViT) backbone is used to extract visual features from input images, capitalizing on recent advancements in large-scale self-supervised pre-training (for example CroCo (3), CroCoMan (4), MAE (5) and DINOv2 (6)). We regress a person-center 2D heatmap from the feature map composed of tokens produced by the backbone. This heatmap indicates the probability of a person being centered at a given point in the corresponding patch, alongside a location offset to predict the precise pixel location of the person’s center in the image.

The Human Perception Head (HPH) is a new cross-attention based module we designed to regress the expressive body-part parameters (body, face, hands) for each detected person. The HPH predicts a variable number of pose and shape parameters for an expressive human parametric model, along with depths to place individuals in the scene. By self-attending to the entire set of tokens (that can be seen as local features) produced by the backbone, our module extracts fine-grained visual cues to reconstruct expressive body parts such as faces and hands in a data-driven manner, without the need for additional cropping or resizing. To be able to directly regress the expressive body parts we needed data, so we generated 60,000 synthetic images featuring expressive humans close to the camera. This supplementary training data enhances the quality of the final human 3D reconstructions, particularly for expressive body parts.

Figure 1: Overview of Multi-HMR. A ViT backbone extracts image embeddings. Detection is conducted at the patch level. Each detected token feature serves as a query for a cross-attention based head, called Human Perception Head, which predicts pose, shape and 3D spatial location of each person. Optionally, camera parameters can be taken into account if known.

Multi-HMR positions each human in the 3D scene from the camera’s viewpoint using a simple regression loss. This gives state of the art results and coherent positioning between individuals, but we’ve also incorporated the option to consider camera parameters when available as this improves the accuracy of the human localization within the camera space. We train a family of models at various resolutions and with different backbone sizes and surpass the state of the art on multiple SMPL and SMPL-X benchmarks. As anticipated, higher resolutions and larger backbones yield superior performance on benchmarks, yet our model based on 448×448 input images and the ViT-S backbone still achieves state of the art results while operating in real-time using a standard GPU. All the details are in the paper (7) and you can even try it yourself on the online demo!

Video 3: 3D reconstruction of multiple humans in camera space with Multi-HMR. Viewpoint is moving from camera view to bird’s-eye-view.

Video 4: 3D reconstruction of multiple humans in camera space with Multi-HMR. Viewpoint is moving from camera view to side view.

Summary and next steps

Multi-HMR is a significant step forward in making HMR technology more practical and more widely applicable in real-world settings. At the moment we’re working on improving Multi-HMR on single images where occlusion, rare human poses, motion blur and bad picture quality lead to less reliable predictions. To address this, we see potential in incorporating temporal information and in leveraging sequences of images (as opposed to a single image) to enhance accuracy and robustness. Multi-view scenarios of the same scene could also offer further opportunities for improvement.

References:

ROMP: Monocular, One-stage, Regression of Multiple 3D People, Yu Sun et al., Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11179-1118.
PIXIE: Collaborative Regression of Expressive Bodies using Moderation, Yao Feng et al. International Conference on 3D Vision (3DV), Online, 1-3 December, 2021.
CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion, Philippe Weinzaepfel et al., The Thirty-sixth Annual Conference on Neural Information Processing and Systems (NeurIPS), New Orleans, USA, 28 November-9 December, 2022.
CroCoMan: Cross-view and Cross-pose Completion for 3D Human Understanding, Matthieu Armando et al., arXiv:2311.09104, 2023.
MAE: Masked Autoencoders Are Scalable Vision Learners, Kaiming He et al., arXiv2111.06377, 2021.
DinoV2: Learning Robust Visual Features without Supervision, Maxime Oquab et al., arXiv:2304.07193, 2023.
Multi-HMR: Single-stage multi-person whole-body human mesh recovery with transformers, Fabien Baradel et al., arXiv:2402.14654, 2024.

Whole-body human mesh recovery of multiple persons from a single image

How HMR works today

Multi-HMR: a new method for multi-person whole-body human mesh recovery in a single-shot

Summary and next steps

References:

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Topics

Whole-body human mesh recovery of multiple persons from a single image

How HMR works today

Multi-HMR: a new method for multi-person whole-body human mesh recovery in a single-shot

Summary and next steps

References:

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings