Moulding Humans - Naver Labs Europe
loader image

Non-parametric 3D Human Shape Estimation from Single Images

human shape estimation

3D Human Shape Estimation

AI powered technologies can nowadays create a 3D model of your face from just a snapshot but it’s been trickier to do the same for the entire human body. Yet, if we were able to turn a picture of a person into their 3D model, it would open the door to lots of new applications, in particular in Augmented Reality. Creating your own avatar to play in a videogame could be done in a few seconds, you could try out virtual clothes in your virtual changing room when shopping online and these are just a few examples.

This transformation from a picture to a 3D model of the body is called 3D human shape estimation. It’s much more difficult than for a face because people’s appearances vary a lot depending on the shape of their body, their morphology and the clothes they’re wearing. Although we make lots of different facial expressions, the highly articulated nature of the human body which is full of joints and (hence) different positions, also makes shape estimation difficult. Some body poses can be ambiguous (see examples) or body parts might simply be occluded. Because of this, initial research work on human 3D shape estimation from images, e.g., HMR [1],  employs a parametric model of the human body, such as the SMPL [2] model, where the parameters control the configuration of the human skeleton and the shape of the naked body whilst simply ignoring other important factors like hair and clothing.

We’ve taken a different approach by looking at non-parametric approaches that don’t rely on a 3D model of the human body. They learn how to recover the 3D information from a collection of images that have been annotated with it. These methods can potentially estimate the shape of the person including their hair and clothing.  The recent BodyNet [3] method proposed using a 3D occupancy grid as a representation, splitting the 3D space into little cubes called ‘voxels’ and training their algorithm to predict if each cube should be full or empty. It gave promising results, recovering the 3D volume of the human body, but it’s limited by the size of the little cubes. The more cubes are considered, the finer the estimation but, although you could potentially get a perfect result with an infinity of voxels, in practice, the computational cost means you get a rather coarse 3D volume.

What we do is represent the 3D surface as a combination of 2 depth maps – the visible depth map and the hidden depth map.  A depth map is a gray-scale image of the same size as the original image, that specifies the distance of each pixel from the camera to the surface of the object (where in this case the object is actually a person). You can see in the figure below that this combination makes our representation more efficient and easier to handle compared to other non-parametric techniques since it can capture more refined details of the body shape while maintaining a reasonable size.

Our method compared to a voxel grid.

The architecture is composed of several stacked hourglass networks to estimate the depth maps. Since our method doesn’t rely on a parametric model, there’s nothing to prevent the network from producing monsters i.e. a person without a head or someone with 3 arms. To improve the’humanness” of the generated 3D output, we incorporate a discriminator in an adversarial manner that penalizes the network during training if the estimated shape quite simply doesn’t look human. The architecture is illustrated below and the details are described in the paper.

A discriminator improves the “humanness” of the generated 3D output.

Another important issue with data-driven approaches and training is how, or where, to get the training images and the 3D information. Most approaches rely on fully synthetic data but don’t generalize well to real scenes so we leveraged the Kinovis capture platform at Inria in the Morpheo team to generate textured 3D models of real people. We then placed them in a realistic virtual environment to generate semi-synthetic images together with ground truth 3D shapes. See examples below.

The extensive experiments we carried out are explained in the ICCV2019 paper as well as the benefits of the method compared to others including BodyNet. The qualitative results obtained by our method (right) show that we recover more detailed shapes than HMR(left) and BodyNet (middle).

The qualitative results obtained by our method (right) show that we recover more detailed shapes than HMR (left) and BodyNet (middle).

In future work, we’ll investigate how using multiple frames of a video can help obtain better 3D shapes of people. Stay tuned…

More about this work:

ICCV 2019 paper: Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images, Valentin Gabeur, Jean-Sébastien Franco, Xavier Martin, Cordelia Schmid, Gregory Rogez

ArXiv version


author    = {Gabeur, Valentin and Franco, Jean-Sébastien Franco and Martin, Xavier and Schmid, Cordelia, and Rogez, Gregory},
title     = { Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year      = {2019}}


[1] End-to-end Recovery of Human Shape and Pose, Angjoo Kanazawa, Michael J. Black, David W. Jacobs and Jitendra Malik, CVPR 2018

[2] SMPL: a skinned multi-person linear model, Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, Michael J. Black, ACM Trans. Graph. 2015

[3] BodyNet: Volumetric Inference of 3D Human Body Shapes, Gül Varol, Duygu Ceylan, Bryan C. Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, Cordelia Schmid, ECCV 2018




Related Content