Non-parametric 3D Human Shape Estimation from Single Images
AI powered technologies can nowadays create a 3D model of your face from just a snapshot but it’s been trickier to do the same for the entire human body. Yet, if we were able to turn a picture of a person into their 3D model, it would open the door to lots of new applications, in particular in Augmented Reality. Creating your own avatar to play in a videogame could be done in a few seconds, you could try out virtual clothes in your virtual changing room when shopping online and these are just a few examples.
This transformation from a picture to a 3D model of the body is called 3D human shape estimation. It’s much more difficult than for a face because people’s appearances vary a lot depending on the shape of their body, their morphology and the clothes they’re wearing. Although we make lots of different facial expressions, the highly articulated nature of the human body which is full of joints and (hence) different positions, also makes shape estimation difficult. Some body poses can be ambiguous (see examples) or body parts might simply be occluded. Because of this, initial research work on human 3D shape estimation from images, e.g., HMR [1], employs a parametric model of the human body, such as the SMPL [2] model, where the parameters control the configuration of the human skeleton and the shape of the naked body whilst simply ignoring other important factors like hair and clothing.
We’ve taken a different approach by looking at non-parametric approaches that don’t rely on a 3D model of the human body. They learn how to recover the 3D information from a collection of images that have been annotated with it. These methods can potentially estimate the shape of the person including their hair and clothing. The recent BodyNet [3] method proposed using a 3D occupancy grid as a representation, splitting the 3D space into little cubes called ‘voxels’ and training their algorithm to predict if each cube should be full or empty. It gave promising results, recovering the 3D volume of the human body, but it’s limited by the size of the little cubes. The more cubes are considered, the finer the estimation but, although you could potentially get a perfect result with an infinity of voxels, in practice, the computational cost means you get a rather coarse 3D volume.
What we do is represent the 3D surface as a combination of 2 depth maps – the visible depth map and the hidden depth map. A depth map is a gray-scale image of the same size as the original image, that specifies the distance of each pixel from the camera to the surface of the object (where in this case the object is actually a person). You can see in the figure below that this combination makes our representation more efficient and easier to handle compared to other non-parametric techniques since it can capture more refined details of the body shape while maintaining a reasonable size.
The architecture is composed of several stacked hourglass networks to estimate the depth maps. Since our method doesn’t rely on a parametric model, there’s nothing to prevent the network from producing monsters i.e. a person without a head or someone with 3 arms. To improve the ‘’humanness” of the generated 3D output, we incorporate a discriminator in an adversarial manner that penalizes the network during training if the estimated shape quite simply doesn’t look human. The architecture is illustrated below and the details are described in the paper.
Another important issue with data-driven approaches and training is how, or where, to get the training images and the 3D information. Most approaches rely on fully synthetic data but don’t generalize well to real scenes so we leveraged the Kinovis capture platform at Inria in the Morpheo team to generate textured 3D models of real people. We then placed them in a realistic virtual environment to generate semi-synthetic images together with ground truth 3D shapes. See examples below.
The extensive experiments we carried out are explained in the ICCV2019 paper as well as the benefits of the method compared to others including BodyNet. The qualitative results obtained by our method (right) show that we recover more detailed shapes than HMR(left) and BodyNet (middle).
In future work, we’ll investigate how using multiple frames of a video can help obtain better 3D shapes of people. Stay tuned…