PoseScript: 3D Human Poses from Natural Language
Ginger Delmas1,2, Philippe Weinzaepfel2, Thomas Lucas2, Francesc Moreno-Noguer1, Grégory Rogez2
ECCV 2022
BibTeX
@inproceedings{posescript, title={{PoseScript: 3D Human Poses from Natural Language}}, author={{Delmas, Ginger and Weinzaepfel, Philippe and Lucas, Thomas and Moreno-Noguer, Francesc and Rogez, Gr\'egory}}, booktitle={{ECCV}}, year={2022} }
News
- 21/03/2023: Updated version of the PoseScript dataset! (more human-written annotations available)
Introduction
Text can be used to improve semantic understanding of human poses.

Gaining semantic understanding of human poses would open the door to a number of applications such as pose teaching, pseudo 3D annotation when deploying a MoCap system is complicated, digital pose generation, or search for complex poses in large-scale datasets.

For this purpose, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS[2] with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. We show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.
The PoseScript dataset

The PoseScript dataset is composed of:
- a set of 20k diverse 3D human poses extracted from AMASS[2];
- ~6k human-written descriptions: these were collected on Amazon Mechanical Turk[3] by showing 3 additional poses, similar to the one to be annotated, so to obtain detailed & discriminative captions;

- 6 synthetic descriptions, in Natural Language, for each pose: these were generated automatically by a randomized captioning process which takes 3D keypoints as input, and extracts low-level pose information – the posecodes – thanks to a set of simple but generic rules. The posecodes are then combined into higher level textual descriptions using syntactic rules. This makes it possible to increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, at no cost.

Here is an example of human-written caption and an generated one, for a given 3D human pose.

The automatically generated captions won’t provide any cultural references, but will me more detailed, eg. “the right knee is forming a L shape”.
The Text-to-Pose retrieval model
We design a text-to-pose retrieval model, where the encodings of the pose and its corresponding PoseScript description are brought close together in a joint embedding space, while elements from different data pairs are pushed apart.

As a consequence, one can freely type a description text to query poses from a large database by looking at embedding similarity.
It also makes it possible to retrieve images showing people in particular poses, provided that the images[4] are associated with SMPL body fits[5].

The text-conditioned pose generation model
Next, we design a text-conditioned pose generation model, such that, at test time, the text is encoded as a distribution, from which we sample a z that will then be decoded as a 3D body pose.
At train time, we also encode the initial pose, and sample from the corresponding distribution to decode it as a new pose, that will be compared to the initial pose, as in a regular variational auto-encoder. To make the model conditioned on text, we force the two distributions of pose and text to be aligned. We additionally regularize the pose distribution so to sample poses without conditioning on text.

As a result, for a given input pose description, one can obtain several different pose generations. The more detailed the description, the less diverse the different generated poses.

Take-home message
- PoseScript is a dataset pairing 3D human poses with both automatically generated and human-written descriptions, in Natural Language.
- We used it to train a text-to-pose retrieval model and a text-conditioned pose generative model.
- We show that better performance on human data can be obtained by pretraining on the automatic descriptions generated by our captioning pipeline.

References
[1] https://en.wikipedia.org/wiki/Downward_Dog_Pose
[2] AMASS: Archive of Motion Capture as Surface Shapes, Mahmood et al., ICCV 2019
[3] https://www.mturk.com/
[4] Microsoft COCO: Common Objects in Context, Lin et al., ECCV 2014
[5] Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation, Joo et al., 3DV 2020