Publications home

PoseScript: 3D human poses from natural language

Published by Gregory Rogez at 23 October 2022

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gregory Rogez

European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October, 2022

Paper Code Demo Dataset

Careers home

News

19/01/2024: Updated the paper with the task of pose description generation (ie. generating texts from an input pose thanks to a learned model; different from the automatic captioning pipeline). Improved the text-to-pose retrieval model and the text-conditioned pose generative model. Updated FID scores.
21/03/2023: Updated version of the PoseScript dataset! (more human-written annotations available)

Introduction

Text can be used to improve semantic understanding of human poses.

Gaining semantic understanding of human poses would open the door to a number of applications such as pose teaching, pseudo 3D annotation when deploying a MoCap system is complicated, digital pose generation, or search for complex poses in large-scale datasets.

For this purpose, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS^[2] with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. We show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.

The PoseScript dataset

The PoseScript dataset is composed of:

a set of 20k diverse 3D human poses extracted from AMASS^[2];
~6k human-written descriptions: these were collected on Amazon Mechanical Turk^[3] by showing 3 additional poses, similar to the one to be annotated, so to obtain detailed & discriminative captions;

6 synthetic descriptions, in Natural Language, for each pose: these were generated automatically by a randomized captioning process which takes 3D keypoints as input, and extracts low-level pose information – the posecodes – thanks to a set of simple but generic rules. The posecodes are then combined into higher level textual descriptions using syntactic rules. This makes it possible to increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, at no cost.

The automatically generated captions won’t provide any cultural references, but will me more detailed, eg. “the right knee is forming a L shape”.

The Text-to-Pose retrieval model

We design a text-to-pose retrieval model, where the encodings of the pose and its corresponding PoseScript description are brought close together in a joint embedding space, while elements from different data pairs are pushed apart.

As a consequence, one can freely type a description text to query poses from a large database by looking at embedding similarity.

It also makes it possible to retrieve images showing people in particular poses, provided that the images^[4] are associated with SMPL body fits^[5].

The text-conditioned pose generation model

Next, we design a text-conditioned pose generation model, such that, at test time, the text is encoded as a distribution, from which we sample a z that will then be decoded as a 3D body pose.

At train time, we also encode the initial pose, and sample from the corresponding distribution to decode it as a new pose, that will be compared to the initial pose, as in a regular variational auto-encoder. To make the model conditioned on text, we force the two distributions of pose and text to be aligned. We additionally regularize the pose distribution so to sample poses without conditioning on text.

As a result, for a given input pose description, one can obtain several different pose generations. The more detailed the description, the less diverse the different generated poses.

Take-home message

PoseScript is a dataset pairing 3D human poses with both automatically generated and human-written descriptions, in Natural Language.
We used it to train a text-to-pose retrieval model and a text-conditioned pose generative model.
We show that better performance on human data can be obtained by pretraining on the automatic descriptions generated by our captioning pipeline.

References

[1] https://en.wikipedia.org/wiki/Downward_Dog_Pose

[2] AMASS: Archive of Motion Capture as Surface Shapes, Mahmood et al., ICCV 2019

[3] https://www.mturk.com/

[4] Microsoft COCO: Common Objects in Context, Lin et al., ECCV 2014

[5] Exemplar fine-tuning for 3D human model fitting towards in-the-wild 3D human pose estimation, Joo et al., 3DV 2020

@inproceedings{posescript,
title={{PoseScript: 3D Human Poses from Natural Language}},
author={{Delmas, Ginger and Weinzaepfel, Philippe and Lucas, Thomas and Moreno-Noguer, Francesc and Rogez, Gr\’egory}},
booktitle={{ECCV}},
year={2022}
}

News

Introduction

The PoseScript dataset

The Text-to-Pose retrieval model

The text-conditioned pose generation model

Take-home message

References

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2026

All

Publications

Blog

News

Code & Data

Careers

People

PoseScript: 3D human poses from natural language

News

Introduction

The PoseScript dataset

The Text-to-Pose retrieval model

The text-conditioned pose generation model

Take-home message

References

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings