Topics

PoseBERT

Published by Fabien Baradel at 29 November 2021

Fabien Baradel, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, Gregory Rogez

2021

A novel, plug and play model for human 3D shape estimation of the body or hands, in videos which is trained by mimicking the BERT algorithm from the natural language processing community.

Code: github/naver/posebert

PoseBERT [1] is a new algorithm that takes as input the 3D poses of a person estimated in each frame of a video i.e. the position of his/her body joints, and predicts a sequence of 3D shapes. Although the estimations may be noisy due to motion blur, occlusions or ambiguities, PoseBERT returns a smooth sequence of 3D shapes. PoseBERT can also be plugged on top of any state of the art pose estimation method such as SPIN [2], our DOPE model [3] or our new MoCap-SPIN model also presented in [1].

Video 1: Demonstration of PoseBERT-body by authors Fabien Baradel (left) and Thibault Groueix (right)

PoseBERT is inspired by the BERT algorithm from the natural language processing (NLP) community. BERT (which stands for Bidirectional Encoder Representations from Transformers), is a method proposed by researchers at Google AI Language in 2018 that has had very good results on a wide variety of NLP tasks such as Question Answering or natural language inference. In their paper [4], the researchers detail, among other elements, a technique named Masked Language Model for bidirectional training of their models. Before feeding sentences into BERT, a percentage of the words in each sequence are masked and the model is trained to predict these masked words, based on the context provided by the other, non-masked, words of the sequence. PoseBERT adapts this learning process to human 3D poses. We mask, or perturb with noise, a percentage of poses in a sequence and PoseBERT attempts to predict the missing or noisy poses by using the context provided by the valid, untouched poses.

We trained two versions of PoseBERT. One model for the body and another one for the hand. In practice, we rely on the SMPL parametric model [5] developed by researchers at the Max Planck Institute in Germany, and train PoseBERT to predict the parameters of this model and not the thousands of vertices of the human 3D mesh. PoseBERT can be trained on Motion Capture data only, without requiring image annotations.

Video 2: Demonstration of PoseBERT-hand

When combined with MoCap-SPIN, PoseBERT reaches state-of-the-art performance for human 3D pose estimation in videos on several 3D pose estimation benchmarks. We also combined PoseBERT with DOPE [3] to estimate the 3D shape of a hand in real-time and used these predictions to animate an ALLEGRO robot hand. This fun demo is given live at the 3DV 2021 conference demo session. We’ll be pursuing this work on pose retargeting and robots manipulating objects like humans so stay tuned to our blog and publications.

Video 3: Demonstration of PoseBERT with DOPE being used to animate the ALLEGRO robot hand.

Code: github/naver/posebert

References:

Leveraging MoCap Data for Human Mesh Recovery, Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis and Grégory Rogez. 9^th International Conference on 3D Vision (3DV), virtual event, 1-3 December, 2021. 2min presentation and 10min presentation on Slideslive
Learning to reconstruct 3D human pose and shape via model-fitting in the loop.Nikos Kolotouros, Georgios Pavlakos, Michael J Black and Kostas Daniilidis. International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October-2 November, 2019
DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild. Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy and Grégory Rogez. European Conference on Computer Vision (ECCV), virtual event, 23-28 August, 2020
Bert: Pre-training of deep bidirectional transformers for language understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina ToutanovaarXiv:1810.04805, 2018
SMPL: A skinned multi-person linear model Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll and Michael J. Black. ACM Transactions on Graphics (TOG) 34 (6), pp. 1-16, November 2015

PoseBERT

References:

Related Content

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Topics

PoseBERT

References:

Related Content

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings