Cross-view and cross-pose completion for 3D human understanding

Published by Matthieu Armando at 17 June 2024

Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Gregory Rogez

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), Seattle, USA, 17-21 June, 2024

Paper arXiv

Careers home

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.

Overview

CroCo-Man leverages the recent CroCo pre-training method, specifically designed to learn 3D cues, and adapt it for the specific setting of human-centric pre-training. Like CroCo, the model takes as input a pair of photographs of the same scene (in our case, the same person). One of them is partially masked with a random pattern and the model is trained to reconstruct it, using information from both images.

Our image pairs are constructed in two ways: a) by taking two views of the same pose (i.e. the same instant) and b) by taking two poses in a motion sequence at different time steps – for instance from videos showing a person in movement. As the human body is non-rigid, going beyond the static setting proposed in CroCo can enable the model to gain some understanding of how body-parts interact and move with respect to one another.
With these two kinds of pairs, we leverage both large-scale multi-view datasets captured in labs, for high-quality cross-view pairs and video datasets for diverse in-the-wild cross-pose pairs.
Following the above strategy, we pre-train two different models, one trained on whole-body images, dubbed CroCo-Body, and one trained on close-up images of hands, dubbed CroCo-Hand.
The video below illustrates the pre-training completion results of CroCo-Body on unseen data.

Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person.

Results

Results show that our pre-trained models performs better than Masked Autoencoders (MAE) or CroCo on several human-centric downstream tasks, after finetuning.

@inproceedings{crocoman,
  title={Cross-view and Cross-pose Completion for 3D Human Understanding},
  author={Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Br{\’e}gier, Romain and Weinzaepfel, Philippe and Rogez, Gr{\’e}gory},
  booktitle={CVPR},
  year={2024}
}

Overview

Results

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Cross-view and cross-pose completion for 3D human understanding

Overview

Results

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings