CroCo-Man: Cross-view and Cross-pose Completion for 3D Human Understanding

Matthieu Armando, Salma Galaaoui, Romain Brégier, Thomas Lucas, Vincent Leroy, Fabien Baradel, Philippe Weinzaepfel, Grégory Rogez

Abstract

Human perception and understanding is a major domain of computer vision which, like many other subdomains, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.

Overview

CroCo-Man leverages the recent CroCo pre-training method, specifically designed to learn 3D cues, and adapt it for the specific setting of human-centric pre-training. Like CroCo, the model takes as input a pair of photographs of the same scene (in our case, the same person). One of them is partially masked with a random pattern and the model is trained to reconstruct it, using information from both images.

Our image pairs are constructed in two ways: a) by taking two views of the same pose (i.e. the same instant) and b) by taking two poses in a motion sequence at different time steps – for instance from videos showing a person in movement. As the human body is non-rigid, going beyond the static setting proposed in CroCo can enable the model to gain some understanding of how body-parts interact and move with respect to one another.
With these two kinds of pairs, we leverage both large-scale multi-view datasets captured in labs, for high-quality cross-view pairs and video datasets for diverse in-the-wild cross-pose pairs.
Following the above strategy, we pre-train two different models, one trained on whole-body images, dubbed CroCo-Body, and one trained on close-up images of hands, dubbed CroCo-Hand.
The video below illustrates the pre-training completion results of CroCo-Body on unseen data.

Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person .

Results

Results show that our pre-trained models performs better than Masked Autoencoders (MAE) or CroCo on several human-centric downstream tasks, after finetuning.

Comparison with other pre-training methods on different downstream tasks (a) or under different fine-tuning data regimes (b) i.e. when varying the number of annotated training samples from COCO_{$_{part}$} for fine-tuning on the body mesh recovery task from 10% to 100%. MAE-Body/Hand means that we pre-train MAE on the same data as CroCo-Body/Hand.

BibTeX

@inproceedings{crocoman,
  title={Cross-view and Cross-pose Completion for 3D Human Understanding},
  author={Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Br{\'e}gier, Romain and Weinzaepfel, Philippe and Rogez, Gr{\'e}gory},
  booktitle={CVPR},
  year={2024}
}