CroCo-Man: Cross-view and Cross-pose Completion for 3D Human Understanding
Abstract
Human perception and understanding is a major domain of computer vision which, like many other subdomains, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
Overview
CroCo-Man leverages the recent CroCo pre-training method, specifically designed to learn 3D cues, and adapt it for the specific setting of human-centric pre-training. Like CroCo, the model takes as input a pair of photographs of the same scene (in our case, the same person). One of them is partially masked with a random pattern and the model is trained to reconstruct it, using information from both images.
Our image pairs are constructed in two ways: a) by taking two views of the same pose (i.e. the same instant) and b) by taking two poses in a motion sequence at different time steps – for instance from videos showing a person in movement. As the human body is non-rigid, going beyond the static setting proposed in CroCo can enable the model to gain some understanding of how body-parts interact and move with respect to one another.
With these two kinds of pairs, we leverage both large-scale multi-view datasets captured in labs, for high-quality cross-view pairs and video datasets for diverse in-the-wild cross-pose pairs.
Following the above strategy, we pre-train two different models, one trained on whole-body images, dubbed CroCo-Body, and one trained on close-up images of hands, dubbed CroCo-Hand.
The video below illustrates the pre-training completion results of CroCo-Body on unseen data.
Results
Results show that our pre-trained models performs better than Masked Autoencoders (MAE) or CroCo on several human-centric downstream tasks, after finetuning.
BibTeX
@inproceedings{crocoman, title={Cross-view and Cross-pose Completion for 3D Human Understanding}, author={Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Br{\'e}gier, Romain and Weinzaepfel, Philippe and Rogez, Gr{\'e}gory}, booktitle={CVPR}, year={2024} }