CroCo-Man: Cross-view and Cross-pose Completion for 3D Human Understanding



Human perception and understanding is a major domain of computer vision which, like many other subdomains, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.


CroCo-Man leverages the recent CroCo pre-training method, specifically designed to learn 3D cues, and adapt it for the specific setting of human-centric pre-training. Like CroCo, the model takes as input a pair of photographs of the same scene (in our case, the same person). One of them is partially masked with a random pattern and the model is trained to reconstruct it, using information from both images.

Our image pairs are constructed in two ways: a) by taking two views of the same pose (i.e. the same instant) and b) by taking two poses in a motion sequence at different time steps – for instance from videos showing a person in movement. As the human body is non-rigid, going beyond the static setting proposed in CroCo can enable the model to gain some understanding of how body-parts interact and move with respect to one another.
With these two kinds of pairs, we leverage both large-scale multi-view datasets captured in labs, for high-quality cross-view pairs and video datasets for diverse in-the-wild cross-pose pairs.
Following the above strategy, we pre-train two different models, one trained on whole-body images, dubbed CroCo-Body, and one trained on close-up images of hands, dubbed CroCo-Hand.
The video below illustrates the pre-training completion results of CroCo-Body on unseen data.

Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person .


Results show that our pre-trained models performs better than Masked Autoencoders (MAE) or CroCo on several human-centric downstream tasks, after finetuning.


Comparison with other pre-training methods on different downstream tasks (a) or under different fine-tuning data regimes (b) i.e. when varying the number of annotated training samples from COCO$_{part}$ for fine-tuning on the body mesh recovery task from 10% to 100%. MAE-Body/Hand means that we pre-train MAE on the same data as CroCo-Body/Hand.


  title={Cross-view and Cross-pose Completion for 3D Human Understanding},
  author={Armando, Matthieu and Galaaoui, Salma and Baradel, Fabien and Lucas, Thomas and Leroy, Vincent and Br{\'e}gier, Romain and Weinzaepfel, Philippe and Rogez, Gr{\'e}gory},

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.