Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
Overview
CroCo-Man leverages the recent CroCo pre-training method, specifically designed to learn 3D cues, and adapt it for the specific setting of human-centric pre-training. Like CroCo, the model takes as input a pair of photographs of the same scene (in our case, the same person). One of them is partially masked with a random pattern and the model is trained to reconstruct it, using information from both images.
Our image pairs are constructed in two ways: a) by taking two views of the same pose (i.e. the same instant) and b) by taking two poses in a motion sequence at different time steps – for instance from videos showing a person in movement. As the human body is non-rigid, going beyond the static setting proposed in CroCo can enable the model to gain some understanding of how body-parts interact and move with respect to one another.
With these two kinds of pairs, we leverage both large-scale multi-view datasets captured in labs, for high-quality cross-view pairs and video datasets for diverse in-the-wild cross-pose pairs.
Following the above strategy, we pre-train two different models, one trained on whole-body images, dubbed CroCo-Body, and one trained on close-up images of hands, dubbed CroCo-Hand.
The video below illustrates the pre-training completion results of CroCo-Body on unseen data.