3D Foundation Models
3D vision encompasses a set of complex and interrelated tasks which include depth estimation, camera pose estimation, multi-view stereo and 3D reconstruction. These tasks have usually been treated individually, each requiring specific, often labor-intensive methodologies which generally results in convoluted, inefficient processing. Each component of the process builds independently on data that could be leveraged more effectively if the tasks were considered in conjunction with one another. Our research builds upon deep learning to create a foundation model that integrates multiple aspects of 3D understanding into a single framework.
Models
DUST3R & MAST3R (Dense Unconstrained Stereo 3D Reconstruction):
DUSt3R is a breakthough 3D foundation model for scene reconstruction which unifies many 3D vision tasks. Based on transformers, it is an all-in-one, easy to use solution which can handle almost any situation and perform 3D reconstruction with as little as only 2 images (no overlap required) and with no prior information on camera calibration or viewpoint poses. It casts the image pairwise reconstruction problem as a regression of pointmaps with a proxy output that captures the 3D scene geometry (point-cloud), connects pixels (3D points) and spatially relates 2 viewpoints (relative pose). DUST3R can be used to easily perform a number of downstream tasks such as recovering camera intrinsics, visual localization, multi-view pose estimation, mono and multi-view depth estimation as well as 3D reconstruction. It can of course be used for any kind of application that uses a 3D environment such as gaming and other activities with AR and VR, autonomous driving, robot services and video generation.
MASt3R (Matching and Stereo 3D Reconstruction)
Our latest model is MASt3R which, based on DUSt3R, has a head for matching. The metric pointmaps make it suitable for map-free relocalization and it can outperform all existing methods thanks to it’s unprecedented pixel matching abilities. Our experiments show that it significantly outperforms the state of the art on multiple matching tasks and up to 30% in map-free localization. MASt3R can handle up to thousands of images making it ideal to map 3D models of complex environments such as cities or indoor buildings. While it retains all of DUSt3R’s capabilities, it significantly enhances the accuracy of the system. We’re continually improving the system and are are currently working on making MASt3R better on long-term changes in scenes (i.e. snow vs. dry and sunny) and push it’s capabilities further by making it capable of handling semantics.
CroCo
The first model and generic architecture we developed is CroCo (Cross-view Completion),. It’s a self-supervised pre-training model inspired by masked image modelling (MIM) where an input image is partially masked and the model reconstructs the image from what’s visible. In CroCo the reconstruction uses a second, unmasked image which allows the reconstruction to be based on the spatial relationship between the two images. CroCo showed improved performance not just for monocular 3D vision downstream tasks like depth estimation but also for binocular ones like optical flow and relative pose estimation when finetuned. An improved version of CroCo (V2) for stereo matching and optical flow was released in 2023.
CroCo-Man
CroCo has recently been adapted to humans in what we call CroCo-Man, where the pair of input images are photos of the same person. These image pairs can be constructed from either two views of the same human pose taken simultaneously or from two poses taken in motion sequence. This also gives the model information on how body-parts interact. These were complemented with existing lab datasets and video datasets. Two models have been trained – one on the full body and another on close-ups of hands.
Related Publications
- Grounding Image Matching in 3D with MASt3R, ECCV 2024
- DUSt3R : Geometric 3D vision made easy, CVPR 2024
- Croco-Man: Cross-view and cross-pose completion for 3D human understanding, CVPR 2024
- CroCo: Self-supervised pretraining for 3D vision tasks by cross-view completion, NeurIPS 2022.
- CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow, ICCV 2023.
Code and datasets
- CroCo and CroCrov2: https://github.com/naver/croco
- DUSt3R : https://github.com/naver/dust3r
- MASt3R: https://github.com/naver/mast3r
- Training data of some datasets will be available soon