3D Foundation Models
3D vision encompasses a set of complex and interrelated tasks which include depth estimation, camera pose estimation, multi-view stereo and 3D reconstruction. These tasks have usually been treated individually, each requiring specific, often labor-intensive methodologies which generally results in convoluted, inefficient processing. Each component of the process builds independently on data that could be leveraged more effectively if the tasks were considered in conjunction with one another. Our research builds upon deep learning to create a foundation model that integrates multiple aspects of 3D understanding into a single framework.
Models
DUSt3R & MASt3R (Dense Unconstrained Stereo 3D Reconstruction)
DUSt3R is a breakthough 3D foundation model for scene reconstruction which unifies many 3D vision tasks. Based on transformers, it is an all-in-one, easy to use solution which can handle almost any situation and perform 3D reconstruction with as little as only 2 images (no overlap required) and with no prior information on camera calibration or viewpoint poses. It casts the image pairwise reconstruction problem as a regression of pointmaps with a proxy output that captures the 3D scene geometry (point-cloud), connects pixels (3D points) and spatially relates 2 viewpoints (relative pose). DUSt3R can be used to easily perform a number of downstream tasks such as recovering camera intrinsics, visual localization, multi-view pose estimation, mono and multi-view depth estimation as well as 3D reconstruction. It can of course be used for any kind of application that uses a 3D environment such as gaming and other activities with AR and VR, autonomous driving, robot services and video generation.
DUSt3R video: Examples of the 3D reconstruction output of DUSt3R from only 2 input images.
MASt3R (Matching and Stereo 3D Reconstruction)
MASt3R is an extension of DUSt3R with two novelties: it outputs metric pointmaps and it has an additional head for matching. This makes it suitable for map-free location as it can outperform all existing methods thanks to its unprecedented pixel matching abilities. Our experiments show that it significantly outperforms the state of the art on multiple matching and up to 30% in map-free localization. MASt3R can handle up to thousands of images making it ideal to map 3D models of complex environments such as cities or indoor buildings. While it retains all of DUSt3R’s capabilities, it significantly enhances the accuracy of the system.
We’re continually improving it and are currently working on making MASt3R better on long-term changes in scenes (i.e. snow vs. dry and sunny) and push its capabilities further by making it capable of handling semantics.
MASt3R Video: performance on map-free localization. On the left is the ground truth (reference) image and immediately to the right, the frame from the video camera which is constantly moving and whose pose needs to be estimated by MASt3R with respect to the reference image and its coordinates. In the graph the black triangle (camera) is the location of the reference image. The green triangle is the ground truth camera that is moving in the scene. The blue one is the prediction made by MAST3R. The objective is to have the blue one as aligned as much as possible with the green one, a result that MASt3R achieves with great precision, ultimately performing real-time tracking
Pow3R
One drawback with our previous models DUSt3R and MASt3R is that they are restricted to taking only images as input. In reality, there are many situations where auxiliary information is known, for instance about the camera calibration or depth information from dedicated sensors like LIDARs. Pow3R is an extension of DUSt3R that can incorporate any auxiliary information such as camera intrinsics, camera pose, dense or sparse depth data, alongside input images.
This opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3R at exploiting all available information. We showcase an example below where Pow3R performs high-resolution 3D reconstruction.
Pow3R video: This is an example of the new capabilities offered with Pow3R. Two images are input, and 3D reconstruction is carried out in two steps in a coarse-to-fine manner. The result of an initial coarse 3D reconstruction is re-injected for a finer block-by-block reconstruction, yielding high-definition output.
MUSt3R
While DUSt3R introduced a novel paradigm in geometric computer vision to unify 3D vision tasks and MASt3R extended it with additional heads for matching and making the 3D output metric, they both process image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, imposes significant challenges for global alignment, especially for robust and fast optimization in large-scale collections. To overcome this limitation, we propose Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Additionally, we incorporate a multi-layer memory mechanism into the model to reduce computational complexity and facilitate scalability, allowing the reconstruction of thousands of 3D pointmaps at high frame-rates with minimal additional complexity. The framework is designed to support both offline and online 3D reconstruction, making it versatile for applications in Structure from Motion (SfM) and Visual SLAM. Unlike many contemporary methods, MUSt3R outputs 3D data in a metric space, and has state-of-the-art performance across various downstream tasks such as uncalibrated Visual Odometry, relative camera pose estimation, scale and focal length determination, 3D reconstruction, and multi-view depth estimation. Below is a video showing MUSt3R running on a laptop with a NVIDIA RTX A2000 8GB – 35W gpu on multiples sequences of the TUM RGBD dataset.
MUSt3R video: MUSt3R running on a laptop with a NVIDIA RTX A2000 8GB – 35W gpu on multiples sequences of the TUM RGBD dataset.
CroCo
The first model and generic architecture we developed is CroCo (Cross-view Completion). It’s a self-supervised pre-training model inspired by masked image modelling (MIM) where an input image is partially masked and the model reconstructs the image from what’s visible. In CroCo the reconstruction uses a second, unmasked image which allows the reconstruction to be based on the spatial relationship between the two images. CroCo showed improved performance not just for monocular 3D vision downstream tasks like depth estimation but also for binocular ones like optical flow and relative pose estimation when finetuned. An improved version of CroCo (V2) for stereo matching and optical flow was released in 2023.
CroCo video: Some reconstruction examples from CroCo on scenes unseen during training. From left to right, we show the first image (input), the masked second image (input), the output from CroCo, and the original (ground-truth) second image.
CroCo-Man
CroCo was adapted to humans in what we call CroCo-Man, where the pair of input images are photos of the same person. These image pairs can be constructed from either two views of the same human pose taken simultaneously, or from two poses taken in motion sequence. This also gives the model information on how body-parts interact. These were complemented with existing lab datasets and video datasets. Two models have been trained – one on the full body and another on close-ups of hands.
CrocoMan video: Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person .
Related Publications
- MUSt3R: Multi-view Network for Stereo 3D Reconstruction, CVPR 2025
- Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors, CVPR 2025
- MASt3R-SfM: a fully-integrated solution for unconstrained Structure-from-Motion, 3DV 2025
- Grounding Image Matching in 3D with MASt3R, ECCV 2024
- DUSt3R : Geometric 3D vision made easy, CVPR 2024
- Croco-Man: Cross-view and cross-pose completion for 3D human understanding, CVPR 2024
- CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow, ICCV 2023
- CroCo: Self-supervised pretraining for 3D vision tasks by cross-view completion, NeurIPS 2022.
Code and datasets
- MASt3R & MASt3R-SfM: https://github.com/naver/mast3r
- DUSt3R : https://github.com/naver/dust3r
- CroCo and CroCrov2: https://github.com/naver/croco
- Training data of some datasets will be available soon