3D Foundation Models

3D vision encompasses a set of complex and interrelated tasks which include depth estimation, camera pose estimation, multi-view stereo and 3D reconstruction. These tasks have usually been treated individually, each requiring specific, often labor-intensive methodologies which generally results in convoluted, inefficient processing. Each component of the process builds independently on data that could be leveraged more effectively if the tasks were considered in conjunction with one another. Our research builds upon deep learning to create a foundation model that integrates multiple aspects of 3D understanding into a single framework.

DUSt3R (Dense Unconstrained Stereo 3D Reconstruction)

DUSt3R is a breakthough 3D foundation model for scene reconstruction which unifies many 3D vision tasks. Based on transformers, it is an all-in-one, easy to use solution which can handle almost any situation and perform 3D reconstruction with as little as only 2 images (no overlap required) and with no prior information on camera calibration or viewpoint poses. It casts the image pairwise reconstruction problem as a regression of pointmaps with a proxy output that captures the 3D scene geometry (point-cloud), connects pixels (3D points) and spatially relates 2 viewpoints (relative pose). DUSt3R can be used to easily perform a number of downstream tasks such as recovering camera intrinsics, visual localization, multi-view pose estimation, mono and multi-view depth estimation as well as 3D reconstruction. It can of course be used for any kind of application that uses a 3D environment such as gaming and other activities with AR and VR, autonomous driving, robot services and video generation.

DUSt3R video: Examples of the 3D reconstruction output of DUSt3R from only 2 input images.

MASt3R (Matching and Stereo 3D Reconstruction)

MASt3R is an extension of DUSt3R with two novelties: it outputs metric pointmaps and it has an additional head for matching. This makes it suitable for map-free location as it can outperform all existing methods thanks to its unprecedented pixel matching abilities. Our experiments show that it significantly outperforms the state of the art on multiple matching and up to 30% in map-free localization. MASt3R can handle up to thousands of images making it ideal to map 3D models of complex environments such as cities or indoor buildings. While it retains all of DUSt3R’s capabilities, it significantly enhances the accuracy of the system.

We’re continually improving MASt3R and are currently working on making it better on long-term changes in scenes (i.e. snow vs. dry and sunny) and in pushing its capabilities further by making it capable of handling semantics.

MASt3R Video: performance on map-free localization. On the left is the ground truth (reference) image and immediately to the right, the frame from the video camera which is constantly moving and whose pose needs to be estimated by MASt3R with respect to the reference image and its coordinates. In the graph the black triangle (camera) is the location of the reference image. The green triangle is the ground truth camera that is moving in the scene. The blue one is the prediction made by MAST3R. The objective is to have the blue one as aligned as much as possible with the green one, a result that MASt3R achieves with great precision, ultimately performing real-time tracking

Pow3R (Empowering unconstrained 3D reconstruction with camera and scene priors)

One drawback with our previous models DUSt3R and MASt3R is that they are restricted to taking only images as input. In reality, there are many situations where auxiliary information is known, for instance about the camera calibration or depth information from dedicated sensors like LIDARs. Pow3R is an extension of DUSt3R that can incorporate any auxiliary information such as camera intrinsics, camera pose, dense or sparse depth data, alongside input images.

This opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3R at exploiting all available information. We showcase an example below where Pow3R performs high-resolution 3D reconstruction.

Pow3R video: This is an example of the new capabilities offered with Pow3R. Two images are input, and 3D reconstruction is carried out in two steps in a coarse-to-fine manner. The result of an initial coarse 3D reconstruction is re-injected for a finer block-by-block reconstruction, yielding high-definition output.

MUSt3R (Multi-view network for stereo 3D reconstruction)

While DUSt3R introduced a novel paradigm in geometric computer vision to unify 3D vision tasks and MASt3R extended it with additional heads for matching and making the 3D output metric, they both process image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, imposes significant challenges for global alignment, especially for robust and fast optimization in large-scale collections. To overcome this limitation, we propose Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Additionally, we incorporate a multi-layer memory mechanism into the model to reduce computational complexity and facilitate scalability, allowing the reconstruction of thousands of 3D pointmaps at high frame-rates with minimal additional complexity. The framework is designed to support both offline and online 3D reconstruction, making it versatile for applications in Structure from Motion (SfM) and Visual SLAM. Unlike many contemporary methods, MUSt3R outputs 3D data in a metric space, and has state-of-the-art performance across various downstream tasks such as uncalibrated Visual Odometry, relative camera pose estimation, scale and focal length determination, 3D reconstruction, and multi-view depth estimation. Below is a video showing MUSt3R running on a laptop with a NVIDIA RTX A2000 8GB – 35W gpu on multiples sequences of the TUM RGBD dataset.

MUSt3R video: MUSt3R running on a laptop with a NVIDIA RTX A2000 8GB – 35W gpu on multiples sequences of the TUM RGBD dataset.

PanSt3R (Multi-view consistent panoptic segmentation)

PanSt3R, built on MUSt3R, the scalable multi-view extension of DUSt3R, aims to incorporates semantic information (object classes and instance boundaries) directly into the 3D reconstruction process allowing the model to jointly predicts 3D geometry and panoptic segmentation in a single forward pass without requiring test-time optimization. To improve the accuracy of instance segmentations, the model relies on a multi-view consistent mask merging approach based on quadratic optimization and combined with LUDVIG uplifting to 3DGS not only improves the multi-view consistency of the predicted panoptic segmentations in the available views, but allows to render such segmentations from any point of view.

PanSt3R achieves state-of-the-art performance on benchmark datasets while being orders of magnitude faster than prior approaches, offering a simple, fast, and scalable solution for 3D scene understanding.

3D Panoptic Reconstruction with PanSt3R+LUDVIG on Replica room_0

Sparse 3D panoptic reconstruction (ScanNet++ scene: 1ada7a0617)

Sparse 3D panoptic reconstruction (NLE office)

CroCo

The first model and generic architecture we developed is CroCo (Cross-view Completion). It’s a self-supervised pre-training model inspired by masked image modelling (MIM) where an input image is partially masked and the model reconstructs the image from what’s visible. In CroCo the reconstruction uses a second, unmasked image which allows the reconstruction to be based on the spatial relationship between the two images. CroCo showed improved performance not just for monocular 3D vision downstream tasks like depth estimation but also for binocular ones like optical flow and relative pose estimation when finetuned. An improved version of CroCo (V2) for stereo matching and optical flow was released in 2023.

CroCo video: Some reconstruction examples from CroCo on scenes unseen during training. From left to right, we show the first image (input), the masked second image (input), the output from CroCo, and the original (ground-truth) second image.

CroCo-Man

CroCo was adapted to humans in what we call CroCo-Man, where the pair of input images are photos of the same person. These image pairs can be constructed from either two views of the same human pose taken simultaneously, or from two poses taken in motion sequence. This also gives the model information on how body-parts interact. These were complemented with existing lab datasets and video datasets. Two models have been trained – one on the full body and another on close-ups of hands.

CrocoMan video: Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person .

Related Publications

PanSt3R: Multi-view Consistent Panoptic Segmentation, ICCV 2025
MUSt3R: Multi-view Network for Stereo 3D Reconstruction, CVPR 2025
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors, CVPR 2025
MASt3R-SfM: a fully-integrated solution for unconstrained Structure-from-Motion, 3DV 2025
Grounding Image Matching in 3D with MASt3R, ECCV 2024
DUSt3R : Geometric 3D vision made easy, CVPR 2024
Croco-Man: Cross-view and cross-pose completion for 3D human understanding, CVPR 2024
CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow, ICCV 2023
CroCo: Self-supervised pretraining for 3D vision tasks by cross-view completion, NeurIPS 2022.

Code

MUSt3R: https://github.com/naver/must3r
Pow3R: https://github.com/naver/pow3r
MASt3R & MASt3R-SfM: https://github.com/naver/mast3r
DUSt3R : https://github.com/naver/dust3r
CroCo and CroCrov2: https://github.com/naver/croco

3D Foundation Models

DUSt3R (Dense Unconstrained Stereo 3D Reconstruction)

MASt3R (Matching and Stereo 3D Reconstruction)

Pow3R (Empowering unconstrained 3D reconstruction with camera and scene priors)

MUSt3R (Multi-view network for stereo 3D reconstruction)

PanSt3R (Multi-view consistent panoptic segmentation)

CroCo

CroCo-Man

Related Publications

Code

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

3D Foundation Models

DUSt3R (Dense Unconstrained Stereo 3D Reconstruction)

MASt3R (Matching and Stereo 3D Reconstruction)

Pow3R (Empowering unconstrained 3D reconstruction with camera and scene priors)

MUSt3R (Multi-view network for stereo 3D reconstruction)

PanSt3R (Multi-view consistent panoptic segmentation)

CroCo

CroCo-Man

Related Publications

Code

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings