3D Foundation Models

DUSt3R 3D reconstruction NAVER HQ

3D vision encompasses a set of complex and interrelated tasks which include depth estimation, camera pose estimation, multi-view stereo and 3D reconstruction. These tasks have usually been treated individually, each requiring specific, often labor-intensive methodologies which generally results in convoluted, inefficient processing. Each component of the process builds independently on data that could be leveraged more effectively if the tasks were considered in conjunction with one another. Our research builds upon deep learning to create a foundation model that integrates multiple aspects of 3D understanding into a single framework.

 

Models

DUST3R & MAST3R (Dense Unconstrained Stereo 3D Reconstruction):

DUSt3R is a breakthough 3D foundation model for scene reconstruction which unifies many 3D vision tasks. Based on transformers, it is an all-in-one, easy to use solution which can handle almost any situation and perform 3D reconstruction with as little as only 2 images (no overlap required) and with no prior information on camera calibration or viewpoint poses. It casts the image pairwise reconstruction problem as a regression of pointmaps with a proxy output that captures the 3D scene geometry (point-cloud), connects pixels (3D points) and spatially relates 2 viewpoints (relative pose). DUST3R can be used to easily perform a number of downstream tasks such as recovering camera intrinsics, visual localization, multi-view pose estimation, mono and multi-view depth estimation as well as 3D reconstruction. It can of course be used for any kind of application that uses a 3D environment such as gaming and other activities with AR and VR, autonomous driving, robot services and video generation.

MASt3R (Matching and Stereo 3D Reconstruction)

Our latest model is MASt3R which, based on DUSt3R, has a head for matching. The metric pointmaps make it suitable for map-free relocalization and it can outperform all existing methods thanks to it’s unprecedented pixel matching abilities. Our experiments show that it significantly outperforms the state of the art on multiple matching tasks and up to 30% in map-free localization. MASt3R can handle up to thousands of images making it ideal to map 3D models of complex environments such as cities or indoor buildings. While it retains all of DUSt3R’s capabilities, it significantly enhances the accuracy of the system. We’re continually improving the system and are are currently working on making MASt3R better on long-term changes in scenes (i.e. snow vs. dry and sunny) and push it’s capabilities further by making it capable of handling semantics.


Video 1: MASt3R performance on map-free localization.On the left is the ground truth (reference) image and immediately to the right, the frame from the video camera which is constantly moving and whose pose needs to be estimated by MASt3R with respect to the reference image and its coordinates. In the graph the black triangle (camera) is the location of the reference image. The green triangle is the ground truth camera that is moving in the scene. The blue one is the prediction made by MAST3R. The objective is to have the blue one as aligned as much as possible with the green one, a result that MASt3R achieves with great precision, ultimately performing real-time tracking

CroCo

The first model and generic architecture we developed is CroCo (Cross-view Completion),. It’s a self-supervised pre-training model inspired by masked image modelling (MIM) where an input image is partially masked and the model reconstructs the image from what’s visible. In CroCo the reconstruction uses a second, unmasked image which allows the reconstruction to be based on the spatial relationship between the two images. CroCo showed improved performance not just for monocular 3D vision downstream tasks like depth estimation but also for binocular ones like optical flow and relative pose estimation when finetuned. An improved version of CroCo (V2) for stereo matching and optical flow was released in 2023.

We present below some reconstruction examples from CroCo on scenes unseen during training. From top to bottom, we show the first image (input), the masked second image (input), the output from CroCo, and the original (ground-truth) second image.

CroCo-Man

CroCo has recently been adapted to humans in what we call CroCo-Man, where the pair of input images are photos of the same person. These image pairs can be constructed from either two views of the same human pose taken simultaneously or from two poses taken in motion sequence. This also gives the model information on how body-parts interact. These were complemented with existing lab datasets and video datasets. Two models have been trained  – one on the full body and another on close-ups of hands.

Example reconstructions of the pre-training objectives consisting of cross-pose and cross-view completion: given a masked image of a person, we reconstruct the masked area by additionally leveraging a second image of the same pose from another viewpoint (cross-view) or another pose (cross-pose) of the same person .

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank