Topics

NAVER LABS Europe @CVPR 2025

Published by Irene Maxwell at 11 June 2025

2025

At CVPR 2025, NAVER LABS Europe is presenting 10 papers that advance the state of the art in 3D reconstruction, visual localization, semantic segmentation, human motion understanding and visual navigation. You’ll find us in orals, highlights, posters and workshops. The detailed agenda of what and where to help you navigate (and find us!) during the conference is in the news item. Here you’ll find a recap of the papers grouped into 3 themes – 3D reconstruction and visual localization, visual navigation and representation learning and semantic segmentation and human motion understanding. This work is part of our research on AI for Robotics.

3D Reconstruction and Visual Localization

MUSt3R: Multi-view Network for Stereo 3D Reconstruction (highlight)
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jérome Revaud, Vincent Leroy

MUSt3R extends the breakthrough 3D reconstruction transformer-based model DUSt3R to handle multiple views simultaneously in a shared coordinate system. The architecture is made symmetric and equipped with a multi-layer memory mechanism. The design significantly improves scalability and efficiency, enabling real-time inference over large image sets. MUSt3R supports both offline and online 3D reconstruction, achieving strong results across SfM, SLAM, and depth estimation tasks.
Code

MUSt3R provides light and fast reconstruction, works online and offline with both large image collections and video.

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, Jérome Revau

Similar to MUSt3R. Pow3R also builds upon the DUSt3R framework but exploits camera and scene priors and other intrinsics alongside input images in a single network. It is a lightweight, versatile 3D vision regression model with new capabilities such as performing inference in native image resolution, or point-cloud completion and state-of-the-art results in3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation.

Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization
Maxime Pietrantoni, Gabriela Csurka, Torsten Sattler

Gaussian Splatting Feature Fields (GSFFs) is a visual localization method that combines 3D Gaussian Splatting with an implicit feature field to create a robust and privacy-preserving scene representation. By aligning 3D and 2D features in a shared embedding space and incorporating contrastive learning and 3D-informed clustering, the approach enables accurate pose estimation through feature or segmentation map alignment. The method achieves state-of-the-art results on several real-world datasets, supporting both privacy-preserving and standard localization pipelines.

Visual Navigation and Representation Learning

Reasoning in Visual Navigation of End-to-End Trained Agents: A Dynamical Systems Approach (highlight)
Steeven Janny, Hervé Poirier, Leonid Antsfeld, Guillaume Bono, Gianluca Monaci, Boris Chidlovskii, Francesco Giuliari, Alessio Del Bue, Christian Wolf

This study investigates how end-to-end-trained agents perform real-world navigation using a physical robot across 262 episodes. It analyzes how agents learn and use realistic dynamics, latent memory and short-horizon planning in real environments. The results reveal that these agents exhibit emerging reasoning and planning behaviours, offering new insights into robotics and control beyond simulation benchmarks. Read the related blog ‘Out of the box robot navigation and Spatial AI: Lessons learned moving AI out of simulation‘

Here we see a race comparison between 3 different agents; in red (Teleport D4) and in blue (Teleport D28) are two classic agents which were not trained with a realistic motion model, and in green (Ours) the NAVER LABS Europe agent.

What matters in ImageNav: architecture, pre-training, sim settings, pose
Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers
Mert Bülent Sarıyıldız, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, Yannis Kalantidis

DUNE is a ViT-based encoder distilled from multiple specialised 2D & 3D foundation models to unify previously siloed visual tasks across 2D, 3D and human understanding such as human mesh, segmentation, 3D reconstruction and depth estimation. It’s trained from DINOv2, MASt3R and Multi-HMR and achieves performance comparable to its larger teachers, sometimes even outperforming them with a much smaller encoder.
Code and Models

Semantic Segmentation and Human Motion Understanding

MEGA: Masked Generative Autoencoder for Human Mesh Recovery (oral)
Guénolé Fiche, Simon Leglaive, Xavier Alameda-Pineda, Francesc Moreno-Noguer

MEGA is a Masked Generative Autoencoder for Human Mesh Recovery (HMR) that addresses the ambiguity of predicting 3D human pose from a single RGB image. By tokenizing human pose and shape, MEGA frames HMR as a sequence generation task, enabling both deterministic single-prediction and stochastic multi-prediction modes. MEGA achieves state-of-the-art results on in-the-wild benchmarks, outperforming existing single- and multi-output methods in both accuracy and flexibility.

LPOSS: Label Propagation over Patches and Pixels for Open-vocabulary Semantic Segmentation
Vladan Stojnić, Yannis Kalantidis, Jiri Matas, Giorgos Tolias

This paper shows that graph-based label propagation can refine weak, patch-level predictions from VLMs like CLIP. LPOSS, leverages a Vision Model to better capture intra-image relationships and applies pixel-level refinement to improve boundary accuracy. By processing the full image rather than local windows, LPOSS+ captures global context and achieves state-of-the-art results on multiple benchmarks.
Code and Demo

Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos
Vadim Tschernezki, Diane Larlus, Andrea Vedaldi, Iro Laina

This work addresses the limitations of 3D vision techniques for segmenting dynamic scenes, particularly in egocentric videos where traditional 3D models struggle. Layered Motion Fusion integrates 2D motion segmentation into layered radiance fields to improve 3D understanding of moving objects. To handle the complexity of long dynamic sequences, the method includes test-time refinement, enabling more accurate geometry reconstruction and significantly outperforming 2D baselines in dynamic segmentation tasks.

CondiMen: Conditional Multi-Person Mesh Recovery
Romain Brégier, Fabien Baradel, Thomas Lucas, Salma Galaaoui, Matthieu Armando, Philippe Weinzaepfel, Grégory Rogez

CondiMen is a Bayesian network-based method for multi-person human mesh recovery that predicts a joint distribution over pose, shape, and camera parameters. Unlike traditional models that produce single estimates, CondiMen captures uncertainty and correlations inherent in projecting 3D humans from 2D images. It supports integration of external information (e.g. known camera intrinsics or multi-view data) and allows efficient inference, achieving competitive or superior performance while remaining suitable for real-time use.

NAVER LABS Europe @CVPR 2025

3D Reconstruction and Visual Localization

Visual Navigation and Representation Learning

Semantic Segmentation and Human Motion Understanding

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Topics

NAVER LABS Europe @CVPR 2025

3D Reconstruction and Visual Localization

Visual Navigation and Representation Learning

Semantic Segmentation and Human Motion Understanding

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings