Self-supervised pretraining and finetuning for monocular depth and visual odometry

Published by Boris Chidlovskii at 13 May 2024

The IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13-17 May, 2024

Abstract

For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos.
We show that our self-supervised models can reach state-of-the-art performance ‘without bells and whistles’ using the standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

We propose a new down-stream task for CroCo pretraining task. We show how to efficiently finetune the CroCo pre-trained model on monocular depth and visual odometry, in self-supervised way. This is, to our best knowledge, the first task where the model pre-training and fine-tuning are both totally self-supervised, with supervision signals coming from image completion at pretraining step, and from the geometric constraints at finetuning step. We benefit from the generic pretrained models oriented towards understanding 3D geometry of a scene, and finetune them on non-annotated videos to reach optimal downstream task performance.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Self-supervised pretraining and finetuning for monocular depth and visual odometry

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings