Foundation Models for Robot Navigation

Autonomous robots can now provide services previously reserved for humans such as table waiting in restaurants, guiding visitors, patrolling, surveilling and providing last mile package delivery. To perform these tasks, the robots need to be able to navigate in a huge diversity of environments and places which are constantly being changed according to needs, preferences and even moods.
Robotics has traditionally addressed navigation with mapping and planning based on SLAM (simultaneous localization and mapping). Our research is based on a different Ansatz, namely to learn a foundation model for navigation with end-to-end agents trained on large amounts of heterogeneous data. We want a robot to work out of the box in its new environment without any special preparation, no prior scans or existing external localization system.

Towards a Foundation Model for Robot Navigation

Learning from large-scale data allows a robot agent to pick up more complex regularities, to process more subtle cues and therefore be more robust. However, unlike large language foundation models (LLMs) which are trained on large amounts of available text, robotic agents need to be trained on diverse 3D scenes or navigation episodes both of which are relatively hard to come by. To overcome this bottleneck, training in simulation has been the main focus in the field of AI (to be) embodied in autonomous agents and robots. This also lets you perform large training without manual supervision of real robot hardware and avoiding the risk of breaking material in the early stages of training. Yet, although simple tasks like point goal navigation (navigation to coordinates given beforehand) are often considered “solved” in simulation, deploying it to real physical robots often paints quite a different picture. This so-called sim2real gap describes the difference between simulated environments and real physical environments. These differences have multiple aspects, including visual differences (shown in Figure 1 below), but also differences in agent dynamics, in environment dynamics and in agent-environment interactions (collision models and haptics). This all hurts in the transfer of the trained agents to real robots.
To learn agents which perform robustly in real environments, we deploy our trained models to real robots and evaluate them in challenging setups on a daily basis. We research improving navigation robustness and speed in real environments by
– reliably simulating realistic agent behavior during training
– leveraging large-scale geometric foundation models
– designing and implementing efficient training losses.

Reliably simulating realistic agent behavior during training

Previous attempts to address the sim2real gap have focused on only the visual gap (as in Figure 1) and, although progress has been made by training on large amounts of data and performing domain randomization during training, it’s come at the cost of navigation speed. Typical agents are trained to take small steps and/or to stop for observations, avoiding the problem of simulating realistic agent dynamics during training.
In our work [1] and [6], we target training agents capable of navigating swiftly in real environments and to reliably model agent dynamics, we minimize the sim2real gap not only in sensing, but in actuation. Our agent predicts velocity commands while in motion, resulting in a navigation speed of 0.7m/sec. The resulting delay between sensing and actuation requires the agent to anticipate motion, which it learns in simulation from a realistic motion model. We create a second-order dynamical model from real robot rollouts and expert actions, which approximates the physical properties of the real robot.
The video in Figure 2 shows a comparison between the real robot behavior, recorded from an onboard camera, and the simulated behavior when the same recorded velocity actions are replayed. The simulated agent behaves like the real robot, matching acceleration, braking and turning behavior. This similarity during training allows the agent to anticipate its future behavior given its own predicted actions and behave accordingly.

Figure 2: Comparison of real robot dynamics with simulated dynamics.

Figure 3 shows the behavior of our agent when deployed in a real office building. It’s capable of moving quickly without stopping between decisions. When the situations require it, the agent slows down and navigates through a difficult passage whereas the competing classical agent, using a 2D map, collides with the high chair. Figure 4 shows a comparison in a situation with a narrow path and clutter.

Figure 3: Fast navigation of the end-to-end trained agent in a real environment.

Figure 4: Navigation in a complex situation with obstacles and a narrow path. Left: our model end-to-end trained in a simulator augmented with a dynamical model of the agent. Right: the map+plan base line of the ROS 2 NavStack uses a one-ray Lidar for sensing, which is limited in its perception capabilities and does not see upper part the chair, leading to collisions.

In [6] we perform a large-scale experimental study involving 262 real episodes, where we analyze the type of reasoning emerging in end-to-end training with Reinforcement Learning (RL) and find evidence for the presence of realistic dynamics used for open-loop forecasting and interplay with sensing, and usage of its latent memory, which holds elements of emerging maps. We also find that the value function learned by the agent relates to long-term planning, see Figure 5, which visualizes the agent’s value estimate when it needs to change its navigation strategy. We’ve created an interactive website where you can explore choices and visualize the systems responses.

Figure 5: A difficult episode, where the robot has been blocked by people and it searches for a different strategy. We also visualize the agent’s value estimate, trained with RL.

Leveraging large-scale geometric foundation models

The perception module of a navigating robot needs to detect navigable space and obstacles, exits and goals and estimate its relative pose with respect to them. All of these subtasks require a certain form of visual reasoning. The detection of free navigable space, obstacles and exits boils down to taking decisions based on appearance cues but the detection of goals requires solving a partial matching task, which in essence is a challenging visual correspondence problem. This problem has not been fully explored in the context of navigation so we’ve recently developed a method achieving state-of-the-art performance on two different benchmarks, ImageNav and Instance-ImageNav[2] built on our geometric foundation model CroCo [3]. This differs from existing approaches based on rewards [Mezghani et al., 2022][Yadav et al., 2023] or local feature matching [Krantz et al., 2023].
CroCo is a binocular transformer model pre-trained on cross-view completion, which we then finetune with a second pre-text task on relative pose and visibility estimation. We show that correspondence solutions naturally emerge from this pre-training strategy without any explicit supervision. These are then exploited for navigation. A crucial property of our method is that goal information is passed from the perception module to the agent, confirming our strategic choice of end-to-end training agents. We’ve made the source code and trained weight of this model publicly available [Source+weights].

Designing and implementing efficient training losses.

Learning good representations of the scene within which the robot navigates is important, but being able to actually determine how useful a representation actually is can lead to greater efficiency. You can compare it to walking into an apartment which you explore and observe then being blindfolded and told to go to a specific place (eg. “the table at the right corner of this room”). If you’re able to get there without bumping into anything, then the mental representation of the scene you’ve kept in your memory has been very useful. This is the premise behind our new training method for navigation [4] where the objective is to assist an end-to-end trained agent in learning a compact and useful representation of the history of observed images through an auxiliary loss. While known auxiliary losses for robotics are often based on learning some form of reconstruction of the scene, we raise the question of whether these representations should be based on precise reconstruction and whether an excess amount of precision during learning in simulation will potentially lead to an overzealous agent, whose excess capabilities do not transfer well to the real world.
We’ve therefore been learning a latent spatial representation with the goal of avoiding spending training signals on learning to explicitly reconstruct the scene in unnecessary detail. The representation built by the agent through the integration of visual observations over time while navigating is passed to a blind auxiliary agent (mole), which is trained to use it as its sole information to navigate to a batch of intermediate subgoals (see Figure 5). Optimizing over the success of the blind auxiliary agent leads to an actionable representation and it can be done independently of the downstream task. We’ve shown that this leads to a significant performance gain when the agent is deployed from simulation to a real environment.

Summary

Our research focuses on building a foundation model for navigation by carefully compromising between increasing training data and improving models, algorithms and inductive biases to learn robust and efficient models from the amount of data we already have available; hybrid training in simulation and training in the real world. Ongoing work focuses on further improving scaling properties, acquiring and including real data and learning from web-scale robotics data. This requires cross-embodiment and multi-task agents [5] to integrate heterogenous sources.

Related publications

1: Learning to navigate efficiently and precisely in real environments, CVPR 2024
2: End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon, ICLR 2024
3: CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion, NeurIPS 2022
4: Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction, ICLR 2024
5: Task-conditioned adaptation of visual features in multi-task policy learning, CVPR, 2024
6: Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach, CVPR, 2025

Foundation Models for Robot Navigation

Towards a Foundation Model for Robot Navigation

Reliably simulating realistic agent behavior during training

Leveraging large-scale geometric foundation models

Designing and implementing efficient training losses.

Summary

Related publications

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Foundation Models for Robot Navigation

Towards a Foundation Model for Robot Navigation

Reliably simulating realistic agent behavior during training

Leveraging large-scale geometric foundation models

Designing and implementing efficient training losses.

Summary

Related publications

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings