Foundation Models for Robot Navigation

DUSt3R 3D reconstruction NAVER HQ

Autonomous robots can now provide services previously reserved for humans such as table waiting in restaurants, guiding visitors, patrolling, surveilling and providing last mile package delivery. To perform these tasks, the robots need to be able to navigate in a huge diversity of environments and places which are constantly being changed according to needs, preferences and even moods.
Robotics has traditionally addressed navigation with mapping and planning based on SLAM (simultaneous localization and mapping). Our research is based on a different Ansatz, namely to learn a foundation model for navigation with end-to-end agents trained on large amounts of heterogeneous data. We want a robot to work out of the box in its new environment without any special preparation, no prior scans or existing external localization system.

Towards a Foundation Model for Robot Navigation

Learning from large-scale data allows a robot agent to pick up more complex regularities, to process more subtle cues and therefore be more robust. However, unlike large language foundation models (LLMs) which are trained on large amounts of available text, robotic agents need to be trained on diverse 3D scenes or navigation episodes both of which are relatively hard to come by. To overcome this bottleneck, training in simulation has been the main focus in the field of AI (to be) embodied in autonomous agents and robots. This also lets you perform large training without manual supervision of real robot hardware and avoiding the risk of breaking material in the early stages of training. Yet, although simple tasks like point goal navigation (navigation to coordinates given beforehand) are often considered “solved” in simulation, deploying it to real physical robots often paints quite a different picture. This so-called sim2real gap describes the difference between simulated environments and real physical environments. These differences have multiple aspects, including visual differences (shown in Figure 1 below), but also differences in agent dynamics, in environment dynamics and in agent-environment interactions (collision models and haptics). This all hurts in the transfer of the trained agents to real robots.
To learn agents which perform robustly in real environments, we deploy our trained models to real robots and evaluate them in challenging setups on a daily basis. We research improving navigation robustness and speed in real environments by
–    reliably simulating realistic agent behavior during training
–    leveraging large-scale geometric foundation models
–    designing and implementing efficient training losses.

blank
Figure 1: The visual sim2real gap: observed images captured by onboard cameras (Real) differ from observations rendered in a simulator (Simulation), even for photo-realistic renderings. Differences are mostly due to the usage of imperfect scans and reconstructions.

Reliably simulating realistic agent behavior during training

Previous attempts to address the sim2real gap have focused on only the visual gap (as in Figure 1) and, although progress has been made by training on large amounts of data and performing domain randomization during training, it’s come at the cost of navigation speed. Typical agents are trained to take small steps and/or to stop for observations, avoiding the problem of simulating realistic agent dynamics during training.
In our work [1], we target training agents capable of navigating swiftly in real environments and to reliably model agent dynamics, we minimize the sim2real gap not only in sensing, but in actuation. Our agent predicts velocity commands while in motion, resulting in a navigation speed of 1 meter per second. The resulting delay between sensing and actuation requires the agent to anticipate motion, which it learns in simulation from a realistic motion model. We create a second-order dynamical model from real robot rollouts and expert actions, which approximates the physical properties of the real robot.
The video in Figure 2 shows a comparison between the real robot behavior, recorded from an onboard camera, and the simulated behavior when the same recorded velocity actions are replayed. The simulated agent behaves like the real robot, matching acceleration, braking and turning behavior. This similarity during training allows the agent to anticipate its future behavior given its own predicted actions and behave accordingly.

Figure 2: Video: comparison of real robot dynamics with simulated dynamics

Figure 3 shows the behavior of our agent when deployed in a real office building. It’s capable of moving quickly without stopping between decisions. When the situations require it, the agent slows down and navigates through a difficult passage whereas the competing classical agent, using a 2D map, collides with the high chair. Figure 4 shows a comparison in a situation with a narrow path and clutter, and Figure 5 visualizes the agent’s value estimate when it needs to change its navigation strategy.

Figure 4: Navigation in a complex situation with obstacles and a narrow path. Left: our model end-to-end trained in a simulator augmented with a dynamical model of the agent. Right: the map+plan base line of the ROS 2 NavStack uses a one-ray Lidar for sensing, which is limited in its perception capabilities and does not see upper part the chair, leading to collisions.

Figure 5: A difficult episode, where the robot has been blocked by people and it searches for a different strategy. We also visualize the agent’s value estimate, trained with Reinforcement Learning.

Leveraging large-scale geometric foundation models

The perception module of a navigating robot needs to detect navigable space and obstacles, exits and goals and estimate its relative pose with respect to them. All of these subtasks require a certain form of visual reasoning. The detection of free navigable space, obstacles and exits boils down to taking decisions based on appearance cues but the detection of goals requires solving a partial matching task, which in essence is a challenging visual correspondence problem. This problem has not been fully explored in the context of navigation so we’ve recently developed a method achieving state-of-the-art performance on two different benchmarks, ImageNav and Instance-ImageNav[2] built on our geometric foundation model CroCo [3]. This differs from existing approaches based on rewards [Mezghani et al., 2022][Yadav et al., 2023]  or local feature matching [Krantz et al., 2023].
CroCo is a binocular transformer model pre-trained on cross-view completion, which we then finetune with a second pre-text task on relative pose and visibility estimation. We show that correspondence solutions naturally emerge from this pre-training strategy without any explicit supervision. These are then exploited for navigation. A crucial property of our method is that goal information is passed from the perception module to the agent, confirming our strategic choice of end-to-end training agents. We’ve made the source code and trained weight of this model publicly available [Source+weights].

blank
Figure 6: The perception module of our agent based on our pre-trained geometric foundation model CroCo is capable of estimating correspondence solutions between visual observations and image goals, easily obtaining the direction to the goal.

Designing and implementing efficient training losses.

Learning good representations of the scene within which the robot navigates is important, but being able to actually determine how useful a representation actually is can lead to greater efficiency. You can compare it to walking into an apartment which you explore and observe then being blindfolded and told to go to a specific place (eg. “the table at the right corner of this room”). If you’re able to get there without bumping into anything, then the mental representation of the scene you’ve kept in your memory has been very useful. This is the premise behind our new training method for navigation [4] where the objective is to assist an end-to-end trained agent in learning a compact and useful representation of the history of observed images through an auxiliary loss. While known auxiliary losses for robotics are often based on learning some form of reconstruction of the scene, we raise the question of whether these representations should be based on precise reconstruction and whether an excess amount of precision during learning in simulation will potentially lead to an overzealous agent, whose excess capabilities do not transfer well to the real world.
We’ve therefore been learning a latent spatial representation with the goal of avoiding spending training signals on learning to explicitly reconstruct the scene in unnecessary detail. The representation built by the agent through the integration of visual observations over time while navigating is passed to a blind auxiliary agent (mole), which is trained to use it as its sole information to navigate to a batch of intermediate subgoals (see Figure 5). Optimizing over the success of the blind auxiliary agent leads to an actionable representation and it can be done independently of the downstream task. We’ve shown that this leads to a significant performance gain when the agent is deployed from simulation to a real environment.

blank
Figure 7: Learning with a mole, a blind side-kick to the main agent, improves learning to estimate useful and actionable representations from a sequence of observed images: if the blind sub-agent can successfully navigate to any position in an environment given only the compact representation extracted by the main agent, then this representation is useful.

Summary

Our research focuses on building a foundation model for navigation by carefully compromising between increasing training data and improving models, algorithms and inductive biases to learn robust and efficient models from the amount of data we already have available; hybrid training in simulation and training in the real world. Ongoing work focuses on further improving scaling properties, acquiring and including real data and learning from web-scale robotics data. This requires cross-embodiment and multi-task agents [5] to integrate heterogenous sources.

Related publications

1: Learning to navigate efficiently and precisely in real environments, CVPR 2024
2: End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon, ICLR 2024
3: CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion, NeurIPS 2022
4: Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction, ICLR 2024
5: Task-conditioned adaptation of visual features in multi-task policy learning, CVPR, 2024.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank