Topics

Out of the box robot navigation and Spatial AI: Lessons learned moving AI out of simulation

Published by Steeven Janny at 3 June 2025

Steeven Janny, Hervé Poirier, Leonid Antsfeld, Guillaume Bono, Gianluca Monaci, Boris Chidlovskii, Christian Wolf

2025

Fast and successful robot navigation is made possible by incorporating physics into embodied AI, carefully observing behaviour and performing extensive real-world testing.

Introduction

Encounters with robots roaming around on wheels may be more commonplace these days, but how they manage to navigate usually requires some very careful preparation. Most often they’ll be equipped with a map, probably move along dedicated pathways, maybe have specific areas marked out for them and, if possible, encounter only furniture that won’t be moved around to mess things up by getting in the way! Achieving ‘out of the box’ navigation that doesn’t require any such preparation or ‘fully autonomous robot policies that can operate seamlessly in new environments’ is the challenge we’ve set ourselves and we’ve recently had some pretty good results. We chose AI as our core solution due to its ability to learn complex behaviors from large amounts of data with minimal handcrafted algorithms. Within AI, deep learning has proven to be excellent at solving various robotic tasks, including navigation, but existing methods are often limited to simulated environments, and transferring these models to real robots has been a major problem – up to now.

We recently reached a milestone in solving the problem with our best AI model which, while trained end-to-end in simulation only, demonstrates near-perfect navigation in unseen buildings, avoiding collisions and at a fast pace (Video 1). By moving AI out of the simulator we’ve garnered a number of valuable insights along the way that we’re eager to share with you here!

Video 1: Our end-to-end trained agent can navigate fast and efficiently to any location in any unseen building.

Training in simulation is (almost) enough

When we started, we had some serious doubts that an agent could be deployed on real robots without having been trained on real world data, at least partially. Compared to simulation, real environments are ‘out of distribution’: lighting, sensor noise and robot motion are very different, which is called the “sim2real gap”. A model trained on simulated observations would therefore face very unexpected inputs on the real robot, and we thought it would hate that – but we were wrong!

One year later, our deep policy achieves near perfect scores on our test episodes in the real world, as evaluated in multiple buildings on two continents (France and Korea). In Video 2 below you can see the NAVER robot ‘Around’ navigating in one of our research buildings in France.

Video 2: On the left, the Around robot onboard camera view navigating at NAVER LABS Europe. On the right we see the same bird’s eye view of the robot (red) navigating in the virtual environment. The pink doughnut is the navigation goal.

So, how does the model generalize despite the significant sim2real gap?

Realistic simulation is the key, and physics matters!

Our agent has a relatively standard structure i.e. deep encoders for different sensor inputs and a recurrent model aggregating observations over time into a latent agent memory, followed by a policy network taking decisions (see [1] for more details). It’s relatively lightweight, which allows us to run it in real-time on a small on-board GPU.

A large majority of work in navigation research is based on the agent using a simple discrete action space where the agent picks an action between ‘Move forward,’ ‘Turn left/right’ and ‘Stop.’ In general, the internal mechanics of the robotic platform are completely ignored because, in simulation, the robot is simply teleported to the requested position. This is strangely specific to the navigation community in AI. Other robotic applications, such as manipulation, simulate precisely the mechanical response of the robot.

In contrast, we supercharged the Habitat simulator with a relatively simple (yet expressive) mathematical model of how the robot reacts to a command. We identified the physics parameters on the real robot to create a “dynamical model”. As a consequence, being subject to inertia, friction and motor response just like a real robot, the simulated agent behaves realistically and the neural agent is trained to cope with this behaviour. We also made the action space more expressive than the discrete one where the agent selects a pair of linear and angular velocities from 28 different possible combinations.

Adding simulated dynamics turned out to be a game changer for real world performance. The impact on navigation performance, speed and style was impressive. Video 3 below shows a comparison (a race) between a standard agent, trained using classical teleportation-like behaviour (red), against our approach trained with a realistic dynamical model (green). This experiment was conducted on a real robot in a real test building and, of course, without teleoperation. For convenience we show the onboard first-person video and a replay of the performance in a synthetic (but otherwise identical) environment.

Video 3: This video show a race comparison between 3 different agents; in red (Teleport D4) and in blue (Teleport D28) are two classic agents which were not trained with a realistic motion model, and in green (Ours), the NAVER LABS Europe agent.

The lesson from this experience is that physics clearly matters. However, successfully training a capable neural model is but a first step. Understanding why it works and what it actually learned to do, allows us to be confident in its capacity to evolve and to generalize. We dove deeply into this topic and dedicated a full stack of experiments on testing the reasoning modes of our end-to-end trained models which we present in our CVPR 2025 paper [2].

In a nutshell, adding realistic dynamics to training in simulation encourages the model to learn a prediction-correction scheme and therefore to predict (i.e. to anticipate) its future position. Through reinforcement learning (RL) training on its interactions with the environment alone, the agent discovered a model of the robot dynamics, which it then leveraged to predict its next position based on the last action it chose. It then uses its sensor inputs to correct its predicted estimate, gaining accuracy.

Finely adapting to robot characteristics increases speed

Our general goal is to work towards a generalist agent capable of controlling any robot or embodiment equally well, whatever its material or software characteristics. The research field is working towards this with Vision Language Action models (VLAs), fine-tuned from Large Language Models (LLMs), which are trained large-scale on massive datasets of robot demonstrations, typically with imitation learning. For the moment this strategy has led to models which can deal with multiple embodiments, albeit slowly, clumsily and somewhat lacking precision. Many videos of robot behaviour are sped up for convenience as the agents take decisions at painstakingly slow speed.

In our work we show that a finely engineered adaptation to robot specifics can lead to impressive gains in performance and, although not necessary for a high agent success rate, it also increases robot speed.

Let’s look at an example: for safety reasons, the NAVER Around robot is equipped with a software bumper algorithm that monitors the surroundings to slow down/stop the robot if an obstacle is detected. When this feature isn’t correctly simulated during training, the agent can successfully navigate, but it requires multiple trials for complex areas as it doesn’t realise that the separate software bumper mechanism is overriding its actions. Around had to repeatedly correct its decisions as shown in Video 4 below.

Video 4: This video shows how a minor safety procedure such as a software bumper, can have an impact on robot navigation performance.

The robot seems to struggle crossing narrow passages and it sometimes gives up and backtracks. This pattern is visible when we overlay its performance score on the map in Figure 1, which colour code the quality of agent decisions as compared to an algorithm with access to perfect information. Doors and narrow passages are highlighted in green, which indicates regions where the agent frequently backtracks. We traced this behaviour back to the software bumper triggered by the doors which slow the robot down.

We decided to integrate this safety software bumper procedure into the simulator, thus simulating the robot in a more realistic way. This resulted in much smoother behaviour, with less hesitations, as the agent had now learned to correctly anticipate its future state, including the corrective behaviour of the software bumper.

Video 5: In this video we see how much smoother the robot navigation is when it anticipates future states.

Planning seems to emerge from RL training

The agent is trained with RL through interactions with the simulator and, while we saw that it leads to excellent navigation behaviour, we also wanted to find out whether it learned to create any long-term plans internally. As part of the RL methodology (we use the “PPO” variant), the agent learns a so-called “value function”, which at each time step provides the agent’s estimate of the amount of reward it expects to receive in the future. As such, we think that it provides some indication of what the agent plans to achieve. In Video 6 below, we provide an interpretation of a single navigation episode collected during a demonstration in a crowded scenario in one of our buildings. The episode provides evidence that navigation strategies in the form of choices of paths are taken, tested and rejected, to be replaced by better options.

In detail, in this episode the agent started at pos ① with the goal at pos ⑦ with two possible paths. The agent chooses the slightly longer path passing through the door left of pos ②, which is blocked by people. It decides to take an alternative route south towards ③ and the value estimate drops. At ④ the agent tries to circumvent the situation by going north, but can’t find a path through the glass panels. The value estimate is now negative. At ⑤, finally, the agent decides to abandon this strategy and seems to re-plan. The value estimate immediately spikes as the agent seems to anticipate a long series of positive reward. At ⑥ we’re back to the blocked position with a similar reward, and the agent now decides to try the other door. At ⑦ we reach the goal, and the value estimate converges to 2.5, equal to the final reward for a successful episode.

As a summary, the agent’s choices appear to have an effect on the value estimate. Abandoning a navigation option for a more promising one increases the value estimate, as the agent now expects a higher future cumulated reward. This gives some evidence that the agent has an idea of where it stands in a plan structured on the level of paths and that its estimate of success goes beyond the effect of the next action.

Video 6: This video shows the agent episode described above.

Choose your inputs wisely

Our agent is trained end-to-end using RL, which allows to train a neural model by only using a reward definition, essentially informing it whether a given task has been executed well or not, but not how the agent should execute it. The agent chooses its own way of reasoning, and in particular which sensor input it chooses to use, and which not.

The Around robot is equipped with multiple sensors, two of which are capable of detecting obstacles and scene structure: a forward facing RGB camera, and four different depth cameras facing in four different directions from which we extract a LiDAR-like representation which we call a “Scan”. The Scan essentially provides information about distances to the nearest obstacles in all directions and is significantly easier to use compared to the camera, whose 2D image is subject to geometric perspective and would need to be reconstructed for easy use. Analysis has shown that the agent chose to base its reasoning on the Scan inputs over the camera inputs. To put it more bluntly, the agent even decided to completely ignore the RGB camera inputs, which it judged to be redundant with information from the other available sensors.

While this may seem to be a disadvantage (and in some sense it is), it can actually explain why our robot performs so well in real-world scenarios. By disregarding images, the robot bypasses a major source of disturbance due to the sim2real gap, which seems to be much more prominent for RGB inputs than scans. To demonstrate this, we propose the following test: in Figure 2 below, try to identify which of the pairs of Scans or camera observations below are from the real world and which are from simulation. The task is easy with images, but almost impossible with the Scan input.

If, on the contrary, we attempt to force the agent to use the camera (by providing appropriate training signals), we still obtain excellent navigation performance in simulation, but performance drops to zero when the agent is transferred to the real Around robot. In fact, the images are so out of distribution that the agent falls into a stable attractor, which means that it repeats this behaviour indefinitely after a few steps. This is shown in Video 7 below.

Video 7: This video shows the performance using (forced) RGB in simulation and how performance drops in the real world.

This is a good illustration of another lesson learned: better performance in simulation doesn’t necessarily translate to success in the real world with a real robot.

Conclusions

Moving AI out of the simulator was a challenging journey. Along the way, we learned a lot and gained a deeper understanding of what truly matters in robotic navigation and which goes beyond simple engineering considerations.

Observing and analyzing the agent’s behaviour while it navigates can reveal valuable insights. Small and well-chosen enhancements to the simulator can lead to significant improvements in performance.
Extensive real-world testing is crucial to assess the true impact of a contribution on performance. Simulation alone is insufficient and can even lead to false conclusions.
Incorporating physics into embodied AI is essential. By simulating the dynamics of our robot, we fundamentally changed the agent’s reasoning pattern, grounding it in reality.

We look forward to sharing future results on the next challenges in navigation, for instance vision-based long-term planning, thus allowing robots to avoid paths with likely blocking points with common-sense reasoning; “having a thousand eyes”, i.e. exploiting observations collected by a continuously operating fleet of many robots; and generalizing our realistic and fast motion capabilities to arbitrary robot embodiments out of the box. as we focus on the challenges of vision and multi-embodiment!

References

1: Learning to navigate efficiently and precisely in real environments, Guillaume Bono et al., CVPR, 2024.
2: Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach, Steeven Janny et al., CVPR, 2025.
3: End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon, Guillaume Bono et al., ICLR, 2024.

For navigation to targets provided visually ie. as a target image, we combine our agent with a geometric foundation model described in [3].

Out of the box robot navigation and Spatial AI: Lessons learned moving AI out of simulation

Introduction

Training in simulation is (almost) enough

Realistic simulation is the key, and physics matters!

Finely adapting to robot characteristics increases speed

Planning seems to emerge from RL training

Choose your inputs wisely

Conclusions

References

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Topics

Out of the box robot navigation and Spatial AI: Lessons learned moving AI out of simulation

Introduction

Training in simulation is (almost) enough

Realistic simulation is the key, and physics matters!

Finely adapting to robot characteristics increases speed

Planning seems to emerge from RL training

Choose your inputs wisely

Conclusions

References

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings