Fast and successful robot navigation is made possible by incorporating physics into embodied AI, carefully observing behaviour and performing extensive real-world testing.
Fast and successful robot navigation is made possible by incorporating physics into embodied AI, carefully observing behaviour and performing extensive real-world testing.
Encounters with robots roaming around on wheels may be more commonplace these days, but how they manage to navigate usually requires some very careful preparation. Most often they’ll be equipped with a map, probably move along dedicated pathways, maybe have specific areas marked out for them and, if possible, encounter only furniture that won’t be moved around to mess things up by getting in the way! Achieving ‘out of the box’ navigation that doesn’t require any such preparation or ‘fully autonomous robot policies that can operate seamlessly in new environments’ is the challenge we’ve set ourselves and we’ve recently had some pretty good results. We chose AI as our core solution due to its ability to learn complex behaviors from large amounts of data with minimal handcrafted algorithms. Within AI, deep learning has proven to be excellent at solving various robotic tasks, including navigation, but existing methods are often limited to simulated environments, and transferring these models to real robots has been a major problem – up to now.
We recently reached a milestone in solving the problem with our best AI model which, while trained end-to-end in simulation only, demonstrates near-perfect navigation in unseen buildings, avoiding collisions and at a fast pace (Video 1). By moving AI out of the simulator we’ve garnered a number of valuable insights along the way that we’re eager to share with you here!
Video 1: Our end-to-end trained agent can navigate fast and efficiently to any location in any unseen building.
When we started, we had some serious doubts that an agent could be deployed on real robots without having been trained on real world data, at least partially. Compared to simulation, real environments are ‘out of distribution’: lighting, sensor noise and robot motion are very different, which is called the “sim2real gap”. A model trained on simulated observations would therefore face very unexpected inputs on the real robot, and we thought it would hate that – but we were wrong!
One year later, our deep policy achieves near perfect scores on our test episodes in the real world, as evaluated in multiple buildings on two continents (France and Korea). In Video 2 below you can see the NAVER robot ‘Around’ navigating in one of our research buildings in France.
Video 2: On the left, the Around robot onboard camera view navigating at NAVER LABS Europe. On the right we see the same bird’s eye view of the robot (red) navigating in the virtual environment. The pink doughnut is the navigation goal.
So, how does the model generalize despite the significant sim2real gap?
Our agent has a relatively standard structure i.e. deep encoders for different sensor inputs and a recurrent model aggregating observations over time into a latent agent memory, followed by a policy network taking decisions (see [1] for more details). It’s relatively lightweight, which allows us to run it in real-time on a small on-board GPU.
A large majority of work in navigation research is based on the agent using a simple discrete action space where the agent picks an action between ‘Move forward,’ ‘Turn left/right’ and ‘Stop.’ In general, the internal mechanics of the robotic platform are completely ignored because, in simulation, the robot is simply teleported to the requested position. This is strangely specific to the navigation community in AI. Other robotic applications, such as manipulation, simulate precisely the mechanical response of the robot.
In contrast, we supercharged the Habitat simulator with a relatively simple (yet expressive) mathematical model of how the robot reacts to a command. We identified the physics parameters on the real robot to create a “dynamical model”. As a consequence, being subject to inertia, friction and motor response just like a real robot, the simulated agent behaves realistically and the neural agent is trained to cope with this behaviour. We also made the action space more expressive than the discrete one where the agent selects a pair of linear and angular velocities from 28 different possible combinations.
Adding simulated dynamics turned out to be a game changer for real world performance. The impact on navigation performance, speed and style was impressive. Video 3 below shows a comparison (a race) between a standard agent, trained using classical teleportation-like behaviour (red), against our approach trained with a realistic dynamical model (green). This experiment was conducted on a real robot in a real test building and, of course, without teleoperation. For convenience we show the onboard first-person video and a replay of the performance in a synthetic (but otherwise identical) environment.
Video 3: This video show a race comparison between 3 different agents; in red (Teleport D4) and in blue (Teleport D28) are two classic agents which were not trained with a realistic motion model, and in green (Ours), the NAVER LABS Europe agent.
The lesson from this experience is that physics clearly matters. However, successfully training a capable neural model is but a first step. Understanding why it works and what it actually learned to do, allows us to be confident in its capacity to evolve and to generalize. We dove deeply into this topic and dedicated a full stack of experiments on testing the reasoning modes of our end-to-end trained models which we present in our CVPR 2025 paper [2].
In a nutshell, adding realistic dynamics to training in simulation encourages the model to learn a prediction-correction scheme and therefore to predict (i.e. to anticipate) its future position. Through reinforcement learning (RL) training on its interactions with the environment alone, the agent discovered a model of the robot dynamics, which it then leveraged to predict its next position based on the last action it chose. It then uses its sensor inputs to correct its predicted estimate, gaining accuracy.
Our general goal is to work towards a generalist agent capable of controlling any robot or embodiment equally well, whatever its material or software characteristics. The research field is working towards this with Vision Language Action models (VLAs), fine-tuned from Large Language Models (LLMs), which are trained large-scale on massive datasets of robot demonstrations, typically with imitation learning. For the moment this strategy has led to models which can deal with multiple embodiments, albeit slowly, clumsily and somewhat lacking precision. Many videos of robot behaviour are sped up for convenience as the agents take decisions at painstakingly slow speed.
In our work we show that a finely engineered adaptation to robot specifics can lead to impressive gains in performance and, although not necessary for a high agent success rate, it also increases robot speed.
Let’s look at an example: for safety reasons, the NAVER Around robot is equipped with a software bumper algorithm that monitors the surroundings to slow down/stop the robot if an obstacle is detected. When this feature isn’t correctly simulated during training, the agent can successfully navigate, but it requires multiple trials for complex areas as it doesn’t realise that the separate software bumper mechanism is overriding its actions. Around had to repeatedly correct its decisions as shown in Video 4 below.
Video 4: This video shows how a minor safety procedure such as a software bumper, can have an impact on robot navigation performance.
The robot seems to struggle crossing narrow passages and it sometimes gives up and backtracks. This pattern is visible when we overlay its performance score on the map in Figure 1, which colour code the quality of agent decisions as compared to an algorithm with access to perfect information. Doors and narrow passages are highlighted in green, which indicates regions where the agent frequently backtracks. We traced this behaviour back to the software bumper triggered by the doors which slow the robot down.
We decided to integrate this safety software bumper procedure into the simulator, thus simulating the robot in a more realistic way. This resulted in much smoother behaviour, with less hesitations, as the agent had now learned to correctly anticipate its future state, including the corrective behaviour of the software bumper.
Video 5: In this video we see how much smoother the robot navigation is when it anticipates future states.
The agent is trained with RL through interactions with the simulator and, while we saw that it leads to excellent navigation behaviour, we also wanted to find out whether it learned to create any long-term plans internally. As part of the RL methodology (we use the “PPO” variant), the agent learns a so-called “value function”, which at each time step provides the agent’s estimate of the amount of reward it expects to receive in the future. As such, we think that it provides some indication of what the agent plans to achieve. In Video 6 below, we provide an interpretation of a single navigation episode collected during a demonstration in a crowded scenario in one of our buildings. The episode provides evidence that navigation strategies in the form of choices of paths are taken, tested and rejected, to be replaced by better options.
In detail, in this episode the agent started at pos ① with the goal at pos ⑦ with two possible paths. The agent chooses the slightly longer path passing through the door left of pos ②, which is blocked by people. It decides to take an alternative route south towards ③ and the value estimate drops. At ④ the agent tries to circumvent the situation by going north, but can’t find a path through the glass panels. The value estimate is now negative. At ⑤, finally, the agent decides to abandon this strategy and seems to re-plan. The value estimate immediately spikes as the agent seems to anticipate a long series of positive reward. At ⑥ we’re back to the blocked position with a similar reward, and the agent now decides to try the other door. At ⑦ we reach the goal, and the value estimate converges to 2.5, equal to the final reward for a successful episode.
As a summary, the agent’s choices appear to have an effect on the value estimate. Abandoning a navigation option for a more promising one increases the value estimate, as the agent now expects a higher future cumulated reward. This gives some evidence that the agent has an idea of where it stands in a plan structured on the level of paths and that its estimate of success goes beyond the effect of the next action.
Video 6: This video shows the agent episode described above.
Our agent is trained end-to-end using RL, which allows to train a neural model by only using a reward definition, essentially informing it whether a given task has been executed well or not, but not how the agent should execute it. The agent chooses its own way of reasoning, and in particular which sensor input it chooses to use, and which not.
The Around robot is equipped with multiple sensors, two of which are capable of detecting obstacles and scene structure: a forward facing RGB camera, and four different depth cameras facing in four different directions from which we extract a LiDAR-like representation which we call a “Scan”. The Scan essentially provides information about distances to the nearest obstacles in all directions and is significantly easier to use compared to the camera, whose 2D image is subject to geometric perspective and would need to be reconstructed for easy use. Analysis has shown that the agent chose to base its reasoning on the Scan inputs over the camera inputs. To put it more bluntly, the agent even decided to completely ignore the RGB camera inputs, which it judged to be redundant with information from the other available sensors.
While this may seem to be a disadvantage (and in some sense it is), it can actually explain why our robot performs so well in real-world scenarios. By disregarding images, the robot bypasses a major source of disturbance due to the sim2real gap, which seems to be much more prominent for RGB inputs than scans. To demonstrate this, we propose the following test: in Figure 2 below, try to identify which of the pairs of Scans or camera observations below are from the real world and which are from simulation. The task is easy with images, but almost impossible with the Scan input.
If, on the contrary, we attempt to force the agent to use the camera (by providing appropriate training signals), we still obtain excellent navigation performance in simulation, but performance drops to zero when the agent is transferred to the real Around robot. In fact, the images are so out of distribution that the agent falls into a stable attractor, which means that it repeats this behaviour indefinitely after a few steps. This is shown in Video 7 below.
Video 7: This video shows the performance using (forced) RGB in simulation and how performance drops in the real world.
This is a good illustration of another lesson learned: better performance in simulation doesn’t necessarily translate to success in the real world with a real robot.
Moving AI out of the simulator was a challenging journey. Along the way, we learned a lot and gained a deeper understanding of what truly matters in robotic navigation and which goes beyond simple engineering considerations.
We look forward to sharing future results on the next challenges in navigation, for instance vision-based long-term planning, thus allowing robots to avoid paths with likely blocking points with common-sense reasoning; “having a thousand eyes”, i.e. exploiting observations collected by a continuously operating fleet of many robots; and generalizing our realistic and fast motion capabilities to arbitrary robot embodiments out of the box. as we focus on the challenges of vision and multi-embodiment!
1: Learning to navigate efficiently and precisely in real environments, Guillaume Bono et al., CVPR, 2024.
2: Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach, Steeven Janny et al., CVPR, 2025.
3: End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon, Guillaume Bono et al., ICLR, 2024.
For navigation to targets provided visually ie. as a target image, we combine our agent with a geometric foundation model described in [3].
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.