Christopher Dance, Jinyoung Choi |
2020 |
If robots are to make last-mile deliveries, assist in the homes of elderly people and serve in offices, shops and warehouses, they must be capable of navigating robustly in dynamic environments. They should react quickly to avoid collisions with objects that move unpredictably (e.g. children and pets); proceed cautiously at blind turns, so as not to startle passers-by; and blend in socially, by moving at an appropriate speed and not going against the flow of pedestrians. Such robots must also avoid getting lost or stuck, even in environments with obstacles (e.g. office furniture or supermarket promotions) that move from one day to the next.
Given the complexity of real-world environments, successful robot navigation systems tend to involve many tuneable parameters. One parameter might determine the maximum speed at which the robot is allowed to drive, while another might calculate the trade-off between staying well away from obstacles and following a short path to the goal. Such ‘navigation parameters’ (1) play a central role in the design of navigation systems. The appropriate settings for these parameters vary depending on the robot’s environment and the use case. Hospital robots, for example, must be cautious to avoid collisions with delicate equipment, and move slowly so as not to alarm patients. In contrast, the top priority for warehouse robots may be to get to a given shelf as soon as possible. Due to the wide variety of considerations, setting navigation parameters manually requires time and expertise. Even settings decided upon by an expert may be unsatisfactory since, for robots interacting with humans, desirable behaviour usually depends on the preferences of those humans.
In this work, we take a deep reinforcement-learning approach to designing navigation policies, motivated by several other recent works (2, 3, 4, 5) that demonstrate superior performance of the resulting policies when compared with classical navigation algorithms. In addition, we exploit human preferences to adapt our navigation policies to specific use cases. Just one other work (6) has previously considered using human preferences in this way and it addresses only simple deterministic path-planning problems. In contrast, we address the full challenge of designing dynamic robot behaviours that successfully cope with partially unmapped environments and unpredictably moving obstacles.
Nevertheless, there is a considerable literature on preference-based reinforcement learning. This literature is largely motivated by the observation that reinforcement learning often fails when applied to complex tasks, due to issues with the reward systems upon which reinforcement-learning models are based (e.g. reward hacking (7) or misspecification (8)). To overcome such issues, human feedback—in the form of answers to queries based on trajectories—can be used to guide the design of reward functions, and to select policy parameters. But how this feedback is obtained and analysed can have a large impact both on the resulting policies and the amount of human feedback required. Previous research on preference-based reinforcement learning has considered a wide variety of query types, including queries based on pairwise comparison of trajectories (9, 6), on ranking three or more trajectories (10), on numerical ratings (11), and on segmenting a trajectory into “good” or “bad” portions (12). Pairwise comparisons, which we use here, are arguably easier to elicit and model, whereas the other types of queries are more informative. Regression models have been used extensively to relate human feedback to reward-function parameters (6, 13, 14, 10, 15, 16). Some such work uses simplified regression models and ignores the deleterious impact of misspecification (i.e. poor model fit) on the resulting policies. Other work (11, 17), employs complex reward models to try to overcome misspecification, but such models tend to require many more preferences to fit.
Previous work on preference-based reinforcement learning first estimates a reward function by regression and then infers a policy from this estimated reward function. We take a different approach: we work with policies that take a navigation-parameter vector as input, and we model preferences directly as a function of that vector. This has several advantages. For instance, our approach enables us to set policy parameters that correspond to hard constraints on robot behaviour, such as maximum allowed velocities and accelerations, noting that such constraints are notoriously difficult to impose via the choice of reward function in deep reinforcement learning (18). As in some other work on preference-based reinforcement learning (7, 10, 17), we use deep reinforcement learning to generate policies that are suited to the high dimension and complex dynamics of the task at hand. However, rather than repeatedly solving the reinforcement-learning problem at each iteration, we solve it just once—in advance—to get a single adaptable policy. Thus in our approach, users do not need to wait for a reinforcement-learning algorithm to converge before they can provide their next set of feedback.
We’ve developed a navigation system that can quickly adapt to a given use case with only small amounts of preference data. Our approach delivers policies informed by human preference and based on reinforcement learning for human-robot interaction that are more suited to any given use case and environment. We formulate the design of a robot navigation system as a problem of maximizing human preference with respect to the choice of a navigation-parameter vector for a particular use case.
We begin by using reinforcement learning to generate a family of navigation policies, which is parameterized by a navigation-parameter vector. In the case of a robot delivering drinks in a café environment, for example, the navigation parameter might include the robot’s maximum linear speed, linear acceleration, angular speed and angular acceleration, which should be selected so as to avoid spilling drinks and to reduce the possibility of collisions with other agents. Then, to design a navigation policy, we select a navigation-parameter vector that maximizes the probability that human participants will prefer it.
We model a navigation task as a partially-observed Markov decision problem (POMDP), parameterized by a waypoint and a navigation-parameter vector. The waypoint specifies the next location that the robot should approach on the way to its goal, and a navigation task is completed when the robot gets close enough to the waypoint. In our model, a navigation-parameter vector has seven components. Four of these components parameterize the dynamics, giving the maximum linear and angular speeds and accelerations with which the robot will be allowed to move under the policy. The remaining components parameterize the rewards (which can be positive or negative), the aim of reinforcement learning being to find a policy that maximizes the cumulative reward. Of these three remaining navigation-parameter-vector components, two prevent the agent from rushing straight towards other moving agents (by giving the minimum acceptable estimated time to collision with those agents and the penalty incurred if this minimum time is violated), and one gives the penalty incurred if the agent has to make an emergency stop to avoid a collision. The reward also has terms which keep the path-length and journey time short, but those terms do not involve free parameters.
We call policies for these POMDPs ‘adaptable navigation policies’, since their behaviour can be adapted by varying the navigation-parameter that is fed to them as part of their input. To train such adaptable navigation policies, we run soft actor critic or SAC (19)—a reinforcement-learning algorithm—on a simulator. We collect the replay buffer (i.e. the history of observations, navigation-parameter vectors, actions and rewards) from several agents simultaneously operating with different navigation-parameter vectors, but in the same environment. The simulator we use is based on the Unity Machine Learning Agents Toolkit, or ML-Agents (20), shown in Figure 1, which provides a wide variety of floorplans and can simulate a set of dynamic agents which each have their own policy. Each agent is modelled as a differential two-wheeled robot with sensor uncertainties, and its observations consist of: coarse scan data from a depth sensor; the robot’s linear and angular velocities; its change in position relative to the previous time step; and the angle to the next waypoint. The robot is controlled by giving its linear and angular velocities for the next time step.
One benefit of using an adaptable policy, rather than training a new parameter-specific policy for each new navigation-parameter vector, is that it enables a robot to switch between multiple use cases more effectively. For instance, a robot operating in a café may need to shift between carrying full cups to customers and returning dirty cups to the kitchen. Indeed, our adaptable policy is able to switch while keeping track of the state of the environment (in Figure 2, observe that the navigation-parameter vector is fed to the network after the recurrent units), a feat which would be difficult to achieve simply by switching between multiple stored policies. Another major benefit of using an adaptable policy is that it negates the need to run a time-consuming reinforcement-learning algorithm at every iteration of preference optimization. There is a potential trade-off with the use of adaptable policies, however, as their performance may not be competitive with that of parameter-specific policies. Fortunately, the results of our experiments suggest that this performance difference is small: by plotting the cumulative reward for both adaptable and parameter-specific policies (shown in Figure 3), we found a negligible difference in robot performance across three types of environment (structured, unstructured, and mixed).
Given an adaptable navigation policy, we can readily collect a robot trajectory for any given navigation-parameter vector. Our preference-optimization algorithm determines which pairs of navigation-parameter vectors should be compared, so as to make a good final choice of navigation-parameter vector while keeping the total number of preference queries small. At each step of this process, we present the pairs of such trajectories to human participants and ask them which trajectory they find better suited to the given use case.
Our algorithm uses a well-known model for preference probabilities, which is known as the Bradley-Terry model, softmax model or logit discrete choice model (21). In this model, the probability of preferring one trajectory to another is proportional to the exponent of some function of that trajectory’s navigation-parameter vector. This function can be thought of as a utility functionWe model this utility function with a Bayesian neural network (22), so that we can keep track of our uncertainty about the value of the utility function. This enables us to perform active query selection, wherein we carefully select queries so as to minimize the number of preference queries that must be made to the human participants. The parameters of this network determine Gaussian distributions for its weights. By averaging over several forward passes through this network with different samples for those weights, we can estimate the mean and standard deviation of the output of the Bayesian network for any given a navigation-parameter vector. We then take this averaged output as our estimate of the mean utility function.
Our preference-optimization algorithm takes a prior over the distribution of the optimal navigation-parameter vector as input. For simplicity, we take this prior to be a box (\(\mathcal{W}\)) in navigation-parameter space, the upper and lower bounds of which are defined by the a priori guesses of a robot-navigation expert on the “best” choice of each navigation parameter. The algorithm maintains a set of candidates for the best navigation-parameter vector and, at each iteration, we grow this set of candidates by sampling uniformly from and keeping only those samples with the highest few values for an upper confidence bound on the utility function, computed using the Bayesian neural network. We then construct query pairs of the form , where is sampled from the subset of candidates with the highest few values of this upper confidence bound and is sampled from the subset of candidates with the highest few values of the mean utility function. Given the participants’ preferences for these queries, the algorithm then updates the Bayesian neural network’s parameters, completing a single iteration. Finally, the algorithm returns the navigation-parameter vector, in among the set of candidates, for which the average output of the Bayesian network is highest.
To determine how our method performs in practice, we used our preference-optimization algorithm to select navigation parameters for a robot delivering drinks in a real-world café environment. First, we collected responses to 100 preference queries by asking five human participants to each answer 20 of those queries. This process was distributed over 10 iterations of the algorithm, with each participant stating two preferences per iteration. The output of the algorithm was our estimate of the best choice of navigation-parameter vector ( \(\pmb{w}_\text{best}\) ). To evaluate the performance of this best parameter, we collected ten videos of the robot using navigation-parameter vector \(\pmb{w}_\text{best}\), and ten videos using parameters sampled uniformly from the box \(\mathcal{W}\). We then randomly matched the videos in pairs, to make 30 pairs (by selecting three random videos from the set of videos with random parameters to be paired with each of the videos using \(\pmb{w}_\text{best}\) ). Working with the same five participants as above, we asked each to evaluate six of these pairs. The navigation parameter \(\pmb{w}_\text{best}\) was preferred in 28 out of these 30 pairs, suggesting that users are significantly more likely to prefer the results of our preference-optimization algorithm.
Our results suggest that reinforcement learning based on our preference-optimization algorithm can be used to tune robot navigation systems for particular use cases—e.g. making last-mile food deliveries, collecting used teacups in homes for the elderly, delivering stationery to meeting rooms, replenishing shop shelves and collecting stock from warehouse shelves—to match users’ preferences, with only a few preference queries (which can come from users that are not robotics experts). We have also demonstrated that our preference-optimization algorithm performs well in an “oracle” setting (i.e. where preferences are given by a simple known formula). Additionally, our simulation experiments provide further evidence that users are significantly more likely to judge the results of our method to be better adapted to a given use case than navigation systems that have been tuned according to an expert’s guesses.
Our preference-optimization scheme could be improved in several ways. For example, given a better prior describing how navigation parameters differ from one use case to another, one should be able to learn from fewer human preferences. Furthermore, the preference model might be improved by allowing more nuanced human feedback, such as ranking three trajectories at a time or accounting for the fact that preferences can vary from one participant to another.
In our current work, we’re developing reinforcement-learning systems for tasks that couple uncertain perception with mechanical contact (e.g. manipulating objects and legged locomotion). Our aim is to develop robots behaviours that can robustly adapt to a broad range of situations and tasks.
We thank Jung-eun Kim, Kyung-sik Park, Jaehun Han, Joonho Seo, Minsu Kim, the NAVER LABS PDX Team and Tomi Silander for their support and contribution to this work.
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
—————
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
——————-
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.