Fast Adaptation of Reinforcement-Learned Robot Navigation Skills to Human Preferences - Naver Labs Europe

#### Our navigation system enables robots to adapt to specific real-world environments and use cases with only small amounts of human preference data.

If robots are to make last-mile deliveries, assist in the homes of elderly people and serve in offices, shops and warehouses, they must be capable of navigating robustly in dynamic environments. They should react quickly to avoid collisions with objects that move unpredictably (e.g. children and pets); proceed cautiously at blind turns, so as not to startle passers-by; and blend in socially, by moving at an appropriate speed and not going against the flow of pedestrians. Such robots must also avoid getting lost or stuck, even in environments with obstacles (e.g. office furniture or supermarket promotions) that move from one day to the next.

Given the complexity of real-world environments, successful robot navigation systems tend to involve many tuneable parameters. One parameter might determine the maximum speed at which the robot is allowed to drive, while another might calculate the trade-off between staying well away from obstacles and following a short path to the goal. Such ‘navigation parameters’ (1) play a central role in the design of navigation systems. The appropriate settings for these parameters vary depending on the robot’s environment and the use case. Hospital robots, for example, must be cautious to avoid collisions with delicate equipment, and move slowly so as not to alarm patients. In contrast, the top priority for warehouse robots may be to get to a given shelf as soon as possible. Due to the wide variety of considerations, setting navigation parameters manually requires time and expertise. Even settings decided upon by an expert may be unsatisfactory since, for robots interacting with humans, desirable behaviour usually depends on the preferences of those humans.

Demonstration showing an agent operating in a simulated environment, and a robot putting policies learned in this simulation into practice—and operating with a policy whose parameters resulted from our preference-optimization method—in a real-world café environment.

In this work, we take a deep reinforcement-learning approach to designing navigation policies, motivated by several other recent works (2, 3, 4, 5) that demonstrate superior performance of the resulting policies when compared with classical navigation algorithms. In addition, we exploit human preferences to adapt our navigation policies to specific use cases. Just one other work (6) has previously considered using human preferences in this way and it addresses only simple deterministic path-planning problems. In contrast, we address the full challenge of designing dynamic robot behaviours that successfully cope with partially unmapped environments and unpredictably moving obstacles.

Nevertheless, there is a considerable literature on preference-based reinforcement learning. This literature is largely motivated by the observation that reinforcement learning often fails when applied to complex tasks, due to issues with the reward systems upon which reinforcement-learning models are based (e.g. reward hacking (7) or misspecification (8)). To overcome such issues, human feedback—in the form of answers to queries based on trajectories—can be used to guide the design of reward functions, and to select policy parameters. But how this feedback is obtained and analysed can have a large impact both on the resulting policies and the amount of human feedback required. Previous research on preference-based reinforcement learning has considered a wide variety of query types, including queries based on pairwise comparison of trajectories (9, 6), on ranking three or more trajectories (10), on numerical ratings (11), and on segmenting a trajectory into “good” or “bad” portions (12). Pairwise comparisons, which we use here, are arguably easier to elicit and model, whereas the other types of queries are more informative. Regression models have been used extensively to relate human feedback to reward-function parameters (6, 13, 14, 10, 15, 16). Some such work uses simplified regression models and ignores the deleterious impact of misspecification (i.e. poor model fit) on the resulting policies. Other work (11, 17), employs complex reward models to try to overcome misspecification, but such models tend to require many more preferences to fit.

Previous work on preference-based reinforcement learning first estimates a reward function by regression and then infers a policy from this estimated reward function. We take a different approach: we work with policies that take a navigation-parameter vector as input, and we model preferences directly as a function of that vector. This has several advantages. For instance, our approach enables us to set policy parameters that correspond to hard constraints on robot behaviour, such as maximum allowed velocities and accelerations, noting that such constraints are notoriously difficult to impose via the choice of reward function in deep reinforcement learning (18). As in some other work on preference-based reinforcement learning (7, 10, 17), we use deep reinforcement learning to generate policies that are suited to the high dimension and complex dynamics of the task at hand. However, rather than repeatedly solving the reinforcement-learning problem at each iteration, we solve it just once—in advance—to get a single adaptable policy. Thus in our approach, users do not need to wait for a reinforcement-learning algorithm to converge before they can provide their next set of feedback.

### Developing our robot navigation system

We’ve developed a navigation system that can quickly adapt to a given use case with only small amounts of preference data. Our approach delivers policies informed by human preference and based on reinforcement learning for human-robot interaction that are more suited to any given use case and environment. We formulate the design of a robot navigation system as a problem of maximizing human preference with respect to the choice of a navigation-parameter vector for a particular use case.

We begin by using reinforcement learning to generate a family of navigation policies, which is parameterized by a navigation-parameter vector. In the case of a robot delivering drinks in a café environment, for example, the navigation parameter might include the robot’s maximum linear speed, linear acceleration, angular speed and angular acceleration, which should be selected so as to avoid spilling drinks and to reduce the possibility of collisions with other agents. Then, to design a navigation policy, we select a navigation-parameter vector that maximizes the probability that human participants will prefer it.

We model a navigation task as a partially-observed Markov decision problem (POMDP), parameterized by a waypoint and a navigation-parameter vector. The waypoint specifies the next location that the robot should approach on the way to its goal, and a navigation task is completed when the robot gets close enough to the waypoint. In our model, a navigation-parameter vector has seven components. Four of these components parameterize the dynamics, giving the maximum linear and angular speeds and accelerations with which the robot will be allowed to move under the policy. The remaining components parameterize the rewards (which can be positive or negative), the aim of reinforcement learning being to find a policy that maximizes the cumulative reward. Of these three remaining navigation-parameter-vector components, two prevent the agent from rushing straight towards other moving agents (by giving the minimum acceptable estimated time to collision with those agents and the penalty incurred if this minimum time is violated), and one gives the penalty incurred if the agent has to make an emergency stop to avoid a collision. The reward also has terms which keep the path-length and journey time short, but those terms do not involve free parameters.

We call policies for these POMDPs ‘adaptable navigation policies’, since their behaviour can be adapted by varying the navigation-parameter that is fed to them as part of their input. To train such adaptable navigation policies, we run soft actor critic or SAC (19)—a reinforcement-learning algorithm—on a simulator. We collect the replay buffer (i.e. the history of observations, navigation-parameter vectors, actions and rewards) from several agents simultaneously operating with different navigation-parameter vectors, but in the same environment. The simulator we use is based on the Unity Machine Learning Agents Toolkit, or ML-Agents (20), shown in Figure 1, which provides a wide variety of floorplans and can simulate a set of dynamic agents which each have their own policy. Each agent is modelled as a differential two-wheeled robot with sensor uncertainties, and its observations consist of: coarse scan data from a depth sensor; the robot’s linear and angular velocities; its change in position relative to the previous time step; and the angle to the next waypoint. The robot is controlled by giving its linear and angular velocities for the next time step.

Figure 2: Architecture of the adaptable policy and critic. The policy maps the navigation-parameter vector and the robot’s observations—coarse scan data from a depth sensor (blue); the robot’s linear and angular velocities (orange); the change in position relative to the previous time step (green); and the angle to the next waypoint (yellow)—to a probability distribution over the robot’s actions (which are scaled by the appropriate components of the navigation-parameter vector to give its linear and angular velocities). The action branch (purple) is only present in the critic. Merged branches indicated concatenation. FC: Fully connected layer. GRU: Layer of gated recurrent units. The leaky ReLU (rectified-linear-unit) activation function 𝑥 ↦ max{𝑥, $$\frac{𝑥}{100}$$} is used between layers.

One benefit of using an adaptable policy, rather than training a new parameter-specific policy for each new navigation-parameter vector, is that it enables a robot to switch between multiple use cases more effectively. For instance, a robot operating in a café may need to shift between carrying full cups to customers and returning dirty cups to the kitchen. Indeed, our adaptable policy is able to switch while keeping track of the state of the environment (in Figure 2, observe that the navigation-parameter vector is fed to the network after the recurrent units), a feat which would be difficult to achieve simply by switching between multiple stored policies. Another major benefit of using an adaptable policy is that it negates the need to run a time-consuming reinforcement-learning algorithm at every iteration of preference optimization. There is a potential trade-off with the use of adaptable policies, however, as their performance may not be competitive with that of parameter-specific policies. Fortunately, the results of our experiments suggest that this performance difference is small: by plotting the cumulative reward for both adaptable and parameter-specific policies (shown in Figure 3), we found a negligible difference in robot performance across three types of environment (structured, unstructured, and mixed).

Figure 3: Left: The performance of agents operating with an adaptable policy versus parameter-specific policies in terms of cumulative reward (i.e. the sum of all rewards received). Each point corresponds to one setting of the navigation-parameter vector across three different environments: structured (red); unstructured (green); and mixed (blue). Right: Floorplans of each environment.

### Building a preference-optimization algorithm

Given an adaptable navigation policy, we can readily collect a robot trajectory for any given navigation-parameter vector. Our preference-optimization algorithm determines which pairs of navigation-parameter vectors should be compared, so as to make a good final choice of navigation-parameter vector while keeping the total number of preference queries small. At each step of this process, we present the pairs of such trajectories to human participants and ask them which trajectory they find better suited to the given use case.

Our algorithm uses a well-known model for preference probabilities, which is known as the Bradley-Terry model, softmax model or logit discrete choice model (21). In this model, the probability of preferring one trajectory to another is proportional to the exponent of some function of that trajectory’s navigation-parameter vector. This function can be thought of as a utility functionWe model this utility function with a Bayesian neural network (22), so that we can keep track of our uncertainty about the value of the utility function. This enables us to perform active query selection, wherein we carefully select queries so as to minimize the number of preference queries that must be made to the human participants. The parameters of this network determine Gaussian distributions for its weights. By averaging over several forward passes through this network with different samples for those weights, we can estimate the mean and standard deviation of the output of the Bayesian network for any given a navigation-parameter vector. We then take this averaged output as our estimate of the mean utility function.

Our preference-optimization algorithm takes a prior over the distribution of the optimal navigation-parameter vector as input. For simplicity, we take this prior to be a box ($$\mathcal{W}$$) in navigation-parameter space, the upper and lower bounds of which are defined by the a priori guesses of a robot-navigation expert on the “best” choice of each navigation parameter. The algorithm maintains a set of candidates for the best navigation-parameter vector and, at each iteration, we grow this set of candidates by sampling uniformly from  and keeping only those samples with the highest few values for an upper confidence bound on the utility function, computed using the Bayesian neural network. We then construct query pairs of the form , where  is sampled from the subset of candidates with the highest few values of this upper confidence bound and  is sampled from the subset of candidates with the highest few values of the mean utility function. Given the participants’ preferences for these queries, the algorithm then updates the Bayesian neural network’s parameters, completing a single iteration. Finally, the algorithm returns the navigation-parameter vector, in among the set of candidates, for which the average output of the Bayesian network is highest.

Figure 4: Bayesian neural network used to model preference probabilities. BayesianFC denotes a Bayesian fully-connected layer, whose weights 𝜽 are sampled from a diagonal Gaussian distribution with learned mean and variance. Merged branches denote concatenation. The functions $$f(𝒘_𝟏)$$ and $$f(𝒘_𝟐)$$ are random samples of the utility function for navigation-parameter vector $$𝒘_𝒌$$, with both ($$𝒘_𝟏$$) and ($$𝒘_𝟐$$) being calculated using the same randomly-sampled weights 𝜽. The output is the probability of preferring navigation-parameter vector $$𝒘_𝟏$$ to navigation-parameter vector $$𝒘_𝟐$$, which is given by the logistic distribution as $$\frac{1}{1 + \exp(f(𝒘_2) – f(𝒘_1))}$$.

### Our navigation technology in practice

To determine how our method performs in practice, we used our preference-optimization algorithm to select navigation parameters for a robot delivering drinks in a real-world café environment. First, we collected responses to 100 preference queries by asking five human participants to each answer 20 of those queries. This process was distributed over 10 iterations of the algorithm, with each participant stating two preferences per iteration.  The output of the algorithm was our estimate of the best choice of navigation-parameter vector ( $$\pmb{w}_\text{best}$$ ). To evaluate the performance of this best parameter, we collected ten videos of the robot using navigation-parameter vector $$\pmb{w}_\text{best}$$, and ten videos using parameters sampled uniformly from the box $$\mathcal{W}$$. We then randomly matched the videos in pairs, to make 30 pairs (by selecting three random videos from the set of videos with random parameters to be paired with each of the videos using $$\pmb{w}_\text{best}$$ ). Working with the same five participants as above, we asked each to evaluate six of these pairs. The navigation parameter $$\pmb{w}_\text{best}$$ was preferred in 28 out of these 30 pairs, suggesting that users are significantly more likely to prefer the results of our preference-optimization algorithm.

### Summary

Our results suggest that reinforcement learning based on our preference-optimization algorithm can be used to tune robot navigation systems for particular use cases—e.g. making last-mile food deliveries, collecting used teacups in homes for the elderly, delivering stationery to meeting rooms, replenishing shop shelves and collecting stock from warehouse shelves—to match users’ preferences, with only a few preference queries (which can come from users that are not robotics experts). We have also demonstrated that our preference-optimization algorithm performs well in an “oracle” setting (i.e. where preferences are given by a simple known formula). Additionally, our simulation experiments provide further evidence that users are significantly more likely to judge the results of our method to be better adapted to a given use case than navigation systems that have been tuned according to an expert’s guesses.

Our preference-optimization scheme could be improved in several ways. For example, given a better prior describing how navigation parameters differ from one use case to another, one should be able to learn from fewer human preferences. Furthermore, the preference model might be improved by allowing more nuanced human feedback, such as ranking three trajectories at a time or accounting for the fact that preferences can vary from one participant to another.

In our current work, we’re developing reinforcement-learning systems for tasks that couple uncertain perception with mechanical contact (e.g. manipulating objects and legged locomotion). Our aim is to develop robots behaviours that can robustly adapt to a broad range of situations and tasks.

We thank Jung-eun Kim, Kyung-sik Park, Jaehun Han, Joonho Seo, Minsu Kim, the NAVER LABS PDX Team and Tomi Silander for their support and contribution to this work.

Our navigation technology enables a café delivery robot to move naturally through a real-world environment and maintain a comfortable distance between itself and customers.

1. Fast Adaptation of Deep Reinforcement Learning-Based Navigation. Jinyoung Choi, Christopher Dance, Jung-eun Kim, Kyung-sik Park, Jaehun Han, Joonho Seo and Minsu Kim. IEEE International Conference on Robotics and Automation (ICRA), 2020.
2. Learning Navigation Behaviors End-to-End with AutoRL. Hao-Tien Lewis Chiang, Aleksandra Faust, Marek Fiser and Anthony Francis. IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 2007–2014, 2019.
3. PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning. Aleksandra Faust, Oscar Ramirez, Marek Fiser, Ken Oslund, Anthony Francis, James Davidson and Lydia Tapia. IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120, 2018.
4. Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning. Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei and Ali Farhadi. IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364, 2017.
5. Building Generalizable Agents with a Realistic and Rich 3D Environment. Yi Wu, Yuxin Wu, Georgia Gkioxari and Yuandong Tian. arXiv:1801.02209, 2018.
6. Improving User Specifications for Robot Behavior through Active Preference Learning: Framework and Evaluation. Nils Wilde, Alexandru Blidaru, Stephen L. Smith and Dana Kulić. The International Journal of Robotics Research, 39(6), pp. 651–667, 2020.
7. Reward learning from human preferences and demonstrations in Atari. Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg and Dario Amodei. Advances in Neural Information Processing Systems (NeurIPS), pp. 8011–8023, 2018.
8. A survey of preference-based reinforcement learning methods. Christian Wirth, Riad Akrour, Gerhard Neumann and Johannes Fürnkranz. Journal of Machine Learning Research, 18(1), 4945–4990, 2017.
9. Preference-Based Policy Learning. Riad Akrour, Marc Schoenauer and Michèle Sebag. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Athens, Greece, 5–9 September 2011.
10. Learning Reward Functions by Integrating Human Demonstrations and Preferences. Malayandi Palan, Nicholas C. Landolfi, Gleb Shevchuk and Dorsa Sadigh. arXiv:1906.08928, 2019.
11. Active Reward Learning. Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer and Jan Peters. Robotics: Science and Systems, 2014.
12. Active Reward Learning from Critiques. Yuchen Cui and Scott Niekum. IEEE International Conference on Robotics and Automation (ICRA), pp. 6907–6914, 2018.
13. Active Preference-Based Learning of Reward Functions. Dorsa Sadigh, Anca D. Dragan, Shankar Sastry and Sanjit A. Seshia. Robotics: Science and Systems, 2017.
14. Learning from Richer Human Guidance: Augmenting Comparison-Based Learning with Feature Queries. Chandrayee Basu, Mukesh Singhal and Anca D. Dragan. Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI ’18). New York: ACM Press, pp. 132–140, 2018.
15. APRIL: Active Preference-learning based Reinforcement Learning. Riad Akrour, Marc Schoenauer and Michèle Sebag. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD). Berlin: Springer, pp. 116–131, 2012.
16. Dueling Posterior Sampling for Preference-Based Reinforcement Learning. Ellen R. Novoseller, Yibing Wei, Yanan Sui, Yisong Yue and Joel W. Burdick. arXiv:1908.01289, 2019.
17. Deep reinforcement learning from human preferences. Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg and Dario Amodei. Advances in Neural Information Processing Systems (NIPS), pp. 4299–4307, 2017.
18. Constrained Policy Optimization. Joshua Achiam, David Held, Aviv Tamar and Pieter Abbeel. Proceedings of the 34th International Conference on Machine Learning (ICML), vol. 70, pp. 22–31, 2017.
19. Soft Actor-Critic Algorithms and Applications. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel and Sergey Levine. arXiv:1812.05905 (v2), 2019.
20. Unity: A General Platform for Intelligent Agents. Arthur Juliani, Vincent-Pierre Berges, Ervin Teng, Andrew Cohen, Jonathan Harper, Chris Elion, Chris Goy, Yuan Gao, Hunter Henry, Marwan Mattar and Danny Lange. arXiv:1809.02627, 2018.
21. Discrete Choice Methods with Simulation. Kenneth Train. Cambridge University Press, 2009.
22. Weight Uncertainty in Neural Networks. Charles Blundell, Julien Cornebise, Koray Kavukcuoglu and Daan Wierstra. arXiv:1505.05424, 2015.