1st AI for Robotics Workshop
NAVER LABS Europe, 28th-29th November 2019
Workshop at NAVER LABS Europe: “How AI can help solve the challenges of real-world robotics applications.”
12 keynote speakers, a poster session and demonstrations. Below videos of all keynotes.
Horst Bischof – “Understanding long-term complex activities”
Graz University of Technology
Abstract: Understanding complex human activities is a requirement for efficient human-robot interaction as well as several other tasks in a production environment. This talk will highlight challenges that arise when analyzing complex human activities (eg. assembly tasks) with a computer vision system. We will demonstrate our recent work in that area and describe some of the major research challenges, including training these systems with minimal supervision, representations of complex activities etc.
Christian Wolf – “Integrating Learning and Projective Geometry for Robotics”
INSA & CNRS
Abstract: In this talk we address perception and navigation problems in robotics settings, in particular mobile terrestrial robots and intelligent vehicles. We focus on learning representations, which are structured and allow to reason on a high level on the presents of objects and actors in a scene and to take planification and control decisions. Two different methods will be compared, which both structure their state as metric maps in a bird’s eye view, updated through affine transforms given ego-motion.
The first method combines Bayesian filtering and Deep Learning to fuse LIDAR input and monocular RGB input, resulting in a semantic occupancy grid centered on a vehicle. A deep network is trained to segment RGB input and to fuse it with Bayesian occupancy grids. The second method automatically learning robot navigation in 3D environments from interactions and reward using Deep Reinforcement Learning. Similar to the first mode, it keeps a metric map of the environment in a bird’s eye view, which is dynamically updated. Different from the first method, the semantic meaning of the map’s content is not determined before hand or learned from supervision. Instead, projective geometry is used as an inductive bias in deep neural networks. The content of the metric map is learned from interactions and reward, allowing the agent to discover regularities and object affordances from the task itself.
We also present a new benchmark and a suite of tasks requiring complex reasoning and exploration in continuous, partially observable 3D environments. The objective is to provide challenging scenarios and a robust baseline agent architecture that can be trained on mid-range consumer hardware in under 24h. Solving our tasks requires substantially more complex reasoning capabilities than standard benchmarks available for this kind of environments.
Alexandre Alahi – “Socially-aware AI for Last-mile Mobility”
Abstract: Artificial Intelligence (AI) is poised to reshape the future of mobility with autonomous moving platforms tackling the “last mile” problem. Integration of these AI-driven systems into our society remains a grand challenge: they not only need to perform transportation tasks, but also to do so in close proximity with humans in the open world. AI implementations are not safe and do not convey trust in the population due to unacceptable fatal accidents involving self-driving cars. Delivery/social robots face similar issues, either freezing in crowded scenes or recklessly forcing humans to move away. To address these challenges, AI must go beyond classification tasks and develop broader cognition: learn and obey unwritten common sense rules and comply with social conventions in order to gain human trust. Robots should respect personal space, yield right-of-way, and ultimately “read” the behavior of others to effectively navigate crowded spaces. I will present a new type of cognition I call socially-aware AI to address these challenges.
Sangbae Kim – “Robots with Physical Intelligence”
Abstract: While industrial robots are effective in repetitive, precise kinematic tasks in factories, the design and control of these robots are not suited for physically interactive performance that humans do easily. These tasks require ‘physical intelligence’ through complex dynamic interactions with environments whereas conventional robots are designed primarily for position control. In order to develop a robot with ‘physical intelligence’, we first need a new type of machines that allows dynamic interactions. This talk will discuss how the new design paradigm allows dynamic interactive tasks. As an embodiment of such a robot design paradigm, the latest version of the MIT Cheetah robots and force-feedback teleoperation arms will be presented. These robots are equipped with proprioceptive actuators, a new design paradigm for dynamic robots. This new class of actuators will play a crucial role in developing ‘physical intelligence’ and future robot applications such as elderly care, home service, delivery, and services in environments unfavorable for humans.
Cordelia Schmid – “Learning to combine primitive skills: A versatile approach to robotic manipulation”
INRIA & Google
Abstract: Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. Traditional task and motion planning (TAMP) methods can solve complex tasks but require full state observability and are not adapted to dynamic scene changes. Recent learning methods can operate directly on visual inputs but typically require many demonstrations and/or task-specific reward engineering. In this work we aim to overcome previous limitations and propose a reinforcement learning (RL) approach to task planning that learns to combine primitive skills. First, compared to previous learning methods, our approach requires neither intermediate rewards nor complete task demonstrations during training. Second, we demonstrate the versatility of our vision-based task planning in challenging settings with temporary occlusions and dynamic scene changes. Third, we propose an efficient training of basic skills from few synthetic demonstrations by exploring recent CNN architectures and data augmentation. Notably, while all of our policies are learned on visual inputs in simulated environments, we demonstrate the successful transfer and high success rates when applying such policies to manipulation tasks on a real UR5 robotic arm. This is joint work with R. Strudel, A. Pashevich, I. Kalevatykh, I. Laptev, J. Sivic.
Sangok Seok – “New Connections Between People, Spaces and Information: Robotics, Autonomous driving, AI and 5G”
Abstract: The goal of this talk is to introduce a variety of the latest achievements of NAVER LABS, NAVER’s R&D corporation, as well as the direction in which NAVER LABS is heading. Firstly, NAVER, which has been specializing in online platforms, is pursuing the core technologies in the fields of robotics, autonomous driving and artificial intelligence in order to naturally apply such services to offline platforms (physical spaces) in our daily lives. Thus, NAVER LABS is focusing on the future technologies that can provide people with information and services more closely. In particular, precise spatial data (machine-readable 3D/HD maps) is imperative for robots and autonomous driving machines to interact with people and provide services. NAVER LABS has developed a novel in-house solution to create 3D high-precision maps of various environments, including indoor spaces, outdoor spaces, and even roads, through technologies such as mapping robots, mobile mapping systems (MMS) and aerial photogrammetry technologies. Such technologies generate essential data that allow for the safe and precise autonomous driving of robots and vehicles. They are also being used as the core data for visual localization (VL) technology which can accurately recognize locations with just a single photograph where GPS signals are not available, such as indoor environments and skyscraper-dense areas. Based on these high-precision spatial data and localization technologies, NAVER LABS is building an expandable mobile platform by incorporating cloud computing, computer vision-based deep learning, advanced driver-assistance systems (ADAS) and human-robot interaction (HRI). Through continuous innovation, NAVER LABS has demonstrated 5G brainless robot technologies, 3D augmented reality head-up displays (AR HUDs) and indoor AR navigation systems. NAVER LABS is constantly striving to offer the technologies of the future to the world. The new connections between people, spaces and information is the vision of NAVER LABS, and the future of our lives.
Radu Horaud – “Audio-visual machine perception for socially interacting robots”
Abstract: In this talk I will give an overview of the research carried out by the Perception team at Inria Grenoble for the past five years. I will start by stating the scientific challenges of socially interactive robots, as opposed to the commercially available interactive devices that are widely available today and that are essentially based on speech technologies. I will discuss the difficulties of multiple users interacting with a robot, as opposed to a single user. I will emphasize the complementary role of visual and audio perception and I will address in detail the problems associated with fusing these two modalities in unrestricted settings.
Torsten Sattler – Visual Localization: “To Learn or Not to Learn?”
Chalmers University of Technology
Abstract: Visual localization is the problem of estimating the 6 degree-of-freedom camera pose from which a given image was taken with respect to the scene. Localization is an important subsystem in interesting Computer Vision / AI applications such as autonomous robots (self-driving cars, drones, etc.) and Augmented / Mixed / Virtual Reality. Consequently, the topic of Visual Localization is currently receiving significant interest from both academia and industry. Deep machine learning has had a significant impact on many fields in Computer Vision and significant gains have been made by replacing traditional hand-crafted pipelines with end-to-end trained neural networks. Consequently, recent work on visual localization has roposed to replace (parts of) classical localization approaches with learned alternatives. In this talk, we discuss the success of these works. This talk consists of three parts: (i) We show that replacing the full localization pipeline with a convolutional neural network does not lead to accurate camera pose estimates. Rather, current approaches do not onsistently outperform simple image retrieval pipelines that approximate the pose of a query image by the poses of the visually most similar training images. (ii) State-of-the-art learned approaches only replace the 2D-3D matching stage of classical feature-based localization approaches by a neural network. We show that, contrary to what has been reported in the literature, these approaches are not necessarily more accurate than classical feature-based approaches on easier datasets while performing considerably worse on larger and more complex datasets. (iii) We show that replacing classical hand-crafted features with learned alternatives that describe higher-level structures can significantly boost localization performance under challenging conditions, e.g., in weakly textured indoor scenes or under changes in illumination.
Vincent Lepetit – 3D Scene Understanding from a Single Image
Abstract: 3D scene understanding is a long standing, fundamental problem in Computer Vision, with direct applications to robotics. In this talk, I will present several works we very recently developed for 3D scene understanding. The first work is a method for 3D object recognition and pose estimation based on a feedback loop inspired by biological mechanisms, and providing very accurate and reliable results. The second work is a method for understanding the 3D layout (walls, floor, ceiling, ..) of an indoor environment from a single image despite possible occlusions by furnitures. I will then discuss the challenges in creating training and evaluation data for 3D registration problems, and present the direction we are currently exploring.
Martin Humenberger – New Approches to Robot Perception
NAVER LABS Europe
In this talk, I will present new approaches for robot perception in the fields of camera pose estimation, local feature extraction and synthetic datasets. I will provide an overview of popular visual localization methods followed by more details on our recent approaches which utilize deep learning techniques for visual localization and feature extraction. In particular, a new method will be presented which uses predefined landmarks for localization (CVPR19) as well as a new feature detector – R2D2 – which outperforms the state-of-the-art in feature-based localization, recently won the long-term visual localization challenge at CVPR19 in the local feature category and will be presented at NeurIPS19. Finally, I will present our work on VSLAM in dynamic environments and provide some updates on the synthetic dataset Virtual KITTI.