This article presents our recent CVPR’19 paper on “Visual Localization by Learning Objects-of-Interest Dense Match Regression”.
Task and challenges
Visual localization consists of estimating the 6-DoF camera pose (position and orientation) from a single RGB image within a given area, also referred to as ‘map’. It’s particularly useful for indoor locations where there’s no GPS for applications in robot navigation, self-driving cars or augmented reality.
To estimate the camera pose the main difficulties are large changes in viewpoints between the query and training images, incomplete maps, regions with no valuable information (textureless surfaces), symmetric and repetitive elements, varying lighting conditions, structural changes, dynamic objects (e.g. people) and scalability to large areas. Being able to handle dynamic scenes where objects may change between mapping and query time is key to all long-term visual localization applications and is a typical failure case of state-of-the-art methods.
Objects-of-Interest
Our main idea here is to leverage what we call Objects-of-Interest (OOIs). We define an OOI as a discriminative and stable area within the 3D map which can be reliably detected from multiple viewpoints whether partly occluded and under various lighting conditions. Typical examples are paintings in a museum, or storefronts and brand logos in a shopping mall.
OOIs-based Visual Localization
Assuming there’s a database of OOIs, our visual localization approach relies on a CNN to detect the objects-of-interest, segment them and provide a dense set of 2D-2D matches between the detected OOIs and their reference images. Reference images are standard views of the OOIs for which the mapping to the 3D coordinates of the object is given by the database. The CNN architecture we use is inspired by DensePose [1]. By transitivity, we can transform the set of 2D-2D matches with the 2D-3D correspondences of the reference images into a set of 2D-3D matches from which camera localization is obtained by solving a Perspective-n-Point problem using RANSAC.
Our method is carefully designed to tackle the open challenges of visual localization. It has several advantages and few limitations with respect to the state-of-the-art.
One clear limitation of our method is that query images without any OOI cannot be localized. However, in many applications such as AR navigation, OOIs exist most of the time and local pose tracking (e.g. visual-inertial odometry) can be used in-between. OOI detection is interesting by itself in this kind of AR application e.g. to display metadata on paintings in a museum or on shops in malls and airports. Furthermore, in a complex real-world application, OOIs can be used to more easily guide people. Commands such as ‘Take a picture of the closest painting’ might be easier to understand than ‘Take a picture with sufficient visual information’.
The Virtual Gallery dataset
We’ve introduced a new synthetic dataset to study the applicability of our approach but also to measure the impact of varying lighting conditions and occlusions on different localization methods. It consists of a scene containing 3-4 rooms in which 42 free-of-use famous paintings are placed on the walls. The scene was created with Unity software to extract ground-truth information such as depth, semantic and instance segmentations, 2D-2D and 2D-3D correspondences, together with the rendered images.
To study robustness to lighting conditions, we generate the scene using 6 different lighting configurations with significant variations between them. To evaluate robustness to occluders such as visitors, we generate test images which contain randomly placed human body models.
The dataset can be downloaded from our site here: Virtual Gallery Dataset
Result overview
Here’s a short recap of the main findings of our experiments. You can read all the details in the paper.
References
[1] Densepose: Dense human pose estimation in the wild. Güler et al. CVPR’18.
[2] Learning less is more – 6D camera localization via 3D surface regression. Brachmann and Rother. CVPR’18.
[3] Posenet: A convolutional network for real-time 6-DoF camera relocalization. Kendall et al. ICCV’15.
[4] A dataset for benchmarking image-based localization. Sun et al. CVPR’17.
NAVER LABS Europe 6-8 Chemin de Maupertuis 38240 Meylan France Contact