Visual Localization by Learning OOIs Dense Match Regression - Naver Labs Europe

The first deep regression-based method that generalizes to larger environments, performs better than state-of-the-art with little training data and is robust to occlusion. Virtual Gallery dataset provided.


This article presents our recent CVPR’19 paper on “Visual Localization by Learning Objects-of-Interest Dense Match Regression”.

Task and challenges

 Visual localization consists of estimating the 6-DoF camera pose (position and orientation) from a single RGB image within a given area, also referred to as ‘map’. It’s particularly useful for indoor locations where there’s no GPS for applications in robot navigation, self-driving cars or augmented reality.

 To estimate the camera pose the main difficulties are large changes in viewpoints between the query and training images, incomplete maps, regions with no valuable information (textureless surfaces), symmetric and repetitive elements, varying lighting conditions, structural changes, dynamic objects (e.g. people) and scalability to large areas. Being able to handle dynamic scenes where objects may change between mapping and query time is key to all long-term visual localization applications and is a typical failure case of state-of-the-art methods.



Visual Localization by Learning OOIs Dense Match Regression figure 1a Visual Localization by Learning OOIs Dense Match Regression figure 1b

Our main idea here is to leverage what we call Objects-of-Interest (OOIs). We define an OOI as a discriminative and stable area within the 3D map which can be reliably detected from multiple viewpoints whether partly occluded and under various lighting conditions. Typical examples are paintings in a museum, or storefronts and brand logos in a shopping mall.


OOIs-based Visual Localization

Visual Localization by Learning OOIs Dense Match Regression figure 2

Assuming there’s a database of OOIs, our visual localization approach relies on a CNN to detect the objects-of-interest, segment them and provide a dense set of 2D-2D matches between the detected OOIs and their reference images. Reference images are standard views of the OOIs for which the mapping to the 3D coordinates of the object is given by the database. The CNN architecture we use is inspired by DensePose [1]. By transitivity, we can transform the set of 2D-2D matches with the 2D-3D correspondences of the reference images into a set of 2D-3D matches from which camera localization is obtained by solving a Perspective-n-Point problem using RANSAC.

Our method is carefully designed to tackle the open challenges of visual localization. It has several advantages and few limitations with respect to the state-of-the-art.

  • First of all, reasoning in 2D allows you to train the model with little training data: we can artificially generate a rich set of viewpoints for each object with homography data augmentation and we can achieve robustness to changes in lighting with color jittering.
  • Second, our method can handle dynamic scenes as long as the OOIs remain static. For instance, you can accurately estimate the pose in a museum where there’s visitors even if the training data doesn’t contain any humans.
  • Third, if some OOIs are moved, we don’t have to retrain the whole network as required by most existing approaches. We only need to update the 2D-3D mapping of the reference images.
  • Fourth, our method focuses on discriminative objects and therefore avoids ambiguous textureless areas.
  • Fifth, our method can scale up to large areas and high numbers of OOIs as object detectors can segment thousands of categories.

One clear limitation of our method is that query images without any OOI cannot be localized. However, in many applications such as AR navigation, OOIs exist most of the time and local pose tracking (e.g. visual-inertial odometry) can be used in-between. OOI detection is interesting by itself in this kind of AR application e.g. to display metadata on paintings in a museum or on shops in malls and airports. Furthermore, in a complex real-world application, OOIs can be used to more easily guide people. Commands such as ‘Take a picture of the closest painting’ might be easier to understand than ‘Take a picture with sufficient visual information’.


The Virtual Gallery dataset

Visual Localization by Learning OOIs Dense Match Regression figure 3 

We’ve introduced a new synthetic dataset to study the applicability of our approach but also to measure the impact of varying lighting conditions and occlusions on different localization methods. It consists of a scene containing 3-4 rooms in which 42 free-of-use famous paintings are placed on the walls. The scene was created with Unity software to extract ground-truth information such as depth, semantic and instance segmentations, 2D-2D and 2D-3D correspondences, together with the rendered images.

To study robustness to lighting conditions, we generate the scene using 6 different lighting configurations with significant variations between them. To evaluate robustness to occluders such as visitors, we generate test images which contain randomly placed human body models.

The dataset can be downloaded from our site here: Virtual Gallery Dataset


Result overview

Here’s a short recap of the main findings of our experiments. You can read all the details in the paper.

  • Regressing dense 2D-2D matches performs better than regressing 3D coordinates or fitting a homography.
  • DSAC++ [2] and structure-based methods are extremely accurate on VirtualGallery, as is our method with a median error of below 3cm and 1°.
  • Approaches directly regressing the pose with a CNN like PoseNet [3] perform poorly as they basically learn the training set’s biases.
  • Most images where our method fails contain no object-of-interest, or only poorly visible ones.
  • When using color jittering at training, our method generalizes to varying lighting conditions.
  • Our method is robust to occlusion by humans, as most existing approaches.
  • Our method performs significantly better than the state-of-the-art deep approaches when trained on few images.
  • Our method is the first deep regression-based method to generalize to larger environments such as the Baidu localization dataset [4].



[1] Densepose: Dense human pose estimation in the wild. Güler et al. CVPR’18.

[2] Learning less is more – 6D camera localization via 3D surface regression. Brachmann and Rother. CVPR’18.

[3] Posenet: A convolutional network for real-time 6-DoF camera relocalization. Kendall et al. ICCV’15.

[4] A dataset for benchmarking image-based localization. Sun et al. CVPR’17.