Visual Localization


Visual localization is an important component of many location-based systems such as self-driving cars, autonomous robots, or augmented, mixed and virtual reality. The goal is to estimate the accurate position and orientation of a camera from captured images. In more detail, correspondences between a representation of the environment (3D map) and a query image are utilized to estimate the camera pose.

The methods used to perform visual localization have to deal with the difficulties of environmental changes i.e. differences in the time of day or season of the year, structural changes such as those made on the outside of buildings or store fronts, moving objects that occlude parts of the scene (cars, people…) as well as large changes in viewpoint between the mapping and query images. To help with these issues we’ve made available several datasets such as the Virtual Gallery dataset and Large-scale localization dataset in crowded indoor spaces.

NAVER LABS large-scale localization dataset.


Structure-based visual localization methods use local features to establish correspondences between 2D query images and 3D reconstructions such as our work on R2D2 (Repeatable and Reliable Detector and Descriptor) and PUMP (pyramidal and uniqueness matching priors for unsupervised learning of local features). These correspondences are then used to compute the camera pose using perspective-n-point (PNP) solvers within a RANSAC loop.

To reduce the search range in large 3D reconstructionsimage retrieval methods such as Deep Image RetrievalSuper Features or Weatherproofing can be used to first retrieve most relevant images from the Structure from Motion (SFM) model. Second, local correspondences are established in the area defined by those images.

Scene point regression methods like SACReg establish the 2D-3D correspondences using a deep neural network (DNN) and absolute pose regression methods directly estimate the camera pose with a DNN.

Another strategy is to estimate the pose of an image by directly aligning deep images features with a reference 3D model which is what we have done with SegLoc and NERF for Camera Pose Refinement.

Image retrieval can also be used for visual localization when no 3D map is available. The camera pose of a query image can be computed from the poses of top retrieved database images by interpolating retrieved image poses, estimating the relative pose between query and retrieved images or by building local 3D models on the fly (see Image Retrieval Benchmark).

Furthermore, objects or semantic information can also be used for visual localization as detailed in our work on Objects of Interest (OOI).

To help the visual localization community advance further, we released Kapture, a data format and toolbox to make it easier to use public datasets, create maps and re-localize images.

The kapture visual localization toolbox.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.