Abstract
Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain mostly scene-specific or limited to small scenes and thus hardly scale to realistic datasets. In this paper, we propose a new paradigm where a single generic SCR model is trained once to be then deployed to new test scenes, regardless of their scale and without further finetuning. For a given query image, it collects inputs from off-the-shelf image retrieval techniques and Structure-from-Motion databases: a list of relevant database images with sparse pointwise 2D-3D annotations. The model is based on the transformer architecture and can take a variable number of images and sparse 2D-3D annotations as input. It is trained on a few diverse datasets and significantly outperforms other scene regression approaches on several benchmarks, including scene-specific models, for visual localization. In particular, we set a new state of the art on the Cambridge localization benchmark, even outperforming feature-matching-based approaches.
Method overview
Given a query image and a set of related views with sparse 2D/3D annotations retrieved from a database, SACReg predicts absolute 3D coordinates for each pixel of the query image. This can be used for visual localization using a robust PnP algorithm. Importantly, SACReg is scene-agnostic: it does not need any retraining for new datasets, only the images and 2D-3D annotations that serve as input are scene-specific.
Regression examples
Below are regression examples on Aachen-Day, a dataset on which SACReg has not been trained. Our model predicts a dense 3D coordinates point map and a confidence map for a given query image using reference images retrieved from a SfM database. Only the first 3 reference images (out of 8) are depicted. 3D coordinates and confidence are colorized and low-confidence areas are not displayed, for visualization purposes.