R2D2: Repeatable and Reliable Detector and Descriptor - Naver Labs Europe
loader image


Interest point detection and local feature description are fundamental steps in many computer vision applications. Classical methods for these tasks are based on a \textit{detect-then-describe} paradigm:
In a first step, repeatable keypoints are detected, often using handcrafted methods. In a second step, they are described using another, often independent, method for description. Neural networks trained with metric learning losses have recently replaced handcrafted features for this second step. While previous learning approaches focused on learning repeatable saliency maps for keypoint detection or learning descriptors adapted to the detected keypoint locations, we argue that salient regions are not necessarily discriminative, and therefore can harm the performance of the description. Instead, we argue that descriptors should be learned only in regions for which matching can be performed with high confidence. An extreme example is a checkerboard, where every corner or blob are salient and repeatable keypoints but where matching is inherently ambiguous and learning descriptors on these keypoints harm the performance.
In this paper, we propose to jointly learn keypoint detection and description, together with a confidence value for the descriptor to be discriminative enough, thus avoiding ambiguous areas and leading to reliable keypoint detectors and descriptors. More precisely, we propose to train a network in a self-supervised manner that outputs a saliency map that has sparse local maxima and is repeatable, together with descriptors and a confidence value on the discriminativeness of the descriptor. The confidence value is trained to have high values where matching is expected to be correct, and low values in the opposite case. Our detection-and-description approach can simultaneously output sparse, repeatable and reliable keypoints that outperforms state-of-the-art detectors and descriptors on the HPatches dataset.