Blog home

Topics

Computer vision

CVPR 2018 – part 2: embedded vision and visual localization

Published by Martin Humenberger at 25 July 2018

Claudine Combe, Martin Humenberger, Nicolas Monet

Careers home

The second article in the four-part series on CVPR 2018. Highlights on embedded vision and visual localization.

Embedded Vision

Although we’ve been getting a lot better at vision tasks in terms of accuracy, little or no regard has yet been paid to energy consumption when doing so. This is kind of surprising as many models are embedded on mobile devices with limited battery life. To practically address this concern, a challenge was co-sponsored by Facebook and Google who want to unify the way we compare models in terms of energy consumption and there a bunch of metrics that let us do that.

Facebook is proposing an API to evaluate model performance and generate metrics for hardware platforms like the Nvidia TX2. Google is more about good practices in developing mobile computer vision models. They described how the latest models for mobile like MobileNet v2 were designed and the advantages of model quantization. Google also introduced the NetAdapt platform whose goal is to adapt a model to a target platform by reducing a network until the target resources constraints are satisfied.

By unifying metrics we hope to see more and more people and papers addressing the energy problem as much as improving accuracy.

It’s no surprise that the Efficient Deep Learning for Computer Vision Workshop had a strong focus on the notion of accuracy vs watts. Forrest Iandola from Deepscale presented a number of different tips and tricks for developing smaller neural nets. XNOR.ai showed how they do quantization using binary weight to optimize the neural network. The new SqueezeNext: Hardware-Aware Neural Network Design network was presented. It outperforms AlexNet accuracy with x63 less parameters (and a x8 increase in speed).

It’s also worth noting that processors “specialized in deep learning” have arrived like the recent Huawei Kirin or the Nvidia Jetson Xavier, but the “right” DNN is increasingly platform-dependent (Mobile, Server, Cloud, …)
Neural network efficiency is most definitely a trend for computer vision.

The Embedded Vision Workshop had a rather diverse program covering topics such as low-level hardware/software designs for dedicated image sensor chips, obstacle detection for UAVs, semantic segmentation for autonomous driving and learning methods for efficient keypoint detectors.

Invited talks were given by Prof. Mohan Trivedi from UCSD on challenges in autonomous driving and human interaction, Cormac Brick from Intel gave interesting insights about the Intel Compute Stick, Raghuveer Rao presented the work on image processing for autonomous platforms at Army Research Laboratory, Donghwan Lee from our own NAVER Labs introduced the topic of scalable and semantic indoor mapping and Prof. Warren J. Gross from McGill University talked about hardware architectures for DNNs.

From left to right: Martin Humenberger, organising committee; Ravi Kumar Satzoda, general chair; Carlo Dal Mutto, winner of the best paper award

NAVER LABS Europe awarded the best paper to Di Febbo et al., for KCNN: Extremely-Efficient Hardware Keypoint Detection with a Compact Convolutional Neural Network and the runner up paper award to Vallurupalli et al. for Efficient Semantic Segmentation using Gradual Grouping.

The best poster award sponsored by Intel was given to Bhowmik et al for Design of a Reconfigurable 3D Pixel-Parallel Neuromorphic Architecture for Smart Image Sensor

Visual Localization

Visual localization is about estimating the location of a camera within a given area (also referred to as map). This is particularly valuable if no other localization technique is available e.g. in GPS-denied environments such as indoor locations. Interesting applications for this are robot navigation, self-driving cars, indoor positioning and augmented reality. Various input sources are used to do this with the most common ones being RGB images, images + depth (RGBD) or 3D point clouds (e.g. from lidar). To find the location of a camera using images, you have to establish correspondences between the query image and the map. Structure-based methods do this with descriptor matching between 3D points of the map and keypoints in the query image. Image retrieval-based methods match the query image with the images of the map using global descriptors and pose regression-based methods try to directly regress the camera pose from the query image. The biggest challenges in localization are large viewpoint changes between the query image and map, incomplete maps, textureless areas without valuable visual information, symmetric and repetitive elements, changing light conditions, structural changes, and dynamic areas which cause unpredictable occlusions (e.g. people).

Camposeco et al. in their Hybrid Camera Pose Estimation present an improvement for geometry or structure-based methods by leveraging 2D-3D as well as 2D-2D correspondences between the query image and the map. The paper tries to answer the question if using one of them is always preferable or if a hybrid solution (best of both) exists. For example, poorly triangulated 3D points degrade the accuracy so in these cases, using 2D-2D matches would probably be better. Their answer is a new RANSAC-based approach which automatically chooses the best solver. The figure below shows how a query image (red) is matched with the model (2D-3D in blue) and the database images (2D-2D pink).

Visualization of 2D-2D matches (pink) and 2D-3D matches (blue). The query image is red, the database images are green [Camposeco et al.].

In InLoc: Indoor Visual Localization With Dense Matching and View Synthesis, Taira et al. present an indoor visual localization approach which especially tackles the challenge of large viewpoint changes. It’s a structure-based approach with (i) a NetVLAD-based image retrieval step, (ii) feature matching using an image representation extracted by a CNN instead of using local descriptors, and (iii) camera pose re-ranking based on the comparison of a rendered view from the camera pose hypothesis with the actual query image (instead of using the number of inliers). The paper also presents a new dataset for large scale indoor localization.

Large-scale indoor visual localization. Given a database of geometrically-registered RGBD images, the 6DoF camera pose of a query RGB image is predicted by retrieving candidate images, estimating candidate camera poses and selecting the best matching camera pose [Taira et al.].

Two papers (both use the name MapNet for their approach), suggested replacing the classic 3D or image-based structure of a map by a learned representation (motivated by approaches such as PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Geometry-Aware Learning of Maps for Camera Localization Brahmbhatt et al. represent maps as a DNN which can fuse multiple sensor inputs. It leverages the geometric constraints between pairs of observations in order to continuously update the DNN weights without absolute camera pose supervision. In MapNet: An Allocentric Spatial Memory for Mapping Environments, Henriques et al. present a dynamic spatial memory which is updated incrementally using camera observations. This memory allows the recurrent network to remember places visited in the past, as well as to re-localize itself with respect to those. It consists of a 2.5D representation where the learned features are projected to the ground-level (which is assumed to be known). In order to localize, these ground-projected camera features are matched against all possible rotations and locations (2D, since z is known). Experiments with synthetic and real data show very promising results. The two figures below show the data flow of the proposed methods starting with Brahmbatt et al.

Data flow for MapNet [Brahmbhatt et al.]. MapNet enforces geometric constraints between relative poses and absolute poses in network training. MapNet+ fuses other inputs such as visual odometry to update maps with self-supervised learning. MapNet+PGO performs pose graph optimization at testing time to further improve accuracy.

Spatial memory of MapNet [Henriques et al.]

Schönberger et al. suggest overcoming the problem of large viewpoint changes and incomplete maps in Semantic Visual Localization by combining 3D geometry and semantic understanding of the world. They encode both information in a trained 3D descriptor especially tailored to predict missing parts (which is one of the problems with view point changes, e.g. > 45°) of the scene. The paper shows that semantic information provides strong clues for visual localization in difficult situations (such as the ones in the Figure below) and that it’s an interesting step towards a robust real-world application of such systems.

In this example, the database contains only images captured in summer and from one particular viewpoint, yet the proposed method correctly localizes images with strong viewpoint, illumination, and seasonal changes [Schönberger et al.].

Brachmann et al. present a camera localization pipeline in Learning Less Is More – 6D Camera Localization via 3D Surface Regression where a CNN that is directly regressing scene coordinates (3D points of the map) is the only learnable component. The proposed three-step training procedure can use a dense 3D model of the scene but it can work without too with only moderate loss in accuracy. A set of images and associated ground truth poses are enough to achieve impressive results on the real-world Cambridge dataset.

Results of [Brachmann et al.] on the Cambridge Landmarks dataset where the proposed method clearly outperforms the selected competitors.

In summary, a lot of interesting approaches on visual localization were presented alongside some impressive results. What appears to be particularly promising are the learning-based approaches which leverage geometry constraints and semantic information to overcome the most critical challenges of visual localization. The open questions that remain are scalability and real-world applications and these were pretty well covered by the work presented on embedded vision. On the one hand, researchers design algorithms especially tailored to the needs of embedded systems and on the other hand they develop hardware that can process these advanced algorithms. Devices on the market, such as the Microsoft Hololens or SDKs such as Apple’s ARKit prove the progress of the past years. CVPR 2018 showed that this field is very active and that we can expect much more in the near future.

About the authors:

Claudine Combe and Nicolas Monet are research engineers in the Edge Computing group. Martin Humenberger leads the 3D Vision group at NAVER LABS Europe.

——

Part 1: Pose Estimation

Part 3: CVPR, Image Retrieval

Part 4: CVPR, 3D Scene Understanding

CVPR 2018 – part 2: embedded vision and visual localization

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Topics

CVPR 2018 – part 2: embedded vision and visual localization

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings