The second article in the four-part series on CVPR 2018. Highlights on embedded vision and visual localization.
Embedded Vision
Although we’ve been getting a lot better at vision tasks in terms of accuracy, little or no regard has yet been paid to energy consumption when doing so. This is kind of surprising as many models are embedded on mobile devices with limited battery life. To practically address this concern, a challenge was co-sponsored by Facebook and Google who want to unify the way we compare models in terms of energy consumption and there a bunch of metrics that let us do that.
Facebook is proposing an API to evaluate model performance and generate metrics for hardware platforms like the Nvidia TX2. Google is more about good practices in developing mobile computer vision models. They described how the latest models for mobile like MobileNet v2 were designed and the advantages of model quantization. Google also introduced the NetAdapt platform whose goal is to adapt a model to a target platform by reducing a network until the target resources constraints are satisfied.
By unifying metrics we hope to see more and more people and papers addressing the energy problem as much as improving accuracy.
It’s no surprise that the Efficient Deep Learning for Computer Vision Workshop had a strong focus on the notion of accuracy vs watts. Forrest Iandola from Deepscale presented a number of different tips and tricks for developing smaller neural nets. XNOR.ai showed how they do quantization using binary weight to optimize the neural network. The new SqueezeNext: Hardware-Aware Neural Network Design network was presented. It outperforms AlexNet accuracy with x63 less parameters (and a x8 increase in speed).
It’s also worth noting that processors “specialized in deep learning” have arrived like the recent Huawei Kirin or the Nvidia Jetson Xavier, but the “right” DNN is increasingly platform-dependent (Mobile, Server, Cloud, …)
Neural network efficiency is most definitely a trend for computer vision.
The Embedded Vision Workshop had a rather diverse program covering topics such as low-level hardware/software designs for dedicated image sensor chips, obstacle detection for UAVs, semantic segmentation for autonomous driving and learning methods for efficient keypoint detectors.
Invited talks were given by Prof. Mohan Trivedi from UCSD on challenges in autonomous driving and human interaction, Cormac Brick from Intel gave interesting insights about the Intel Compute Stick, Raghuveer Rao presented the work on image processing for autonomous platforms at Army Research Laboratory, Donghwan Lee from our own NAVER Labs introduced the topic of scalable and semantic indoor mapping and Prof. Warren J. Gross from McGill University talked about hardware architectures for DNNs.
NAVER LABS Europe awarded the best paper to Di Febbo et al., for KCNN: Extremely-Efficient Hardware Keypoint Detection with a Compact Convolutional Neural Network and the runner up paper award to Vallurupalli et al. for Efficient Semantic Segmentation using Gradual Grouping.
The best poster award sponsored by Intel was given to Bhowmik et al for Design of a Reconfigurable 3D Pixel-Parallel Neuromorphic Architecture for Smart Image Sensor
Visual Localization
Visual localization is about estimating the location of a camera within a given area (also referred to as map). This is particularly valuable if no other localization technique is available e.g. in GPS-denied environments such as indoor locations. Interesting applications for this are robot navigation, self-driving cars, indoor positioning and augmented reality. Various input sources are used to do this with the most common ones being RGB images, images + depth (RGBD) or 3D point clouds (e.g. from lidar). To find the location of a camera using images, you have to establish correspondences between the query image and the map. Structure-based methods do this with descriptor matching between 3D points of the map and keypoints in the query image. Image retrieval-based methods match the query image with the images of the map using global descriptors and pose regression-based methods try to directly regress the camera pose from the query image. The biggest challenges in localization are large viewpoint changes between the query image and map, incomplete maps, textureless areas without valuable visual information, symmetric and repetitive elements, changing light conditions, structural changes, and dynamic areas which cause unpredictable occlusions (e.g. people).
Camposeco et al. in their Hybrid Camera Pose Estimation present an improvement for geometry or structure-based methods by leveraging 2D-3D as well as 2D-2D correspondences between the query image and the map. The paper tries to answer the question if using one of them is always preferable or if a hybrid solution (best of both) exists. For example, poorly triangulated 3D points degrade the accuracy so in these cases, using 2D-2D matches would probably be better. Their answer is a new RANSAC-based approach which automatically chooses the best solver. The figure below shows how a query image (red) is matched with the model (2D-3D in blue) and the database images (2D-2D pink).
About the authors:
Claudine Combe and Nicolas Monet are research engineers in the Edge Computing group. Martin Humenberger leads the 3D Vision group at NAVER LABS Europe.
Part 1: Pose Estimation
Part 3: CVPR, Image Retrieval
Part 4: CVPR, 3D Scene Understanding