This is our fourth and final article on CVPR2018. Lots of new approaches on the (3D) scene and some new datasets to look at.
In the Google tutorial on UltraFast 3D Sensing, Reconstruction and Understanding of People, Objects and Environments Juergen Sturm addressed the need for more robust systems for 3D capture, reconstruction and understanding, in particular for virtual and augmented reality. What it takes to understand physical space was pretty much summarised as being able to answer the following 4 questions: Where am I? Can I put something here? What is around me? What might this do?
Zhang et al. propose to do a classification of the scene before doing the segmentation to filter out the wrong labels in their Context encoding for semantic segmentation paper.
Yang et al. presented a new model to tackle the problem of street scene segmentation in DenseASPP for Semantic Segmentation in Street Scenes paper where objects vary largely in scale.
For human pose estimation, Facebook presented the model DensePose: Dense Human Pose Estimation In The Wild inspired by Mask-RCNN. They also offered a new related DensePose-COCO dataset with image to surface correspondences.
For pose estimation, the 3D Pose Estimation and 3D Model Retrieval for Objects in the Wild paper by Grabner et al proposes an approach to retrieve 3D CAD models for objects in the wild.
Understanding the 3D structure of the world from still images is of increasing interest to the vision community. Techniques like depth, normal or plane estimation enable a better interaction with the surrounding world. AR apps or robotics directly benefit from this progress.
In a side tutorial Microsoft showed how to reconstruct a 3D world using the well-known HoloLens AR headset and the new ‘Research’ mode that gives access to all sensors and especially the depth information.
A number of papers also showed improvements in the resolution of depth estimation from monocular images by including some geometry constraints e.g. Single View Stereo Matching (see Figure below), LEGO: Learning Edge with Geometry all at Once by Watching Videos and GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation. This helps in preserving edges in the depth maps.
In the paper PlaneNet: Piece-wise Planar Reconstruction from a Single RGB Image, Liu et al. propose a method to estimate planes in a single image and then estimate the depth map from these planes.
In LayoutNet: Reconstructing the 3D Room Layout From a Single RGB Image, Zou et al. predict the room 3D indoor model from a panorama image.
The most popular dataset to train disparity or depth maps is the KITTI dataset. The data collected here are from outdoor scenes which is really suitable for autonomous vehicles but less for indoor AR experiences. We came across some papers that leverage this problem. In Li and Snavely’s MegaDepth: Learning Single-View Depth Prediction from Internet Photos they released a depth dataset of images taken from the Internet with various scenes. In Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains, Pang et al. propose an approach to do cross-domain adaptation.
For the task of optical flow estimation, which consists of estimating the displacement of every pixel in a video, the focus has been pushed to efficient CNN networks.
Inspired by traditional coarse-to-fine approaches, Sun et al. propose PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume that extracts a feature pyramid for each image and then leverages a warping layer and a cost volume layer to estimate optical flow with a good trade-off between accuracy and speed. Similarly, Hui et al. present LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation which lets you significantly reduce the number of weights and the speed of the state-of-the-art CNN-based FlowNet2 method while being more accurate. Finally, Lu et al. introduced Devon: Deformable Volume Network for Learning Optical Flow for fast optical flow inference without requiring any warping that would deform images.
These previous approaches assume that ground-truth optical flow is available at training time. In the past, some unsupervised approaches were proposed to train CNN-based optical flow approaches based on smoothness prior and photometric loss. Wang et al. extend these approaches in Occlusion Aware Unsupervised Learning of Optical Flow by estimating both forward and backward flow. This allows you to model occlusion as illustrated below.
Yin and Shi propose GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose which learns dense depth, optical flow and camera pose without supervision. The approach is illustrated in the Figure below where, given a series of frames, both depth maps and camera motions are estimated and used to estimate a rigid flow which is refined to integrate moving objects.
A general observation from the workshops and presentations is the tremendous interest in understanding our surrounding world from standard cameras. The techniques being used are both interesting and useful for robotic navigation, autonomous driving and AR apps even if there’s still a lot of things to solve – especially for real world applications. There’s no doubt this area of research will gain in performance in the next few years and will serve a more and a much broader range of applications.
About the authors.
Philippe Weinzaepfel is a researcher in the computer vision group at NAVER LABS Europe. He presented PoTion: Pose Motion Representation for Action Recognition at CVPR 2018. This work is carried out in collaboration with INRIA.
Claudine Combe and Nicolas Monet are research engineers in the Edge Computing group.
Part 1: Pose Estimation
Part 2: Embedded Vision and Visual Localization
Part 3: CVPR, Image Retrieval