Assessing performance on data not seen during training is critical in order to validate machine learning models. In computer vision, however, experimentally measuring the actual robustness and generalization performance of high-level recognition methods is difficult in practice, especcially in video analyzis, due to high data acquisition and labeling costs.
Furthermore, it is sometimes nearly impossible to acquire data for some test scenarios of interest (e.g., storms, accidents, …). In this work, we show how to leverage the recent progress in computer graphics (especially off-the-shelf tools like game engines) to generate photo-realistic virtual worlds useful to assess the performance of video analysis algorithms.
The main benefits of our approach are (i) the low cost of data generation, including with high-quality detailed annotations, (ii) the flexibility to automatically generate rich and varied scenes and their annotations, including under rare conditions to perform “what-if” and “ceteris paribus” analysis, and (iii) techniques to quantify the “transferability of conclusions” from synthetic to real-world data.
The main novel idea behind our approache consists in initializing the virtual worlds from 3D synthetic clones of real-world video sequences.
Citation: CVPR 2016, Las Vegas, Nevada, USA; June 26th – July 1st, 2016.
Also: MIT Technology Review | 16th March 2016
Modern computer vision algorithms typically require expensive data acquisition and accurate manual labeling. In this work, we instead leverage the recent progress in computer graphics to generate fully labeled, dynamic, and photo-realistic proxy virtual worlds. We propose an efficient real-to-virtual world cloning method, and validate our approach by building and publicly releasing a new video dataset, called “Virtual KITTI”, automatically labeled with accurate ground truth for object detection, tracking, scene and instance segmentation, depth, and optical flow. We provide quantitative experimental evidence suggesting that (i) modern deep learning algorithms pre-trained on real data behave similarly in real and virtual worlds, and (ii) pre-training on virtual data improves performance. As the gap between real and virtual worlds is small, virtual worlds enable measuring the impact of various weather and imaging conditions on recognition performance, all other things being equal. We show these factors may affect drastically otherwise high-performing deep models for tracking.