Announcing Virtual KITTI 2 - Naver Labs Europe
loader image

New release of the popular synthetic image dataset for training and testing.

Machine learning has led to astonishing developments in many fields, but very large amounts of data are often required for training, which can be challenging to collect. In autonomous driving systems, this is a particular problem, because data must be collected by manually driven vehicles equipped with special sensors and then annotated, often by hand.

In 2015, a researcher in our computer vision team who’s a keen video game player, wondered if video game technology could be used to swiftly and cheaply create a large amount of training data for machine learning systems. In a video-game world, it would be easy to create data for rare events, and scenes with a change in only one condition (such as the weather) could be generated. Moreover, exact ground truth could be generated along with the data, so little annotation would be required.

To explore this idea, our researchers used the Unity game engine to carefully recreate real-world videos from the popular KITTI autonomous driving benchmark suite. The result, Virtual KITTI [2], was one of the first synthetic datasets for training and testing machine learning models for autonomous driving applications.

Five years later, the need for machine learning data is greater than ever, and we’re excited to release a new, more photorealistic version of the dataset [3]: Virtual KITTI 2.


Example images from the original KITTI (top row), Virtual KITTI (middle row), and Virtual KITTI 2 (bottom row) datasets.

What’s new in Virtual KITTI 2.0

The first Virtual KITTI consists of five driving video sequences cloned from the original KITTI dataset. Virtual KITTI 2 consists of the same five sequence clones as Virtual KITTI, but has the following new features.

Increased photorealism:  One of the unique advantages of synthetic data is that the same scenarios can be recreated with small changes in conditions such as lighting or viewing angle. As in the original dataset, each sequence is provided with small horizontal rotations of the cameras (±15° and ±30°) as well as changes in the weather (fog, clouds, and rain) and time of day (morning and sunset). The advances in the Unity game engine 2018.4 [4] mean that the basic Virtual KITTI image sequences are closer to the image sequences of the original real KITTI dataset. Moreover, Virtual KITTI 2 exploits recent improvements in lighting and post-processing of the game engine such that the changes in the virtual sequences are even closer to real changes in conditions.

Stereo cameras: The original Virtual KITTI dataset provides images from one camera. However, the real KITTI dataset provides stereo images from two cameras. In Virtual KITTI 2, a new camera has been added to provide stereo images, enabling this dataset to be used with the wide range of existing methods that employ more than one camera.

Additional ground truth: Each Virtual KITTI camera renders an RGB image. It also renders several types of ground-truth: class segmentation, instance segmentation, depth, and forward optical flow. For each sequence, camera parameters as well as vehicle colour, pose, and bounding boxes are provided. In addition, in Virtual KITTI 2, backward optical flow and forward and backward scene flow images are newly provided.

Testing Virtual KITTI 2

To showcase the capabilities of Virtual KITTI 2, we repeated the original multiobject tracking experiments of Gaidon et al. [2] and added new ones on stereo matching, monocular depth estimation, camera pose estimation, and semantic segmentation to demonstrate the multiple potential uses of the dataset [3].

Multiobject tracking: In these experiments, two trackers were compared: dynamic programming min-cost flow [5] and Markov decision process [6] on Virtual KITTI, Virtual KITTI 2, and KITTI image sequences. Our results show that the original conclusion of [2] holds: in the context of multi-object tracking performance, the gap between real and virtual data is small.

Stereo matching: The dense deep learning-based stereo matching method GANet [7] was evaluated on the Virtual KITTI 2 and KITTI images. Our clone of the real sequence yields results similar to those of the real sequence. Moreover, we found that while camera rotations and lighting conditions do not affect GANet results, fog and rain do. This finding would be difficult to obtain without the advantages of synthetic data.

Monocular depth and camera pose estimation: The deep learning-based estimation algorithm SfmLearner [8] was used to obtain monocular depth and camera pose on Virtual KITTI, Virtual KITTI 2, and KITTI images. These results show that Virtual KITTI 2 obtains results similar to those obtained using the original KITTI images.

Semantic segmentation: Virtual KITTI 2’s ground-truth semantic segmentation annotations were used to evaluate the state-of-the-art urban scene segmentation method, Adapnet++ [9]. The results show that Adapnet++ performs better on RGB images than depth images, which is consistent with the results of the original Adapnet++ study on real images. Moreover, the results were numerically similar with those obtained using the real KITTI images, indicating the potential utility of synthetic data for semantic segmentation evaluation.


Example sematic segmentation results for RGB-depth pairs. Each image block shows (top row) the original input frame (left: RGB, right: depth), (middle row) ground-truth, and (bottom row) predicted segmentation.

Using Virtual KITTI 2

Since the introduction of Virtual KITTI in 2015, several other synthetic datasets have appeared. They’ve proven useful in evaluating preliminary prototypes and, in combination with real-world datasets, they can improve performance [2,10].

Together, these datasets have successfully demonstrated that while synthetic datasets cannot completely replace real-world data, they are a cost-effective alternative with pretty good transferability. This new, expanded version of the dataset enables investigations that wouldn’t be practical otherwise and which, we hope, will contribute to future developments in autonomous driving technology.

To start using the dataset, you can download it here:

Our paper with more information about the experiments and usage of the new dataset is available as an arXiv preprint [3].

[1] The KITTI Vision Benchmark Suite,

[2] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016.

[3] Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2. arXiv:2001.10773, 2020.

[4] Unity Real-Time Development,

[5] Hamed Pirsiavash, Deva Ramanan, and Charles C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.

[6] Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to track: Online multi-object tracking by decision making. In ICCV, 2015.

[7] Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip H. S. Torr. GANet: Guided aggregation net for end-to-end stereo matching. In CVPR, 2019.

[8] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.

[9] Abhinav Valada, Rohit Mohan, and Wolfram Burgard. Self-supervised model adaptation for multimodal semantic segmentation. International Journal of Computer Vision, Jul. 2019. Special Issue: Deep Learning for Robotic Vision.

[10] César De Souza, Adrien Gaidon, Yohann Cabon, Antonio Lopez. Procedural generation of videos to train deep action recognition networks.  In CVPR, 2017