Virtual KITTI 2 Dataset
Virtual KITTI 2 is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.
Blog article: Announcing Virtual KITTI 2
Terms of Use and Reference
PLEASE READ THESE TERMS CAREFULLY BEFORE DOWNLOADING THE VIRTUAL KITTI 2 DATASET. DOWNLOADING OR USING THE DATASET MEANS YOU ACCEPT THESE TERMS.
The Virtual KITTI 2 Dataset may be used for non-commercial purposes only and is subject to the Creative Commons Attribution-NonCommercial-ShareAlike 3.0, a summary of which is located here.
COPYRIGHT
Copyrights in the Virtual KITTI 2 dataset are owned by Naver Corporation.
ACKNOWLEDGMENT
The Virtual KITTI 2 dataset is an adaptation of the Virtual KITTI 1.3.1 dataset as described in the papers below.
CITATION
When using or referring to this dataset in your research, please cite the papers below and cite Naver as the originator of Virtual KITTI 2, an adaptation of Xerox’s Virtual KITTI Dataset.
BibTex:
@misc{cabon2020vkitti2, title={Virtual KITTI 2}, author={Cabon, Yohann and Murray, Naila and Humenberger, Martin}, year={2020}, eprint={2001.10773}, archivePrefix={arXiv}, primaryClass={cs.CV} } pdf arXiv @inproceedings{gaidon2016virtual, title={Virtual worlds as proxy for multi-object tracking analysis}, author={Gaidon, Adrien and Wang, Qiao and Cabon, Yohann and Vig, Eleonora}, booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition}, pages={4340--4349}, year={2016} } pdf
Download
We provide one .tar[.gz] archive per type of data as described below. Here is a list of the MD5 checksums for each archive.
vkitti_2.0.3_rgb.tar
vkitti_2.0.3_depth.tar
vkitti_2.0.3_classSegmentation.tar
vkitti_2.0.3_instanceSegmentation.tar
vkitti_2.0.3_textgt.tar.gz
vkitti_2.0.3_forwardFlow.tar
vkitti_2.0.3_backwardFlow.tar
vkitti_2.0.3_forwardSceneFlow.tar
vkitti_2.0.3_backwardSceneFlow.tar
Format Description
Because of the changes in implementation between Virtual KITTI 1 and Virtual KITTI 2, the format of Virtual KITTI 2 is closer to the Virtual Gallery Dataset.
SceneX/Y/frames/rgb/Camera_Z/rgb_%05d.jpg SceneX/Y/frames/depth/Camera_Z/depth_%05d.png SceneX/Y/frames/classsegmentation/Camera_Z/classgt_%05d.png SceneX/Y/frames/instancesegmentation/Camera_Z/instancegt_%05d.png SceneX/Y/frames/backwardFlow/Camera_Z/backwardFlow_%05d.png SceneX/Y/frames/backwardSceneFlow/Camera_Z/backwardSceneFlow_%05d.png SceneX/Y/frames/forwardFlow/Camera_Z/flow_%05d.png SceneX/Y/frames/forwardSceneFlow/Camera_Z/sceneFlow_%05d.png SceneX/Y/colors.txt SceneX/Y/extrinsic.txt SceneX/Y/intrinsic.txt SceneX/Y/info.txt SceneX/Y/bbox.txt SceneX/Y/pose.txt where X ∈ {01, 02, 06, 18, 20} and represent one of 5 different locations. Y ∈ {15-deg-left, 15-deg-right, 30-deg-left, 30-deg-right, clone, fog, morning, overcast, rain, sunset} and represent the different variations. Z ∈ [0, 1] and represent the left (same as in virtual kitti) or right camera (offset by 0.532725m to the right). Note that our indexes always start from 0.
RGB: link (7.01GB)
All RGB images are encoded as RGB JPG files with 8-bit per channel.
Depth: link (7.58GB)
All depth images are encoded as Grayscale 16bit PNG files.We use a fixed far plane of 655.35 meters (pixels farther away are clipped; however, it is not relevant for this dataset).This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that a pixel intensity of 1 in our single channel PNG16 depth images corresponds to a distance of 1cm to the camera plane.The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):
depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
Class Segmentation: link (969MB)
All class segmentation images are encoded as RGB PNG files with 8-bit per channel.
Colors are described in the colors.txt file (described below).
Instance Segmentation: link (166MB)
All instance segmentation images are encoded as 8bit indexed PNG files.
The value of a pixel is equal to (trackID+1). A value of 0 means it is not a vehicle.
Optical/Scene flow
Forward optical flow: link (29.6GB)
Backward optical flow: link (27.1GB)
Forward scene flow: link (14.8GB)
Backward scene flow: link (14.8GB)
For all types of flow, images are encoded as RGB PNG files with 16-bit per channel.
Optical flow :
Forward – rgb at t + optical flow at t = rgb at t+1
Backward – rgb at t + optical flow at t = rgb at t-1
In G, flow along x-axis normalized by image width and quantized to [0;2^16 – 1]
B = 0 for invalid flow (e.g., sky pixels)
Some example decoding code in Python using OpenCV and numpy: def read_vkitti_png_flow(flow_fn): “Convert from .png to (h, w, 2) (flow_x, flow_y) float32 array” # read png to bgr in 16 bit unsigned short bgr = cv2.imread(flow_fn, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH) h, w, _c = bgr.shape assert bgr.dtype == np.uint16 and _c == 3 # b == invalid flow flag == 0 for sky or other invalid flow invalid = bgr[…, 0] == 0 # g,r == flow_y,x normalized by height,width and scaled to [0;2**16 – 1] out_flow = 2.0 / (2**16 – 1.0) * bgr[…, 2:0:-1].astype(‘f4’) – 1 out_flow[…, 0] *= w – 1 out_flow[…, 1] *= h – 1 out_flow[invalid] = 0 # or another value (e.g., np.nan) return out_flow
Scene flow :
Scene flow contains the movement of a point in a texture in camera space between two frames.
Forward – displacement of pixel in meter between t and t+1, clamped between -10m and 10m
Backward – displacement of pixel in meter between t and t-1, clamped between -10m and 10m
To get the value in meter you have to do: dx=(sceneFlow.r * 2.0 / 65535.0) - 1.0) * 10.0 # from [0, 2^16-1] to [0, 2], then [-1, 1], then [-10, 10] dy=(sceneFlow.g * 2.0 / 65535.0) - 1.0) * 10.0 dz=(sceneFlow.b * 2.0 / 65535.0) - 1.0) * 10.0
Camera parameters, Object Detection (2D & 3D) and Multi-Object Tracking Ground Truth: link (23.3MB)
In our system of 3D camera coordinates x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera).
colors.txt
Category r g b Terrain 210 0 200 Sky 90 200 255 Tree 0 199 0 Vegetation 90 240 0 Building 140 140 140 Road 100 60 100 GuardRail 250 100 255 TrafficSign 255 255 0 TrafficLight 200 200 0 Pole 255 130 0 Misc 80 80 80 Truck 160 60 60 Car 255 127 80 Van 0 139 139 Undefined 0 0 0
extrinsic.txt
Header: frame cameraID r1,1 r1,2 r1,3 t1 r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1 frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) ri,j: are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation coefficients t r1,1 r1,2 r1,3 t1 Extrinsic = r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1
intrinsic.txt
Header: frame cameraID K[0,0] K[1,1] K[0,2] K[1,2] frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) K[0,0]: coefficient of the 1st line, 1st column K[1,1]: coefficient of the 2nd line, 2nd column K[0,2]: coefficient of the 1nd line, 3rd column K[1,2]: coefficient of the 2nd line, 3rd column K[0,0] 0 K[0,2] Intrinsic = 0 K[1,1] K[1,0] 0 0 1
info.txt
Header: trackID label model color trackID: track identification number (unique for each object instance) label: KITTI-like name of the ‘type’ of the object (Car, Van) model: the name of the 3D model used to render the object (can be used for fine-grained recognition) color: the name of the color of the object
bbox.txt
Header: frame cameraID trackID left right top bottom number_pixels truncation_ratio occupancy_ratio isMoving frame: frame index in the video (starts from 0) cameraID: 0 (left) or 1 (right) trackID: track identification number (unique for each object instance) left, right, top, bottom: KITTI-like 2D ‘bbox’, bounding box in pixel coordinates (inclusive, (0,0) origin is on the upper left corner of the image) number_pixels: number of pixels of this vehicle in the image truncation_ratio: object 2D truncation ratio in [0..1] (0: no truncation, 1: entirely truncated) occupancy_ratio: object 2D occupancy ratio (fraction of non-occluded pixels) in [0..1] (0: fully occluded, 1: fully visible, independent of truncation) isMoving: 0/1 flag to indicate whether the object is really moving between this frame and the next one
pose.txt
Header: frame cameraID trackID alpha width height length world_space_X world_space_Y world_space_Z rotation_world_space_y rotation_world_space_x rotation_world_space_z camera_space_X camera_space_Y camera_space_Z rotation_camera_space_y rotation_camera_space_x rotation_camera_space_z frame: frame index in the video (starts from 0) cameraID: 0 (left) or 1 (right) trackID: track identification number (unique for each object instance) alpha: KITTI-like observation angle of the object in [-pi..pi] width, height, length: KITTI-like 3D object ‘dimensions’ in meters world_space_X, world_space_Y, world_space_Z: KITTI-like 3D object ‘location’, respectively x, y, z in world coordinates in meters (center of bottom face of 3D bounding box) rotation_world_space_y, rotation_world_space_x, rotation_world_space_z: yaw, pitch, roll in world coordinates [-pi..pi] (KITTI convention is ry == 0 iff object is aligned with x-axis and pointing right) camera_space_X, camera_space_Y, camera_space_Z: KITTI-like 3D object ‘location’, respectively x, y, z in camera coordinates in meters (center of bottom face of 3D bounding box) rotation_camera_space_y, rotation_camera_space_x, rotation_camera_space_z: yaw, pitch, roll in camera coordinates [-pi..pi] (KITTI convention is ry == 0 iff object is aligned with x-axis and pointing right)
Contact: For questions, please contact Yohann Cabon.