Virtual KITTI 2 Dataset

Virtual KITTI 2 is a more photo-realistic and better-featured version of the original virtual KITTI dataset. It exploits recent improvements of the Unity game engine and provides new data such as stereo images or scene flow.

Blog article: Announcing Virtual KITTI 2

Terms of Use and Reference

PLEASE READ THESE TERMS CAREFULLY BEFORE DOWNLOADING THE VIRTUAL KITTI 2 DATASET. DOWNLOADING OR USING THE DATASET MEANS YOU ACCEPT THESE TERMS.

The Virtual KITTI 2 Dataset may be used for non-commercial purposes only and is subject to the Creative Commons Attribution-NonCommercial-ShareAlike 3.0, a summary of which is located here.

COPYRIGHT

Copyrights in the Virtual KITTI 2 dataset are owned by Naver Corporation.

ACKNOWLEDGMENT

The Virtual KITTI 2 dataset is an adaptation of the Virtual KITTI 1.3.1 dataset as described in the papers below.

CITATION

When using or referring to this dataset in your research, please cite the papers below and cite Naver as the originator of Virtual KITTI 2, an adaptation of Xerox’s Virtual KITTI Dataset.

BibTex:

@misc{cabon2020vkitti2,
  title={Virtual KITTI 2},
  author={Cabon, Yohann and Murray, Naila and Humenberger, Martin},
  year={2020},
  eprint={2001.10773},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}
pdf
arXiv

@inproceedings{gaidon2016virtual,
  title={Virtual worlds as proxy for multi-object tracking analysis},
  author={Gaidon, Adrien and Wang, Qiao and Cabon, Yohann and Vig, Eleonora},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={4340--4349},
  year={2016}
}
pdf

Download
We provide one .tar[.gz] archive per type of data as described below. Here is a list of the MD5 checksums for each archive.

vkitti_2.0.3_rgb.tar
vkitti_2.0.3_depth.tar
vkitti_2.0.3_classSegmentation.tar
vkitti_2.0.3_instanceSegmentation.tar
vkitti_2.0.3_textgt.tar.gz
vkitti_2.0.3_forwardFlow.tar
vkitti_2.0.3_backwardFlow.tar
vkitti_2.0.3_forwardSceneFlow.tar
vkitti_2.0.3_backwardSceneFlow.tar

Format Description
Because of the changes in implementation between Virtual KITTI 1 and Virtual KITTI 2, the format of Virtual KITTI 2 is closer to the Virtual Gallery Dataset.

SceneX/Y/frames/rgb/Camera_Z/rgb_%05d.jpg
SceneX/Y/frames/depth/Camera_Z/depth_%05d.png
SceneX/Y/frames/classsegmentation/Camera_Z/classgt_%05d.png
SceneX/Y/frames/instancesegmentation/Camera_Z/instancegt_%05d.png
SceneX/Y/frames/backwardFlow/Camera_Z/backwardFlow_%05d.png
SceneX/Y/frames/backwardSceneFlow/Camera_Z/backwardSceneFlow_%05d.png
SceneX/Y/frames/forwardFlow/Camera_Z/flow_%05d.png
SceneX/Y/frames/forwardSceneFlow/Camera_Z/sceneFlow_%05d.png
SceneX/Y/colors.txt
SceneX/Y/extrinsic.txt
SceneX/Y/intrinsic.txt
SceneX/Y/info.txt
SceneX/Y/bbox.txt
SceneX/Y/pose.txt

where X ∈ {01, 02, 06, 18, 20} and represent one of 5 different locations.
Y ∈ {15-deg-left, 15-deg-right, 30-deg-left, 30-deg-right, clone, fog, morning, overcast, rain, sunset} and represent the different variations.
Z ∈ [0, 1] and represent the left (same as in virtual kitti) or right camera (offset by 0.532725m to the right). 
Note that our indexes always start from 0.

The corresponding real KITTI scenes can be found in the KITTI object tracking benchmark.

RGB: link (7.01GB)

All RGB images are encoded as RGB JPG files with 8-bit per channel.

Depth: link (7.58GB)

All depth images are encoded as Grayscale 16bit PNG files.We use a fixed far plane of 655.35 meters (pixels farther away are clipped; however, it is not relevant for this dataset).This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that a pixel intensity of 1 in our single channel PNG16 depth images corresponds to a distance of 1cm to the camera plane.The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):

depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)

Class Segmentation: link (969MB)

All class segmentation images are encoded as RGB PNG files with 8-bit per channel.
Colors are described in the colors.txt file (described below).

Instance Segmentation: link (166MB)

All instance segmentation images are encoded as 8bit indexed PNG files.
The value of a pixel is equal to (trackID+1). A value of 0 means it is not a vehicle.

Optical/Scene flow
Forward optical flow: link (29.6GB)
Backward optical flow: link (27.1GB)
Forward scene flow: link (14.8GB)
Backward scene flow: link (14.8GB)

For all types of flow, images are encoded as RGB PNG files with 16-bit per channel.
Optical flow :

Forward – rgb at t + optical flow at t = rgb at t+1
Backward – rgb at t + optical flow at t = rgb at t-1

In R, flow along x-axis normalized by image width and quantized to [0;2^16 – 1]
In G, flow along x-axis normalized by image width and quantized to [0;2^16 – 1]
B = 0 for invalid flow (e.g., sky pixels)

Some example decoding code in Python using OpenCV and numpy:

def read_vkitti_png_flow(flow_fn):
  “Convert from .png to (h, w, 2) (flow_x, flow_y) float32 array”
  # read png to bgr in 16 bit unsigned short

  bgr = cv2.imread(flow_fn, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
  h, w, _c = bgr.shape
  assert bgr.dtype == np.uint16 and _c == 3
  # b == invalid flow flag == 0 for sky or other invalid flow
  invalid = bgr[…, 0] == 0
  # g,r == flow_y,x normalized by height,width and scaled to [0;2**16 – 1]
  out_flow = 2.0 / (2**16 – 1.0) * bgr[…, 2:0:-1].astype(‘f4’) – 1
  out_flow[…, 0] *= w – 1
  out_flow[…, 1] *= h – 1
  out_flow[invalid] = 0 # or another value (e.g., np.nan)
  return out_flow

Scene flow :

Scene flow contains the movement of a point in a texture in camera space between two frames.
Forward – displacement of pixel in meter between t and t+1, clamped between -10m and 10m
Backward – displacement of pixel in meter between t and t-1, clamped between -10m and 10m

To get the value in meter you have to do:

dx=(sceneFlow.r * 2.0 / 65535.0) - 1.0) * 10.0 # from [0, 2^16-1] to [0, 2], then [-1, 1], then [-10, 10]
dy=(sceneFlow.g * 2.0 / 65535.0) - 1.0) * 10.0
dz=(sceneFlow.b * 2.0 / 65535.0) - 1.0) * 10.0

Camera parameters, Object Detection (2D & 3D) and Multi-Object Tracking Ground Truth: link (23.3MB)

In our system of 3D camera coordinates x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera).

colors.txt

Correspondence table for class segmentation images.

Category r g b
Terrain 210 0 200
Sky 90 200 255
Tree 0 199 0
Vegetation 90 240 0
Building 140 140 140
Road 100 60 100
GuardRail 250 100 255
TrafficSign 255 255 0
TrafficLight 200 200 0
Pole 255 130 0
Misc 80 80 80
Truck 160 60 60
Car 255 127 80
Van 0 139 139
Undefined 0 0 0

extrinsic.txt

Get the extrinsic parameters of every camera, for every frame.

Header: frame cameraID r1,1 r1,2 r1,3 t1 r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1 

frame: frame index in the video (starts from 0) 
cameraID: camera index (between 0 and 5) 
ri,j: are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation coefficients t 
            r1,1 r1,2 r1,3 t1 
Extrinsic = r2,1 r2,2 r2,3 t2 
            r3,1 r3,2 r3,3 t3 
            0    0    0    1

intrinsic.txt

Get the intrinsic parameters of every camera, for every frame.

Header: frame cameraID K[0,0] K[1,1] K[0,2] K[1,2]

frame: frame index in the video (starts from 0)
cameraID: camera index (between 0 and 5)
K[0,0]: coefficient of the 1st line, 1st column
K[1,1]: coefficient of the 2nd line, 2nd column
K[0,2]: coefficient of the 1nd line, 3rd column
K[1,2]: coefficient of the 2nd line, 3rd column
            K[0,0] 0      K[0,2]
Intrinsic = 0      K[1,1] K[1,0]
            0      0      1

info.txt

Get the vehicle meta data

Header: trackID label model color

trackID: track identification number (unique for each object instance)
label: KITTI-like name of the ‘type’ of the object (Car, Van)
model: the name of the 3D model used to render the object (can be used for fine-grained recognition)
color: the name of the color of the object

bbox.txt

Get the 2D information of the vehicles

Header: frame cameraID trackID left right top bottom number_pixels truncation_ratio occupancy_ratio isMoving

frame: frame index in the video (starts from 0)
cameraID: 0 (left) or 1 (right)
trackID: track identification number (unique for each object instance)
left, right, top, bottom: KITTI-like 2D ‘bbox’, bounding box in pixel coordinates
(inclusive, (0,0) origin is on the upper left corner of the image)
number_pixels: number of pixels of this vehicle in the image
truncation_ratio: object 2D truncation ratio in [0..1] (0: no truncation, 1: entirely truncated)
occupancy_ratio: object 2D occupancy ratio (fraction of non-occluded pixels) in [0..1]
(0: fully occluded, 1: fully visible, independent of truncation)
isMoving: 0/1 flag to indicate whether the object is really moving between this frame and the next one

pose.txt

Get the 3D information of the vehicles

Header: frame cameraID trackID alpha width height length world_space_X world_space_Y world_space_Z rotation_world_space_y rotation_world_space_x rotation_world_space_z camera_space_X camera_space_Y camera_space_Z rotation_camera_space_y rotation_camera_space_x rotation_camera_space_z

frame: frame index in the video (starts from 0)
cameraID: 0 (left) or 1 (right)
trackID: track identification number (unique for each object instance)
alpha: KITTI-like observation angle of the object in [-pi..pi]
width, height, length: KITTI-like 3D object ‘dimensions’ in meters
world_space_X, world_space_Y, world_space_Z: KITTI-like 3D object ‘location’, respectively x, y, z in world coordinates in meters (center of bottom face of 3D bounding box)
rotation_world_space_y, rotation_world_space_x, rotation_world_space_z: yaw, pitch, roll in world coordinates [-pi..pi] (KITTI convention is ry == 0 iff object is aligned with x-axis and pointing right)
camera_space_X, camera_space_Y, camera_space_Z: KITTI-like 3D object ‘location’, respectively x, y, z in camera coordinates in meters (center of bottom face of 3D bounding box)
rotation_camera_space_y, rotation_camera_space_x, rotation_camera_space_z: yaw, pitch, roll in camera coordinates [-pi..pi] (KITTI convention is ry == 0 iff object is aligned with x-axis and pointing right)

Contact: For questions, please contact Yohann Cabon.

Virtual KITTI 2 Dataset

NAVER FRANCE Gender Equality 2024

All

Publications

Blog

News

Code & Data

Careers

People

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

NAVER FRANCE Gender Equality 2023

Action

Virtual KITTI 2 Dataset

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings