Virtual Gallery Dataset
The Virtual Gallery dataset is a synthetic dataset that targets multiple challenges such as varying lighting conditions and different occlusion levels for various tasks such as depth estimation, instance segmentation and visual localization.
It consists of a scene containing 3-4 rooms, in which a total of 42 free-for-use famous paintings are placed on the walls.
The virtual model and the captured images were generated with Unity software, allowing us to extract ground-truth information such as depth, semantic and instance segmentation, 2D-2D and 2D-3D correspondences.
Terms of Use and Reference
COPYRIGHT
Copyrights of The Virtual Gallery Dataset are owned by NAVER LABS Corp.
PLEASE READ THESE TERMS CAREFULLY BEFORE DOWNLOADING THE VIRTUAL GALLERY DATASET. DOWNLOADING OR USING THE DATASET MEANS YOU ACCEPT THESE TERMS.
The Virtual Gallery dataset may be used for non-commercial purposes only and is subject to the Creative Commons Attribution Non Commercial No Derivatives 4.0 International License.
ACKNOWLEDGMENT
Artwork: Courtesy National Gallery of Art, Washington.
CITATION
When using or referring to this dataset in your research, please cite NAVER LABS Europe as the originator of the Virtual Gallery dataset and cite our CVPR 2019 paper, cf. also full reference below:
Visual Localization by Learning Objects-of-Interest Dense Match Regression
Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, Martin Humenberger
In CVPR, 2019
@inproceedings{Weinzaepfel:VirtualGallery:CVPR2019, author = {Weinzaepfel, Philippe and Csurka, Gabriela and Cabon, Yohann and Humenberger, Martin}, title = {Visual Localization by Learning Objects-of-Interest Dense Match Regression}, booktitle = {CVPR}, year = {2019} }
News
- 27/05/2019 – Released first version of the Virtual Gallery dataset
Data
We consider a realistic scenario that simulates the scene captured by a robot equipped with 6 cameras for training, and photos taken by visitors for testing.
For both training and testing scenarios, RGB, depth, class segmentation and instance segmentation images are available alongside camera poses and 2D-2D, 2D-3D painting correspondences.
All images were recorded in 1080p.
There is a total of 43668 frames in the training set and 11904 frames in the testing set.
The format of each of the provided modalities is described below. Note that frame indexes always start from 0.
Training
For training, we consider a realistic scenario that simulates the scene captured by a robot.
The camera setup of the robot consists of 6 cameras in a 360° configuration at a fixed height of 165cm.
Cameras are aranged in a circle, each camera is at 10cm of the center, with an horizontal field of view 70°.
The robot is following 5 different loops inside the gallery with pictures taken roughly every 20cm which results in about 250 images for each camera.
Links
We provide one .tar archive per trajectory:
virtual_gallery_1.0.0_training_loop1.tar
virtual_gallery_1.0.0_training_loop2.tar
virtual_gallery_1.0.0_training_loop3.tar
virtual_gallery_1.0.0_training_loop4.tar
virtual_gallery_1.0.0_training_loop5.tar
Archives contain files in the format:
training/gallery_lightX_loopY/frames/rgb/camera_Z/rgb_%05d.jpg training/gallery_lightX_loopY/frames/depth/camera_Z/depth_%05d.png training/gallery_lightX_loopY/frames/classsegmentation/camera_Z/classgt_%05d.png training/gallery_lightX_loopY/frames/instancesegmentation/camera_Z/instancegt_%05d.jpg training/gallery_lightX_loopY/colors.txt training/gallery_lightX_loopY/extrinsic.txt training/gallery_lightX_loopY/instances.txt training/gallery_lightX_loopY/instances_points.txt training/gallery_lightX_loopY/instances_pose.txt training/gallery_lightX_loopY/intrinsic.txt where X ∈ [1, 6] and represent one of 6 different lighting condition. Y ∈ [1, 5] and represent one of 5 different camera trajectories (loop). Z ∈ [1, 6] and represent one of 6 different cameras (viewpoint).
The training set consists of 221 to 266 images per camera per loop, for each of the 6 lighting conditions.
Testing
For testing, we consider a scenario that simulates photos taken by visitors.
We sample random positions, orientations and focal lengths, ensuring that viewpoints
- are plausible and realistic (in terms of orientation, height and distance to the wall)
- cover the entire scene
The test set consists of 496 images that are rendered with each of the 6 lighting conditions, and for 4 different densities of humans present in the scene (including the empty case).
Links
We provide one .tar archive per occlusion level:
virtual_gallery_1.0.0_testing_occlusion1.tar
virtual_gallery_1.0.0_testing_occlusion2.tar
virtual_gallery_1.0.0_testing_occlusion3.tar
virtual_gallery_1.0.0_testing_occlusion4.tar
Archives contain files in the format:
testing/gallery_lightX_occlusionY/frames/rgb/camera_0/rgb_%05d.jpg testing/gallery_lightX_occlusionY/frames/depth/camera_0/depth_%05d.png testing/gallery_lightX_occlusionY/frames/classsegmentation/camera_0/classgt_%05d.png testing/gallery_lightX_occlusionY/frames/instancesegmentation/camera_0/instancegt_%05d.png testing/gallery_lightX_occlusionY/colors.txt testing/gallery_lightX_occlusionY/extrinsic.txt testing/gallery_lightX_occlusionY/instances.txt testing/gallery_lightX_occlusionY/instances_points.txt testing/gallery_lightX_occlusionY/instances_pose.txt testing/gallery_lightX_occlusionY/intrinsic.txt where X ∈ [1, 6] and represent one of 6 different lighting condition. Y ∈ [1, 4] such as * (Y == 1) => There are no human occluders in the scene * (Y == 2) => There are between 5 and 20 human occluders in the virtual gallery for every frame (not necessarily visible) * (Y == 3) => There are between 5 and 40 human occluders in the virtual gallery for every frame (not necessarily visible) * (Y == 4) => There are between 5 and 60 human occluders in the virtual gallery for every frame (not necessarily visible)
Format Description
RGB
All RGB images are encoded as RGB JPG files with 8-bit per channel.
Depth
All depth images are encoded as Grayscale 16bit PNG files.
We use a fixed far plane of 655.35 meters (pixels farther away are clipped; however, it is not relevant for this dataset).
This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that a pixel intensity of 1 in our single channel PNG16 depth images corresponds to a distance of 1cm to the camera plane.
The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):
depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
Class Segmentation
All class segmentation images are encoded as RGB PNG files with 8-bit per channel.
Colors are described in the colors.txt file (described below).
Instance Segmentation
All instance segmentation images are encoded as 8bit indexed PNG files.
The value of a pixel is the instanceId. A value of 0 means it is not a painting. Other values are described in instances.txt section (described below).
Text files description
In our system of 3D camera coordinates x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera).
colors.txt
Correspondence table for class segmentation images.
Category r g b Wall 255 127 80 Ceiling 255 248 220 Sky 0 255 255 Door 189 183 107 Light 230 230 250 Floor 233 150 122 Misc 80 80 80 Painting 128 0 0 Human 0 255 0 Undefined 0 0 0
instances.txt
Correspondence table for instance segmentation images.
Header: instanceId width height author painting_name year instanceID: painting identification number (starts from 1) width: width of the painting in the world in meter height: height of the painting in the world in meter author: name of the author of the painting year: date of the painting as indicated on the National Gallery of Art website; when multiple dates were possible, the oldest one was chosen
extrinsic.txt
Get the extrinsic parameters of every camera, for every frame.
Header: frame cameraID r1,1 r1,2 r1,3 t1 r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1 frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) ri,j: are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation coefficients t r1,1 r1,2 r1,3 t1 Extrinsic = r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1
intrinsic.txt
Get the intrinsic parameters of every camera, for every frame.
Header: frame cameraID K[0,0] K[1,1] K[0,2] K[1,2] frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) K[0,0]: coefficient of the 1st line, 1st column K[1,1]: coefficient of the 2nd line, 2nd column K[0,2]: coefficient of the 1nd line, 3rd column K[1,2]: coefficient of the 2nd line, 3rd column K[0,0] 0 K[0,2] Intrinsic = 0 K[1,1] K[1,0] 0 0 1
instances_points.txt
For every frame and every visible painting, get a list of at least 4 points on the painting that are within the camera frustum.
Header: frame cameraID instanceID number_of_points [world_space, screen_space position_texture_space]... frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) instanceID: painting identification number (starts from 1) number_of_points: number of points listed. Number of times the part [world_space, screen_space position_texture_space] is repeated world_space: coordinates of the point in world space in meter screen_space: coordinates of the point in screen space in pixel (coordinates of the point in the rgb/classgt/instancegt/depth image). Inclusive, (0,0) is on the upper left corner of the image texture_space: coordinates of the point in the original texture: floats between 0 and 1 for horizontal/vertical. (0,0) is the upper left corner of the image. (1,1) is the bottom right corner of the image
instances_pose.txt
For every frame and every visible painting, get the 3D pose.
Header: frame cameraID instanceID number_pixels position_center_world_space position_bottom_left_world_space position_bottom_right_world_space position_top_left_world_space position_top_right_world_space rotation_world_space position_center_camera_space position_bottom_left_camera_space position_bottom_right_camera_space position_top_left_camera_space position_top_right_camera_space rotation_camera_space frame: frame index in the video (starts from 0) cameraID: camera index (between 0 and 5) instanceID: painting identification number (starts from 1) number_pixels: number of pixels of the instance visible in the corresponding frame position_center_world_space: position of the center of the painting in world space in meter position_bottom_left_world_space: position of the bottom left corner of the painting in world space in meter position_bottom_right_world_space: position of the bottom right corner of the painting in world space in meter position_top_left_world_space: position of the top left corner of the painting in world space in meter position_top_right_world_space: position of the top right corner of the painting in world space in meter rotation_world_space: Yaw Pitch Roll of the painting relative to the world system of axis * rotation_y: rotation around Y-axis (yaw) in world coordinates [-pi..pi] (convention is ry == 0 iff object aligned with x-axis and pointing right) * rotation_x: rotation around X-axis (pitch) in world coordinates [-pi..pi] * rotation_z: rotation around Z-axis (roll) in world coordinates [-pi..pi] position_center_world_space: position of the center of the painting in world space in meter position_bottom_left_camera_space: position of the bottom left corner of the painting in camera space in meter position_bottom_right_camera_space: position of the bottom right corner of the painting in camera space in meter position_top_left_camera_space: position of the top left corner of the painting in camera space in meter position_top_right_camera_space: position of the top right corner of the painting in camera space in meter rotation_camera_space: Yaw Pitch Roll of the painting relative to the camera system of axis * rotation_y: rotation around Y-axis (yaw) in camera coordinates [-pi..pi] (convention is ry == 0 iff object aligned with x-axis and pointing right) * rotation_x: rotation around X-axis (pitch) in camera coordinates [-pi..pi] * rotation_z: rotation around Z-axis (roll) in camera coordinates [-pi..pi]
Contact
For questions, please contact Yohann Cabon.