Virtual Gallery Dataset

The Virtual Gallery dataset is a synthetic dataset that targets multiple challenges such as varying lighting conditions and different occlusion levels for various tasks such as depth estimation, instance segmentation and visual localization.

It consists of a scene containing 3-4 rooms, in which a total of 42 free-for-use famous paintings are placed on the walls.

The virtual model and the captured images were generated with Unity software, allowing us to extract ground-truth information such as depth, semantic and instance segmentation, 2D-2D and 2D-3D correspondences.

Terms of Use and Reference

COPYRIGHT

Copyrights of The Virtual Gallery Dataset are owned by NAVER LABS Corp.

PLEASE READ THESE TERMS CAREFULLY BEFORE DOWNLOADING THE VIRTUAL GALLERY DATASET. DOWNLOADING OR USING THE DATASET MEANS YOU ACCEPT THESE TERMS.

The Virtual Gallery dataset may be used for non-commercial purposes only and is subject to the Creative Commons Attribution Non Commercial No Derivatives 4.0 International License.

ACKNOWLEDGMENT

Artwork: Courtesy National Gallery of Art, Washington.

CITATION

When using or referring to this dataset in your research, please cite NAVER LABS Europe as the originator of the Virtual Gallery dataset and cite our CVPR 2019 paper, cf. also full reference below:

Visual Localization by Learning Objects-of-Interest Dense Match Regression
Philippe Weinzaepfel, Gabriela Csurka, Yohann Cabon, Martin Humenberger
In CVPR, 2019

@inproceedings{Weinzaepfel:VirtualGallery:CVPR2019,
  author = {Weinzaepfel, Philippe and Csurka, Gabriela and Cabon, Yohann and Humenberger, Martin},
  title = {Visual Localization by Learning Objects-of-Interest Dense Match Regression},
  booktitle = {CVPR},
  year = {2019}
}

News

27/05/2019 – Released first version of the Virtual Gallery dataset

Data

We consider a realistic scenario that simulates the scene captured by a robot equipped with 6 cameras for training, and photos taken by visitors for testing.
For both training and testing scenarios, RGB, depth, class segmentation and instance segmentation images are available alongside camera poses and 2D-2D, 2D-3D painting correspondences.

All images were recorded in 1080p.
There is a total of 43668 frames in the training set and 11904 frames in the testing set.

The format of each of the provided modalities is described below. Note that frame indexes always start from 0.

Training

For training, we consider a realistic scenario that simulates the scene captured by a robot.
The camera setup of the robot consists of 6 cameras in a 360° conﬁguration at a ﬁxed height of 165cm.
Cameras are aranged in a circle, each camera is at 10cm of the center, with an horizontal field of view 70°.

The robot is following 5 different loops inside the gallery with pictures taken roughly every 20cm which results in about 250 images for each camera.

Links

We provide one .tar archive per trajectory:

virtual_gallery_1.0.0_training_loop1.tar
virtual_gallery_1.0.0_training_loop2.tar
virtual_gallery_1.0.0_training_loop3.tar
virtual_gallery_1.0.0_training_loop4.tar
virtual_gallery_1.0.0_training_loop5.tar

Archives contain files in the format:

training/gallery_lightX_loopY/frames/rgb/camera_Z/rgb_%05d.jpg
training/gallery_lightX_loopY/frames/depth/camera_Z/depth_%05d.png
training/gallery_lightX_loopY/frames/classsegmentation/camera_Z/classgt_%05d.png
training/gallery_lightX_loopY/frames/instancesegmentation/camera_Z/instancegt_%05d.jpg
training/gallery_lightX_loopY/colors.txt
training/gallery_lightX_loopY/extrinsic.txt
training/gallery_lightX_loopY/instances.txt
training/gallery_lightX_loopY/instances_points.txt
training/gallery_lightX_loopY/instances_pose.txt
training/gallery_lightX_loopY/intrinsic.txt

where X ∈ [1, 6] and represent one of 6 different lighting condition.
Y ∈ [1, 5] and represent one of 5 different camera trajectories (loop).
Z ∈ [1, 6] and represent one of 6 different cameras (viewpoint).

The training set consists of 221 to 266 images per camera per loop, for each of the 6 lighting conditions.

Testing

For testing, we consider a scenario that simulates photos taken by visitors.
We sample random positions, orientations and focal lengths, ensuring that viewpoints

are plausible and realistic (in terms of orientation, height and distance to the wall)
cover the entire scene

The test set consists of 496 images that are rendered with each of the 6 lighting conditions, and for 4 different densities of humans present in the scene (including the empty case).

Links

We provide one .tar archive per occlusion level:

virtual_gallery_1.0.0_testing_occlusion1.tar
virtual_gallery_1.0.0_testing_occlusion2.tar
virtual_gallery_1.0.0_testing_occlusion3.tar
virtual_gallery_1.0.0_testing_occlusion4.tar

Archives contain files in the format:

testing/gallery_lightX_occlusionY/frames/rgb/camera_0/rgb_%05d.jpg
testing/gallery_lightX_occlusionY/frames/depth/camera_0/depth_%05d.png
testing/gallery_lightX_occlusionY/frames/classsegmentation/camera_0/classgt_%05d.png
testing/gallery_lightX_occlusionY/frames/instancesegmentation/camera_0/instancegt_%05d.png
testing/gallery_lightX_occlusionY/colors.txt
testing/gallery_lightX_occlusionY/extrinsic.txt
testing/gallery_lightX_occlusionY/instances.txt
testing/gallery_lightX_occlusionY/instances_points.txt
testing/gallery_lightX_occlusionY/instances_pose.txt
testing/gallery_lightX_occlusionY/intrinsic.txt

where X ∈ [1, 6] and represent one of 6 different lighting condition.
Y ∈ [1, 4] such as
  * (Y == 1) => There are no human occluders in the scene
  * (Y == 2) => There are between 5 and 20 human occluders in the virtual gallery for every frame (not necessarily visible)
  * (Y == 3) => There are between 5 and 40 human occluders in the virtual gallery for every frame (not necessarily visible)
  * (Y == 4) => There are between 5 and 60 human occluders in the virtual gallery for every frame (not necessarily visible)

Format Description

RGB

All RGB images are encoded as RGB JPG files with 8-bit per channel.

Depth

All depth images are encoded as Grayscale 16bit PNG files.

We use a fixed far plane of 655.35 meters (pixels farther away are clipped; however, it is not relevant for this dataset).

This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that a pixel intensity of 1 in our single channel PNG16 depth images corresponds to a distance of 1cm to the camera plane.

The depth map in centimeters can be directly loaded in Python with numpy and OpenCV via the one-liner (assuming “import cv2”):

depth = cv2.imread(depth_png_filename, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)

Class Segmentation

All class segmentation images are encoded as RGB PNG files with 8-bit per channel.
Colors are described in the colors.txt file (described below).

Instance Segmentation

All instance segmentation images are encoded as 8bit indexed PNG files.
The value of a pixel is the instanceId. A value of 0 means it is not a painting. Other values are described in instances.txt section (described below).

Text files description

In our system of 3D camera coordinates x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera).

colors.txt

Correspondence table for class segmentation images.

Category r g b
Wall 255 127 80
Ceiling 255 248 220
Sky 0 255 255
Door 189 183 107
Light 230 230 250
Floor 233 150 122
Misc 80 80 80
Painting 128 0 0
Human 0 255 0
Undefined 0 0 0

instances.txt

Correspondence table for instance segmentation images.

Header: instanceId width height author painting_name year

instanceID: painting identification number (starts from 1)
width: width of the painting in the world in meter
height: height of the painting in the world in meter
author: name of the author of the painting
year: date of the painting as indicated on the National Gallery of Art website; when multiple dates were possible, the oldest one was chosen

extrinsic.txt

Get the extrinsic parameters of every camera, for every frame.

Header: frame cameraID r1,1 r1,2 r1,3 t1 r2,1 r2,2 r2,3 t2 r3,1 r3,2 r3,3 t3 0 0 0 1

frame: frame index in the video (starts from 0)
cameraID: camera index (between 0 and 5)
ri,j: are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation coefficients t
            r1,1 r1,2 r1,3 t1
Extrinsic = r2,1 r2,2 r2,3 t2
            r3,1 r3,2 r3,3 t3
            0    0    0    1

intrinsic.txt

Get the intrinsic parameters of every camera, for every frame.

Header: frame cameraID K[0,0] K[1,1] K[0,2] K[1,2]

frame: frame index in the video (starts from 0)
cameraID: camera index (between 0 and 5)
K[0,0]: coefficient of the 1st line, 1st column
K[1,1]: coefficient of the 2nd line, 2nd column
K[0,2]: coefficient of the 1nd line, 3rd column
K[1,2]: coefficient of the 2nd line, 3rd column
            K[0,0] 0      K[0,2]
Intrinsic = 0      K[1,1] K[1,0]
            0      0      1

instances_points.txt

For every frame and every visible painting, get a list of at least 4 points on the painting that are within the camera frustum.

Header: frame cameraID instanceID number_of_points [world_space, screen_space position_texture_space]...

frame: frame index in the video (starts from 0)
cameraID: camera index (between 0 and 5)
instanceID: painting identification number (starts from 1)
number_of_points: number of points listed. Number of times the part [world_space, screen_space position_texture_space] is repeated
world_space: coordinates of the point in world space in meter
screen_space: coordinates of the point in screen space in pixel (coordinates of the point in the rgb/classgt/instancegt/depth image). Inclusive, (0,0) is on the upper left corner of the image
texture_space: coordinates of the point in the original texture: floats between 0 and 1 for horizontal/vertical. (0,0) is the upper left corner of the image. (1,1) is the bottom right corner of the image

instances_pose.txt

For every frame and every visible painting, get the 3D pose.

Header: frame cameraID instanceID number_pixels position_center_world_space position_bottom_left_world_space position_bottom_right_world_space position_top_left_world_space position_top_right_world_space rotation_world_space position_center_camera_space position_bottom_left_camera_space position_bottom_right_camera_space position_top_left_camera_space position_top_right_camera_space rotation_camera_space

frame: frame index in the video (starts from 0)
cameraID: camera index (between 0 and 5)
instanceID: painting identification number (starts from 1)
number_pixels: number of pixels of the instance visible in the corresponding frame
position_center_world_space: position of the center of the painting in world space in meter
position_bottom_left_world_space: position of the bottom left corner of the painting in world space in meter
position_bottom_right_world_space: position of the bottom right corner of the painting in world space in meter
position_top_left_world_space: position of the top left corner of the painting in world space in meter
position_top_right_world_space: position of the top right corner of the painting in world space in meter
rotation_world_space: Yaw Pitch Roll of the painting relative to the world system of axis
  * rotation_y: rotation around Y-axis (yaw) in world coordinates [-pi..pi] (convention is ry == 0 iff object aligned with x-axis and pointing right)
  * rotation_x: rotation around X-axis (pitch) in world coordinates [-pi..pi]
  * rotation_z: rotation around Z-axis (roll) in world coordinates [-pi..pi]
position_center_world_space: position of the center of the painting in world space in meter
position_bottom_left_camera_space: position of the bottom left corner of the painting in camera space in meter
position_bottom_right_camera_space: position of the bottom right corner of the painting in camera space in meter
position_top_left_camera_space: position of the top left corner of the painting in camera space in meter
position_top_right_camera_space: position of the top right corner of the painting in camera space in meter
rotation_camera_space: Yaw Pitch Roll of the painting relative to the camera system of axis
  * rotation_y: rotation around Y-axis (yaw) in camera coordinates [-pi..pi] (convention is ry == 0 iff object aligned with x-axis and pointing right)
  * rotation_x: rotation around X-axis (pitch) in camera coordinates [-pi..pi]
  * rotation_z: rotation around Z-axis (roll) in camera coordinates [-pi..pi]

Contact

For questions, please contact Yohann Cabon.

Virtual Gallery Dataset

Terms of Use and Reference

COPYRIGHT

ACKNOWLEDGMENT

CITATION

News

Data

Training

Testing

Format Description

RGB

Depth

Class Segmentation

Instance Segmentation

Text files description

colors.txt

instances.txt

extrinsic.txt

intrinsic.txt

instances_points.txt

instances_pose.txt

Contact

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Virtual Gallery Dataset

Terms of Use and Reference

COPYRIGHT

ACKNOWLEDGMENT

CITATION

News

Data

Training

Testing

Format Description

RGB

Depth

Class Segmentation

Instance Segmentation

Text files description

colors.txt

instances.txt

extrinsic.txt

intrinsic.txt

instances_points.txt

instances_pose.txt

Contact

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings