Annotated scenes in HRI: Robots waiting for the elevator
This dataset of 125 procedurally-generated expert-annotated scenes accompanies the RO-MAN 2025 paper ‘Robots waiting for the elevator: integrating social norms in a low-data regime goal selection problem‘.
LPOSS
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs).
DUNE
A unified encoder of different foundation models excelling in 2D vision, 3D understanding, and 3D human perception. Code accompanies the CVPR 2025 paper.
Speech-MASSIVE
Covers 12 languages from different families and inherits from the original MASSIVE dataset the annotations for the intent prediction and slot filling tasks. See also the Interspeech 2024 paper.
mHuBERT-147
A promising compact model for speech processing pipelines, offering an unprecedented balance between high performance and parameter efficiency. Developed within the the EU UTTER project.
Models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.
A codebase to evaluate the robustness and uncertainty properties of semantic segmentation models as implemented in the CVPR 2024 paper.
A Pytorch codebase for research to replicate the CVPR22 paper.
Resources related to our EMNLP and WMT 2021 publications on multilingual MT. We release model checkpoints, fairseq modules to decode from those models, the test splits we used in the papers, and translation outputs by our models.
Kapture is a file format as well as a set of tools for manipulating datasets, and in particular Visual Localization and Structure from Motion data.
Data mixing strategies that can be computed on-the-fly with minimal computational overhead, highly transferable visual representations.
Benchmark associated with the 3DV2020 paper of the same name.
Updated photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
713 video clips from YouTube of mimed actions for a subset of 50 classes from the Kinetics400 dataset.
Targets challenges such as varying lighting conditions and different occlusion levels for tasks such as depth estimation, instance segmentation and visual localization.
585 samples (1006 sentences) randomly selected and annotated with the SemEval2016 annotation guidelines for the restaurant domain.