Model checkpoints, fairseq modules to decode from those models, the test splits we used in the papers, and translation outputs by our models
Robots waiting for the elevator
We train a variety of models with only 125 procedurally-generated expert-annotated scenes, testing the impact of the proposed feature maps. In our ablation study, the feature maps help the models’ performance and their generalization capabilities to non-synthetic, real scenes.
LPOSS
We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs).
DUNE
A unified encoder of different foundation models excelling in 2D vision, 3D understanding, and 3D human perception. Code accompanies the CVPR 2025 paper.
Speech-MASSIVE
Covers 12 languages from different families and inherits from the original MASSIVE dataset the annotations for the intent prediction and slot filling tasks. See also the Interspeech 2024 paper.
mHuBERT-147
A promising compact model for speech processing pipelines, offering an unprecedented balance between high performance and parameter efficiency. Developed within the the EU UTTER project.
Models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.
A codebase to evaluate the robustness and uncertainty properties of semantic segmentation models as implemented in the CVPR 2024 paper.
A Pytorch codebase for research to replicate the CVPR22 paper.
Kapture is a file format as well as a set of tools for manipulating datasets, and in particular Visual Localization and Structure from Motion data.
Data mixing strategies that can be computed on-the-fly with minimal computational overhead, highly transferable visual representations.
Benchmark associated with the 3DV2020 paper of the same name.
Updated photo-realistic synthetic video dataset designed to learn and evaluate computer vision models for several video understanding tasks: object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation.
713 video clips from YouTube of mimed actions for a subset of 50 classes from the Kinetics400 dataset.
Targets challenges such as varying lighting conditions and different occlusion levels for tasks such as depth estimation, instance segmentation and visual localization.
585 samples (1006 sentences) randomly selected and annotated with the SemEval2016 annotation guidelines for the restaurant domain.