CODE & DATA

Contributing to the open science community

We publish code and other resources throughout the year to further research, foster collaboration with the global research community and accelerate innovation across different fields of application. Depending on the resource type, it may be hosted on Naver Hugging Face and/or Naver GitHub platforms or occasionally on our own servers.

Topics

Multi HMR+Anny

Multi-HMR + Anny

A Multi-HMR checkpoint trained with the Anny body model.

Monocular multi-person Human Mesh Recovery in a unified, open and interpretable representation. Anny is our open-source (Apache 2.0) parametric 3D human model.

3D vision, Computer vision, Human understanding

Anny Model CodeData

Anny

3D human parametric model

An open-source (Apache 2.0) parametric 3D human model that covers a wide spectrum of humanity, from infants to elders, with interpretable and easy-to-use controls.

3D vision, Computer vision, Human understanding, Synthetic data

StarDrinks

English and Korean test set for SLU evaluation in a drink ordering scenario

The dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation. To download this dataset you need to submit a request online.

Data, NLP, Speech

Anny-One

A large-scale synthetic dataset for 3D human understanding

Multi-person configurations, multi-view setups with extreme camera positions, diverse cameras, viewpoints, and FOVs, indoor 3D scenes and full Anny annotations. Designed for Human Mesh Recovery in challenging settings.

3D vision, Computer vision, Human understanding, Synthetic data

LUDVIG

Learning-free Uplifting of 2D Visual features to Gaussian splatting scenes

A simple yet effective aggregation technique yields applied to semantic masks from Segment Anything (SAM) and to generic DINOv2 features, integrating 3D scene geometry through graph diffusion (ICCV 2025 paper)

3D vision, Computer vision

Spire

A speech-augmented LLM model

Spire translates and transcribes speech input from English into 10 other languages as well as translating text input in both language directions. Spire is an output of the EU project UTTER (Unified Transcription and Translation for Extended Reality).

Foundation models, LLM, Machine translation, NLP

Hugging Face link

Annotated scenes in HRI: Robots waiting for the elevator

Integrating social norms in a low-data regime goal selection problem

This dataset of 125 procedurally-generated expert-annotated scenes accompanies the RO-MAN 2025 paper ‘Robots waiting for the elevator: integrating social norms in a low-data regime goal selection problem‘.

Data, Human understanding

Pow3R

EmPOWering unconstrained 3D reconstruction with camera and scene priors

A novel large 3D vision regression model, highly versatile in accepted input modalities and, alongside input images, incorporates any combination of auxiliary information such as intrinsics, relative pose etc. within a single network. Builds upon the DUSt3R paradigm.

3D vision, Foundation models

MUSt3R

Multi-view Network for Stereo 3D Reconstruction

The latest DUSt3R-based model. MUSt3R is symmetric and enables online predictions of the camera pose and 3D structure of a collection of images by using a multi-layer memory mechanism. Works online and offline.

3D vision, Foundation models

LPOSS

Label Propagation over patches and pixels for Open-vocabulary Semantic Segmentation

A training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs).

Computer vision, Data, LLM

LLM-as-a-qualitative-judge

Automating error analysis in natural language generation

LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators.

DUNE

Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

A unified encoder of different foundation models excelling in 2D vision, 3D understanding, and 3D human perception. Code accompanies the CVPR 2025 paper.

3D vision, Computer vision, Data, Foundation models, Human understanding

GUARD

Guaranteed Generation from Large Language Models

A principled approach to enforcing strict guarantees for LLMs without compromising their generative capabilities combining an autoregressive proposal distribution with rejection sampling.

OSCAR

Online Soft Compression And Reranking

A novel query-dependent online soft compression method for RAG that reduces computational overhead while preserving performance. Unlike traditional hard compression methods, which shorten retrieved texts, or soft compression approaches, which map documents to continuous embeddings offline, OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates.

Information retrieval, LLM

Hugging Face link

Geo4D

Leveraging video generators for Geometric 4D scene reconstruction

Geo4D is a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes including that of long videos.

3D vision, Computer vision

PoseEmbroider

Towards 3D, visual, semantic-aware human pose representation

A transformer-based multi-modal alignment retrieval model that processes 3D poses, person’s pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic aware human pose representation that can sort out partial information (e.g. image with the lower body occluded). Code with ECCV 2024 paper.

3D vision, Computer vision, Human understanding, NLP

GOAL

A Generalist Combinatorial Optimization Agent Learning

Code to learn to solve 16 combinatorial optimization problems with strong transfer learning capacities.

mRAG

Retrieval-augmented generation in multilingual settings

Retrieval-augmented generation (RAG) in the multilingual setting (mRAG). Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities.

Foundation models, Information retrieval, Neural retrieval

Speech-MASSIVE

A multilingual Spoken Language Understanding (SLU) dataset

Covers 12 languages from different families and inherits from the original MASSIVE dataset the annotations for the intent prediction and slot filling tasks. See also the Interspeech 2024 paper.

Data, LLM, NLP, Speech

Hugging Face link

UNIC

Universal Classification Models via Multi-teacher Distillation

General encoder for classification. Accompanies ECCV’24 paper.

Computer vision, Visual representation learning

DEBiT (Dual Encoder Binocular Transformer)

Correspondence Pretext Tasks for Goal-oriented Visual Navigation

An end-to-end trained agent for image goal navigation. Accompanies ICLR24 paper End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon.

.

Computer vision, Foundation models

BERGEN: benchmarking RAG

A Benchmarking Library for Retrieval-Augmented Generation

Designed to ease the reproducibility and integration of new datasets and models and identify strong baselines.

Neural retrieval, NLP

ELITR-Bench

A benchmark for the evaluation of long-context LLMs on meeting transcripts.

The meeting data used in this benchmark originally comes from the ELITR dataset. This dataset and experiments are described in the paper and are an output of the EU UTTER project.

Data, Foundation models, LLM, Speech

Pasero

Lightweight Pytorch framework for training and running text generation models.

Can be used for machine translation, speech translation, language modeling and dialogue supporting a number of popular pre-trained models.

Foundation models, LLM, Machine translation, NLP

SHiNe

Semantic Hierarchy Nexus for Open-vocabulary Object Detection

A novel classifier that uses semantic knowledge from class hierarchies. Can be seamlessly integrated with any off-the-shelf OvOD detector, with no additional computational overhead during inference.

Computer vision

ZLaP

A classification method based on on label propagation (LP) that utilizes geodesic distances.

Code that accompanies the CVPR 24 paper, Label Propagation for Zero-shot Classification with Vision-Language Models.

Computer vision

1 2 3 4 Next »

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.