A Universal Encoder for Embodied Perception

Modern robotic systems deployed in unconstrained environments usually have to process visual information across multiple tasks simultaneously. These capabilities, which range from recognition and localization to 3D understanding, are powered by separate models, each with their own visual encoder and task-specific decoder. As the number of tasks grows, running multiple large models becomes computationally demanding and often requires the kind of resources only available on the cloud. Yet, despite being trained for different tasks, these encoders usually share the same architecture. The work described here originated in identifying and realising the potential to reduce computational redundancy in the encoder so that we could perform multiple visual perception tasks on a robot.
The idea of replacing these multiple encoders with a universal one also has the advantage of providing a single shared encoder output representation that feeds the multiple perception tasks. Individual, pre-trained perception models perform remarkably well but, because they’re inherently specialized each captures different aspects of visual understanding often making them complementary to each other. Combining their strengths provides an opportunity to build more generic representations that generalize more robustly across tasks, domains and environments. A universal encoder therefore not only simplifies the system design, it reduces the memory and computational requirements as well as making the system generalize better.
Learning a universal encoder through multi-teacher distillation
To leverage this complementarity while reducing computational cost, we adopted a multi-teacher distillation approach. Multiple, strong pre-trained encoders act as teachers and their knowledge is distilled into a single student model. This results in a universal encoder that integrates the strengths of diverse models while remaining compact and efficient. The task-specific decoders are then fine-tuned to remain compatible with this new shared encoder, which replaces all the original task-specific ones.
Our first model based on this idea, UNIC, was distilled from four complementary teacher models. It outperforms each of them on several image-level and patch-level classification tasks.
We extended this approach with DUNE, a universal visual encoder trained across both 2D and 3D perception tasks. The DUNE encoder is distilled from DINOv2, Multi-HMR and MASt3R and it shines on a number of semantic and 3D reasoning tasks: depth estimation, semantic segmentation, map-free 3D relocalization and human-mesh recovery, all within the unified framework.
Ongoing work is currently extending distillation to more teachers and evaluation to even more diverse tasks, such as multi-view consistent panoptic segmentation (PanSt3R) and even Visual Question Answering (VQA).
What this enables
A universal encoder allows multiple perception tasks to share the same visual representation, eliminating redundant computation. For example, in the DUNE setup, replacing three ViT-Large encoders with a single ViT-Base encoder, reduces memory usage for encoding by 90%.
This makes systems faster, more efficient and compact enough to run directly on robotic platforms. In the DUNE setup, this reduction at the encoder level translates into an overall system that is 4x faster and, as you increase the number of tasks, the greater the gain in overall efficiency.
Additionally, once such a generic encoder is in place, new capabilities can be added simply by attaching additional decoders, without having to retrain the encoder.

Related publications and code
UNIC: Universal Classification Models via Multi-teacher Distillation, ECCV 2024
DUNE: Distilling a Universal Encoder from heterogenous 2D and 3D teachers, CVPR 2025
