A Universal Encoder for Embodied Perception

Universal Encoder banner image

Modern robotic systems deployed in unconstrained environments usually have to process visual information across multiple tasks simultaneously. These capabilities, which range from recognition and localization to 3D understanding, are powered by separate models, each with their own visual encoder and task-specific decoder. As the number of tasks grows, running multiple large models becomes computationally demanding and often requires the kind of resources only available on the cloud. Yet, despite being trained for different tasks, these encoders usually share the same architecture. The work described here originated in identifying and realising the potential to reduce computational redundancy in the encoder so that we could perform multiple visual perception tasks on a robot.

The idea of replacing these multiple encoders with a universal one also has the advantage of providing a single shared encoder output representation that feeds the multiple perception tasks. Individual, pre-trained perception models perform remarkably well but, because they’re inherently specialized each captures different aspects of visual understanding often making them complementary to each other. Combining their strengths provides an opportunity to build more generic representations that generalize more robustly across tasks, domains and environments. A universal encoder therefore not only simplifies the system design, it reduces the memory and computational requirements as well as making the system generalize better.

Learning a universal encoder through multi-teacher distillation

To leverage this complementarity while reducing computational cost, we adopted a multi-teacher distillation approach. Multiple, strong pre-trained encoders act as teachers and their knowledge is distilled into a single student model. This results in a universal encoder that integrates the strengths of diverse models while remaining compact and efficient. The task-specific decoders are then fine-tuned to remain compatible with this new shared encoder, which replaces all the original task-specific ones.

Our first model based on this idea, UNIC, was distilled from four complementary teacher models. It outperforms each of them on several image-level and patch-level classification tasks.

We extended this approach with DUNE, a universal visual encoder trained across both 2D and 3D perception tasks. The DUNE encoder is distilled from DINOv2, Multi-HMR and MASt3R and it shines on a number of semantic and 3D reasoning tasks: depth estimation, semantic segmentation, map-free 3D relocalization and human-mesh recovery, all within the unified framework.

Ongoing work is currently extending distillation to more teachers and evaluation to even more diverse tasks, such as multi-view consistent panoptic segmentation (PanSt3R) and even Visual Question Answering (VQA).

This video illustrates how our universal encoder can be used across multiple tasks leading to a novel, highly efficient system that can run on a robot.

What this enables

A universal encoder allows multiple perception tasks to share the same visual representation, eliminating redundant computation. For example, in the DUNE setup, replacing three ViT-Large encoders with a single ViT-Base encoder, reduces memory usage for encoding by 90%.

This makes systems faster, more efficient and compact enough to run directly on robotic platforms. In the DUNE setup, this reduction at the encoder level translates into an overall system that is 4x faster and, as you increase the number of tasks, the greater the gain in overall efficiency.

Additionally, once such a generic encoder is in place, new capabilities can be added simply by attaching additional decoders, without having to retrain the encoder.

Universal Encoder vs Separate Models
Figure 1: This graphic illustrates the DUNE setup (3 ViT-L teachers and 4 evaluation tasks) with the original system using separate encoders for each task (top) vs updated system using our universal encoder (bottom). When separate models are used, multiple visual representations are produced by the individual encoders. Our approach computes a single representation that can be fed to multiple decoders. This results in a reduction of 90% encoder memory footprint and a 12x speed up for the encoding part. Overall, this translates into 62% total memory reduction and an overall system that is 4x times faster.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank