Lifelong Learning for Visual Representation

Visual representation learning (VRL) is a core component of computer vision and fundamental to the development of perception tools deployed in robots so they can understand and interact with their environment. As the robotic environments and the tasks evolve, the visual perception modules need to be updated and this need for plasticity in the visual perception of robotic platforms is what drives our research in lifelong learning. Our ultimate goal is to be able to design robots that continuously enrich their models and skills when deployed in new environments and/or when taking part in new interactions.
Lifelong learning (also known as continual learning) encompasses many research questions and concrete applicative challenges. What we work on can be roughly organized along three complementary research lines that span the lifecycle of visual perception models: from learning strong initial representations, to adapting them over time, to consolidating multiple models into unified ones.
- Pretraining strong models. Lifelong learning can be made better by acting at the stage of pre-training so our strategies aim to start from the strongest possible models. In this work, we produce models with a better generalization when exposed to new tasks or domains.
- Adapting models efficiently while mitigating catastrophic forgetting. We develop efficient adaptation techniques so that our models can adapt to new data and new tasks, whilst retaining what was learned from previous data and tasks.
- Towards universal visual encoders. Pre-trained models have become a commodity and, individually, they obtain strong results on a broad range of tasks. They also tend to be complementary. We aim to reconcile multiple models towards a single, universal one with broader applicability.
1: Pre-training strong models
At its origin, the success of deep learning in computer vision had been mostly fuelled by large quantities of data, annotated with fine-grained annotation labels. This led to strong models which were used to initialise dedicated ones for most computer vision tasks. The data and, more importantly, the labels have always been difficult to acquire. To overcome this limitation, we’ve explored ways to train visual representations with self-supervision such as our MoCHi method or with weak supervision, such as with ICMLM. That work pioneered the learning of visual representations from scratch using textual image descriptions. When labels are available, we’ve used them to help in pre-training large models and our generalization benchmarks such as CoG have shown that self-supervised models tend to generalize better, whereas our t-ReX model is an example of a supervised model with good generalization properties. Image generation methods have progressed enormously and we’ve been researching how these methods can help overcome the problem of lack of raw data for training. Leveraging the recent models that can create realistic images from simple textual prompts, we’ve explored whether large and synthetic datasets could replace real ones at the pre-training stage (ImageNet-SD).
2: Adapting models while mitigating catastrophic forgetting
Once we’ve pre-trained strong and versatile models, the question on how to use them for specific tasks arises and this often requires adaptation. For the task of visual localization, we’ve designed an adaptor-based architecture ‘GRAPPA‘ which performs well for both indoor and outdoor retrieval. We’ve also explored how to adapt large and generic pretrained models to specialized tasks, using distillation (Distill). Finally, we’ve shown how large image-level pre-trained models can be leveraged to enhance 3D representations with semantic information (such as N3F and LUDVIG). The term catastrophic forgetting is used to describe the fact that neural networks, after being updated, tend to forget previous training. These updates can come with a distribution shift such as moving from daylight to night scenes. This distribution can vary across batches of data or can gradually vary, as illustrated in OASIS. We’ve also explored class incremental learning for semantic segmentation in RASP and novel class discovery for detection in PANDAS and we’re interested in many more scenarios.
3: Towards universal visual encoders
Pre-trained models have become widely available and offer strong results on a broad range of tasks. These models tend to be complementary and any method that can combine them could lead to even stronger generalization across a variety of tasks. In robotics applications, multiple components need to run in parallel to simultaneously perform a large variety of tasks, yet they typically rely on an encoder that produces visual representations. Replacing these encoders with a single universal one, generic enough to support all target tasks while inheriting the strengths of the models it was trained on, would therefore be of high practical interest. In this line of work, we focus on distilling multiple strong and complementary encoders into a single model, which would significantally reduce overall computational cost. Our first universal encoder, UNIC, was distilled from four strong teacher models and performs better than each of them on several classification tasks. We have recently extended this approach to heterogeneous tasks going beyond classification, using some of the best visual representations as teachers. Again, by applying multi-teacher distillation, we trained DUNE, a universal visual encoder that excels in 2D vision, 3D understanding, and 3D human perception.
Related publications
- [LUDVIG] LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian Splatting scenes, ICCV 2025
- [DUNE] DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers, CVPR 2025
- [LPOSS] LPOSS: Label Propagation over patches and pixels for Open-vocabulary Semantic Segmentation, CVPR 2025
- [UNIC] UNIC: Universal Classification Models via Multi-teacher Distillation, ECCV 2024
- [Distill] On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models,TMLR 2024
- [RaSP] RaSP: Relation-aware Semantic Prior for weakly supervised incremental segmentation, CoLLAs 2023
- [ImageNetSD] Fake it till you make it: learning transferable representations from synthetic ImageNet clones, CVPR 2023
- [t-ReX] No reason for no supervision: improved generalization in supervised models, ICLR 2023
- [Grappa] Granularity-aware adaptation for image retrieval over multiple tasks, ECCV 2022
- [N3F] Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations, 3DV 2022
- [OASIS] On the road to online adaptation for semantic image segmentation, CVPR 2022
- [CoG] Concept generalization in visual representation learning, ICCV 2021
- [MoChi] Hard negative mixing for contrastive learning, NeurIPS 2020
- [ICMLM] Learning visual representations with caption annotations, ECCV 2020
