Lifelong Learning for Visual Representation
Visual representation learning (VRL) is a core component of computer vision and fundamental in the development of perception tools deployed in robots so they can understand and interact with their environment. As the robot environments and the tasks evolve, the visual perception modules need to be updated and this need for plasticity in the visual perception of robotic platforms is what drives our research in lifelong learning. Our ultimate goal is to be able to design robots that continuously enrich their models and skills when deployed in new environments and/or when taking part in new interactions.
Lifelong learning (also known as continual learning), encompasses many research questions and concrete applicative challenges. What we work on can be roughly organized along the following research lines:
- Starting from the strongest possible model. Lifelong learning can be made better by acting at the stage of pretraining so our strategies are designed to produce models that generalize better when exposed to new tasks or new domains.
- Adapting models efficiently while mitigating catastrophic forgetting. We develop efficient adaptation techniques so that our models can adapt to new data and new tasks, whilst retaining what was learned from previous data and tasks.
- Leveraging generative AI. When prompted with text, recent tools can automatically generate new images from scratch or alter existing ones. We leverage these tools to improve and extend existing perception pipelines.
- Towards universal models. Pretrained models have become a commodity and, individually, they obtain strong results on a broad range of tasks. They also tend to be complementary. We aim at reconciling multiple models towards a single, universal one with broader applicability.
1: Pretraining strong models
Until recently, the success of deep learning in computer vision was mostly fueled by large quantities of data, annotated with fine-grained annotation labels. This led to strong models which were used to initialise dedicated ones for most computer vision tasks. The data and, more importantly, the labels have always been difficult to acquire. To overcome this limitation, we’ve explored ways to train visual representations with self-supervison such as our MoCHi method (Mixing of Contrastive Hard negatives) or with weak supervision. In weak supervision in 2022 we pioneered the learning of visual representation from scratch using the textual captions/descriptions that come with them with ICMLM (Image Conditioned Masked Language Modeling). This idea has since been pushed to much larger scales in large pretrained multimodal models such as CLIP. When labels are available, we’ve used them to help in pretraining large models and our generalization benchmarks such as CoG (Concept Generalization in VRL) have shown that self-supervised models tend to generalize better, whereas our recent t-ReX model is an example of a supervised model with good generalization properties.
2: Adapting models while mitigating catastrophic forgetting
The term catastrophic forgetting is used to describe the fact that neural networks, after being updated, tend to forget previous training. These updates can come with a distribution shift such as moving from daylight to night scenes. This distribution can vary across batches of data as shown in or can gradually vary, illustrated in the OASIS (Online Adaptation for Semantic Image Segmentation) video below. We’ve also explored class incremental learning for semantic segmentation in RASP (Relation Aware Semantic Prior for weakly supervised incremental segmentation) and novel class discovery for detection in PANDAS (Prototype-based novel class discovery and detection) and we’re interested in many more scenarios.
3: Leveraging generative AI
Image generation methods have progressed enormously and we’ve been researching how these methods can help overcome the problem of lack of raw data for training. Leveraging the recent models that can create realistic images from simple textual prompts, we’ve explored whether large and synthetic datasets could replace real ones at the pretraining stage (ImageNet-SD). We’ve pushed this idea of using generative models further, to complement real data in concrete applications, making image retrieval models more robust to weather and lighting changes with Ret4Loc and improving the distillation of large, pretrained models into small specialized ones with Distill.
4: Towards universal models
A careful trade-off has to be found between strong, multi-purpose models and highly specialized ones. This trade-off depends on the application. It is particularly relevant for the task of visual localization where we designed an adaptor-based architecture called GRAPPA which performs well on multiple scenarios. In robotics applications, multiple components need to run in parallel to simultanously perform a large variety of tasks, yet they all rely on an encoder which produces visual representations. We’re very interested in distilling multiple strong, complementary encoders into a single one as this would reduce the overall compute and have been making recent progress with our universal encoder UNIC which performs better than four individual strong teachers. We’re very excited about how this could be a key enabler for embodied applications.
Related publications
- [UNIC] UNIC: Universal Classification Models via Multi-teacher Distillation, ECCV 2024
- [POC] Placing Objects in Context via Inpainting for Out-of-distribution Segmentation, ECCV 2024
- [Distill] On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models,TMLR 2024
- [Ret4Loc] Weatherproofing retrieval for localization with generative AI and geometric consistency, ICLR 2024
- [RaSP] RaSP: Relation-aware Semantic Prior for weakly supervised incremental segmentation, CoLLAs 2023
- [ImageNetSD] Fake it till you make it: learning transferable representations from synthetic ImageNet clones, CVPR 2023
- [t-ReX] No reason for no supervision: improved generalization in supervised models, ICLR 2023
- [Grappa] Granularity-aware adaptation for image retrieval over multiple tasks, ECCV 2022
- [OASIS] On the road to online adaptation for semantic image segmentation, CVPR 2022
- [CoG] Concept generalization in visual representation learning, ICCV 2021
- [MoChi] Hard negative mixing for contrastive learning, NeurIPS 2020
- [ICMLM] Learning visual representations with caption annotations, ECCV 2020