Lifelong Learning for Visual Representation

blank

Visual representation learning (VRL) is a core component of computer vision and fundamental in the development of perception tools deployed in robots so they can understand and interact with their environment. As the robot environments and the tasks evolve, the visual perception modules need to be updated and this need for plasticity in the visual perception of robotic platforms is what drives our research in lifelong learning. Our ultimate goal is to be able to design robots that continuously enrich their models and skills when deployed in new environments and/or when taking part in new interactions.

Lifelong learning (also known as continual learning), encompasses many research questions and concrete applicative challenges. What we work on can be roughly organized along the following research lines:

  1. Starting from the strongest possible model. Lifelong learning can be made better by acting at the stage of pretraining so our strategies are designed to produce models that generalize better when exposed to new tasks or new domains.
  2. Adapting models efficiently while mitigating catastrophic forgetting. We develop efficient adaptation techniques so that our models can adapt to new data and new tasks, whilst retaining what was learned from previous data and tasks.
  3. Leveraging generative AI. When prompted with text, recent tools can automatically generate new images from scratch or alter existing ones. We leverage these tools to improve and extend existing perception pipelines.
  4. Towards universal models. Pretrained models have become a commodity and, individually, they obtain strong results on a broad range of tasks. They also tend to be complementary. We aim at reconciling multiple models towards a single, universal one with broader applicability.

 

1: Pretraining strong models

Until recently, the success of deep learning in computer vision was mostly fueled by large quantities of data, annotated with fine-grained annotation labels. This led to strong models which were used to initialise dedicated ones for most computer vision tasks. The data and, more importantly, the labels have always been difficult to acquire. To overcome this limitation, we’ve explored ways to train visual representations with self-supervison such as our MoCHi method (Mixing of Contrastive Hard negatives) or with weak supervision. In weak supervision in 2022 we pioneered the learning of visual representation from scratch using the textual captions/descriptions that come with them with ICMLM (Image Conditioned Masked Language Modeling). This idea has since been pushed to much larger scales in large pretrained multimodal models such as CLIP. When labels are available, we’ve used them to help in pretraining large models and our generalization benchmarks such as CoG (Concept Generalization in VRL) have shown that self-supervised models tend to generalize better, whereas our recent t-ReX model is an example of a supervised model with good generalization properties.

2: Adapting models while mitigating catastrophic forgetting

The term catastrophic forgetting is used to describe the fact that neural networks, after being updated, tend to forget previous training. These updates can come with a distribution shift such as moving from daylight to night scenes. This distribution can vary across batches of data as shown in or can gradually vary, illustrated in the OASIS (Online Adaptation for Semantic Image Segmentation) video below. We’ve also explored class incremental learning for semantic segmentation in RASP (Relation Aware Semantic Prior for weakly supervised incremental segmentation) and novel class discovery for detection in PANDAS (Prototype-based novel class discovery and detection) and we’re interested in many more scenarios.

This video shows how our proposed continual learning online adaptation method on the right applied to image segmentation (classes listed in boxes), compares to image segmentation without adaptation (left) or with a Naive continual learning method (middle) which suffers from catastrophic forgetting where the classes are forgotten over time.

3: Leveraging generative AI

Image generation methods have progressed enormously and we’ve been researching how these methods can help overcome the problem of lack of raw data for training. Leveraging the recent models that can create realistic images from simple textual prompts, we’ve explored whether large and synthetic datasets could replace real ones at the pretraining stage (ImageNet-SD). We’ve pushed this idea of using generative models further, to complement real data in concrete applications, making image retrieval models more robust to weather and lighting changes with Ret4Loc and improving the distillation of large, pretrained models into small specialized ones with Distill.

4: Towards universal models

 A careful trade-off has to be found between strong, multi-purpose models and highly specialized ones. This trade-off depends on the application. It is particularly relevant for the task of visual localization where we designed an adaptor-based architecture called GRAPPA which performs well on multiple scenarios. In robotics applications, multiple components need to run in parallel to simultanously perform a large variety of tasks, yet they all rely on an encoder which produces visual representations. We’re very interested in distilling multiple strong, complementary encoders into a single one as this would reduce the overall compute and have been making recent progress with our universal encoder UNIC which performs better than four individual strong teachers. We’re very excited about how this could be a key enabler for embodied applications.

blank
Multi-teacher distillation into a single encoder: UNIC. Relative gains using the UNIC encoder distilled from teachers DINO, DeiT-III, iBOT, dBOT-ft, over the respective best teacher for each task using a single encoder and no task-specific parameters. All models are trained on ImageNet-1K.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank