Lifelong Learning for Visual Representation

blank

Visual representation learning (VRL) is a core component of computer vision and fundamental to the development of perception tools deployed in robots so they can understand and interact with their environment. As the robotic environments and the tasks evolve, the visual perception modules need to be updated and this need for plasticity in the visual perception of robotic platforms is what drives our research in lifelong learning. Our ultimate goal is to be able to design robots that continuously enrich their models and skills when deployed in new environments and/or when taking part in new interactions.

Lifelong learning (also known as continual learning) encompasses many research questions and concrete applicative challenges. What we work on can be roughly organized along three complementary research lines that span the lifecycle of visual perception models: from learning strong initial representations, to adapting them over time, to consolidating multiple models into unified ones.

  1. Pretraining strong models. Lifelong learning can be made better by acting at the stage of pre-training so our strategies aim to start from the strongest possible models. In this work, we produce models with a better generalization when exposed to new tasks or domains.
  2. Adapting models efficiently while mitigating catastrophic forgetting. We develop efficient adaptation techniques so that our models can adapt to new data and new tasks, whilst retaining what was learned from previous data and tasks.
  3. Towards universal visual encoders. Pre-trained models have become a commodity and, individually, they obtain strong results on a broad range of tasks. They also tend to be complementary. We aim to reconcile multiple models towards a single, universal one with broader applicability.

 

1: Pre-training strong models

At its origin, the success of deep learning in computer vision had been mostly fuelled by large quantities of data, annotated with fine-grained annotation labels. This led to strong models which were used to initialise dedicated ones for most computer vision tasks. The data and, more importantly, the labels have always been difficult to acquire. To overcome this limitation, we’ve explored ways to train visual representations with self-supervision such as our MoCHi method or with weak supervision, such as with ICMLM. That work pioneered the learning of visual representations from scratch using textual image descriptions. When labels are available, we’ve used them to help in pre-training large models and our generalization benchmarks such as CoG have shown that self-supervised models tend to generalize better, whereas our t-ReX model is an example of a supervised model with good generalization properties. Image generation methods have progressed enormously and we’ve been researching how these methods can help overcome the problem of lack of raw data for training. Leveraging the recent models that can create realistic images from simple textual prompts, we’ve explored whether large and synthetic datasets could replace real ones at the pre-training stage (ImageNet-SD).

2: Adapting models while mitigating catastrophic forgetting

Once we’ve pre-trained strong and versatile models, the question on how to use them for specific tasks arises and this often requires adaptation. For the task of visual localization, we’ve designed an adaptor-based architecture ‘GRAPPA‘ which performs well for both indoor and outdoor retrieval. We’ve also explored how to adapt large and generic pretrained models to specialized tasks, using distillation (Distill). Finally, we’ve shown how large image-level pre-trained models can be leveraged to enhance 3D representations with semantic information (such as N3F and LUDVIG). The term catastrophic forgetting is used to describe the fact that neural networks, after being updated, tend to forget previous training. These updates can come with a distribution shift such as moving from daylight to night scenes. This distribution can vary across batches of data or can gradually vary, as illustrated in OASIS. We’ve also explored class incremental learning for semantic segmentation in RASP and novel class discovery for detection in PANDAS and we’re interested in many more scenarios.

3: Towards universal visual encoders

Pre-trained models have become widely available and offer strong results on a broad range of tasks. These models tend to be complementary and any method that can combine them could lead to even stronger generalization across a variety of tasks. In robotics applications, multiple components need to run in parallel to simultaneously perform a large variety of tasks, yet they typically rely on an encoder that produces visual representations. Replacing these encoders with a single universal one, generic enough to support all target tasks while inheriting the strengths of the models it was trained on, would therefore be of high practical interest. In this line of work, we focus on distilling multiple strong and complementary encoders into a single model, which would significantally reduce overall computational cost. Our first universal encoder, UNIC, was distilled from four strong teacher models and performs better than each of them on several classification tasks. We have recently extended this approach to heterogeneous tasks going beyond classification, using some of the best visual representations as teachers. Again, by applying multi-teacher distillation, we trained DUNE, a universal visual encoder that excels in 2D vision, 3D understanding, and 3D human perception.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.

blank

Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.

blank