Modern machine learning has been successfully applied to many problems, recently bringing huge improvements to several fields as diverse as object localization and recognition in images , game playing  and autonomous driving , to name but a few. Unfortunately, such performance is achieved at the expense of having a single, very specialized model per task i.e. a model trained to distinguish between cats and dogs has no idea how to distinguish apples from pears. Updating the recognition model to take into account these two new classes works to some extent but the same, underlying technology that makes these models perform so well (deep learning), has a major drawback: it has a terrible memory! After a few model updates, requiring multiple retraining as sets of new classes come along (called ”fine-tuning” on the newly proposed example images), the initial model will have forgotten what differentiates a dog from a cat. In jargon, we call this issue catastrophic forgetting [8, 2]. The line of research which tries to mitigate this issue, and which we discuss here, is called lifelong learning.
How big the memory problem is depends very much on the application. In some contexts, we might be happy with a model that can do a single task very well, without the need to learn new concepts during its lifespan. One example would be warehouse robots who need to perform repetitive actions in environments that rarely change. In other contexts, catastrophic forgetting represents a real issue. Think about a computer vision model for a social network app such as a filter to detect whether images violate a code of conduct before they’re posted. The module may be initially trained to handle realistic images (DSLR/phone photos). After some time, you may want it to also handle sketches or artwork, and update the model using relevant samples. However, unless you take the necessary precautions, your module is likely to forget how to handle the camera images.
In this blog post, we take you on a journey through the current state of lifelong learning research. Some of the points we cover are the solutions proposed, what researchers are currently focussed on and the benchmarks they use.
Throughout the lifetime of a model, different factors can vary across time. It can be exposed to new domains, to new tasks, to new classes or a combination of all of them. These different factors are illustrated in Figure 1.
New domains, where domain means image statistics. One example is the one used earlier, where we’ve trained a model on camera images and later we want to include sketches of drawings in the “comfort zone” of the model. In Figure 1 (left), in a simplified triangles vs. circles task, we have a domain where shapes are blue and a domain where shapes are purple.
New tasks. We have a model trained on one (or more) task(s), and we wish to include more. Using the same example as above, we may have an original filter that classifies whether an image contains violence, and in a second phase we may also want to be able to tell whether it contains hate. In Figure 1 (middle) the first task is again triangles vs. circles, and the second task is squares vs. crosses.
New classes. We have a model trained on a certain number of classes, and we want to include other classes. For an intuitive example, consider a house robot that needs to recognize an increasing number of different concepts. In Figure 1 (right), we’re again initially interested in the triangles vs. circles task, but later we want to also classify squares vs. crosses.
Deep learning models are generally trained by being repeatedly exposed to image samples they need to learn. An effective solution to the issue of forgetting (at least for the accuracy of a model), is that of keeping a record of the data used to devise the original model, so that every time new samples from new domains/tasks/classes arrive, one can also re-use the original ones, helping the model to remember.
This strategy requires a steadily increasing amount of memory and hence a steadily increasing use of energy and, in some cases, it’s not even possible to keep track of old data (for example, for privacy issues where retention periods expire). Lifelong learning research tries to find more effective and efficient solutions to the problem. We cover here the main ones, largely drawing from Parisi et al.  and Maltoni and Lomonaco  to categorize them (see also Table 1).
Rehearsal-based methods. These methods keep a memory buffer with samples from the older domains/tasks/classes that the model needs to remember. The main directions are (i) avoiding new knowledge interfering with existing knowledge and (ii) reducing memory requirements. The main advantage of these methods is their effectiveness in retaining prior knowledge. The main disadvantage is the need to store old data. This can be problematic for several reasons; apart from memory requirements, sometimes it’s just not possible to keep track of old data, e.g., for privacy related issues (especially in the medical field) or if the original model comes from a third party.
Architecture growing. These methods help a model to remember old information by increasing the number of parameters throughout its lifespan. The idea is to protect old patterns by freezing some parts of the model that were previously trained. This is effective in avoiding catastrophic forgetting but a major drawback is the memory requirements of the model which increase throughout its lifespan. This may even be critical in embedded systems (for example, mobile applications) where the model needs to fit specific constraints such as limited memory.
Regularization strategies. Regularization strategies face the lifelong learning problem from an optimization perspective. These methods generally constrain the loss that the model optimizes in a way that penalizes it from forgetting earlier concepts. A notable example is penalizing the important weights for severe changes. The huge advantage of this class of methods is that it overcomes the weaknesses of the other two families: there’s no need to store old data points nor to increase the model capacity throughout its lifespan. However it does become increasingly difficult to retain good performance on older tasks.
One could naturally take the best of all worlds and define a hybrid strategy. See Figure 2 for an overview of the various possibilities.
Defining realistic protocols to assess the performance of lifelong learning models is a vibrant research area itself. The use of non-realistic protocols has lately been a source of debate. Some widely adopted evaluation benchmarks are indeed way less realistic than the ones typically used in “standard” machine learning. For example, the most adopted benchmark constitutes learning from different versions of the MNIST dataset where pixels are (unrealistically) randomly permuted. It’s true that even in such simplistic contexts neural networks generally forget the past as new information comes in, but these scenarios are extremely different to any realistic application.
Among different attempts to propose new protocols, a notable one is the CORe-50 dataset , where lifelong learning performance can be evaluated in different directions (varying classes, varying domains or both). Furthermore, it also allows learning from temporally coherent streams of data, which is consistent with the way humans are exposed to visual information. This direction is also pursued in a very recent work , which introduces a new dataset that allows learning from streams of data recorded in the wild.
Practitioners also started using ImageNet, with the goal of sequentially learning samples from the 1,000 provided classes. Intriguing results were achieved with the REMIND algorithm , where competitive performance is achieved with just a single pass over ImageNet’s samples.
Of course we can’t predict the future, but we strongly believe that lifelong learning will play a crucial role in democratizing AI applications. This is because real, ambient AI should be able to adapt to an evolving environment. This will only be possible if models can enrich their capabilities as they’re exposed to new problems they need to solve, instead of drifting away from their initial purpose.
Our bet is that, although rehearsal approaches might be the only solution in some cases, we’ll see the gap narrow between them and methods that are less data-demanding (i.e. regularization strategies). Some meta-learning approaches have started to appear providing alternatives with very promising performance. The idea here is to learn the learning algorithms themselves, in order to accommodate specific needs (for instance, avoiding catastrophic forgetting). In contexts where rehearsal is the only option, a natural direction is to reduce the storage requirements. Recently, different pieces of work have independently explored the space of “featured replay” [5, 10, 3], where the information related to previous tasks is stored in a more compressed fashion, namely as feature embeddings.
Apart from the methods themselves, significant effort has been devoted to the design of more realistic protocols that more closely mimic the conditions in which a human learns. For instance, there’s a surge of interest in benchmarks where the goal is learning from a data stream, without allowing the learner to perform multiple passes over the data. This is very exciting and challenges the classical learning setting where neural networks perform so well (performing multiple passes over a training set). The number of applications that could arise from a learning system that can efficiently learn from a stream of data is countless! We passionately look forward to the next few years of lifelong learning research and what we’re working on to contribute to future progress.
 End to End Learning for Self-Driving Cars. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao and Karol Zieba. arXiv:1604.07316 [cs.CV], 2016.
 Catastrophic Interference in Connectionist Networks: Can It Be Predicted, Can It Be Prevented? Robert M French. Proceedings of Advances in Neural Information Processing Systems 6 (NIPS), 1993.
 REMIND Your Neural Network to Prevent Catastrophic Forgetting. Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya and Christopher Kanan. Proceedings of the European Conference on Computer Vision (ECCV), 2020.
 ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton. Proceedings of Advances in Neural Information Processing Systems (NIPS), 2012.
 Continuous Domain Adaptation with Variational Domain-Agnostic Feature Replay. Qicheng Lao, Xiang Jiang, Mohammad Havaei, and Yoshua Bengio. arXiv:2003.04382 [cs.LG], 2020.
 CORe50: a New Dataset and Benchmark for Continuous Object Recognition. Vincenzo Lomonaco and Davide Maltoni. Proceedings of the Conference on Robot Learning (CoRL), pp. 17 – 26, 2017.
 Continuous learning in single-incremental-task scenarios. Davide Maltoni and Vincenzo Lomonaco. Neural Networks, 116: 56-73, 2019. DOI: 10.1016/.jneunet.2019.03.010
 Catastrophic interference in connectionist networks: The sequential learning problem. Michael McCloskey and Neil J. Cohen. The Psychology of Learning and Motivation, 24: 109–165, 1989. DOI: 10.1016/S0079-7421(08)60536-8
 Continual Lifelong Learning with Neural Networks: A Review. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan and Stefan Wermter. Neural Networks, 113: 54–71, 2019. DOI: 10.1016/j.neunet.2019.01.012
 Latent Replay for Real-Time Continual Learning. Lorenzo Pellegrini, Gabrile Graffieti, Vincenzo Lomonaco and Davide Maltoni. arXiv:1912.01100 [cs.LG], 2019.
 Stream-51: Streaming Classification and Novelty Detection from Videos. Ryne Roady, Tyler L. Hayes, Hitesh Vaidya and Christopher Kanan. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Workshops, 2020. Mastering the game of Go with deep neural networks and tree search. David Silver, Aja Huang, Chris J. Maddison et al. Nature, 529: 484–489, (2016). DOI: 10.1038/nature16961.