SuperLoss: Robust curriculum learning helps machines to learn like humans - Naver Labs Europe
SuperLoss blog image

Our novel framework, SuperLoss—which uses individual sample losses as error measures to determine the relative difficulty of samples in a dataset—can be plugged on top of existing neural network models to implement curriculum learning for any task, even with noisy datasets.

Generally, humans (and animals) learn concepts by mastering a series of progressively challenging problems. One example of this process is the way in which schoolchildren learn to solve increasingly advanced mathematical problems over several years, as illustrated in Figure 1. By learning simpler concepts first, children are better equipped to solve more difficult problems later. Overcoming simple challenges in this way provides them with a baseline of knowledge that can be built upon iteratively.

Curriculum learning takes inspiration from this natural style of learning and applies it in the context of machines (1). Typically in machine learning, a neural network is tasked with sequential samples taken randomly from the entire set of training data. In curriculum learning, however, the network is instead presented with the easier samples first. This approach has been shown to perform better than traditional machine learning (2, 3, 4), even for small datasets (5).

For curriculum learning in computers to be a success, some prior knowledge about the task at hand is normally required. This is because the relative difficulty of each sample in a given dataset must first be estimated to enable the network to tackle the samples in curriculum order.

Curriculum Learning
Figure 1: Curriculum learning describes the process of building knowledge by finding solutions to iteratively more difficult problems. Easier problems (left) provide the basic building blocks on which harder problems (right) can be understood.

Estimating the difficulty of samples in a dataset

In early work on curriculum learning (2), experiments were carried out on toy datasets in which the separation between easy and hard samples was clear and predefined in the dataset construction. More recent approaches (6) have shown that the losses—i.e. the prediction error—of samples during training can be used to identify which ones are difficult, as they usually show a high loss across training compared to easy samples. To effectively apply curriculum learning, the importance of the difficult samples is lessened while training by reducing the weight of their contribution (downweighting). At later stages of the training process, the model learns to tackle more difficult samples, which thus contribute more to the training objective.

However, this approach is challenging to implement. Even ‘self-learning’ models that are capable of estimating difficulty themselves call for significant changes to the training procedure to work properly. They may require, for example, multistage training (7), extra parameters or layers (8, 9), and ad hoc adaptations specific to each task. For these reasons, such methods are generally specialized to specific tasks, like image classification (10, 11).

All in all, current curriculum-learning approaches demand significant adaptation of the training procedure for a given task and, for this reason, generally require dedicated training schemes. Such schemes are time-consuming to implement and computationally expensive in practice, as well as being restrictive in terms of application. Additionally, they often require clean labelled datasets for training, which places further limits on their applicability.

SuperLoss: a straightforward framework for implementing curriculum learning for any task

We have developed an easy-to-use framework, called SuperLoss, which makes curriculum learning applicable to any task (12). Our SuperLoss module can in fact be plugged on top of an existing loss function during training, as shown in Figure 2. SuperLoss automatically downweights the contribution of hard samples while upweighting easy samples, effectively implementing the core principle of curriculum learning.

To determine the difference between easy and hard samples, the current loss of a sample is compared with respect to an exponential averaging of the losses over all samples. A direct benefit of this approach is that no change is required at test time. Also, very little additional computation overhead is added during the learning process.

Training Inference
Figure 2: (Left) Illustration showing the process of standard training compared with training with SuperLoss. (Top left) A sample from a dataset is processed by the neural network. The algorithm then compares the input (label) to the output—to determine the error (or ‘loss’)—and changes the weights that the model may use to reduce the loss in the next evaluation. (Bottom left) The SuperLoss module, which can be plugged on top of an existing loss function, adds an extra step to each iteration that increases the weight of easier problems and reduces the weight of harder ones to effectively implement the core principle of curriculum learning. (Right) At test time, no change is required: the sample is fed to the network, which outputs a prediction.

Using confidence estimates to increase the reliability of network predictions

Our work is inspired by a family of recently proposed loss functions referred to as ‘confidence-aware’. Such functions incorporate confidence estimates, which increase the reliability of predictions made by a neural network without adding a great deal of computational cost for training. Additionally, a confidence-aware loss allows curriculum learning to be performed automatically. Existing loss functions are specialized to precise tasks and do not generalize easily, which limits their application.

Somewhat surprisingly, however, we’ve discovered that confidence-aware loss functions for different tasks share striking similarities.

Three recently designed confidence-aware loss functions are shown in Figure 3. Each was designed for a different task and was independently proposed (10, 13, 14). In each plot, the region that corresponds to the low-confidence value is almost flat, whereas the higher confidence region contains standard/emphasized gradients. In other words, the gradient of the loss with respect to the network parameters increases with the confidence when all other parameters are fixed.

figure 3
Figure 3: Plots of three confidence-aware loss functions, each designed for different tasks—left: confidence-aware cross entropy; middle: reliability loss; right: introspection loss—show remarkably similar features. At low confidence values (the left side of each plot), the gradient is flattened. At higher confidence values (the right side of each plot) standard/emphasized gradients are visible. Y axis: Correctness of the network prediction. X axis: Confidence value. The colour gradient shown in the plot represents the value of the loss function, where blue is smaller and yellow is larger.

Based on these similarities, we propose a novel way to transform any loss function into a confidence-aware version. Our solution is a task-agnostic, interpretable, confidence-aware loss function that receives the standard loss and an additional confidence parameter. For it to comply with any type of loss, we design our function such that it is translation-invariant and homogeneous with respect to the input loss and that it generalizes the input loss.

The formulation of our confidence-aware transform admits an optimal confidence value given the input loss (as this specific confidence value has a closed-form solution). We can therefore define the SuperLoss as the value of our confidence-aware loss for the optimal confidence. The SuperLoss has a single input: the original loss value. Therefore, it can simply be placed on top of any loss function (hence the name!) and does not require any change in the training procedure, nor any extra parameters.

High robustness to noise with SuperLoss enables learning from automatically collected web data

To determine the performance of SuperLoss, we carry out extensive experiments on various computer vision tasks (image classification, deep regression, object detection and image retrieval). Overall, our results show that the use of SuperLoss gives rise to small, consistent improvements when training on clean data. More significantly, however, we found that for data labels containing noise—e.g. those automatically collected from the web—training with SuperLoss leads to significantly higher performance.

The improvement in accuracy for noisy data arises because noisy samples will remain difficult (have a high loss) even after numerous passes (or epochs). As a result of this, the contributions of these samples during training will be downweighted by the SuperLoss module. This is illustrated in the plot in Figure 4, which shows the evolution of losses for easy, hard and noisy samples. The easy and hard samples are determined to be clean with small and high loss, respectively, after a few passes (or epochs) during training.

Figure 4: The evolution of losses for easy (green), hard (blue) and noisy (red) samples when using SuperLoss on the CIFAR-100 dataset with 60% noise. The easy samples are determined within a few passes (or epochs) of training. We observe that the hard and noisy samples are clearly separated (after roughly 20 epochs).
Cifar10 100
Figure 5: Image classification results for a number of models, including SuperLoss (blue), on CIFAR-10 and CIFAR-100 datasets with noise (up to 80%) artificially added to the labels.

We compared SuperLoss to a number of other models on datasets with varying noise (see Figure 5). Our results show that the addition of noise, in the form of false labels, causes the performance of all models to decrease significantly when using standard training. However, the results obtained with SuperLoss show that accuracy is less impacted by noise, even at high percentages (80%). This is true even compared to most state-of-the-art models, which have more limited applicability (e.g. are specialized to image classification, require the addition of novel network parameters and/or necessitate a change in the procedure to work properly).

Another strong result that we obtain is on image retrieval when the model is trained on a dataset automatically collected from the web. Researchers have previously had to apply geometric verification to clean such data, and then use the subset of clean data to obtain a reasonable performance in this area (15). However, by plugging SuperLoss on top of their method, we obtain a better performance when training on the full, noisy dataset rather than the subset of clean data. This highlights the ability of SuperLoss to enable learning from large, automatically collected noisy datasets instead of clean, manually curated datasets.

Summarizing SuperLoss and future work

In summary, we have developed a novel, easy-to-use framework that enables curriculum learning to be applied for any task, even for noisy data. SuperLoss works as a module that can be plugged on top of an existing loss function to increase the accuracy of any model by upweighting easy questions and downweighting hard ones, thus creating a curriculum by which the network is able to learn. This approach is computationally less expensive than state-of-the-art models and, moreover, does not require specialization for a specific task. Additionally, SuperLoss is adept at learning from large, noisy datasets, such as those collected automatically from the web. In this respect, we plan to investigate how SuperLoss could help in the context of semi-supervised learning where some labels are missing, which in a sense is another form of noise.


  1. Learning and development in neural networks: the importance of starting small. Jeffrey L. Elman. Cognition, vol. 48, no. 1, 1993, pp. 71–99.
  2. Curriculum learning. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Jason Weston. Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), Montreal, Quebec, Canada, 14–18 June 2009.
  3. Training agent for first-person shooter game with actor-critic curriculum learning. Yuxin Wu, Yuandong Tian. 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–29 April 2017.
  4. Curriculum learning for multi-task classification of visual attributes. Nikolaos Sarafianos, Theodore Giannakopoulos, Christophoros Nikou, Ioannis A. Kakadiaris. Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017.
  5. Multi-task curriculum transfer deep learning of clothing attributes. Qi Dong, Shaogang Gong, Xiatian Zhu. IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, California, USA, 24–31 March 2017.
  6. SELF: learning to filter noisy labels with self-ensembling. Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, Thomas Brox. 8th International Conference on Learning Representations (ICLR), virtual event, 26 April–1 May
  7. Self-paced learning for latent variable models. M. Pawan Kumar, Benjamin Packer, Daphne Koller. Proceedings of the 23rd International Conference on Neural Information Processing Systems (NIPS’10), Vancouver, Canada, 6–11 December 2010.
  8. 02U-Net: a simple noisy label detection approach for deep neural networks. Jinchi Huang, Lie Qu, Rongfei Jia, Binqian Zhao. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019.
  9. MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, Li Fei-Fei. 2018 International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018.
  10. Data parameters: a new family of parameters for learning a differentiable curriculum. Shreyas Saxena, Oncel Tuzel, Dennis DeCoste. Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 8–14 December 2019.
  11. Dynamic curriculum learning for imbalanced data classification. Yiru Wang, Weihao Gan, Jie Yang, Wei Wu, Junjie Yan. IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019.
  12. SuperLoss: a generic loss for robust curriculum learning. Thibault Castells, Philippe Weinzaepfel, Jerome Revaud. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), virtual event, 6–12 December 2020.
  13. Self-supervised learning of geometrically stable features through probabilistic introspection. David Novotny, Samuel Albanie, Diane Larlus, Andrea Vedaldi. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA, 18–22 June 2018.
  14. R2D2: reliable and repeatable detector and descriptor. Jerome Revaud, César De Souza, Martin Humenberger, Philippe Weinzaepfel. Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 8 December–14 December 2019.
  15. Deep image retrieval: learning global representations for image search. Albert Gordo, Jon Almazán, Jérome Revaud, Diane Larlus. 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016.