Estimating the difficulty of samples in a dataset
In early work on curriculum learning (2), experiments were carried out on toy datasets in which the separation between easy and hard samples was clear and predefined in the dataset construction. More recent approaches (6) have shown that the losses—i.e. the prediction error—of samples during training can be used to identify which ones are difficult, as they usually show a high loss across training compared to easy samples. To effectively apply curriculum learning, the importance of the difficult samples is lessened while training by reducing the weight of their contribution (downweighting). At later stages of the training process, the model learns to tackle more difficult samples, which thus contribute more to the training objective.
However, this approach is challenging to implement. Even ‘self-learning’ models that are capable of estimating difficulty themselves call for significant changes to the training procedure to work properly. They may require, for example, multistage training (7), extra parameters or layers (8, 9), and ad hoc adaptations specific to each task. For these reasons, such methods are generally specialized to specific tasks, like image classification (10, 11).
All in all, current curriculum-learning approaches demand significant adaptation of the training procedure for a given task and, for this reason, generally require dedicated training schemes. Such schemes are time-consuming to implement and computationally expensive in practice, as well as being restrictive in terms of application. Additionally, they often require clean labelled datasets for training, which places further limits on their applicability.
SuperLoss: a straightforward framework for implementing curriculum learning for any task
We have developed an easy-to-use framework, called SuperLoss, which makes curriculum learning applicable to any task (12). Our SuperLoss module can in fact be plugged on top of an existing loss function during training, as shown in Figure 2. SuperLoss automatically downweights the contribution of hard samples while upweighting easy samples, effectively implementing the core principle of curriculum learning.
To determine the difference between easy and hard samples, the current loss of a sample is compared with respect to an exponential averaging of the losses over all samples. A direct benefit of this approach is that no change is required at test time. Also, very little additional computation overhead is added during the learning process.