Contrastive learning is an effective way of learning visual representations in a self-supervised manner. Pushing the embeddings of two transformed versions of the same image (forming the positive pair) close to each other and further apart from the embedding of any other image (negatives) using a contrastive loss, leads to powerful and transferable representations. We demonstrate that harder negatives are needed to facilitate better and faster learning for contrastive self-supervised learning and propose ways of synthesizing harder negative features, on-the-fly and with minimal computational overhead.
During the past decade, the ways practitioners have developed models to tackle computer vision tasks, such as image classification or object detection, have been revolutionized due to advancements in deep learning techniques. These powerful tools have been combined with supervised learning so that a dataset is collected for the task at hand and a deep neural network is trained to solve it. Assuming the size and quality of the labelled training set are sufficient, this process works quite well for many computer vision problems and, in particular, for image classification. Manually annotating the data however, often requires expert-level knowledge for a particular problem making it costly. Moreover, it doesn’t scale. This makes it unsustainable for every time we want to solve a new, different task. Fortunately, there are several ways to overcome this difficulty. One such way is to train the costly model only once and use it for many different computer vision problems i.e. learning a model that produces general purpose representations that can be transferred to other tasks.
Self-supervised learning goes a step beyond this as it aims at learning general purpose representations without having to rely on human annotations. Models pre-trained with self-supervision are a great starting point for transfer learning to downstream tasks like classification on other datasets, or even object detection or instance segmentation [1,2]. In some cases, they outperform models pre-trained with supervision.
Contrastive self-supervised learning
The most successful self-supervised models are based on contrastive learning. The idea is simple whereby you first create a positive pair from two transformed versions of the same image, then train a model that solves the proxy task of bringing the representations of the positive pair close to each other and further than the representations of any other image from a set of negatives.
This may naturally lead to the question of where negatives come from. While, for SimCLR [2], negatives come from the same batch, MoCo [1] keeps a larger memory of features. [2,3,5]. Although other methods explore ways for creating better positive pairs e.g. MoCo-v2 [3], InfoMinAug [5], another way would be to try to look for harder negatives. To get more challenging negatives, MoCo increases the size of the memory. This works well up to a point but, after a while, the performance on downstream tasks doesn’t really increase.
Mixing of Contrastive Hard negatives
In our paper [6], we propose using the hardest existing negatives to synthesize additional hard negatives, on the fly, directly in the feature space. We do that by either mixing two of the hardest negatives, or by mixing the query itself with one of the hardest negatives. This process is graphically depicted in Figure 3. We call our approach “MoCHi”, which stands for “Mixing of Contrastive Hard Negatives”.
Experimental Evaluation
We pre-trained our models on the ImageNet-1K dataset without using any labels and evaluated our approach on a number of downstream tasks and datasets. We chose to build our approach on top of MoCo-v2, but MoCHi can be applied to any contrastive learning method that uses a set of negatives.
Linear classification
In the first column of results in Table 1 below, we see the image classification performance after freezing the model and only training linear classifiers on ImageNet. Overall, MoCHi retains the state-of-the-art performance of MoCo-v2 [3] on ImageNet. We attribute the fact that the performance slightly decreases to the biases induced by training with hard negatives on the same dataset as the downstream task. As we explain below, hard negative mixing reduces alignment and increases uniformity for the dataset used during training.
Transfer Learning
When it comes to transfer learning, we see from Table 1 (detection on PASCAL VOC) and Table 2 (detection and segmentation on COCO) that MoCHi offers consistent gains over MoCo-v2, and matches or outperforms the state of the art. Specifically we see that:
- MoCHi helps the model learn faster.
MoCHi achieves high performance gains over MoCo-v2 for transfer learning after only 100 epochs of training; hard negative mixing helps a lot for shorter training and MoCHi achieves +1% Average Precision (AP) gains over MoCo-v2 on PASCAL VOC when fine-tuning after pre-training for 100 epochs. Even more is that, from Table 2 we see that MoCHi can match supervised pre-training performance after 100 epochs on COCO.
When training for the commonly used setup of 200 epochs, MoCHi shows consistent gains over MoCo-v2 and other methods for transfer learning. It’s noticeable that MoCHi can achieve performance similar to MoCo-v2 after 800 epochs on PASCAL VOC and achieves state-of-the-art performance for both object detection and instance segmentation on COCO.
- Gains are consistent across hyperparameters.
Performance gains are consistent across multiple hyperparameter configurations for MoCHi, including the number of points synthesized or the number of top negatives taken into account.
Understanding the feature space
In order to better understand how MoCHi works and affects contrastive learning, we analyzed training using the class labels from ImageNet-1K, our training set. Specifically, we track how false negatives vary during training. By false negatives (FN) we refer to negative/memory items from the same class as the query that are highly ranked with respect to logits (not yet normalized predictions), i.e. exist in the top-1024 highest logits for the query. In the plot below in Figure 4 we show the percentage of FN in the top-1024 across training, averaged over all queries.
Let’s first take a look at the synthetic points. From the lines with square markers we see that only a small percentage of synthetic points are (definitely) FNs (<2%). But what about the “real” negatives? We see that for all runs, FNs in the top-1024 increase with training, which is highly desirable as it implies that we’re learning a space where features from the same class are closer together. However, what we also see is that MoCHi has, overall, a smaller percentage of FNs than MoCo; why then does MoCHi perform better for downstream tasks? To try to better understand the embedding space learned by MoCHi, we also looked into the recently proposed scores of alignment and uniformity from Wang & Isola [4]. In short, Alignment is defined as the average distance between representations with the same class, while uniformity refers to the average pairwise distance between all embeddings (regardless of class). In Figure 5 below, we plot the two metrics for variants of MoCHi, MoCo and a supervised model for the validation set of the ImageNet-100 dataset.
We see that MoCHi increases the uniformity score of the representations compared to both MoCo-v2 and the supervised models. This further supports our hypothesis that MoCHi allows the proxy task to learn to better utilize the embedding space. In fact, we see that the supervised model leads to high alignment but very low uniformity, denoting features targeting the classification task. On the other hand, MoCo-v2 and MoCHi , which we experimentally show leads to more generalizable representations, i.e. both MoCo-v2 and MoCHi outperform the supervised ImageNet-pretrained backbone for transfer learning.
Conclusions
Overall, MoCHi increases the difficulty of the instance discrimination task by spawning more challenging negative pairs during training visual representations. This translates into consistent gains for transfer learning performance over state-of-the-art methods, across multiple hyperparameter configurations and for a number of tasks and datasets. MoCHi also facilitates faster learning. We observed gains of 1% Average Precision over MoCo-v2 on PASCAL VOC when pre-training for only 100 epochs. On COCO, MoCHi was also able to match the performance of supervised pre-training for instance segmentation after only 100 epochs. Finally, by measuring the uniformity metrics recently proposed in [4], we show that MoCHi results in better utilization of the underlying feature space.
Find out more about MoCHi by reading our paper [6], viewing our recorded presentation at NeurIPS 2020 or visit our project page for pre-trained models.
This is the collaborative work of Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel and Diane Larlus.