Improving self-supervised representation learning by synthesizing challenging negatives - Naver Labs Europe
Blog Image Constrastive Learning

Contrastive learning is an effective way of learning visual representations in a self-supervised manner. Pushing the embeddings of two transformed versions of the same image (forming the positive pair) close to each other and further apart from the embedding of any other image (negatives) using a contrastive loss, leads to powerful and transferable representations. We demonstrate that harder negatives are needed to facilitate better and faster learning for contrastive self-supervised learning and propose ways of synthesizing harder negative features, on-the-fly and with minimal computational overhead.

During the past decade, the ways practitioners have developed models to tackle computer vision tasks, such as image classification or object detection, have been revolutionized due to advancements in deep learning techniques. These powerful tools have been combined with supervised learning so that a dataset is collected for the task at hand and a deep neural network is trained to solve it. Assuming the size and quality of the labelled training set are sufficient, this process works quite well for many computer vision problems and, in particular, for image classification. Manually annotating the data however, often requires expert-level knowledge for a particular problem making it costly. Moreover, it doesn’t scale. This makes it unsustainable for every time we want to solve a new, different task. Fortunately, there are several ways to overcome this difficulty. One such way is to train the costly model only once and use it for many different computer vision problems i.e. learning a model that produces general purpose representations that can be transferred to other tasks.

Self-supervised learning goes a step beyond this as it aims at learning general purpose representations without having to rely on human annotations. Models pre-trained with self-supervision are a great starting point for transfer learning to downstream tasks like classification on other datasets, or even object detection or instance segmentation [1,2]. In some cases, they outperform models pre-trained with supervision.

Supervised learning vs self-supervised learning
Figure 1: In supervised learning we train a model with classification labels on a source dataset (e.g. ImageNet). Transfer learning is the process of utilizing i.e. transferring the representations learned this way on novel, downstream tasks different than the tasks they were trained for. Self-supervised learning aims at learning general purpose representations without relying on any human annotations.

Contrastive self-supervised learning

The most successful self-supervised models are based on contrastive learning. The idea is simple whereby you first create a positive pair from two transformed versions of the same image, then train a model that solves the proxy task of bringing the representations of the positive pair close to each other and further than the representations of any other image from a set of negatives.

contrastive self-supervised learning
Figure 2: The main idea behind contrastive self-supervised learning. You first create a positive pair from two transformed versions of the same image. Then, you train a model that solves for the proxy task of bringing the representations of the positive pair close to each other, and further than the representations of any other image from a set of negatives.

This may naturally lead to the question of where negatives come from. While, for SimCLR [2], negatives come from the same batch, MoCo [1] keeps a larger memory of features. [2,3,5]. Although other methods explore ways for creating better positive pairs e.g. MoCo-v2 [3], InfoMinAug [5], another way would be to try to look for harder negatives. To get more challenging negatives, MoCo increases the size of the memory. This works well up to a point but, after a while, the performance on downstream tasks doesn’t really increase.

Mixing of Contrastive Hard negatives

In our paper [6], we propose using the hardest existing negatives to synthesize additional hard negatives, on the fly, directly in the feature space. We do that by either mixing two of the hardest negatives, or by mixing the query itself with one of the hardest negatives. This process is graphically depicted in Figure 3. We call our approach “MoCHi”, which stands for “Mixing of Contrastive Hard Negatives”.

Mixing the hardest negatives of each query to synthesize new hard negatives.
Figure 3: Mixing the hardest negatives of each query to synthesize new hard negatives.

Experimental Evaluation

We pre-trained our models on the ImageNet-1K dataset without using any labels and evaluated our approach on a number of downstream tasks and datasets. We chose to build our approach on top of MoCo-v2, but MoCHi can be applied to any contrastive learning method that uses a set of negatives.

Linear classification

In the first column of results in Table 1 below, we see the image classification performance after freezing the model and only training linear classifiers on ImageNet. Overall, MoCHi retains the state-of-the-art performance of MoCo-v2 [3] on ImageNet. We attribute the fact that the performance slightly decreases to the biases induced by training with hard negatives on the same dataset as the downstream task. As we explain below, hard negative mixing reduces alignment and increases uniformity for the dataset used during training.

linear classification on ImageNet-1K
Table 1: Results for linear classification on ImageNet-1K and object detection on PASCAL VOC with a ResNet-50 backbone. Wherever standard deviation is reported, it refers to multiple runs for the fine-tuning part. For MoCHi runs we also report in parenthesis the difference to MoCo-v2. * denotes reproduced results. We bold (resp. underline) the highest results overall (resp. for MoCHi).

Transfer Learning

 

When it comes to transfer learning, we see from Table 1 (detection on PASCAL VOC) and Table 2 (detection and segmentation on COCO) that MoCHi offers consistent gains over MoCo-v2, and matches or outperforms the state of the art. Specifically we see that:

  • MoCHi helps the model learn faster.

MoCHi achieves high performance gains over MoCo-v2 for transfer learning after only 100 epochs of training; hard negative mixing helps a lot for shorter training and MoCHi achieves +1% Average Precision (AP) gains over MoCo-v2 on PASCAL VOC when fine-tuning after pre-training for 100 epochs. Even more is that, from Table 2 we see that MoCHi can match supervised pre-training performance after 100 epochs on COCO.

  • MoCHi transfers better.

When training for the commonly used setup of 200 epochs, MoCHi shows consistent gains over MoCo-v2 and other methods for transfer learning. It’s noticeable that MoCHi can achieve performance similar to MoCo-v2 after 800 epochs on PASCAL VOC and achieves state-of-the-art performance for both object detection and instance segmentation on COCO.

  • Gains are consistent across hyperparameters.

Performance gains are consistent across multiple hyperparameter configurations for MoCHi, including the number of points synthesized or the number of top negatives taken into account.

Object detection and instance segmentation
Table 2: Object detection and instance segmentation results on COCO with the ×1 training schedule and a C4 backbone. * denotes reproduced results.

Understanding the feature space

In order to better understand how MoCHi works and affects contrastive learning, we analyzed training using the class labels from ImageNet-1K, our training set. Specifically, we track how false negatives vary during training. By false negatives (FN) we refer to negative/memory items from the same class as the query that are highly ranked with respect to logits (not yet normalized predictions), i.e. exist in the top-1024 highest logits for the query. In the plot below in Figure 4 we show the percentage of FN in the top-1024 across training, averaged over all queries.

Percentage of false negatives in the top-1024 across training, averaged over all queries.
Figure 4: Percentage of false negatives in the top-1024 across training, averaged over all queries.

Let’s first take a look at the synthetic points. From the lines with square markers we see that only a small percentage of synthetic points are (definitely) FNs (<2%). But what about the “real” negatives? We see that for all runs, FNs in the top-1024 increase with training, which is highly desirable as it implies that we’re learning a space where features from the same class are closer together. However, what we also see is that MoCHi has, overall, a smaller percentage of FNs than MoCo; why then does MoCHi perform better for downstream tasks? To try to better understand the embedding space learned by MoCHi, we also looked into the recently proposed scores of alignment and uniformity from Wang & Isola [4]. In short, Alignment is defined as the average distance between representations with the same class, while uniformity refers to the average pairwise distance between all embeddings (regardless of class). In Figure 5 below, we plot the two metrics for variants of MoCHi, MoCo and a supervised model for the validation set of the ImageNet-100 dataset.

Uniformity for variants of MoCHi, MoCo and Supervised model for the validation set of ImageNet-100
Figure 5: Y-axis Alignment, X-axis Uniformity for variants of MoCHi, MoCo and Supervised model for the validation set of ImageNet-100

We see that MoCHi increases the uniformity score of the representations compared to both MoCo-v2 and the supervised models. This further supports our hypothesis that MoCHi allows the proxy task to learn to better utilize the embedding space. In fact, we see that the supervised model leads to high alignment but very low uniformity, denoting features targeting the classification task. On the other hand, MoCo-v2 and MoCHi , which we experimentally show leads to more generalizable representations, i.e. both MoCo-v2 and MoCHi outperform the supervised ImageNet-pretrained backbone for transfer learning.

Conclusions

Overall, MoCHi increases the difficulty of the instance discrimination task by spawning more challenging negative pairs during training visual representations. This translates into consistent gains for transfer learning performance over state-of-the-art methods, across multiple hyperparameter configurations and for a number of tasks and datasets. MoCHi also facilitates faster learning. We observed gains of 1% Average Precision over MoCo-v2 on PASCAL VOC when pre-training for only 100 epochs. On COCO, MoCHi was also able to match the performance of supervised pre-training for instance segmentation after only 100 epochs. Finally, by measuring the uniformity metrics recently proposed in [4], we show that MoCHi results in better utilization of the underlying feature space.

Find out more about MoCHi by reading our paper [6], viewing our recorded presentation at NeurIPS 2020 or visit our project page for pre-trained models.

This is the collaborative work of Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel and Diane Larlus.

References

  1. He, K., Fan, H., Wu, Y., Xie, S. and Girshick, R. Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), virtual event, 14-19 June, 2020.
  2. Chen, T., Kornblith, S., Norouzi, M., Hinton, G. A simple framework for contrastive learning of visual representations. International Conference on Machine Learning (ICML), virtual event, 12-18 July, 2020.
  3. Chen, X., Fan, H., Girshick, R. Kaiming, H. Improved baselines with momentum contrastive learning. [MoCo-v2] arXiv:2003.04297 (2020)
  4. Wang, T., and Isola P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. International Conference on Machine Learning (ICML), virtual event, 12-18 July, 2020.
  5. Tian, Yonglong, et al. What makes for good views for contrastive learning. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), virtual event, 6–12 December, 2020.
  6. Kalantidis, Y., Sariyildiz, M.B., Pion, N., Weinzaepfel, P. and Larlus, D. Hard Negative Mixing for Contrastive Learning. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), virtual event, 6–12 December, 2020.
Yannis Kalantidis
Yannis Kalantidis
I have been a research scientist at Naver Labs Europe since March 2020; I am a member of the Computer Vision team. My research interests include representation learning, video understanding, multi-modal learning and large-scale vision and language. For a full list of publications please visit my Google Scholar profile: https://scholar.google.com/citations?user=QJZQgN8AAAAJ&hl=en or my personal website: https://www.skamalas.com/ I am very passionate about making the research community tackle more socially impactful problems. Together with Laura Sevilla-Lara, we lead the Computer Vision for Global Challenges initiative - please visit https://www.cv4gc.org/ for more info. I grew up and lived in Greece until 2015 with brief breaks in Sweden, Spain and the United States. I lived in the Bay Area from 2015 till 2020, working as a research scientist at Yahoo Research (2015-2017) and Facebook AI (2017-2020). I got my PhD in late 2014 from the National Technical University of Athens under the supervision of Yannis Avrithis. I am passionate about traveling, photography, film, interactive visual arts and music.