Concept generalization in visual representation learning
Mert Bulent Sariyildiz1,2, Yannis Kalantidis1, Diane Larlus1, Karteek Alahari2
1 NAVER LABS Europe 2 Inria
ICCV 2021
Figure-1: An overview of our ImageNet Concept Generalization (CoG) benchmark. (a) An example of five concepts from the Full ImageNet dataset (IN-21K), ranked by increasing semantic distance (decreasing Lin similarity) to the ImageNet-1K ImageNet subset (IN-1K) concept “Tiger cat”. (b) We rank the 21K concepts of IN-21K according to their semantic distance to the 1000 concepts of IN-1K and split the ranked list to extract 5 groups of 1000 concepts. We refer to the five IN-1K-sized datasets of increasing semantic distance from IN-1K as concept generalization levels, denoted as L1/2/3/4/5. (c) The proposed ImageNet-CoG benchmark uses a model trained on IN-1K as a feature extractor and evaluates its concept generalization capabilities by learning linear classifiers for each level of more and more challenging unseen concepts.
Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG Benchmark, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen IN-21K concept sets that are semantically more and more distant from the ImageNet-1K (IN-1K) subset, a ubiquitous training set. This allows us to benchmark visual representations learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31 convolution and transformer-based models and show how different architectures, levels of supervision, regularization techniques and use of web data impact the concept generalization performance
Citation:
If you find our paper interesting, please consider citing us:
@InProceedings{sariyildiz2021conceptgeneralization, title={Concept Generalization in Visual Representation Learning}, author={Sariyildiz, Mert Bulent and Kalantidis, Yannis and Larlus, Diane and Alahari, Karteek}, booktitle={International Conference on Computer Vision}, year={2021} }
Benchmark results:
In the paper, we evaluate 31 state-of-the-art models on the ImageNet-CoG benchmark. We use ResNet50 as a reference model. The remaining 30 models are divided into four categories.
Architecture: Models with different backbone architecture. The ones having similar (resp. dissimilar) number of parameters to ResNet50 are colored in red (resp. orange).
- T2T-ViT-t-14, visual transformer model
- DeiT-S, visual transformer model
- DeiT-S-distilled, distilled DeiT-S
- Inception-v3, CNN with inception modules
- NAT-M4, neural architecture search model
- EfficientNet-B1, neural architecture search model
- EfficientNet-B4, bigger EfficientNet-B1
- DeiT-B-distilled, bigger DeiT-S-distilled
- ResNet152, bigger ResNet50
- VGG-19, simple CNN architecture
Self-supervision: ResNet50 models trained in this framework.
- SimCLR-v2, online instance discrimination (ID)
- MoCo-v2, ID with momentum encoder and memory bank
- BYOL, negative-free ID with momentum encoder
- MoCHi, ID with negative pair mining
- InfoMin, ID with careful positive pair selection
- OBoW, online bag-of-visual-words prediction
- SwAV, online clustering
- DINO, online clustering
- BarlowTwins, feature de-correlation using positive pairs
- CompReSS, distilled model from SimCLR-v1 (with ResNet50x4)
Regularization: ResNet50 models with additional regularization
- MixUp, label-associated augmentation in input space
- Manifold-MixUp, label-associated augmentation in representation space
- CutMix, label-associated augmentation in input space
- ReLabel, model trained on a “multi-label” version of IN-1K
- Adv-Robust, adversarially robust model
- MEAL-v2, distilled ResNet50
Use of web data: ResNet50 models trained using additional data
- MoPro, trained on Webvision-V1
- Semi-Sup, semi-supervised model first pretrained on YFCC-100M, then fine-tuned on IN-1K
- Semi-Weakly-Sup, semi-weakly supervised model first pretrained on IG-1B, then fine-tuned on IN-1K
- CLIP, vision & language model trained on WebImageText.
Results
Figure-2: Linear classification on the ImageNet-CoG benchmark. Top-1 accuracies for all the 31 models listed above after training logistic regression classifiers on IN-1K and each level L1/2/3/4/5. (a) Absolute top-1 accuracy on all levels. (b)-(e) accuracy relative to the baseline ResNet50 for all the models, split across the four model categories presented above.
Benchmark files
These two files contain the concepts and data splits for ImageNet-CoG:
- cog_concepts_split_file.pkl: List of image filenames in the train and test splits for all 5000 ImageNet concepts in the CoG levels (~678MB).
- cog_levels_mapping_file.pkl: List of ImageNet concept names for each ImageNet-CoG level (~100KB).
If downloading doesn’t automatically start when linking the links above, please directly copy the links into your browser.