Concept generalization in visual representation learning

Mert Bulent Sariyildiz1,2Yannis Kalantidis1Diane Larlus1, Karteek Alahari2

1 NAVER LABS Europe            2 Inria

ICCV 2021

An overview of our ImageNet Concept Generalization (CoG) benchmark.

Figure-1: An overview of our ImageNet Concept Generalization (CoG) benchmark. (a) An example of five concepts from the Full ImageNet dataset (IN-21K), ranked by increasing semantic distance (decreasing Lin similarity) to the ImageNet-1K ImageNet subset (IN-1K) concept “Tiger cat”. (b) We rank the 21K concepts of IN-21K according to their semantic distance to the 1000 concepts of IN-1K and split the ranked list to extract 5 groups of 1000 concepts. We refer to the five IN-1K-sized datasets of increasing semantic distance from IN-1K as concept generalization levels, denoted as L1/2/3/4/5. (c) The proposed ImageNet-CoG benchmark uses a model trained on IN-1K as a feature extractor and evaluates its concept generalization capabilities by learning linear classifiers for each level of more and more challenging unseen concepts.


  • October 2021: We are presenting this work at the virtual ICCV 2021! Come chat with us during Paper Sessions 8a and 8b.
  • October 2021: Code is now publicly avalaible on Github. 
  • September 2021: Our paper has been accepted to International Conference on Computer Vision (ICCV) 2021! Please see the new version on arXiv, which includes an extensive evaluation of 31 visual representation learning models on the ImageNet-CoG benchmark.
  • December 2020: The first version of our paper is released on arXiv.

Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG Benchmark, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen IN-21K concept sets that are semantically more and more distant from the ImageNet-1K (IN-1K) subset, a ubiquitous training set. This allows us to benchmark visual representations learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31 convolution and transformer-based models and show how different architectures, levels of supervision, regularization techniques and use of web data impact the concept generalization performance


If you find our paper interesting, please consider citing us:

      title={Concept Generalization in Visual Representation Learning},
      author={Sariyildiz, Mert Bulent and Kalantidis, Yannis and Larlus, Diane and Alahari, Karteek},
      booktitle={International Conference on Computer Vision},

Benchmark results:

In the paper, we evaluate 31 state-of-the-art models on the ImageNet-CoG benchmark. We use ResNet50 as a reference model. The remaining 30 models are divided into four categories.

Architecture: Models with different backbone architecture. The ones having similar (resp. dissimilar) number of parameters to ResNet50 are colored in red (resp. orange).

  1. T2T-ViT-t-14, visual transformer model
  2. DeiT-S, visual transformer model
  3. DeiT-S-distilled, distilled DeiT-S
  4. Inception-v3, CNN with inception modules
  5. NAT-M4, neural architecture search model
  6. EfficientNet-B1, neural architecture search model
  7. EfficientNet-B4, bigger EfficientNet-B1
  8. DeiT-B-distilled, bigger DeiT-S-distilled
  9. ResNet152, bigger ResNet50
  10. VGG-19, simple CNN architecture

Self-supervision: ResNet50 models trained in this framework.

  1. SimCLR-v2, online instance discrimination (ID)
  2. MoCo-v2, ID with momentum encoder and memory bank
  3. BYOL, negative-free ID with momentum encoder
  4. MoCHi, ID with negative pair mining
  5. InfoMin, ID with careful positive pair selection
  6. OBoW, online bag-of-visual-words prediction
  7. SwAV, online clustering
  8. DINO, online clustering
  9. BarlowTwins, feature de-correlation using positive pairs
  10. CompReSS, distilled model from SimCLR-v1 (with ResNet50x4)

Regularization: ResNet50 models with additional regularization

  1. MixUp, label-associated augmentation in input space
  2. Manifold-MixUp, label-associated augmentation in representation space
  3. CutMix, label-associated augmentation in input space
  4. ReLabel, model trained on a “multi-label” version of IN-1K
  5. Adv-Robust, adversarially robust model
  6. MEAL-v2, distilled ResNet50

Use of web data: ResNet50 models trained using additional data

  1. MoPro, trained on Webvision-V1
  2. Semi-Sup, semi-supervised model first pretrained on YFCC-100M, then fine-tuned on IN-1K
  3. Semi-Weakly-Sup, semi-weakly supervised model first pretrained on IG-1B, then fine-tuned on IN-1K
  4. CLIP, vision & language model trained on WebImageText.


Linear classification on the ImageNet-CoG benchmark.

Figure-2: Linear classification on the ImageNet-CoG benchmark. Top-1 accuracies for all the 31 models listed above after training logistic regression classifiers on IN-1K and each level L1/2/3/4/5. (a) Absolute top-1 accuracy on all levels. (b)-(e) accuracy relative to the baseline ResNet50 for all the models, split across the four model categories presented above.

Benchmark files

These two files contain the concepts and data splits for ImageNet-CoG:

If downloading doesn’t automatically start when linking the links above, please directly copy the links into your browser.

This web site uses cookies for the site search, to display videos and for aggregate site analytics.

Learn more about these cookies in our privacy notice.


Cookie settings

You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.

FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.

AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.

Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.