Concept generalization in visual representation learning

Published by Mert Bulent Sariyildiz at 10 September 2021

Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, Karteek Alahari

International Conference on Computer Vision (ICCV), virtual event, 11-17 October, 2021

Paper Benchmark Results Code Poster arXiv

Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be leveraged to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially in a self-supervised learning framework. Nonetheless, the choice of unseen concepts for such an evaluation is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG Benchmark, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen IN-21K concept sets that are semantically more and more distant from the ImageNet-1K (IN-1K) subset, a ubiquitous training set. This allows us to benchmark visual representations learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31 convolution and transformer-based models and show how different architectures, levels of supervision, regularization techniques and use of web data impact the concept generalization performance

Figure-1: An overview of our ImageNet Concept Generalization (CoG) benchmark. (a) An example of five concepts from the Full ImageNet dataset (IN-21K), ranked by increasing semantic distance (decreasing Lin similarity) to the ImageNet-1K ImageNet subset (IN-1K) concept “Tiger cat”. (b) We rank the 21K concepts of IN-21K according to their semantic distance to the 1000 concepts of IN-1K and split the ranked list to extract 5 groups of 1000 concepts. We refer to the five IN-1K-sized datasets of increasing semantic distance from IN-1K as concept generalization levels, denoted as L_1/2/3/4/5. (c) The proposed ImageNet-CoG benchmark uses a model trained on IN-1K as a feature extractor and evaluates its concept generalization capabilities by learning linear classifiers for each level of more and more challenging unseen concepts.

Benchmark results:

In the paper, we evaluate 31 state-of-the-art models on the ImageNet-CoG benchmark. We use ResNet50 as a reference model. The remaining 30 models are divided into four categories.

Architecture: Models with different backbone architecture. The ones having similar (resp. dissimilar) number of parameters to ResNet50 are colored in red (resp. orange).

T2T-ViT-t-14, visual transformer model
DeiT-S, visual transformer model
DeiT-S-distilled, distilled DeiT-S
Inception-v3, CNN with inception modules
NAT-M4, neural architecture search model
EfficientNet-B1, neural architecture search model
EfficientNet-B4, bigger EfficientNet-B1
DeiT-B-distilled, bigger DeiT-S-distilled
ResNet152, bigger ResNet50
VGG-19, simple CNN architecture

Self-supervision: ResNet50 models trained in this framework.

SimCLR-v2, online instance discrimination (ID)
MoCo-v2, ID with momentum encoder and memory bank
BYOL, negative-free ID with momentum encoder
MoCHi, ID with negative pair mining
InfoMin, ID with careful positive pair selection
OBoW, online bag-of-visual-words prediction
SwAV, online clustering
DINO, online clustering
BarlowTwins, feature de-correlation using positive pairs
CompReSS, distilled model from SimCLR-v1 (with ResNet50x4)

Regularization: ResNet50 models with additional regularization

MixUp, label-associated augmentation in input space
Manifold-MixUp, label-associated augmentation in representation space
CutMix, label-associated augmentation in input space
ReLabel, model trained on a “multi-label” version of IN-1K
Adv-Robust, adversarially robust model
MEAL-v2, distilled ResNet50

Use of web data: ResNet50 models trained using additional data

MoPro, trained on Webvision-V1
Semi-Sup, semi-supervised model first pretrained on YFCC-100M, then fine-tuned on IN-1K
Semi-Weakly-Sup, semi-weakly supervised model first pretrained on IG-1B, then fine-tuned on IN-1K
CLIP, vision & language model trained on WebImageText.

Results

Figure-2: Linear classification on the ImageNet-CoG benchmark. Top-1 accuracies for all the 31 models listed above after training logistic regression classifiers on IN-1K and each level L_1/2/3/4/5. (a) Absolute top-1 accuracy on all levels. (b)-(e) accuracy relative to the baseline ResNet50 for all the models, split across the four model categories presented above.

Benchmark files

These two files contain the concepts and data splits for ImageNet-CoG:

cog_concepts_split_file.pkl: List of image filenames in the train and test splits for all 5000 ImageNet concepts in the CoG levels (~678MB).
cog_levels_mapping_file.pkl: List of ImageNet concept names for each ImageNet-CoG level (~100KB).

If downloading doesn’t automatically start when linking the links above, please directly copy the links into your browser.

@InProceedings{sariyildiz2021conceptgeneralization,
title={Concept Generalization in Visual Representation Learning},
author={Sariyildiz, Mert Bulent and Kalantidis, Yannis and Larlus, Diane and Alahari, Karteek},
booktitle={International Conference on Computer Vision},
year={2021}
}

Benchmark results:

Results

Benchmark files

INTERACTION

Equip robots to interact safely with humans, other robots and systems.

VISION

Perception to help robots understand and interact with the environment.

ACTION

Providing embodied agents with sequential decision-making capabilities to safely execute complex tasks in dynamic environments.

NAVER FRANCE Gender Equality 2025

All

Publications

Blog

News

Code & Data

Careers

People

Concept generalization in visual representation learning

Benchmark results:

Results

Benchmark files

All

Publications

Blog

News

Code & Data

Careers

People

Cookie settings