Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning
CVPR 2022

University of Maryland, College Park


By leveraging contrastive learning, clustering, and other pretext tasks, unsupervised methods for learning image representations have reached impressive results on standard benchmarks. The result has been a crowded field - many methods with substantially different implementations yield results that seem nearly identical on popular benchmarks, such as linear evaluation on ImageNet. However, a single result does not tell the whole story. In this paper, we compare methods using performance-based benchmarks such as linear evaluation, nearest neighbor classification, and clustering for several different datasets, demonstrating the lack of a clear frontrunner within the current state-of-the-art. In contrast to prior work that performs only supervised vs. unsupervised comparison, we compare several different unsupervised methods against each other. To enrich this comparison, we analyze embeddings with measurements such as uniformity, tolerance, and centered kernel alignment (CKA), and propose two new metrics of our own: nearest neighbor graph similarity and linear prediction overlap. We reveal through our analysis that in isolation, single popular methods should not be treated as though they represent the field as a whole, and that future work ought to consider how to leverage the complimentary nature of these methods. We also leverage CKA to provide a framework to robustly quantify augmentation invariance, and provide a reminder that certain types of invariance will be undesirable for downstream tasks.


We examine a few popular assumptions about unsupervised image representation learning. We add our voices to recent evidence that there is no clear "best" method, per current methods. We show, such as in the figure above, that unsupervised methods learn representations that are quite distinct from each other (and the analysis community should not treat them as though they are interchangeable), even though some have similar objectives. We also take a closer look at augmentation invariance, providing evidence where others have only speculated on the interactions between color jitter and unsupervised training objectives.

No clear "best" method

No single method achieves the highest accuracy on every dataset for these ResNet-50 based models on this linear probing (finetune last layer only) experiment.

Learned representations exhibit interesting similarities, substantial differences

In addition to the figure at the top of this page, we show uniformity and tolerance results. These suggest that, intuitively, methods with similar objectives, such as the clustering methods (DeepCluster, SwAV) might be more similar to each other than to other methods. However, this same line of reasoning would suggest these clustering-based methods are more alike to supervised learning than to the contrastive methods.

Color-invariant training produces demonstrably color-invariant representations

We use CKA to compare representations of augmented and non-augmented images. We show that the unsupervised methods tend to produce very similar representations for augmentations which were used during their pre-training.


The website template was borrowed from Ben Mildenhall via Hao Chen.