Diffusion Models Beat GANs on Image Classification
Preprint, superseded by Do text-free diffusion models learn discriminative visual representations?


While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.


As the title suggests, we show that diffusion models are better than GANs for both image generation as well as image classification. A few recent papers have started working towards unified image representation modeling, with better results than BigBiGAN. While current diffusion models fall short of current state-of-the-art, we hope that by showing their current performance and promise, we bring much deserved attention to their potential as joint generation-classification models. To this end, we explore their performance for some classification tasks, and distill general principles for using their features effectively in these settings.

Diffusion models are better than GANs for unified representation learning.

They beat GAN-based models such as BigBiGAN for both image generation and image classification. This occurs even though the BigBiGAN was designed and trained with classification in mind, whereas the diffusion model originally targetted only generative tasks. Note that with careful pooling and feature selection, we obtain better results. In fact, with a learned attention head to perform operations on frozen features, our result improves substantially. Together, these results demonstrate the promise of diffusion features for recognition tasks.

Performance is heavily reliant on block number, time step, and feature pooling.

Feature selection matters. Diffusion, unlike typical systems, has an additional hyperparamter: time step. We find that the features from time steps around the start and middle of the range tend to be best. In terms of block number, the features at or just after the bottleneck are the most useful. The same trend holds true for features - some pooling is better than both no pooling and pooling everything.

Diffusion features are useful for downstream recognition tasks.

The diffusion model we use was originally trained for generation on ImageNet. Nevertheless, we find the frozen features are useful for classifcation on FGVC datasets. While other self-supervised methods tend to generate better features, diffusion is somewhat competitive on Aircraft. We suggest that with more research and perhaps minor adjustments to the pre-training, diffusion could close the gap for these tasks.

Feature selection settings depend on the data.

Surprisingly, the same features that are best for ImageNet are not necessarily the best for FGVC datasets. Here, we explore settings for CUB. We find that the features at the bottleneck block tend to be best, and that smaller pooling kernels (larger feature maps) work better as well. This introduces an additional difficulty for extracting features for diffusion- choosing the ideal settings for each dataset. However, it also points to an important direction for future research: automatic feature selection, as well as cross-time and cross-block feature pooling and aggreation.

Diffusion features vary dramatically depending on time and block.

We compute and plot centered kernel alignment (CKA) to compare features from diffusion and other models. By comparing diffusion features at different blocks and time steps both to each other, as well as to other models, we see interesting patterns emerge. The similarities between diffusion and other models on the diagonal of each plot suggests that, like other methods, diffusion models learn features that gradually become more discriminative toward later layers. Additionally, the near-zero values in, for example, the plot in the top right corner, show that diffusion features at different blocks are more different from each other than they are compared to features from other methods. Overall, the qualitative trends in the CKA confirm our findings about the usefullness of the features for different block numbers and time steps.


The website was based on the popular template from Ben Mildenhall.