Do text-free diffusion models learn discriminative visual representations?
Submission Under Review, supercedes Diffusion Models Beat GANs on Image Classification

Abstract

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks -- image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation.

Overview


We answer the titular question with an emphatic "yes" -- text-free diffusion models learn highly useful, discriminative visual representations. Building upon recent works which explore the utility of diffusion for zero-shot tasks, segmentation, or classification with tiny image datasets such as CIFAR, we perform a comprehensive benchmark using popular classification, detection, and segmentation datasets to demonstrate the promising potential of text-free diffusion models for these tasks. We furthermore develop and propose sophisticated, flexible mechanisms for ideally using diffusion features, that is, DifFormer (attention-based feature fusion) and DifFeed (feedback-based feature fusion). With compelling results, we suggest diffusion models can be an unsupervised, unified representation model, which learns weights that generate good features for both generative and discriminative tasks, all from a single pretraining task and with a single set of weights.

Inputs need some noise.


While for standard models, adding noise is an "attack" that will cause performance to degrade, we find that diffusion models actually need some noise in order to perform properly, a phenomenon which we illustrate with the plot above. We hypothesize that this is because it is important the denoising task isn't too easy. However, there must be a balance. For later noise steps, performance declines because the input itself becomes unrecognizable.

Features are complimentary.


We show a comparison between features between guided diffusion and an MAE-pretrained ViT-B, plotting their similarity in terms of centered kernel alignment (CKA). We notice that many different diffusion blocks have features that have significant similarities to the most discriminative layers of the ViT (the later layers). We also find that the featuers from different diffusion blocks and noise time steps are themselves quite unique from each other. Thus, we suggest that to unlock the power of the diffusion network as a visual representation learner, leveraging these diverse features in concert with each other is essential.

How should features be combined?


The first issue we tackle is pooling. Diffusion feature maps can be large, and for some tasks (e.g. fine-grained visual categorization), naively selecting a small pooling size can be detrimental to the performance. So, we propose an attention head to act as a learned pooling mechanism. Also, as we note above, features tend to be quite complimentary, so we propose combining these features across both blocks and time using our attention head. We refer to this mechanism, powered by the unique diversity of diffusion features, as DifFormer. We note that this requires many forward passes in order to fuse across time. As a more efficient alternative, we match and even sometimes exceed the performance without the need to fuse across time by leveraging feedback. Feedback works very well due to the UNet shape of the model, and the compatibly of corresonding encoder and decoder blocks (demonstrated by the X-shape in the CKA figure). We name this approach DifFeed.

Diffusion models beat GANs (and some other models too) for image classification.


To borrow from the title of one of the most popular papers in the area, we find that "diffusion models beat GANs for image [classification]," not just synthesis. We also find that our DifFormer and DifFeed help them compete even with models like MAGE that leverage state-of-the-art pretrained image features (VQ-GAN tokens) and were pretrained with both tasks in mind. We believe that if diffusion models perform this well out of the box, it is very worthwhile to pursue more tailored modifications to unlock their potential for unified representation learning.

Diffusion models do well on fine-grained visual categorization (FGVC), too.


Compared to other SOTA unsupervised representation learning methods, diffusion models perform very well in this transfer learning setting.

And object detection.


Also for semantic segmentation!


Citation


The website was based on the popular template from Ben Mildenhall.