Accelerate High-Quality Diffusion Models with Inner Loop Feedback
Submission Under Review

  • 1University of Maryland, College Park
  • 2NVIDIA
  • *Internship at NVIDIA
  • Correspondence to: Matthew Gwilliam, Zhiyu Cheng.

Abstract

We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models' inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF's 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons.

Overview


If you want to speed up your diffusion model, what you do might depend on the resources you have access to. If you don't want to or can't afford to train anything, you might opt for caching. By re-using intermediate diffusion outputs across time steps, you can save a lot of time, albeit at the expense of image quality. On the other hand, if you can afford to train, you will often use distillation to either train a model that is smaller (and therefore faster) or else a model that takes more aggressive denoising steps (and so requires less steps to generate a decent image). The drawback here is that you typically won't recover quite the same quality and aesthetics as the original model (which we refer to as "maximum" quality). On this webpage (see the paper for more!) we will describe the motivation for our feedback mechanism. We will explain how our feedback mechanism works. We will prove that it works well. And that's it! The code is private, but feel free to email Matt and/or Zhiyu if you want to learn more.

Caching is not free lunch.


Caching literature indicates that the features for a given block, for consecutive timesteps, are fairly similar. That is, if I take the first decoder block in a diffusion U-Net, its outputs at step 800 and step 750 are pretty similar. We find this tends to hold for diffusion transformers as well (the prior works are concerned mainly with U-Nets). However, we find that the makeup of the features changes a bit as well. In the plot above, we show, for each DiT block, the difference in the features at the given timestep compared to the first timestep, calculated by summing the elementwise differences. We show this for both a baseline (standard diffusion inference) and caching (every other step, re-use the latest result for the inner 14 blocks, from the 8th block to the 21st block). We normalize by dividing the values in both plots by their shared maximum. This reveals that caching causes the features to evolve differently, especially at earlier steps, where they don't change from the first features as quickly.


The fact that the intermediate features evolve differently is trivial; the more important question is whether they do not perform as well. We find that when we look closely, caching does in fact harm output quality. For these PixArt-alpha 512x512 images note the dramatic loss of details and identities for the left 2 samples, the blurriness in the next sample, and the artifacts in the rightmost sample. We hypothesize that we can rectify these issues by operating in the feature space itself. That is, we will introduce a module to attempt to predict what the baseline features would have been; still skipping the computation for the step, but without the drawbacks.

Feedback attempts to enable more powerful diffusion steps.


Standard diffusion forwards attempt to predict the noise added at the given timestep. Our method essentially attempts to predict some future noise, or, rather, multiple steps worth of noise. This allows the diffusion process to proceed with the same quality, but in fewer steps. We do this by looping within the model: feeding the outputs of some block to a feedback module, which modifies the features before passing them to an earlier block to continue the model forward.


Compared to caching, we use features from the later blocks to modulate the computations of the earlier blocks. What we are doing is using the model's previous progress towards a noise prediction to inform its current progress towards a noise prediction. When trained correctly, this allows for noise predictions which result in images which more closely approximate, or even exceed, the quality of the original results.

ILF generates high quality images quickly.


Notice the superior details of ILF compared to caching. With 1.7x speedup we can achieve similar results to the un-accelerated baseline for the 512x512 PixArt-sigma images!


Image generation metrics all have their shortcomings, but the majority of these numbers confirm ILF's efficacy.

Citation


The website was based on the popular template from Ben Mildenhall.