Can Dispersive Loss Supercharge Diffusion Models Without Extra Pre‑training?
Dispersive Loss is a plug‑and‑play regularization technique that enhances diffusion‑based generative models by encouraging dispersed internal representations, requiring no additional pre‑training, parameters, or data, and consistently improves performance across various model sizes and configurations, as demonstrated through extensive experiments.
Background and Motivation
Diffusion models have become a dominant class of generative models, yet their training objectives are largely based on regression losses and lack explicit regularization of internal representations. In parallel, self‑supervised representation learning, especially contrastive methods, has shown strong ability to learn useful features. Existing approaches such as Representation Alignment (REPA) improve diffusion training by aligning model representations with those of a pretrained encoder, but they depend on extra pre‑training, additional parameters, and external data.
To address these drawbacks, the authors propose a self‑contained, lightweight regularization technique called Dispersive Loss , which can be inserted into any diffusion model without any extra training, parameters, or data.
Dispersive Loss Concept
Dispersive Loss encourages the internal representations of a diffusion model to be spread out in the hidden space, similar to the repulsive effect of contrastive learning but without requiring positive pairs. The loss is added alongside the standard diffusion regression loss: L_total = L_diffusion + λ * L_dispersive where L_dispersive depends only on the current batch of intermediate features and introduces no additional learnable parameters or layers.
Construction of Dispersive Loss
The loss is derived from the InfoNCE formulation. Starting from the standard InfoNCE loss:
the authors retain only the term that pushes all pairs apart, discarding the positive‑pair attraction term. This yields the final Dispersive Loss expression:
In practice, the loss can be computed with a few lines of code (see Algorithm 1).
Algorithm 1: Dispersive Loss based on InfoNCE and L2 distance
Variants of Dispersive Loss
The core idea extends to other contrastive objectives that only penalize negative pairs. The paper introduces three variants:
InfoNCE (or cosine dissimilarity) – the primary version.
Hinge loss – using a squared hinge formulation for negative pairs.
Covariance loss – encouraging off‑diagonal entries of the feature covariance matrix to be zero.
All variants consistently outperform the baseline across experiments.
Integration with Diffusion Models
Dispersive Loss can be inserted into any diffusion or flow‑based generative model without modifying the original regression loss. Algorithm 2 shows the simple integration step: select the intermediate layer(s) whose features will be regularized and add the loss term.
Experimental Setup
Models: DiT and SiT variants.
VAE tokenizer produces a 32×32×4 latent space.
Sampling: 250‑step ODE Heun sampler.
Training: 80 epochs, no classifier‑free guidance (CFG).
Default hyper‑parameters: λ = 0.5, temperature τ = 0.5.
Results
Contrastive vs. Dispersive
Contrastive loss, which includes positive‑pair terms, fails to improve generation quality under most settings, especially when the two views have independent noise. Dispersive Loss, applied to a single‑view batch, consistently yields lower FID scores.
Effect of Different Variants
All four Dispersive variants (InfoNCE, cosine, Hinge, Covariance) outperform the baseline, with the InfoNCE‑based version achieving the largest FID reduction (≈11.35%).
Block Selection
Applying Dispersive Loss to any Transformer block improves performance, with the best results when regularizing all blocks simultaneously. The effect propagates to other blocks even when only a single block receives the loss.
Loss Weight and Temperature
Varying the regularization weight λ and temperature τ shows that all configurations improve over the baseline (FID = 36.49), confirming the robustness of the method.
Scaling to Different Model Sizes
Experiments on SiT and DiT models of sizes S, B, L, and XL demonstrate that Dispersive Loss yields consistent gains, with larger models benefiting more due to higher over‑fitting risk.
One‑Step Generative Models
The technique also improves recent one‑step diffusion models such as MeanFlow, confirming its generality.
System‑Level Comparison with REPA
Unlike REPA, which requires external pretrained encoders and extra data, Dispersive Loss is self‑contained, adding negligible computational overhead while delivering comparable or superior performance.
Conclusion
Dispersive Loss provides a simple, plug‑and‑play regularizer for diffusion‑based generative models that requires no extra pre‑training, parameters, or data. Extensive experiments across model architectures, sizes, and training settings demonstrate consistent improvements in generation quality, establishing Dispersive Loss as an effective and practical tool for advancing diffusion modeling.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
