Drifting Models Enable One‑Step Generation, Shattering Speed Records

The paper introduces Drifting Models, a new generative paradigm that moves the distribution evolution to the training phase, achieving true one‑step (1‑NFE) generation with state‑of‑the‑art ImageNet FID scores of 1.54 in latent space and 1.61 in pixel space, while eliminating the need for distillation or classifier‑free guidance.

AIWalker
AIWalker
AIWalker
Drifting Models Enable One‑Step Generation, Shattering Speed Records

Problem

Current diffusion and flow‑matching models rely on 20‑100 iterative denoising steps during inference, which makes generation slow and computationally expensive. Existing one‑step methods such as Consistency Models require complex distillation pipelines and still fall short of the quality of multi‑step diffusion models.

Proposed Approach: Drifting Models

The authors propose a fundamentally different generative paradigm called Drifting Models . Instead of iterating at inference, the distribution evolution is shifted to the training stage. Each SGD update is interpreted as a push‑forward of the current sample distribution, and a drift‑field is defined to describe how generated samples should move to match the data distribution.

Drift‑Field Theory

A drift‑field is a vector field \(v(x)\) that combines an attractive force from the data distribution and a repulsive force from the current generated distribution. When the generated distribution equals the data distribution, the drift‑field becomes zero, indicating equilibrium. The field is constructed using kernel‑based mean‑shift ideas, where a kernel \(k(\cdot)\) measures similarity between samples.

Training Objective

The training loss minimizes the squared norm of the drift‑field: L = \|v(x)\|^2 A stop‑gradient operation freezes the previous iteration’s drift‑field, allowing the network to move toward the frozen target without back‑propagating through the drift computation itself. This yields a simple, differentiable objective that encourages the push‑forward distribution to converge to the data distribution.

Feature‑Space Implementation

To obtain richer training signals, the drift‑field is computed in a pretrained feature space rather than raw pixel space. The authors use a latent‑MAE encoder (trained on ImageNet) to extract latent features, and also experiment with self‑supervised encoders such as SimCLR and MoCo v2. Positive samples are real data features, while negative samples are features of generated images, mirroring contrastive learning.

One‑Step Generation

At inference time the model performs a single forward pass (1‑NFE) without any iterative denoising or distillation. The same network architecture used during training (a DiT‑like transformer with adaLN‑zero) directly maps Gaussian noise to an image (or latent) in one step.

Experiments

ImageNet 256×256 (latent space) : 1‑NFE FID = 1.54 , outperforming SiT‑XL/2 (2.06) and DiT‑XL/2 (2.27) and comparable to multi‑step diffusion models.

ImageNet 256×256 (pixel space) : 1‑NFE FID = 1.61 , a large margin over StyleGAN‑XL (2.30) and ADM (4.59).

Robot control : Replacing the multi‑step diffusion policy in Diffusion Policy with a 1‑NFE Drifting Model matches or exceeds the success rate of the original 100‑NFE policy.

No classifier‑free guidance needed : Best results are achieved with CFG scale = 1.0, i.e., without additional guidance.

Ablation Studies

Anti‑symmetry of the drift‑field : Disrupting anti‑symmetry causes catastrophic failure, confirming the theoretical requirement.

Batch size and sample count : Larger numbers of positive and negative samples improve the estimation of the drift‑field and boost generation quality.

Feature encoder quality : Encoders pretrained with SimCLR or MoCo v2 yield strong results; a latent‑MAE trained on ImageNet further improves FID to 1.36 after classifier fine‑tuning.

System‑Level Comparisons

The authors train larger variants and compare against prior one‑step methods. Their best model (Base size) reaches 1.54 FID with only 87 G FLOPs, whereas StyleGAN‑XL requires 1574 G FLOPs for a 2.30 FID, demonstrating superior efficiency.

Discussion and Conclusion

Drifting Models reinterpret generative modeling as training‑time distribution evolution, eliminating iterative inference. While the empirical results are compelling, several open questions remain, such as the precise conditions under which the drift‑field guarantees convergence and how to design optimal drift‑fields, kernels, and feature encoders. The authors anticipate that this perspective will inspire further research into alternative implementations of training‑driven distribution dynamics.

Key Highlights Introduces a new generation paradigm that removes inference‑time iteration. Achieves true one‑step high‑quality generation without distillation. Sets new SOTA 1‑NFE FID scores on ImageNet (1.54 latent, 1.61 pixel). Demonstrates applicability to robot control tasks.
Drifting Models illustration
Drifting Models illustration
Drift field visualisation
Drift field visualisation
2D distribution evolution example
2D distribution evolution example
Sample evolution during training
Sample evolution during training
Anti‑symmetry ablation
Anti‑symmetry ablation
Effect of batch size and sample count
Effect of batch size and sample count
Feature encoder comparison
Feature encoder comparison
System‑level comparison table
System‑level comparison table
Robot control results
Robot control results
generative modelingdiffusionImageNetDrifting ModelsOne-step Generation
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.