Pixel Mean Flow: One‑Step Diffusion Beats Multi‑Step Models on ImageNet

The Pixel Mean Flow (pMF) method eliminates multi‑step sampling and latent‑space encoding, generating high‑quality images in a single step and achieving state‑of‑the‑art FID scores on ImageNet while drastically reducing computational cost.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Pixel Mean Flow: One‑Step Diffusion Beats Multi‑Step Models on ImageNet

Background

Traditional diffusion models require multi‑step sampling (dozens to hundreds of network passes) and latent‑space encoding (compressing images into a low‑dimensional space). Recent works such as Consistency Models, MeanFlow, and JiT have reduced one of these costs, but achieving single‑step generation directly in pixel space remains challenging.

Pixel Mean Flow (pMF) design

pMF trains a neural network to output a denoised image x in a single forward pass. During training the loss is computed on a velocity field u that describes the ODE trajectory of the diffusion process. The predicted image x is obtained by a simple transformation of the average velocity field, effectively integrating u over the diffusion time interval. Because natural images lie on a low‑dimensional manifold (manifold hypothesis), predicting x (which resembles a clean or slightly blurred image) is easier than predicting high‑dimensional noise u.

2‑D toy experiment: when data are projected into a 512‑dimensional observation space, u -prediction collapses while x -prediction remains stable, confirming the hypothesis.

ImageNet experiment (256×256): u -prediction yields an FID of 164.89, whereas x -prediction stays in the single‑digit range.

Image
Image

Training details

Network architecture: pMF‑H/16 (hierarchical transformer with patch size 16).

Losses: standard diffusion loss on u plus an optional perceptual loss on x.

Optimizer: Muon optimizer, which converges faster and yields better results than Adam.

Experimental results

ImageNet 256×256: pMF‑H/16 achieves an FID of 2.22, surpassing the previous single‑step method EP​G (FID = 8.82). Compared with StyleGAN‑XL (1574 Gflops per forward pass), pMF reduces FLOPs by a factor of 5.8 while delivering comparable FID.

ImageNet 512×512: using 32×32 patches, pMF reaches an FID of 2.48 with similar compute.

Adding a perceptual loss reduces FID from 9.56 to 3.53 (≈ 6‑point improvement).

Latent‑space VAE decoders incur large overhead (310 Gflops at 256 px, 1230 Gflops at 512 px), exceeding the total compute of the pMF generator.

Image
Image

Ablation studies

Muon optimizer converges faster and yields better FID than Adam.

MeanFlow full‑plane sampling (sampling across the entire (r, t) plane) is essential; restricting sampling to r = t or r = 0 degrades performance.

Traditional pre‑conditioners (EDM, sCM) are less effective than direct x -prediction in high‑dimensional settings.

Image
Image

Conclusion

The study demonstrates that single‑step, no‑latent‑space image generation is feasible and competitive with multi‑step diffusion and GAN approaches, offering a simpler and more compute‑efficient generative model.

Paper: https://arxiv.org/abs/2601.22158

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Diffusion ModelsImageNetperceptual losspixel mean flowsingle-step generation
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.