Pixel Mean Flow: One‑Step Diffusion Beats Multi‑Step Models on ImageNet
The Pixel Mean Flow (pMF) method eliminates multi‑step sampling and latent‑space encoding, generating high‑quality images in a single step and achieving state‑of‑the‑art FID scores on ImageNet while drastically reducing computational cost.
Background
Traditional diffusion models require multi‑step sampling (dozens to hundreds of network passes) and latent‑space encoding (compressing images into a low‑dimensional space). Recent works such as Consistency Models, MeanFlow, and JiT have reduced one of these costs, but achieving single‑step generation directly in pixel space remains challenging.
Pixel Mean Flow (pMF) design
pMF trains a neural network to output a denoised image x in a single forward pass. During training the loss is computed on a velocity field u that describes the ODE trajectory of the diffusion process. The predicted image x is obtained by a simple transformation of the average velocity field, effectively integrating u over the diffusion time interval. Because natural images lie on a low‑dimensional manifold (manifold hypothesis), predicting x (which resembles a clean or slightly blurred image) is easier than predicting high‑dimensional noise u.
2‑D toy experiment: when data are projected into a 512‑dimensional observation space, u -prediction collapses while x -prediction remains stable, confirming the hypothesis.
ImageNet experiment (256×256): u -prediction yields an FID of 164.89, whereas x -prediction stays in the single‑digit range.
Training details
Network architecture: pMF‑H/16 (hierarchical transformer with patch size 16).
Losses: standard diffusion loss on u plus an optional perceptual loss on x.
Optimizer: Muon optimizer, which converges faster and yields better results than Adam.
Experimental results
ImageNet 256×256: pMF‑H/16 achieves an FID of 2.22, surpassing the previous single‑step method EPG (FID = 8.82). Compared with StyleGAN‑XL (1574 Gflops per forward pass), pMF reduces FLOPs by a factor of 5.8 while delivering comparable FID.
ImageNet 512×512: using 32×32 patches, pMF reaches an FID of 2.48 with similar compute.
Adding a perceptual loss reduces FID from 9.56 to 3.53 (≈ 6‑point improvement).
Latent‑space VAE decoders incur large overhead (310 Gflops at 256 px, 1230 Gflops at 512 px), exceeding the total compute of the pMF generator.
Ablation studies
Muon optimizer converges faster and yields better FID than Adam.
MeanFlow full‑plane sampling (sampling across the entire (r, t) plane) is essential; restricting sampling to r = t or r = 0 degrades performance.
Traditional pre‑conditioners (EDM, sCM) are less effective than direct x -prediction in high‑dimensional settings.
Conclusion
The study demonstrates that single‑step, no‑latent‑space image generation is feasible and competitive with multi‑step diffusion and GAN approaches, offering a simpler and more compute‑efficient generative model.
Paper: https://arxiv.org/abs/2601.22158
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
