FluxSR: The First 12B‑Parameter Single‑Step Diffusion Model for Real‑World Super‑Resolution
FluxSR introduces a novel single‑step diffusion approach for real‑world image super‑resolution built on the 12‑billion‑parameter FLUX.1‑dev model, employing Flow‑Trajectory Distillation, TV‑LPIPS and attention‑diversity losses to achieve high fidelity, reduced artifacts, and lower memory and compute costs.
Highlights
Developed FluxSR , a single‑step diffusion Real‑ISR model based on FLUX.1‑dev, the first such model built on a >12 B‑parameter foundation.
Proposed Flow‑Trajectory Distillation (FTD) to align the noise‑to‑image flow with the low‑resolution‑to‑high‑resolution flow, preserving the realism of the original T2I model while enabling super‑resolution.
Introduced a large‑model‑friendly training strategy that eliminates the need for an extra teacher model during training, reducing memory consumption and training cost.
Problem Statement
Multi‑step diffusion models incur high computational cost, limiting their use in real‑world image super‑resolution (Real‑ISR).
Existing single‑step diffusion methods are constrained by the performance of their teacher models, often producing artifacts.
Training large models with distillation adds significant memory and compute overhead.
Proposed Solution
Introduce FluxSR , a single‑step diffusion Real‑ISR technique that leverages flow matching.
Apply Flow‑Trajectory Distillation (FTD) to distill a multi‑step flow model into a single step while keeping the diffusion endpoint distribution unchanged.
Adopt a training strategy that embeds teacher knowledge into the noise‑to‑image flow and generates flow data offline, avoiding an extra teacher model during training.
Design a new perceptual loss TV‑LPIPS that incorporates total‑variation ideas to restore high‑frequency components and reduce artifacts.
Introduce Attention Diversity Loss (ADL) as a regularizer to mitigate repetitive patterns in generated images.
Technical Foundations
Base Model : FLUX.1‑dev serves as the foundation, learning the relationship between the noise‑to‑image flow and the low‑resolution‑to‑high‑resolution flow.
FTD : By keeping the original T2I flow unchanged and learning a super‑resolution (SR) flow trajectory, the method derives a direct LR‑to‑HR flow without distribution shift.
Training Strategy : Uses offline‑generated noise‑image pairs from the teacher model, eliminating the need for online teacher inference and cutting GPU usage.
Loss Functions :
Reconstruction loss with v‑prediction.
TV‑LPIPS combines LPIPS with a total‑variation term to penalize high‑frequency periodic artifacts.
ADL computes cosine similarity between token features and their mean to encourage diverse attention patterns.
Method Details
The goal is to distill a pre‑trained text‑to‑image (T2I) flow model into a single‑step diffusion ISR model. Existing single‑step ISR methods fine‑tune T2I models with additional modules (e.g., VSD, GAN) but suffer from flow misalignment, causing a shift from the real data distribution to the generated distribution.
FTD formulates the LR‑to‑HR flow by fitting the intermediate vector field that maps noise to LR images, then leveraging the linearity of ReFlow trajectories to compute the LR‑to‑HR flow analytically. The resulting equations are illustrated in the following figures:
Direct parameterization of the flow avoids the need for a separate teacher model during inference, reducing both memory and compute overhead.
Training Strategy for Large Models
Challenges include inference efficiency (requiring two flow models) and estimation error for the time‑dependent velocity field. The proposed strategy ensures only one flow model is needed at inference time and incorporates a reconstruction loss to improve performance.
Offline generation of 2400 noise‑image pairs (1024×1024) using FLUX.1‑dev provides the training data, eliminating any real‑world dataset requirement.
Loss Functions
TV‑LPIPS reduces periodic artifacts by penalizing pixel‑wise variations while preserving sharp edges. The loss is defined as:
ADL (Attention Diversity Loss) computes cosine similarity between each token’s feature vector and the mean feature vector across all tokens, encouraging diverse attention patterns and mitigating repetitive artifacts.
Experiments
Setup
Training data : 2400 synthetic 1024×1024 noise‑image pairs generated by FLUX.1‑dev; LR images obtained via the RealESRGAN degradation pipeline.
Test data : DIV2K‑val (synthetic) and two real datasets, RealSR and RealSet65, evaluated on full‑resolution images.
Baselines : Multi‑step diffusion ISR models (StableSR, DiffBIR, SeeSR, ResShift, AddSR) and single‑step diffusion ISR models (SinSR, OSEDiff, etc.).
Metrics : Four full‑reference (PSNR, SSIM, LPIPS, DISTS) and four no‑reference (MUSIQ, MANIQA, TOPIQ, Q‑Align) scores.
Quantitative Comparison
Tables 1 and 2 (shown below) demonstrate that FluxSR achieves the best performance on all no‑reference metrics across all test sets, and competitive results on full‑reference metrics. Notably, FluxSR outperforms StableSR on every dataset despite using a single inference step.
Qualitative Comparison
Figure 5 (below) shows visual results on severely degraded inputs. FluxSR restores fine details more faithfully than competing methods, avoiding the artificial textures and ringing artifacts observed in DiffBIR, ResShift, SinSR, and others.
Ablation Studies
FTD loss : Removing FTD and using only reconstruction loss degrades high‑frequency detail and introduces noticeable artifacts (Table 3).
ADL and TV‑LPIPS : Replacing them with alternative perceptual losses reduces performance; the combination of TV‑LPIPS and ADL yields the best scores (Table 4).
Conclusion and Limitations
FluxSR demonstrates that a 12 B‑parameter diffusion model can be distilled into an efficient single‑step Real‑ISR system that delivers unprecedented realism, high‑frequency fidelity, and reduced inference cost. However, the model remains large and computationally demanding, and periodic artifacts have not been completely eliminated. Future work will explore model pruning and more effective artifact‑suppression techniques to achieve a lightweight yet high‑performance Real‑ISR solution.
References
[1] One Diffusion Step to Real‑World Super‑Resolution via Flow Trajectory Distillation
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
