Artificial Intelligence 21 min read

FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed

FLUX-Lightning, introduced by PaddleMIX, combines phased consistency distillation, adversarial learning, distribution‑matching distillation, and reflow loss to reduce diffusion model inference to just four steps while preserving image quality, and leverages the CINN compiler to achieve over 30% speed gains on A800 GPUs, surpassing existing SOTA acceleration methods.

Baidu Geek Talk

Aug 11, 2025

FLUX-Lightning Slashes Diffusion Inference to 4 Steps, Doubling Speed

Background

Diffusion models have achieved remarkable results in high‑fidelity image and video generation, but their inference requires dozens to hundreds of denoising steps, each invoking a large U‑Net or Transformer, leading to prohibitive latency especially for high‑resolution or video generation.

Overall Optimization Scheme

PaddleMIX builds the Fast‑Diffusers toolbox, which integrates training‑free acceleration techniques such as dynamic redundant computation skipping (SortBlock), intelligent cache reuse (TeaBlockCache), and mathematical approximation (FirstBlock‑Taylor). These methods double inference speed while keeping generation quality nearly unchanged.

Distillation Acceleration and Framework Performance Optimization

The toolbox also incorporates model distillation and deep‑learning compiler optimizations. It integrates Consistency Models (PCM) and Improved Distribution Matching Distillation (DMD2) and introduces the self‑developed distillation model FLUX‑Lightning, which generates high‑quality high‑resolution images in only four steps, achieving SOTA performance. Additionally, the CINN compiler further boosts inference compared to Torch Compile, Onediff, and TensorRT.

FLUX‑Lightning Overview

FLUX‑Lightning combines four components: phased consistency distillation, adversarial learning, distribution‑matching distillation, and reflow loss. It achieves over 2× speedup with almost unchanged quality.

Phased Consistency Distillation

Consistency models map any point on a PF‑ODE trajectory to the start point, enabling one‑step generation while retaining multi‑step capability. FLUX‑Lightning splits the ODE trajectory into M sub‑trajectories, samples a timestep within each sub‑trajectory, adds noise, and uses a teacher denoiser to denoise to the sub‑trajectory endpoint, enforcing consistency across the interval.

Adversarial Learning

An adversarial discriminator, composed of a frozen teacher denoiser and trainable heads, judges real versus fake samples in latent space, further improving image quality under few‑step generation.

Distribution Matching Distillation

Inspired by One‑step Diffusion with Distribution Matching Distillation, FLUX‑Lightning minimizes the KL divergence between the student and teacher output distributions, using the student model’s score function to reduce extra parameters and computation.

Reflow Loss

The reflow loss corrects flow errors during distillation, enhancing stability and fidelity.

Algorithm Flow

The complete algorithm proceeds through interval definition, consistency function application, adversarial discrimination, and distribution‑matching optimization, as illustrated in the accompanying diagram.

CINN High‑Performance Inference

The CINN compiler translates high‑level model graphs into optimized low‑level code, applying multiple optimization passes to improve execution efficiency on CPUs and GPUs. Experiments on A800 GPUs show 30%‑36% speed improvements for FLUX‑1‑dev and FLUX‑1‑schnell models, outperforming competing frameworks.

Experiments

Setup : 450k images from LAION‑aesv2 (resolution > 1024, aesthetic > 6, watermark < 0.5) and COCO‑10k for evaluation. The FLUX base model with CFG‑augmented ODE solver is used. Metrics include CLIP similarity and FID‑FLUX.

Quantitative Results : Ablation studies confirm that adversarial learning, distribution‑matching distillation, and reflow loss each improve performance. FLUX‑Lightning achieves the best FID‑FLUX (8.0182) among SOTA distillation models.

Qualitative Results : Visual comparisons show FLUX‑Lightning excels in human body part accuracy, text generation, pose realism, and overall prompt adherence.

Human Evaluation : Five models were ranked on 50 challenging prompts by four reviewers. FLUX‑Lightning received the highest average score (7.37), indicating superior aesthetic quality.

Training

Data preparation involves downloading the LAION‑45w dataset and a file list, then launching distributed training with the provided command line. The training script text_to_image_generation_flux_lightning.py loads the FLUX teacher model, applies LoRA weights, and runs inference.

Inference

Model weights can be downloaded, and inference is performed with a simple Python script that loads the LoRA weights and generates images using the specified prompt, steps, and guidance scale. CINN‑accelerated inference is enabled via environment variables and a dedicated script.

Conclusion and Outlook

FLUX‑Lightning demonstrates that combining phased consistency distillation, adversarial learning, distribution‑matching distillation, and reflow loss can achieve four‑step high‑quality image generation, and CINN further reduces latency to 1.66 s per image on A800. Future work includes exploring TrigFlow for quantization error reduction and more efficient adversarial losses, aiming for even faster generation without quality loss.