How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion
The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.
Background
Diffusion Transformers (DiT) have become a leading architecture for text‑to‑image generation, but their quadratic‑complexity self‑attention creates prohibitive latency for high‑resolution outputs. The authors aim to replace the full‑attention mechanism with a linear‑complexity alternative that can be applied to pretrained DiT models without sacrificing performance.
Survey of Efficient Attention
The paper categorises existing efficient‑attention techniques into three groups: (1) Formulation Variation, which replaces softmax with alternative kernels (e.g., Mamba‑2, Gated Linear Attention, Generalized Linear Attention) (2) Key‑Value Compression, which reduces the number of KV tokens via down‑sampling or learned projections (e.g., PixArt‑Sigma, Agent Attention, Linformer) and (3) Key‑Value Sampling, which sparsifies the attention matrix by selecting a subset of tokens (e.g., Routing Attention, Swin Transformer, BigBird, LongFormer). The authors evaluate these strategies on FLUX.1‑dev, a state‑of‑the‑art DiT model, and find that most fail to preserve the quality of pretrained DiT because they either distort fine‑grained details or break the strict token‑wise interaction required by DiT.
Four Critical Factors for Linearising Pretrained DiT
Through extensive experiments the authors identify four components that are essential for a successful linearisation: locality (attention should be confined to a neighbourhood around each query), formulation consistency (the linearised attention must retain the softmax‑scaled‑dot‑product form), high‑rank attention maps (the approximated maps should still capture complex token relationships), and feature integrity (original Q, K, V features must remain uncompressed).
CLEAR: Convolution‑Like Linear Diffusion Transformer
Guided by the above observations, the authors propose CLEAR, a convolution‑style local attention scheme where each query interacts only with tokens whose Euclidean distance is less than a radius r. This yields a fixed number of KV tokens per query, giving the model linear complexity with respect to the total number of image tokens. The attention mask is circular rather than square, incurring only a modest constant‑factor overhead.
Training and Optimisation
Only the attention‑layer parameters are fine‑tuned; the rest of the DiT backbone remains frozen. Knowledge‑distillation losses (Flow Matching [20, 21]) are combined with an attention‑layer consistency loss, weighted by hyper‑parameters α and β. Training uses 10 K samples from the pretrained DiT, 10 K fine‑tuning iterations, batch size 32, and DeepSpeed ZeRO‑2 on four H100 GPUs (≈1 day). Hyper‑parameters α = β = 0.5 follow prior work (LinFusion). Evaluation on the COCO‑2014 validation set (5 000 images, 1024×1024) reports FID, LPIPS, CLIP‑image similarity, DINO similarity, IS, FLOPs, PSNR, and MS‑SSIM.
Multi‑GPU Parallel Inference
Because each query only accesses a local window, CLEAR enables efficient patch‑wise parallelism: each GPU processes a vertical image patch, and inter‑GPU communication is limited to the narrow border region. For text tokens, the authors approximate full‑attention by averaging the attention outputs across patches, eliminating the need to broadcast all KV pairs. This strategy is orthogonal to existing patch‑parallel methods such as DistriFusion [22].
Experimental Setup
All experiments replace the attention layers of FLUX‑1‑dev with CLEAR, testing three window radii. Implementation relies on PyTorch FlexAttention for fast sparse attention. Training data are synthetic samples generated by the original DiT; this yields better fine‑tuning results than using real datasets.
Results
CLEAR reduces attention FLOPs by 99.5% and speeds up 8K image generation by 6.3× while achieving comparable or superior quantitative metrics to the original model (see Fig. 10). With the knowledge‑distillation loss, CLIP‑image scores exceed 90. Qualitative comparisons (Fig. 11) show that CLEAR preserves layout, texture, and colour. The method also scales to 2048×2048 and 4096×4096 resolutions, achieving MS‑SSIM up to 0.9 (Fig. 13). Using SDEdit as a high‑resolution up‑sampling baseline, CLEAR can control the trade‑off between fine detail and content preservation (Fig. 12).
Conclusion
By enforcing locality, formulation consistency, high‑rank attention maps, and feature integrity, CLEAR provides a practical linear‑complexity alternative for pretrained diffusion transformers, dramatically cutting compute and memory while maintaining generation quality, and enabling efficient multi‑GPU inference for ultra‑high‑resolution text‑to‑image synthesis.
Figures
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
