Artificial Intelligence 21 min read

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.

AIWalker

Jan 17, 2025

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

Background

Diffusion Transformers (DiT) have become a leading architecture for text‑to‑image generation, but their quadratic‑complexity self‑attention creates prohibitive latency for high‑resolution outputs. The authors aim to replace the full‑attention mechanism with a linear‑complexity alternative that can be applied to pretrained DiT models without sacrificing performance.

Survey of Efficient Attention

The paper categorises existing efficient‑attention techniques into three groups: (1) Formulation Variation, which replaces softmax with alternative kernels (e.g., Mamba‑2, Gated Linear Attention, Generalized Linear Attention) (2) Key‑Value Compression, which reduces the number of KV tokens via down‑sampling or learned projections (e.g., PixArt‑Sigma, Agent Attention, Linformer) and (3) Key‑Value Sampling, which sparsifies the attention matrix by selecting a subset of tokens (e.g., Routing Attention, Swin Transformer, BigBird, LongFormer). The authors evaluate these strategies on FLUX.1‑dev, a state‑of‑the‑art DiT model, and find that most fail to preserve the quality of pretrained DiT because they either distort fine‑grained details or break the strict token‑wise interaction required by DiT.

Four Critical Factors for Linearising Pretrained DiT

Through extensive experiments the authors identify four components that are essential for a successful linearisation: locality (attention should be confined to a neighbourhood around each query), formulation consistency (the linearised attention must retain the softmax‑scaled‑dot‑product form), high‑rank attention maps (the approximated maps should still capture complex token relationships), and feature integrity (original Q, K, V features must remain uncompressed).

CLEAR: Convolution‑Like Linear Diffusion Transformer

Guided by the above observations, the authors propose CLEAR, a convolution‑style local attention scheme where each query interacts only with tokens whose Euclidean distance is less than a radius r. This yields a fixed number of KV tokens per query, giving the model linear complexity with respect to the total number of image tokens. The attention mask is circular rather than square, incurring only a modest constant‑factor overhead.

Training and Optimisation

Only the attention‑layer parameters are fine‑tuned; the rest of the DiT backbone remains frozen. Knowledge‑distillation losses (Flow Matching [20, 21]) are combined with an attention‑layer consistency loss, weighted by hyper‑parameters α and β. Training uses 10 K samples from the pretrained DiT, 10 K fine‑tuning iterations, batch size 32, and DeepSpeed ZeRO‑2 on four H100 GPUs (≈1 day). Hyper‑parameters α = β = 0.5 follow prior work (LinFusion). Evaluation on the COCO‑2014 validation set (5 000 images, 1024×1024) reports FID, LPIPS, CLIP‑image similarity, DINO similarity, IS, FLOPs, PSNR, and MS‑SSIM.

Multi‑GPU Parallel Inference

Because each query only accesses a local window, CLEAR enables efficient patch‑wise parallelism: each GPU processes a vertical image patch, and inter‑GPU communication is limited to the narrow border region. For text tokens, the authors approximate full‑attention by averaging the attention outputs across patches, eliminating the need to broadcast all KV pairs. This strategy is orthogonal to existing patch‑parallel methods such as DistriFusion [22].

Experimental Setup

All experiments replace the attention layers of FLUX‑1‑dev with CLEAR, testing three window radii. Implementation relies on PyTorch FlexAttention for fast sparse attention. Training data are synthetic samples generated by the original DiT; this yields better fine‑tuning results than using real datasets.

Results

CLEAR reduces attention FLOPs by 99.5% and speeds up 8K image generation by 6.3× while achieving comparable or superior quantitative metrics to the original model (see Fig. 10). With the knowledge‑distillation loss, CLIP‑image scores exceed 90. Qualitative comparisons (Fig. 11) show that CLEAR preserves layout, texture, and colour. The method also scales to 2048×2048 and 4096×4096 resolutions, achieving MS‑SSIM up to 0.9 (Fig. 13). Using SDEdit as a high‑resolution up‑sampling baseline, CLEAR can control the trade‑off between fine detail and content preservation (Fig. 12).

Conclusion

By enforcing locality, formulation consistency, high‑rank attention maps, and feature integrity, CLEAR provides a practical linear‑complexity alternative for pretrained diffusion transformers, dramatically cutting compute and memory while maintaining generation quality, and enabling efficient multi‑GPU inference for ultra‑high‑resolution text‑to‑image synthesis.

Figures

CLEAR linearised FLUX.1‑dev high‑resolution output

Speed and GFLOPs comparison between CLEAR and original FLUX.1‑dev

Locality vs. global attention visualisation

Efficient Attention Linear Attention High‑Resolution Image Generation clear Diffusion Transformers Multi‑GPU Inference

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.