7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms
The AAAI 2026 paper by Tsinghua’s Huang‑Gao team shows that modeling Vision‑Transformer attention as a Block‑Circulant matrix and computing it with FFT reduces the quadratic complexity to O(N log N), delivering up to seven‑fold real‑world speedups without sacrificing accuracy.
Motivation and Key Observation
Self‑Attention in Vision Transformers requires pairwise similarity between all queries and keys, giving a quadratic O(N²) cost. Empirical analysis of DeiT, Swin‑Transformer and PVT shows that their learned attention matrices decompose naturally into 14×14 blocks, each exhibiting the same circulant pattern. This structure matches a Block‑Circulant with Circulant Blocks (BCCB) matrix, i.e., a 2‑D translation‑invariant global convolution.
Mathematical Foundations
A circulant matrix C satisfies that each row is a cyclic shift of the previous row. Multiplication C·x equals a circular cross‑correlation, which by the Cross‑Correlation Theorem can be computed as element‑wise multiplication of the discrete Fourier transforms (DFT) of the two signals followed by an inverse DFT. This yields an O(N log N) algorithm.
Cross‑Correlation Theorem : The Fourier transform of the cross‑correlation of two signals equals the element‑wise product of the first signal’s conjugated DFT and the second signal’s DFT.
Extending to two dimensions, a BCCB matrix is equivalent to a 2‑D convolution with circular padding. Therefore a BCCB‑based attention can be evaluated with 2‑D FFTs at the same O(N log N) cost.
Circulant Attention Design
The attention matrix A is forced into the BCCB subspace by solving a Frobenius‑norm projection:
min_{\tilde{A}\in\mathcal{B}} \|A-\tilde{A}\|_F^2where \mathcal{B} denotes the set of BCCB matrices. Because a BCCB matrix is fully defined by its first row, the optimal \tilde{A} is obtained by orthogonal projection onto the basis of shifted one‑hot vectors. The projected matrix replaces the raw attention matrix in the standard Softmax‑Attention pipeline, while the matrix‑vector products are implemented as FFT‑based convolutions.
Two auxiliary components are added:
Token Reweighting – an input‑dependent scalar r applied before and after attention to adaptively emphasize important tokens.
Complexity Analysis
Standard Softmax Attention: O(N²) operations.
Circulant Attention: O(N log N) due to FFT‑based multiplication (see Equation 12 in the paper). The authors derive a step‑by‑step reduction of each BCCB product term to a sum of 2‑D cross‑correlations, confirming the asymptotic gain.
Experimental Results
ImageNet‑1K : Replacing the attention modules of DeiT, Swin‑Transformer and PVT with Circulant Attention consistently improves top‑1 accuracy while reducing FLOPs. For Swin‑Transformer only the first two stages are modified, yet gains are observed.
COCO Object Detection : The Circulant‑Attention‑augmented PVT‑S (CA‑PVT‑S) surpasses the larger PVT‑L by 1.3 box AP with fewer FLOPs, demonstrating the method’s advantage on high‑resolution tasks.
ADE20K Semantic Segmentation : Consistent performance gains are reported for both PVT and Swin backbones when Circulant Attention is applied.
Efficiency Analysis : At a resolution of 1536×1536, CA‑DeiT‑T uses eight times fewer FLOPs and achieves a 7× real‑world speedup. Throughput‑vs‑accuracy trade‑off curves show that Circulant Attention improves the balance between latency and accuracy across resolutions.
Visualization of Equivalent Global Convolution Kernels
Because a BCCB matrix corresponds to a 2‑D global convolution, the authors visualize the equivalent convolution kernels for different inputs. The kernels adapt to image content (e.g., bird shapes or power‑line patterns), confirming that the model generates data‑dependent global filters.
Conclusion
Circulant Attention exploits the inherent BCCB structure of attention maps learned by Vision Transformers. By projecting attention onto the BCCB subspace and performing matrix‑vector products with FFT‑based convolutions, the method reduces the theoretical complexity from O(N²) to O(N log N) and delivers up to 7× real‑world inference acceleration while preserving or improving accuracy on classification, detection and segmentation benchmarks.
Paper: Vision Transformers are Circulant Attention Learners (AAAI 2026) – https://arxiv.org/pdf/2512.21542
Code: https://github.com/LeapLabTHU/Circulant-Attention
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
