7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms

The AAAI 2026 paper by Tsinghua’s Huang‑Gao team shows that modeling Vision‑Transformer attention as a Block‑Circulant matrix and computing it with FFT reduces the quadratic complexity to O(N log N), delivering up to seven‑fold real‑world speedups without sacrificing accuracy.

AIWalker
AIWalker
AIWalker
7× Faster Inference: Tsinghua’s Huang‑Gao Team Redesigns Vision‑Transformer Attention via Fourier Transforms

Motivation and Key Observation

Self‑Attention in Vision Transformers requires pairwise similarity between all queries and keys, giving a quadratic O(N²) cost. Empirical analysis of DeiT, Swin‑Transformer and PVT shows that their learned attention matrices decompose naturally into 14×14 blocks, each exhibiting the same circulant pattern. This structure matches a Block‑Circulant with Circulant Blocks (BCCB) matrix, i.e., a 2‑D translation‑invariant global convolution.

Visualization of DeiT attention patterns resembling BCCB matrices
Visualization of DeiT attention patterns resembling BCCB matrices

Mathematical Foundations

A circulant matrix C satisfies that each row is a cyclic shift of the previous row. Multiplication C·x equals a circular cross‑correlation, which by the Cross‑Correlation Theorem can be computed as element‑wise multiplication of the discrete Fourier transforms (DFT) of the two signals followed by an inverse DFT. This yields an O(N log N) algorithm.

Cross‑Correlation Theorem : The Fourier transform of the cross‑correlation of two signals equals the element‑wise product of the first signal’s conjugated DFT and the second signal’s DFT.

Extending to two dimensions, a BCCB matrix is equivalent to a 2‑D convolution with circular padding. Therefore a BCCB‑based attention can be evaluated with 2‑D FFTs at the same O(N log N) cost.

Circulant Attention Design

The attention matrix A is forced into the BCCB subspace by solving a Frobenius‑norm projection:

min_{\tilde{A}\in\mathcal{B}} \|A-\tilde{A}\|_F^2

where \mathcal{B} denotes the set of BCCB matrices. Because a BCCB matrix is fully defined by its first row, the optimal \tilde{A} is obtained by orthogonal projection onto the basis of shifted one‑hot vectors. The projected matrix replaces the raw attention matrix in the standard Softmax‑Attention pipeline, while the matrix‑vector products are implemented as FFT‑based convolutions.

Two auxiliary components are added:

Token Reweighting – an input‑dependent scalar r applied before and after attention to adaptively emphasize important tokens.

Complexity Analysis

Standard Softmax Attention: O(N²) operations.

Circulant Attention: O(N log N) due to FFT‑based multiplication (see Equation 12 in the paper). The authors derive a step‑by‑step reduction of each BCCB product term to a sum of 2‑D cross‑correlations, confirming the asymptotic gain.

Experimental Results

ImageNet‑1K : Replacing the attention modules of DeiT, Swin‑Transformer and PVT with Circulant Attention consistently improves top‑1 accuracy while reducing FLOPs. For Swin‑Transformer only the first two stages are modified, yet gains are observed.

ImageNet‑1K results
ImageNet‑1K results

COCO Object Detection : The Circulant‑Attention‑augmented PVT‑S (CA‑PVT‑S) surpasses the larger PVT‑L by 1.3 box AP with fewer FLOPs, demonstrating the method’s advantage on high‑resolution tasks.

COCO detection results
COCO detection results

ADE20K Semantic Segmentation : Consistent performance gains are reported for both PVT and Swin backbones when Circulant Attention is applied.

ADE20K results
ADE20K results

Efficiency Analysis : At a resolution of 1536×1536, CA‑DeiT‑T uses eight times fewer FLOPs and achieves a 7× real‑world speedup. Throughput‑vs‑accuracy trade‑off curves show that Circulant Attention improves the balance between latency and accuracy across resolutions.

Efficiency comparison between Softmax and Circulant Attention
Efficiency comparison between Softmax and Circulant Attention

Visualization of Equivalent Global Convolution Kernels

Because a BCCB matrix corresponds to a 2‑D global convolution, the authors visualize the equivalent convolution kernels for different inputs. The kernels adapt to image content (e.g., bird shapes or power‑line patterns), confirming that the model generates data‑dependent global filters.

Visualization of learned global convolution kernels
Visualization of learned global convolution kernels

Conclusion

Circulant Attention exploits the inherent BCCB structure of attention maps learned by Vision Transformers. By projecting attention onto the BCCB subspace and performing matrix‑vector products with FFT‑based convolutions, the method reduces the theoretical complexity from O(N²) to O(N log N) and delivers up to 7× real‑world inference acceleration while preserving or improving accuracy on classification, detection and segmentation benchmarks.

Paper: Vision Transformers are Circulant Attention Learners (AAAI 2026) – https://arxiv.org/pdf/2512.21542

Code: https://github.com/LeapLabTHU/Circulant-Attention

computer visionefficiencyAttention MechanismsFFTVision TransformersAAAI 2026Circulant Matrices
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.