Can Circulant Attention Reduce Vision Transformer Cost by 7×?
The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.
TL;DR
Circulant Attention reduces the quadratic complexity of standard self‑attention (O(N²)) to O(N log N) by constraining the attention matrix to be Block‑Circulant with Circulant Blocks (BCCB) and performing the matrix‑vector product with FFT‑based multiplication.
Paper Information
Title: Vision Transformers are Circulant Attention Learners (AAAI 2026)
ArXiv: https://arxiv.org/pdf/2512.21542
Code: https://github.com/LeapLabTHU/Circulant-Attention
1. Self‑Attention and Circulant Matrices
Standard self‑attention computes a raw attention matrix A = QKᵀ followed by a softmax. In vision Transformers the learned attention patterns closely resemble a BCCB matrix, a 2‑D extension of a circulant matrix whose multiplication can be carried out via FFT in O(N log N) time.
2. Circulant Matrix Theory
A circulant matrix C is defined by its first row; each subsequent row is a cyclic shift of the previous one. Multiplying C by a vector is equivalent to a circular cross‑correlation, which by the Cross‑Correlation Theorem equals element‑wise multiplication of the DFTs of the two signals followed by an inverse DFT.
Cross‑Correlation Theorem: The Fourier transform of the cross‑correlation of two signals equals the element‑wise product of the first signal’s conjugated DFT and the second signal’s DFT.
3. Block‑Circulant with Circulant Blocks (BCCB)
A BCCB matrix consists of blocks that are themselves circulant, and the blocks follow a circulant pattern. The entire matrix is fully determined by its first row. Reshaping that row into a 2‑D kernel shows that the product B·x is a 2‑D depthwise convolution with circular padding, which can be computed efficiently with FFT.
4. Motivation for Efficient Vision Transformers
Standard softmax attention incurs O(N²) cost. Empirically, vision Transformers learn attention maps that are highly structured and can be approximated by BCCB matrices, opening the possibility to replace the expensive pairwise computation with a structured, FFT‑friendly operation.
5. Modeling Attention as BCCB
The authors project the raw attention matrix A onto the subspace of BCCB matrices by orthogonal projection that minimizes the Frobenius norm ‖A − A_BCCB‖_F. The resulting matrix is fully described by its first row (the kernel).
6. Efficient Computation
Given the BCCB representation, the matrix‑vector product is performed as:
Compute the DFT of the kernel (the first row) and the input vector.
Multiply the two spectra element‑wise.
Apply the inverse DFT to obtain the spatial result.
This reduces the per‑layer attention cost from O(N²) to O(N log N).
7. Complexity Analysis
The overall complexity of Circulant Attention for the attention step is O(N log N) compared with O(N²) for vanilla softmax attention. FLOP counts and runtime measurements show an 8× theoretical reduction, translating to up to 7× real‑world speed‑up at high resolutions (e.g., 1536 × 1536).
8. Experimental Results
ImageNet‑1K : Replacing attention in DeiT, Swin‑Transformer and PVT with Circulant Attention consistently improves top‑1 accuracy while reducing FLOPs.
COCO Object Detection : The CAT‑PVT‑S model surpasses the larger PVT‑L in box AP while using fewer FLOPs, demonstrating the advantage on high‑resolution detection tasks.
ADE20K Semantic Segmentation : Both PVT and Swin backbones gain accuracy when Circulant Attention is applied.
Efficiency : At 1536 × 1536 resolution, CA‑DeiT‑T uses 8× fewer FLOPs and achieves a 7× speed‑up. Throughput‑accuracy trade‑offs on ImageNet‑224 also improve.
9. Visualization of Equivalent Global Convolution Kernels
Because a BCCB matrix corresponds to a 2‑D global convolution, the authors visualise the learned kernels. Different input images (e.g., a bird on a wire) produce kernels that highlight semantically relevant structures, confirming the translation‑invariant behavior of the attention.
Conclusion
Circulant Attention bridges Vision Transformer attention with classical signal‑processing tools (FFT and convolution). By enforcing a BCCB structure, it provides a mathematically grounded, hardware‑friendly alternative that dramatically reduces computation while preserving or improving model performance, paving the way for next‑generation efficient visual foundation models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
