Artificial Intelligence 15 min read

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

The article reviews the AAAI 2026 paper "Vision Transformers are Circulant Attention Learners", explaining how modeling self‑attention as a Block‑Circulant matrix enables FFT‑based multiplication that cuts the quadratic complexity of standard attention, achieving up to seven‑fold inference speed‑up while preserving accuracy across ImageNet, COCO and ADE20K benchmarks.

AI Frontier Lectures

Mar 19, 2026

Can Circulant Attention Reduce Vision Transformer Cost by 7×?

TL;DR

Circulant Attention reduces the quadratic complexity of standard self‑attention (O(N²)) to O(N log N) by constraining the attention matrix to be Block‑Circulant with Circulant Blocks (BCCB) and performing the matrix‑vector product with FFT‑based multiplication.

Paper Information

Title: Vision Transformers are Circulant Attention Learners (AAAI 2026)

ArXiv: https://arxiv.org/pdf/2512.21542

Code: https://github.com/LeapLabTHU/Circulant-Attention

1. Self‑Attention and Circulant Matrices

Standard self‑attention computes a raw attention matrix A = QKᵀ followed by a softmax. In vision Transformers the learned attention patterns closely resemble a BCCB matrix, a 2‑D extension of a circulant matrix whose multiplication can be carried out via FFT in O(N log N) time.

2. Circulant Matrix Theory

A circulant matrix C is defined by its first row; each subsequent row is a cyclic shift of the previous one. Multiplying C by a vector is equivalent to a circular cross‑correlation, which by the Cross‑Correlation Theorem equals element‑wise multiplication of the DFTs of the two signals followed by an inverse DFT.

Cross‑Correlation Theorem: The Fourier transform of the cross‑correlation of two signals equals the element‑wise product of the first signal’s conjugated DFT and the second signal’s DFT.

3. Block‑Circulant with Circulant Blocks (BCCB)

A BCCB matrix consists of blocks that are themselves circulant, and the blocks follow a circulant pattern. The entire matrix is fully determined by its first row. Reshaping that row into a 2‑D kernel shows that the product B·x is a 2‑D depthwise convolution with circular padding, which can be computed efficiently with FFT.

4. Motivation for Efficient Vision Transformers

Standard softmax attention incurs O(N²) cost. Empirically, vision Transformers learn attention maps that are highly structured and can be approximated by BCCB matrices, opening the possibility to replace the expensive pairwise computation with a structured, FFT‑friendly operation.

5. Modeling Attention as BCCB

The authors project the raw attention matrix A onto the subspace of BCCB matrices by orthogonal projection that minimizes the Frobenius norm ‖A − A_BCCB‖_F. The resulting matrix is fully described by its first row (the kernel).

6. Efficient Computation

Given the BCCB representation, the matrix‑vector product is performed as:

Compute the DFT of the kernel (the first row) and the input vector.

Multiply the two spectra element‑wise.

Apply the inverse DFT to obtain the spatial result.

This reduces the per‑layer attention cost from O(N²) to O(N log N).

7. Complexity Analysis

The overall complexity of Circulant Attention for the attention step is O(N log N) compared with O(N²) for vanilla softmax attention. FLOP counts and runtime measurements show an 8× theoretical reduction, translating to up to 7× real‑world speed‑up at high resolutions (e.g., 1536 × 1536).

8. Experimental Results

ImageNet‑1K : Replacing attention in DeiT, Swin‑Transformer and PVT with Circulant Attention consistently improves top‑1 accuracy while reducing FLOPs.

COCO Object Detection : The CAT‑PVT‑S model surpasses the larger PVT‑L in box AP while using fewer FLOPs, demonstrating the advantage on high‑resolution detection tasks.

ADE20K Semantic Segmentation : Both PVT and Swin backbones gain accuracy when Circulant Attention is applied.

Efficiency : At 1536 × 1536 resolution, CA‑DeiT‑T uses 8× fewer FLOPs and achieves a 7× speed‑up. Throughput‑accuracy trade‑offs on ImageNet‑224 also improve.

9. Visualization of Equivalent Global Convolution Kernels

Because a BCCB matrix corresponds to a 2‑D global convolution, the authors visualise the learned kernels. Different input images (e.g., a bird on a wire) produce kernels that highlight semantically relevant structures, confirming the translation‑invariant behavior of the attention.

Conclusion

Circulant Attention bridges Vision Transformer attention with classical signal‑processing tools (FFT and convolution). By enforcing a BCCB structure, it provides a mathematically grounded, hardware‑friendly alternative that dramatically reduces computation while preserving or improving model performance, paving the way for next‑generation efficient visual foundation models.