Circulant Attention Mechanisms

Updated 2 January 2026

Circulant Attention is a novel mechanism that replaces dense self-attention with circulant and BCCB matrices to achieve sub-quadratic complexity.
It uses FFT-based convolutions to perform efficient softmax operations and matrix multiplications, vastly reducing memory usage and runtime.
Empirical results show enhanced accuracy and speed in vision transformers, language models, and imaging tasks compared to traditional attention methods.

Circulant Attention is a class of attention mechanisms that leverage structured (circulant or block-circulant) matrices to achieve substantial reductions in computational and memory complexity over conventional dense attention, while maintaining or even enhancing the modeling capacity of deep neural networks. By capitalizing on the fast convolutional algorithms enabled by circulant structure, these methods offer sub-quadratic run time—typically $\mathcal{O}(N\log N)$ or $\mathcal{O}(NW^2)$ for $N$ tokens/locations—making high-resolution and long-sequence applications tractable in domains such as vision, language, and scientific imaging (Han et al., 25 Dec 2025, Yamada, 9 Apr 2025, Janjusevic et al., 2024).

1. Mathematical Definition of Circulant and Block-Circulant Attention

A circulant matrix $C \in \mathbb{R}^{N \times N}$ is generated by a vector $z \in \mathbb{R}^N$ : each row is a cyclic shift of the previous, and matrix-vector multiplication $Cv$ implements circular convolution $z \ast v$ (Yamada, 9 Apr 2025). In two dimensions, a Block Circulant with Circulant Blocks (BCCB) matrix $B \in \mathbb{R}^{N \times N}$ ( $N = H \cdot W$ ) consists of $H \times H$ blocks, each a $W \times W$ circulant.

In Circulant Attention, the conventional dense $A \in \mathbb{R}^{N\times N}$ from self-attention is replaced or closely approximated by a circulant or BCCB matrix, so that

$O = \mathrm{Softmax}(A) V \quad \mapsto \quad O = \mathrm{Softmax}(\operatorname{BCCB}(a)) V,$

where $a \in \mathbb{R}^N$ determines the first row (and thus all rows) of the BCCB (Han et al., 25 Dec 2025). In the CAT model, a softmaxed weighting $z^* \in \mathbb{R}^N$ forms $\operatorname{Cir}(z^*)$ , and $O = \operatorname{Cir}(z^*) V$ (Yamada, 9 Apr 2025).

In locally-windowed settings, a circulant-sparse attention pattern is adopted, restricting nonzero attention to a fixed window around each token while preserving circulant structure via periodic boundary conditions (Janjusevic et al., 2024).

2. Efficient Computation via FFT and Algorithmic Design

Exploiting the convolution theorem, multiplication by circulant (1D) and BCCB (2D) matrices is equivalently convolution in the spatial domain or multiplication in the frequency domain. For $x \in \mathbb{R}^N$ and circulant $C(z)$ : $C(z) x = \mathrm{IFFT}(\mathrm{FFT}(z) \odot \mathrm{FFT}(x)).$ For BCCB $B(a)$ and $x$ reshaped spatially: $B(a) x = F_{2D}^{-1}\left( \mathrm{conj}(F_{2D}(a)) \odot F_{2D}(x) \right),$ with each transform costing $\mathcal{O}(N\log N)$ (Han et al., 25 Dec 2025).

A typical FFT-based circulant attention algorithm (flattened, single-head, omitting batch):

Compute 2D FFTs of $Q$ and $K$ , multiply pointwise, IFFT, sum across channels to get $a$ .
Rowwise softmax to yield $s$ ; FFT( $s$ ), FFT( $V$ ).
Multiply, IFFT to recover $O$ (Han et al., 25 Dec 2025).

For circulant-sparse attention in local windows, only $W^2$ weights per row are kept and softmaxed, and the application is a sparse matmul with cost $\mathcal{O}(NW^2)$ (Janjusevic et al., 2024).

3. Architectural Integrations and Parameter Efficiency

Circulant attention is deployed as a drop-in replacement for standard attention in Vision Transformers, LLMs, and interpretable unrolled networks.

In ViTs, the vanilla attention is replaced by the BCCB-projected attention. Token reweighting modules may compensate for normalizations inherent to BCCB structure (Han et al., 25 Dec 2025).
In CAT, combining query and key into a single projection reduces per-head parameters from $3d^2$ to $(d + d^2)$ (the "qv" variant), or $3d^2$ for "qkv" (distinct but FFT-friendly projections). Average pooling and partial attention replacement are also explored (Yamada, 9 Apr 2025).
In GroupCDL, circulant-sparse adjacency $\mathbf{S}$ acts as a nonlocal self-similarity prior, integrated within an unrolled dictionary learning framework for interpretable image restoration (Janjusevic et al., 2024).

Method	Attention Matrix	Complexity	Parameter Regime
Vanilla SA	Dense $N \times N$	$\mathcal{O}(N^2 d)$	$3d^2$ per head
Circulant/BCCB	Circulant/BCCB-approx.	$\mathcal{O}(N\log N d)$	$2d + d^2$ (qv), $3d^2$ (qkv)
Sparse CircAtt	BCCB with local window	$\mathcal{O}(NW^2 d)$	Modest, window-determined

4. Complexity, Memory, and Empirical Properties

Circulant attention provides dramatic reductions in computational and memory requirements compared to quadratic dense attention.

ViT with BCCB attention: $\mathcal{O}(N \log N d)$ compute, does not materialize $N \times N$ matrices, stores $N d$ activations plus $O(N)$ spectra (Han et al., 25 Dec 2025).
CAT: Never constructs $N \times N$ softmax, all steps at most $\mathcal{O}(N \log N)$ , with comparable or fewer parameters (Yamada, 9 Apr 2025).
CircAtt (GroupCDL): Locally windowed, BCCB sparse matrix with only $N W^2$ entries, for $\mathcal{O}(N W^2 C)$ complexity in $C$ channels (Janjusevic et al., 2024).

Empirically, circulant attention has been shown to:

Increase ImageNet-1k top-1 accuracy of DeiT-T from 72.2% to 75.0% (+2.8 pts), and show similar gains for PVT and Swin-T (Han et al., 25 Dec 2025).
Yield 7× faster runtime than vanilla attention at $1536^2$ resolution (CA-DeiT-T) (Han et al., 25 Dec 2025).
Achieve consistent 10% speedups in naive PyTorch CAT implementations for moderate $N$ , with greater advantage for larger $N$ and true FFT kernels (Yamada, 9 Apr 2025).
Match or outperform black-box baselines in denoising and compressed sensing MRI with 10–40× speedups and up to 8× fewer parameters (GroupCDL) (Janjusevic et al., 2024).

5. Limitations, Assumptions, and Model Adaptations

One key assumption is that attention matrices in vision transformers already closely approximate BCCB structure in practice (Han et al., 25 Dec 2025). However, BCCB imposes a doubly stochastic-like normalization after softmax, preventing the attention from highlighting certain keys across all queries. To address this, lightweight post-attention token reweighting modules are introduced to restore flexibility, with post-reweighting yielding the best results (Han et al., 25 Dec 2025).

For data with non-stationary dependencies (e.g., edges), mixing circulant global and local or sparse patterns may be advantageous. Adaptive frequency filtering (e.g., AFNO/AFF) could further enrich the BCCB prior. The same principles extend to 1D and non-uniform grids (Han et al., 25 Dec 2025).

CAT is most effective when token mixing is globally uniform (average pooling) or under masked language modeling objectives. Partial replacement ("CAT-Alter") can offer robust performance across pooling or masking conditions (Yamada, 9 Apr 2025).

Circulant-sparse attention (CircAtt) delivers full shift-invariance and eliminates the need for overlapping patches, at a small windowed computational price, but inherently restricts interactions to a fixed radius unless further adaptations are used (Janjusevic et al., 2024).

6. Applications and Empirical Outcomes

Image Classification: Circulant attention in ViTs raises top-1 ImageNet-1k accuracy by +0.5 to +3.0 points over baselines for multiple architectures (DeiT, PVT, Swin), often with equal or lower compute cost (Han et al., 25 Dec 2025).
Object Detection/Segmentation: CA-PVT and CA-Swin improve COCO box AP by +0.8–3.8 and ADE20K mIoU by +0.7–3.7, with no increase in model size (Han et al., 25 Dec 2025).
Language Modeling: CAT achieves perplexity reductions on masked LM (WikiText-103: 13.94 → 10.28) and is competitive in causal settings, especially when alternated with standard attention (Yamada, 9 Apr 2025).
Image Restoration and Scientific Imaging: GroupCDL with CircAtt yields competitive or superior denoising performance (e.g., +0.4 dB on Set12) and state-of-the-art compressed sensing MRI reconstruction (0.35–0.56 dB gain) at a small parameter and computational footprint (Janjusevic et al., 2024).
Generalization: Models with noise-adaptive thresholds (GroupCDL) maintain near-optimal denoising across a wide noise range, whereas “blind” methods are sensitive to mismatches (Janjusevic et al., 2024).

7. Implementation Considerations and Extensions

FFT-based circulant attention is highly amenable to GPU and hardware acceleration, particularly for large $N$ . Batched real FFT implementations reduce compute/memory further, and zero-padding to powers of two maximizes FFT kernel throughput (Yamada, 9 Apr 2025). For windowed CircAtt, a single CUDA kernel computes all sparse-similarity neighborhoods in parallel.

Extending circulant attention to hybrid forms, adaptive filtering, and irregular domains remains an active research direction. The combinatorial flexibility of circulant priors points to broad applicability—from image and sequence models to interpretable scientific workflows—while preserving mathematical transparency and computational tractability.

References:

"Vision Transformers are Circulant Attention Learners" (Han et al., 25 Dec 2025)
"CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers" (Yamada, 9 Apr 2025)
"GroupCDL: Interpretable Denoising and Compressed Sensing MRI via Learned Group-Sparsity and Circulant Attention" (Janjusevic et al., 2024)