Symmetric Dot-Product Attention in DCT-Formers

Updated 9 February 2026

Symmetric dot-product attention is defined by compressing self-attention computations via DCT to significantly reduce complexity and memory usage.
The approach leverages low-frequency DCT projections and fast FFT-based transforms to achieve quasi-linear scaling while approximating full attention.
Empirical evaluations reveal a trade-off between minor accuracy drops and substantial gains in efficiency, making it ideal for low-resource and edge deployments.

The term "DCT-Former" encompasses a class of architectures that integrate the Discrete Cosine Transform (DCT) into deep neural network transformers for efficiency or for leveraging frequency-domain information. DCT-Former architectures span efficient attention approximations for language and vision, frequency-aware transformers for enhancement tasks, and lightweight ViT alternatives for medical imaging. Their common motivation is to address the memory and computational bottlenecks of self-attention or to exploit the complementary strengths of spatial and frequency representations.

1. DCT-Former: DCT-Based Self-Attention Approximation

DCT-Former (Scribano et al., 2022) is an attention-efficient transformer architecture that approximates the self-attention mechanism by compressed computation in the DCT domain. The core problem addressed is the quadratic complexity of standard self-attention, where for input sequence length $n$ , the attention operation requires $O(n^2)$ time and memory.

Algorithmic Principle

The DCT-Former leverages lossy compression, specifically type-II DCT, to reduce sequence length:

The input matrix $X \in \mathbb{R}^{n \times d}$ is projected using the first $\bar n$ low-frequency rows of the DCT basis ( $\bar D$ ), yielding $\bar X = \bar D X$ with $\bar n \ll n$ .
Compressed queries, keys, and values are computed as $\bar Q = \bar X W_Q$ , $\bar K = \bar X W_K$ , and $\bar V = \bar X W_V$ .
Attention is performed in this low-dimensional subspace: $\bar A = \mathrm{softmax}(\bar Q \bar K^T / \sqrt{d})$ .
The result $\bar Y = \bar A \bar V$ is upsampled back with the inverse DCT: $Y = \bar D^T \bar Y$ .

There is no need to materialize the full $n \times n$ attention matrix; DCT compression and decompression dominate, with fast variants ( $O(n \log n)$ ) efficient via FFT-based DCT.

Complexity

Method	Time Complexity	Memory Complexity
Standard Attention	$O(n^2 d)$	$O(n^2)$
DCT-Former	$O(n \log n\,d + \bar n^2 d)$	$O(\bar n^2)$

As $\bar n = O(n^{\alpha})$ for $\alpha < 1$ , DCT-Former achieves quasi-linear scaling.

Pseudocode

barX = barD @ X
barQ = barX @ W_Q; barK = barX @ W_K; barV = barX @ W_V
S = (barQ @ barK.T) / sqrt(d)
barA = softmax(S)
barY = barA @ barV
Y = barD.T @ barY

2. Empirical Evaluation and Trade-offs

Experiments on language modeling and transfer learning demonstrate that DCT-Former:

Reduces peak memory by up to 80% and inference latency by $\sim65\%$ at sequence length 4096 (baseline: 1,250MB/45.6ms; DCT-Former: 326MB/15.8ms).
Shows a small accuracy drop (e.g., F1 0.90 $\rightarrow$ 0.87 on IMDB sentiment), but strictly outperforms Linformer and matches Nyströmformer at comparable $\bar n$ .
Maintains competitive pretraining MLM accuracy (vanilla: 59.7%; DCT-Former: 54.7% at $\bar n=32$ ), with a normalized speed–accuracy improvement of $\sim$ 15%.

This positions DCT-Former as an effective foundation for low-resource, real-time, or edge deployment where linear or subquadratic scaling dominates practical feasibility (Scribano et al., 2022).

Limitations

DCT-Former applies softmax in compressed space, not the full domain: $\bar A= \text{softmax}(\text{DCT}_{2D}(QK^T)/\sqrt{d})$ , which introduces a relaxation error since DCT and softmax are non-commutative.
The choice of $\bar n$ manifests a direct accuracy-efficiency tradeoff; adaptive or learned frequency selection is proposed as future work.
DCT-Former is readily extensible to 2D/3D DCT for vision or video, and could be fused with local or sparse attention for extreme-length sequences.

3. DCT-Former Variants in CNNs: Dynamic Clone Transformer

A distinct DCT-Former appears in convolutional architectures as the "Dynamic Clone Transformer" (DCT) module (Ye, 2021), forming the backbone of the "DyClotNet" network. Here, DCT does not denote the cosine transform, but a dual-branch channel expansion unit inspired by multi-path fully connected (MPFC) theories:

Clone generator branch ("replicator"): Channel-wise replication, $D(X',p) = [X',..., X']$ , cost-free expansion from $C'$ to $pC'$ channels.
Difference vector branch ("recalibrator"): Squeeze-and-excitation style context vector $s$ modulating each replicate, computed by two lightweight FC layers ( $W_1$ , $W_2$ ) and global average pooling.

The DCT-Former block enables sublinear FLOP scaling versus standard pointwise convolution expansion, reducing expansion cost to $\leq 5/16$ of the original for $p\geq2$ with negligible accuracy loss. While not DCT in the frequency sense, this variant is referenced as DCT-Former in DyClotNet (Ye, 2021).

4. DCT-Former in Frequency-Aware Vision Transformers

Frequency-aware DCT-Former architectures incorporate DCT or frequency-domain processing for vision or enhancement applications:

4.1 DEFormer for Low-Light Image Enhancement

DEFormer (Yin et al., 2023) employs a dual-branch design:

Learnable Frequency Branch (LFB): YCbCr conversion, 8x8 patchwise 2D DCT, grouping coefficients into 192 explicit frequency channels; enhanced with a curvature-based module (CFE) that splits and processes channels by frequency energy.
Shallow RGB Branch: Standard convolutional and transformer layers in the RGB space.
Cross-Domain Fusion (CDF): Channel-wise gating, cross-fusion, and spatial attention combine frequency and RGB features.

Integration of DCT features leads to increased PSNR/SSIM ( $+2.61$ dB PSNR over a pure-RGB baseline) and boosts dark-object detection mAP by $2.1\%$ . Each addition (LFB, CFE, CDF) yields quantifiable improvements, showing the utility of explicit frequency channels for texture recovery in challenging low-light scenarios.

4.2 DCT-HistoTransformer for Histopathological Classification

DCT-HistoTransformer (also called DCT-Former) (Ranjbar et al., 2024) targets resource-constrained histopathological image analysis:

DCT-Attention Branch: 2D-DCT per channel, low-pass filtering (retaining only $(M/r) \times (N/r)$ lowest frequencies), then group self-attention (cascaded multihead) on the resulting tokens, inverse DCT, and upsampling/padding.
MobileConv Branch: Lightweight spatial processing via a bottlenecked convolutional path.
Fusion: Both outputs are element-wise summed and passed onward; the architecture stacks several such DCT-Conv blocks.

Key effects:

Attention cost scales as $O((L^2 / r^4)d)$ with DCT, suppressing the quadratic cost to linear-quadratic for typical $r=2$ (factor 1/16 reduction).
SOTA-comparable accuracy (96% binary, 88% multiclass on BreaKHis), outperforming vanilla ViT and Swin in all magnification regimes with lower computational burden.
DCT energy compaction and filtering eliminate high-frequency noise and reduce overfitting—critical for small datasets and edge deployment.

5. Architectural Themes and Commonalities

A unifying editor's term is frequency-structured transformer for models embedding DCT or other spectral transforms in their workflow:

They compress or filter high-frequency (noise) components, focusing attention on global or smooth patterns.
Computational bottlenecks of self-attention are addressed either by operating in a compressed domain (DCT-Former (Scribano et al., 2022), DCT-HistoTransformer (Ranjbar et al., 2024)) or by manipulating channel expansions efficiently (Dynamic Clone Transformer (Ye, 2021)).
For enhancement and analysis tasks (DEFormer, DCT-HistoTransformer), frequency-domain representations aid both interpretability and performance under data or resource constraints.

6. Empirical Performance Summary

Architecture	Task/Domain	Key Benefit	Accuracy/Score	Efficiency Gain
DCT-Former (Scribano et al., 2022)	Language (MLM/IMDB)	$O(n\log n)$ , param-free	F1 .87 vs .90	–70–80% mem/time
DCT-Former (Ye, 2021)	Efficient CNN backbone	$>3\times$ expansion speedup	–	$\leq 5/16$ FLOPs
DEFormer (Yin et al., 2023)	Low-light image enhancement	Texture/PSNR restoration	PSNR 23.73/0.821	$+2.61$ dB/ $+2.1\%$
DCT-HistoTransformer (Ranjbar et al., 2024)	Histopathology (BreaKHis)	Lightweight, robust	96% binary acc	SOTA by cost/acc

All performance claims are extracted directly from the referenced works.

7. Research Directions and Limitations

Relaxation error from softmax/DCT non-commutativity may be minimized using adaptive, learned, or hybrid compression strategies.
Application beyond 1D sequences (to 2D/3D) is straightforward for DCT-Former and DCT-HistoTransformer, suggesting potential in video, high-resolution image, and spatiotemporal domains.
Modular frequency branches are synergistic with "Green AI" goals (low-power, real-time, edge inference).
Further study is needed on ablation between spatial and frequency branches, adaptive frequency selection, and integration with sparse or local attention.

In summary, DCT-Former architectures demonstrate that DCT-based compression is a principled, efficient, and empirically validated strategy for attention reduction, frequency-enhanced representation, and lightweight transformer design for both NLP and computer vision domains (Scribano et al., 2022, Ye, 2021, Yin et al., 2023, Ranjbar et al., 2024).