Double-Condensing Attention Condenser

Updated 7 February 2026

Double-Condensing Attention Condenser is an attention mechanism that uses dual parallel branches to condense features and enable fine-grained self-attention with low computational cost.
Its architecture employs multi-stage feature condensation, bottleneck embedding, and parallel attention, significantly reducing memory and FLOPs compared to traditional methods.
DC-AC modules achieve competitive accuracy in tasks like medical image analysis while meeting TinyML constraints, delivering fast inference and energy efficiency.

A Double-Condensing Attention Condenser (DC-AC) is an efficient attention mechanism specifically designed to enable fine-grained, highly selective self-attention in deep neural networks while maintaining an extremely low computational and memory footprint. Originating in the context of resource-constrained machine learning (TinyML), DC-AC modules have been extensively adopted for both edge-device inference and large-scale classification tasks due to their condensed representations and selective focus mechanisms. The architecture is characterized by multi-stage feature condensation and parallel attention computation, achieving superior accuracy-to-efficiency trade-offs in tasks such as medical image analysis, particularly in skin lesion classification (Tai et al., 2023, Wong et al., 2022).

1. Fundamental Model Architecture

A DC-AC block receives an activation tensor $X \in \mathbb{R}^{H \times W \times C}$ , with $N=H \cdot W$ tokens. The module contains two parallel "attention condenser" branches, each consisting of:

Condensation: Spatial and channel reduction using a $1 \times 1$ convolution, depthwise convolution, and another $1 \times 1$ convolution.
Bottleneck Embedding: Transformation to a low-dimensional subspace, yielding $E_i \in \mathbb{R}^{N_i \times d}$ , where $N_i \ll N$ , $d \ll C$ .
Expansion: Projection of this embedding back to the value space $V_i \in \mathbb{R}^{N_i \times C}$ .

In parallel, the module computes a global query projection $Q \in \mathbb{R}^{N \times d}$ from $X$ . Each branch produces keys $N=H \cdot W$ 0. For each branch $N=H \cdot W$ 1, attention weights are computed as $N=H \cdot W$ 2, with outputs $N=H \cdot W$ 3. The two outputs are fused (summed or concatenated plus pointwise convolution) and undergo a residual addition with the input $N=H \cdot W$ 4, optionally followed by normalization and activation. The output retains the input shape $N=H \cdot W$ 5 (Tai et al., 2023).

Typical dimensioning (reference implementation) uses $N=H \cdot W$ 6, $N=H \cdot W$ 7, $N=H \cdot W$ 8, with $N=H \cdot W$ 9 on $1 \times 1$ 0 inputs, and all projections structured for minimal parameter count.

2. Mathematical Formulation

For input $1 \times 1$ 1 and each branch $1 \times 1$ 2:

Condensation: $1 \times 1$ 3
Embedding: $1 \times 1$ 4
Expansion: $1 \times 1$ 5

Queries: $1 \times 1$ 6

Keys: $1 \times 1$ 7

Attention and output per branch: $1 \times 1$ 8

$1 \times 1$ 9

Branch fusion and output: $1 \times 1$ 0

$1 \times 1$ 1

(Tai et al., 2023)

3. Implementation Workflow

The following pseudocode summarizes the DC-AC block computation (Tai et al., 2023):

$N_i \ll N$ 4

This dual-branch strategy enables a high compression ratio for attention computation, with O( $1 \times 1$ 2) cost per branch, in contrast to the O( $1 \times 1$ 3) cost of conventional full-rank attention.

4. Integration in Network Topologies and Constraints

DC-AC modules are embedded in deep network topologies, including multi-column architectures where each column uses DC-AC blocks at multiple feature resolutions. For instance, in skin lesion classification:

Input: $1 \times 1$ 4
Stem: $1 \times 1$ 5 convolution, stride 2
Four parallel columns with interleaved conv and DC-AC blocks
Unified feature map via concatenation and $1 \times 1$ 6 conv
Global average pooling $1 \times 1$ 7 dense $1 \times 1$ 8 sigmoid head

This configuration achieves approximately 1.6M parameters and 0.325G FLOPs per $1 \times 1$ 9 input (Tai et al., 2023).

AttendNeXt (Wong et al., 2022) employs a related DC-AC block with two stages of channel-wise condensation ( $E_i \in \mathbb{R}^{N_i \times d}$ 0), projecting the final condensed embedding to query, key, and value via linear subspaces, and restores the representation via expansion with batch normalization and ReLU at each step. Machine-driven generative synthesis techniques are used to fix the degrees of condensation (e.g., $E_i \in \mathbb{R}^{N_i \times d}$ 1, $E_i \in \mathbb{R}^{N_i \times d}$ 2), anti-aliased downsampling is imposed, and the design is constrained to avoid strided pointwise convolutions.

5. Quantitative Evaluation and Efficiency

When applied to the SIIM-ISIC skin cancer dataset, a DC-AC-powered network reached an AUROC of 0.8865 (private test set) with a model size under 7MB (FP32) or under 2MB (8-bit quantized). Single-image inference requires only 325M FLOPs, translating to 10ms latency on an ARM Cortex-M55 or 3ms on Raspberry Pi 4 (Tai et al., 2023). Comparison models such as MobileViT-S (5.6M parameters, 2.03G FLOPs) achieved only 0.8566 AUROC; Cancer-Net SCa variants reached up to 0.7430 (Tai et al., 2023).

AttendNeXt with DC-AC blocks achieved 75.8% ImageNet top-1 with 3.6MB and %%%%39 $N=H \cdot W$ 040%%%% throughput over FB-Net C (ARM Cortex-A72 benchmark). DC-AC blocks are routinely 1.2–1.5 $E_i \in \mathbb{R}^{N_i \times d}$ 5 smaller and 6–10 $E_i \in \mathbb{R}^{N_i \times d}$ 6 faster than high-accuracy MobileNet and MobileViT baselines (Wong et al., 2022).

Error tolerance to reduced precision and pruning is demonstrated: 8-bit post-training quantization results in $E_i \in \mathbb{R}^{N_i \times d}$ 7 AUROC loss, and 20% structured channel pruning yields a 25% size reduction with $E_i \in \mathbb{R}^{N_i \times d}$ 8 AUROC loss (Tai et al., 2023).

6. TinyML Considerations

DC-AC is explicitly engineered for TinyML constraints:

Memory: Weights occupy 1.6M $E_i \in \mathbb{R}^{N_i \times d}$ 94B = 6.4MB (FP32) or 1.6MB (INT8), and peak activation memory is $N_i \ll N$ 03MB, fitting well within a 16MB SRAM budget.
Compute: Depthwise-separable and condenser operations minimize multiply-accumulate counts. The double-condense structure restricts self-attention to reduced sets (O( $N_i \ll N$ 1)), avoiding prohibitive scaling.
Compression: Quantization and structured pruning can be performed aggressively without substantial loss in accuracy.

These properties facilitate deployment on mobile and embedded systems for applications such as on-device dermoscopy and tele-dermatology (Tai et al., 2023).

7. Relation to Prior and Parallel Work

DC-AC is conceptually and architecturally distinct from the double-attention modules in $N_i \ll N$ 2-Nets (Chen et al., 2018). The latter introduce "gather and distribute" second-order attention pooling, wherein features are aggregated globally and reallocated via a two-stage attention mechanism, achieving efficiency via factorization to O( $N_i \ll N$ 3) cost per block. However, DC-AC relies on a multi-stage condensation and lightweight self-attention applied to compressed representations within each branch, optimizing for different computational bottlenecks typical in embedded hardware.

The impact of DC-AC is most pronounced in use-cases where both fine-grained attention and tight memory or latency budgets are required. The approach exemplifies a transition from monolithic attention computation to modular, highly controlled, hardware-aware neural network design paradigms.

Table: Key Quantitative Comparisons

Model	Params (MB)	Throughput (A72, rel.)	Top-1/ImageNet	AUROC (SIIM-ISIC)
AttendNeXt (DC-AC)	3.6	10.5x	75.8%	(not reported)
MobileViT-XS	2.8	4.0x	74.7%	0.8566
MobileNetV3-L	4.1	1.7x	75.6%	–
FB-Net C	4.5	1.0x	74.7%	–
DC-AC (skin cancer)	1.6	–	–	0.8865

All results from (Tai et al., 2023, Wong et al., 2022). These results affirm the heightened throughput and competitive performance of DC-AC-equipped architectures in both generic and skin image-specific benchmarks.

DC-AC and its variants establish a paradigm for selective attention that is tractable for deployment under stringent TinyML requirements. By fusing efficient condensation, low-rank attention, parallelism, and hardware-aware design, DC-AC enables advanced visual recognition at dynamic, embedded, and clinical endpoints with minimal losses in discriminative power (Tai et al., 2023, Wong et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

Double-Condensing Attention Condenser: Leveraging Attention in Deep Learning to Detect Skin Cancer from Skin Lesion Images (2023)

Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers (2022)

$A^2$-Nets: Double Attention Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Double-Condensing Attention Condenser.