Papers
Topics
Authors
Recent
Search
2000 character limit reached

Double-Condensing Attention Condenser

Updated 7 February 2026
  • Double-Condensing Attention Condenser is an attention mechanism that uses dual parallel branches to condense features and enable fine-grained self-attention with low computational cost.
  • Its architecture employs multi-stage feature condensation, bottleneck embedding, and parallel attention, significantly reducing memory and FLOPs compared to traditional methods.
  • DC-AC modules achieve competitive accuracy in tasks like medical image analysis while meeting TinyML constraints, delivering fast inference and energy efficiency.

A Double-Condensing Attention Condenser (DC-AC) is an efficient attention mechanism specifically designed to enable fine-grained, highly selective self-attention in @@@@2@@@@ while maintaining an extremely low computational and memory footprint. Originating in the context of resource-constrained machine learning (TinyML), DC-AC modules have been extensively adopted for both edge-device inference and large-scale classification tasks due to their condensed representations and selective focus mechanisms. The architecture is characterized by multi-stage feature condensation and parallel attention computation, achieving superior accuracy-to-efficiency trade-offs in tasks such as medical image analysis, particularly in skin lesion classification (Tai et al., 2023, Wong et al., 2022).

1. Fundamental Model Architecture

A DC-AC block receives an activation tensor X∈RH×W×CX \in \mathbb{R}^{H \times W \times C}, with N=H⋅WN=H \cdot W tokens. The module contains two parallel "attention condenser" branches, each consisting of:

  • Condensation: Spatial and channel reduction using a 1×11 \times 1 convolution, depthwise convolution, and another 1×11 \times 1 convolution.
  • Bottleneck Embedding: Transformation to a low-dimensional subspace, yielding Ei∈RNi×dE_i \in \mathbb{R}^{N_i \times d}, where Ni≪NN_i \ll N, d≪Cd \ll C.
  • Expansion: Projection of this embedding back to the value space Vi∈RNi×CV_i \in \mathbb{R}^{N_i \times C}.

In parallel, the module computes a global query projection Q∈RN×dQ \in \mathbb{R}^{N \times d} from XX. Each branch produces keys Ki∈RNi×dK_i \in \mathbb{R}^{N_i \times d}. For each branch ii, attention weights are computed as Ai=softmax(QKi⊤/d)A_i = \mathrm{softmax}(Q K_i^\top / \sqrt{d}), with outputs Oi=AiViO_i = A_i V_i. The two outputs are fused (summed or concatenated plus pointwise convolution) and undergo a residual addition with the input XX, optionally followed by normalization and activation. The output retains the input shape Y∈RH×W×CY \in \mathbb{R}^{H \times W \times C} (Tai et al., 2023).

Typical dimensioning (reference implementation) uses d=16d=16, N1=142N_1=14^2, N2=72N_2=7^2, with C=64C=64 on 56×5656 \times 56 inputs, and all projections structured for minimal parameter count.

2. Mathematical Formulation

For input X∈RN×CX \in \mathbb{R}^{N \times C} and each branch i=1,2i=1,2:

  1. Condensation: $C_i = f_\textrm{condense}_i(X) \in \mathbb{R}^{N_i \times C_i}$
  2. Embedding: $E_i = f_\textrm{embed}_i(C_i) = C_i W_{E,i} \in \mathbb{R}^{N_i \times d}$
  3. Expansion: $V_i = f_\textrm{expand}_i(E_i) = E_i W_{V,i} \in \mathbb{R}^{N_i \times C}$

Queries: Q=XWQ∈RN×dQ = X W_Q \in \mathbb{R}^{N \times d}

Keys: Ki=EiWK,i∈RNi×dK_i = E_i W_{K,i} \in \mathbb{R}^{N_i \times d}

Attention and output per branch: Ai=softmax(QKi⊤d)∈RN×NiA_i = \mathrm{softmax}\left(\frac{Q K_i^\top}{\sqrt{d}} \right) \in \mathbb{R}^{N \times N_i}

Oi=AiVi∈RN×CO_i = A_i V_i \in \mathbb{R}^{N \times C}

Branch fusion and output: O=O1+O2O = O_1 + O_2

Y=LayerNorm(X+O)Y = \mathrm{LayerNorm}(X + O)

(Tai et al., 2023)

3. Implementation Workflow

The following pseudocode summarizes the DC-AC block computation (Tai et al., 2023):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def DCACBlock(X):
    # X: [H, W, C]
    N = H*W
    Q = reshape(X, [N, C]) @ W_Q  # [N, d]
    branch_outputs = zero([N, C])
    for i in {1,2}:
        C_i = Conv1x1_reduce_i(X)
        C_i = DepthwiseConv(C_i, stride=s_i)
        C_i = Conv1x1_embed_i(C_i)
        E_i = reshape(C_i, [N_i, d])
        K_i = E_i @ W_K_i
        V_i = E_i @ W_V_i
        A_i = softmax((Q @ K_i.T) / sqrt(d))
        O_i = A_i @ V_i
        branch_outputs += O_i
    O = reshape(branch_outputs, [H, W, C])
    Y = X + O
    Y = Activation(LayerNorm(Y))
    return Y

This dual-branch strategy enables a high compression ratio for attention computation, with O(Nâ‹…NiN \cdot N_i) cost per branch, in contrast to the O(N2N^2) cost of conventional full-rank attention.

4. Integration in Network Topologies and Constraints

DC-AC modules are embedded in deep network topologies, including multi-column architectures where each column uses DC-AC blocks at multiple feature resolutions. For instance, in skin lesion classification:

  • Input: 224×224×3224 \times 224 \times 3
  • Stem: 3×33 \times 3 convolution, stride 2
  • Four parallel columns with interleaved conv and DC-AC blocks
  • Unified feature map via concatenation and 1×11 \times 1 conv
  • Global average pooling →\rightarrow dense →\rightarrow sigmoid head

This configuration achieves approximately 1.6M parameters and 0.325G FLOPs per 224×224224 \times 224 input (Tai et al., 2023).

AttendNeXt (Wong et al., 2022) employs a related DC-AC block with two stages of channel-wise condensation (C→d1→d2C \rightarrow d_1 \rightarrow d_2), projecting the final condensed embedding to query, key, and value via linear subspaces, and restores the representation via expansion with batch normalization and ReLU at each step. Machine-driven generative synthesis techniques are used to fix the degrees of condensation (e.g., d1≈C/4d_1 \approx C/4, d2≈C/8−C/16d_2 \approx C/8-C/16), anti-aliased downsampling is imposed, and the design is constrained to avoid strided pointwise convolutions.

5. Quantitative Evaluation and Efficiency

When applied to the SIIM-ISIC skin cancer dataset, a DC-AC-powered network reached an AUROC of 0.8865 (private test set) with a model size under 7MB (FP32) or under 2MB (8-bit quantized). Single-image inference requires only 325M FLOPs, translating to 10ms latency on an ARM Cortex-M55 or 3ms on Raspberry Pi 4 (Tai et al., 2023). Comparison models such as MobileViT-S (5.6M parameters, 2.03G FLOPs) achieved only 0.8566 AUROC; Cancer-Net SCa variants reached up to 0.7430 (Tai et al., 2023).

AttendNeXt with DC-AC blocks achieved 75.8% ImageNet top-1 with 3.6MB and %%%%39Ki∈RNi×dK_i \in \mathbb{R}^{N_i \times d}40%%%% throughput over FB-Net C (ARM Cortex-A72 benchmark). DC-AC blocks are routinely 1.2–1.5×\times smaller and 6–10×\times faster than high-accuracy MobileNet and MobileViT baselines (Wong et al., 2022).

Error tolerance to reduced precision and pruning is demonstrated: 8-bit post-training quantization results in <1%<1\% AUROC loss, and 20% structured channel pruning yields a 25% size reduction with <0.5%<0.5\% AUROC loss (Tai et al., 2023).

6. TinyML Considerations

DC-AC is explicitly engineered for TinyML constraints:

  • Memory: Weights occupy 1.6M×\times4B = 6.4MB (FP32) or 1.6MB (INT8), and peak activation memory is ≤\leq3MB, fitting well within a 16MB SRAM budget.
  • Compute: Depthwise-separable and condenser operations minimize multiply-accumulate counts. The double-condense structure restricts self-attention to reduced sets (O(Nâ‹…NiN \cdot N_i)), avoiding prohibitive scaling.
  • Compression: Quantization and structured pruning can be performed aggressively without substantial loss in accuracy.

These properties facilitate deployment on mobile and embedded systems for applications such as on-device dermoscopy and tele-dermatology (Tai et al., 2023).

7. Relation to Prior and Parallel Work

DC-AC is conceptually and architecturally distinct from the double-attention modules in A2A^2-Nets (Chen et al., 2018). The latter introduce "gather and distribute" second-order attention pooling, wherein features are aggregated globally and reallocated via a two-stage attention mechanism, achieving efficiency via factorization to O(d2Nd^2 N) cost per block. However, DC-AC relies on a multi-stage condensation and lightweight self-attention applied to compressed representations within each branch, optimizing for different computational bottlenecks typical in embedded hardware.

The impact of DC-AC is most pronounced in use-cases where both fine-grained attention and tight memory or latency budgets are required. The approach exemplifies a transition from monolithic attention computation to modular, highly controlled, hardware-aware neural network design paradigms.

Table: Key Quantitative Comparisons

Model Params (MB) Throughput (A72, rel.) Top-1/ImageNet AUROC (SIIM-ISIC)
AttendNeXt (DC-AC) 3.6 10.5x 75.8% (not reported)
MobileViT-XS 2.8 4.0x 74.7% 0.8566
MobileNetV3-L 4.1 1.7x 75.6% –
FB-Net C 4.5 1.0x 74.7% –
DC-AC (skin cancer) 1.6 – – 0.8865

All results from (Tai et al., 2023, Wong et al., 2022). These results affirm the heightened throughput and competitive performance of DC-AC-equipped architectures in both generic and skin image-specific benchmarks.


DC-AC and its variants establish a paradigm for selective attention that is tractable for deployment under stringent TinyML requirements. By fusing efficient condensation, low-rank attention, parallelism, and hardware-aware design, DC-AC enables advanced visual recognition at dynamic, embedded, and clinical endpoints with minimal losses in discriminative power (Tai et al., 2023, Wong et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Double-Condensing Attention Condenser.