Cluster Attention Adapter (CLAdapter)

Updated 3 January 2026

CLAdapter is a lightweight, plug-and-play module that uses learnable cluster centers and transformation matrices to adapt pre-trained vision representations for data-scarce tasks.
It employs a unified interface for both CNNs and Transformers, utilizing cluster-based cosine similarity to dynamically adjust feature maps in 2D and 3D domains.
Empirical results demonstrate significant performance improvements with marginal overhead, adding only +7–10% parameters and <1G extra FLOPs in low-resource scenarios.

The Cluster Attention Adapter (CLAdapter) is a lightweight, plug-and-play architectural module for vision models, specifically designed to refine and adapt feature representations from large-scale pre-trained backbones—including both convolutional neural networks (CNNs) and Transformers—to data-limited downstream scientific tasks. CLAdapter achieves this by integrating cluster-based attention mechanisms and learnable transformation matrices, providing both parameter efficiency and effective adaptation across 2D and 3D vision domains. Distinct from earlier clustering-based sparse attention in dense prediction settings, CLAdapter primarily focuses on transferring and personalizing pre-trained representations to specialized, often low-resource, applications by introducing attention-guided distribution correlation with a set of learnable cluster centers (Li et al., 27 Dec 2025, Xie et al., 2022).

1. Architectural Formulation

At the core of CLAdapter is a unified interface that processes the output feature maps or tokens from any standard pre-trained backbone (e.g., ViT, ConvNeXt, Swin, or their 3D variants). All spatial (and, if present, temporal) dimensions of backbone features are flattened into a matrix $H\in\mathbb{R}^{N\times D}$ with $N$ flattened patches/tokens and feature dimension $D$ . The CLAdapter module introduces $K$ learnable cluster centers $A=\{A_1,\dots,A_K\}$ , $A_i\in\mathbb{R}^D$ , which are optimized via gradient descent. Feature sets are layer-normalized and mapped onto cluster centers to generate a distribution correlation via cosine similarity:

$\beta = \mathrm{softmax}\bigl(\hat H^q\cdot[\hat A_1,\ldots,\hat A_K]\bigr), \quad \hat H^q=\frac{H^q}{\|H^q\|}, \quad \hat A_i = \frac{A_i}{\|A_i\|}$

with $H^q = \frac{1}{N}\sum_{i=1}^N H_i$ .

For each cluster, a transformation matrix $M_i \in \mathbb{R}^{D \times D}$ is learned. A sample-specific adapter is produced as the convex combination $M^* = \sum_{i=1}^K\beta_i M_i$ , which is applied to all features:

$H^* = \mathrm{LayerNorm}(H M^*)$

Subsequently, a two-layer MLP (expansion ratio 4, activation: GELU) further enhances the representation:

$H' = \mathrm{GELU}(H^* W_1 + b_1) W_2 + b_2$

This $H'$ is reshaped to original spatial/temporal format and passed to the task-specific head.

2. Mathematical and Algorithmic Framework

Cluster centers are initialized from a zero-mean Gaussian, $A_i\sim\mathcal{N}(0,\sigma^2 I)$ , and jointly optimized with transformation matrices and MLP parameters. Loss is typically a cross-entropy objective, with L2 regularization:

$\ell_{\rm CE} = -\frac{1}{B} \sum_{j=1}^B \sum_{c} y_{j,c} \log \mathrm{softmax}_c (W_{hd} h'_j + b_{hd}), \qquad \mathcal L = \ell_{\rm CE} + \lambda\|\theta\|_2^2$

where $\theta$ denotes all CLAdapter and head parameters.

Training follows a staged fine-tuning (SFT) algorithm:

Stage 1 (LP-like): Freeze the backbone, optimize only CLAdapter and head for $T_1$ epochs.
Stage 2 (Full): Unfreeze backbone, optimize all parameters for $T_2$ epochs.

Default hyperparameters include $K=20$ clusters, MLP expansion of $4\times$ , AdamW optimizer, batch size 16, and weight initialization $A_i, M_i\sim\mathcal{N}(0,0.02^2)$ . SFT typically converges in 30–40 epochs in stage 2 (Li et al., 27 Dec 2025).

3. Modular Integration Across Architectures

The CLAdapter’s unified interface supports seamless insertion in both CNNs and Transformers, adaptable to 2D or 3D tasks. For CNNs, feature maps $(C', H', W')$ are reshaped to $(N, D)$ with $N=H'W'$ , while for video and volumetric data $(T, C, H, W)$ , the feature map is flattened over $T\times H \times W$ . The CLAdapter module can be inserted after one or more blocks, or just before the task head, minimizing the increment in FLOPs ( $<$ 1 G); the additional parameter count is marginal (+7–10%) (Li et al., 27 Dec 2025).

In contrast with clustering-based sparse attention as in ClusTR (Xie et al., 2022), which reduces the quadratic self-attention cost via pre-aggregation of tokens and supports multi-scale clustering for dense tasks, CLAdapter’s primary objective is not computational sparsification but dynamic feature adaptation through cluster attention in data-scarce domains.

4. Comparative Performance and Ablation Results

Extensive evaluation across ten benchmarks encompassing generic, OOD, and multiple scientific modalities demonstrates CLAdapter’s state-of-the-art performance under data-limited conditions. Selected results, with backbone where specified:

Dataset	Metric	Score (CLAdapter+Backbone)
Tiny-ImageNet	Top-1 Acc	94.21% (ViT-L)
PACS (OOD)	Top-1 Acc	91.41% (ViT-B)
BreakHis (4-way)	F1	93.71% (ViT-B)
HCRF (2-way)	F1	98.59% (ConvNeXt-B)
Apple Disease	Acc	98.36%
WHU-RS19	Acc	99.80%
UCF101 (video)	Top-1 Acc	97.60% (Swin-B)
HMDB51 (video)	Top-1 Acc	75.80%

Ablation studies indicate optimal performance at $K=20$ clusters, and significant benefit from SFT over pure linear probing or full fine-tuning (e.g., stage-1 LP at $\sim$ 84% vs. stage-2 FT at 95% on BreakHis). Relative gains up to +175% in F1 are observed (ConvNeXt), with only 0.4–1 G additional FLOPs (Li et al., 27 Dec 2025).

5. Relation to Clustering-Guided Sparse Attention in Transformers

ClusTR (Xie et al., 2022) introduces a clustering-guided sparse self-attention, where key/value tokens are aggregated into $K\ll N$ cluster tokens, reducing attention cost from $O(N^2 d)$ to $O(NKd)$ , and supporting multi-scale aggregation. For dense prediction, this cluster attention yields higher parameter/FLOP efficiency and state-of-the-art accuracy; e.g., on ImageNet-1K, CLAdapter-S achieves 83.2% Top-1 accuracy (22.7M params), surpassing grid aggregation (PVT-style) and competing sparse attention schemes.

Methodologically, while both ClusTR’s sparse self-attention and CLAdapter use clustering concepts, CLAdapter diverges by using learnable cluster centers for attention-guided selection and adaptation (not just token aggregation), focuses on downstream adaptation to scientific or low-resource data, and is architecturally agnostic (applicable to both CNNs and Transformers).

6. Limitations and Future Directions

CLAdapter has not been validated for dense prediction tasks (such as detection or segmentation), which have been the focus of clustering-guided sparse attention in works like ClusTR. For extremely small downstream sample sizes ( $<$ 100), reliable estimation of cluster centers may be challenging. A plausible implication is that further validation, particularly on dense prediction benchmarks, could be necessary to demonstrate CLAdapter's full applicability (Li et al., 27 Dec 2025). The unified, lightweight adaptation mechanism nevertheless provides a foundation for future extensions, including potential integration with multi-scale clustering paradigms observed in sparse attention literature (Xie et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains (2025)

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cluster Attention Adapter (CLAdapter).