Residual Adapter Block (RAB)

Updated 31 January 2026

Residual Adapter Block (RAB) is a lightweight two-layer bottleneck adapter that integrates frozen pre-trained features with task-specific updates using a residual blending mechanism.
It employs dual branches for visual and text outputs via a structured MLP formulation, optimized for few-shot generalization while maintaining low parameter overhead.
Empirical evidence shows that optimal residual blending yields superior performance by balancing pre-trained knowledge retention with new feature adaptation.

A Residual Adapter Block (RAB) is a lightweight, two-layer bottleneck module designed to augment task-specific adaptation for large-scale vision-LLMs, most notably in the CLIP-Adapter framework. RABs are appended to the frozen pre-trained CLIP backbones in both visual and text branches. Their core mechanism involves learning a residual-style blend between pre-trained and newly adapted features, controlled by a tunable hyperparameter. This approach is engineered to enhance few-shot generalization by reprojecting features through a low-dimensional bottleneck while explicitly preserving pre-trained knowledge.

1. Architectural Role and Placement

RABs operate as post-encoder modules that interface directly with the frozen outputs of both image and text encoders in CLIP. In the visual branch, after the global-pooled image feature $f\in\mathbb{R}^D$ is extracted by the frozen image encoder (e.g., ResNet-50), a two-layer bottleneck adapter $A_v$ is applied. Analogously, in the text branch, a similar bottleneck adapter $A_t$ is appended after the text encoder produces the classifier weight matrix $\mathbf{W}\in\mathbb{R}^{D\times K}$ . The output of each adapter is blended with the original feature via a residual-style combination controlled by a scalar coefficient (either learned or fixed), before classification proceeds with updated or frozen text weights (Gao et al., 2021).

Input Image I
    ↓
Frozen CLIP Visual Encoder
    ↓
f ∈ ℝ^D            ← original CLIP feature
    ↓
┌────────────────────────────┐
│   Bottleneck Adapter A_v   │
│      ℝ^D → ℝ^D via ℝ^d     │
└────────────────────────────┘
    ↓
A_v(f) ∈ ℝ^D       ← new feature
    ↓
f* = (1–α)·f + α·A_v(f)        ← residual blending
    ↓
Classification

2. Mathematical Formulation

The RAB consists of a two-layer bottleneck MLP. For the visual adapter: $A_v(f) = \mathrm{ReLU}\bigl(f^\top W_1^v\bigr) W_2^v \in\mathbb{R}^D$ and for the text adapter: $A_t(\mathbf{W}) = \mathrm{ReLU}\bigl(\mathbf{W}^\top W_1^t\bigr) W_2^t \in\mathbb{R}^{D\times K}$ with $W_1^v, W_1^t\in\mathbb{R}^{D\times d}$ (bottleneck), $W_2^v, W_2^t\in\mathbb{R}^{d\times D}$ . The blended residual outputs are: $f^\star = (1-\alpha)f + \alpha A_v(f),\quad \mathbf{W}^\star = (1-\beta)\mathbf{W} + \beta A_t(\mathbf{W})$ These are used in the CLIP-style softmax classifier: $p_i = \frac{\exp\left({\mathbf{W}^\star_i}^\top f^\star/\tau\right)}{\sum_{j=1}^K \exp\left({\mathbf{W}^\star_j}^\top f^\star/\tau\right)}$

3. Dimensions and Hyperparameters

CLIP-Adapter typically employs $D=1024$ (ResNet-50) or $D=512$ (ViT). The bottleneck reduction is set at $d=D/4$ (e.g., $d=256$ for ResNet-50). Consequently, adapter layer shapes are $W_1^v, W_1^t \in \mathbb{R}^{1024\times256}$ and $W_2^v, W_2^t \in \mathbb{R}^{256\times1024}$ for the ResNet configuration. The residual coefficients $\alpha$ (visual) and $\beta$ (text) are dataset-specific hyperparameters: typical optimal values are $\alpha\approx0.2$ for generic datasets (ImageNet) and $\alpha\approx0.6$ for fine-grained datasets (DTD, EuroSAT). Selection is made via a small discrete search (Gao et al., 2021).

4. Training Protocol in Few-Shot Regimes

CLIP-Adapter with RABs is trained in a few-shot setting with the backbone fully frozen; only adapter weights ( $\{W_1^v, W_2^v, W_1^t, W_2^t\}$ ) and optionally the residual coefficients ( $\alpha, \beta$ ) are optimized. The AdamW optimizer is used, with a learning rate of $1\times10^{-5}$ and batch size 32. The loss function is standard cross-entropy: $\mathcal{L} = -\frac{1}{N}\sum_{n=1}^N \sum_{i=1}^K y_i^{(n)}\log p_i^{(n)}$ No explicit regularization (such as dropout or additional weight decay) is applied beyond that inherent in AdamW.

5. Empirical Performance: Ablation and Generalization

Ablation experiments demonstrate the critical role of the residual blend. On DTD (fine-grained, 16-shot) with $\alpha=0$ (zero-shot) accuracy is 40.72%, with $\alpha=1$ (adapter only) 63.79%, and with the optimal residual blend ( $\alpha=0.6$ ) 66.06%. On ImageNet (generic, 16-shot), results are 60.46% ( $\alpha=0$ ), 59.05% ( $\alpha=1$ ), and 61.33% ( $\alpha=0.2$ ) (Gao et al., 2021). Pure adapter adaptation ( $\alpha=1$ ) can overfit on broader datasets, while blending consistently yields superior few-shot generalization. This suggests that residual blending balances new knowledge injection and the preservation of pre-trained manifold structure.

6. Comparative Assessment: Advantages and Limitations

RAB-equipped CLIP-Adapter is highly parameter-efficient, requiring approximately $2 \times D \times d$ parameters (about 0.5M for ResNet-50), in contrast to millions for full fine-tuning. The frozen backbone avoids catastrophic forgetting, ensuring the original zero-shot capability is partly retained. The method outperforms prompt tuning (CoOp) in nearly all few-shot benchmarks, while requiring no prompt-specific continuous tokens and presenting a simpler design. RABs can adapt visual, textual, or both branches with independent or shared blending ratios.

Limitations include the introduction of (typically one or two) new hyperparameters ( $\alpha, \beta$ ) that require dataset-level tuning, though grid search is computationally light. There is also a small additional computational footprint for the adapter forward pass, which remains negligible compared to full backbone passes. RABs do not replace prompt tuning in scenarios dominated by prompt engineering or complex language structure; they are complementary within the overall adaption toolkit.

7. Conceptual Summary and Significance

The Residual Adapter Block constitutes a minimal, two-layer MLP “bottleneck” attached to the output of a frozen pre-trained model. Through low-rank projection and a learnable residual blend, RABs integrate task-specific adaptation with preservation of prior knowledge. This design enables robust few-shot generalization with minimal parameter overhead and mitigates overfitting, and is empirically superior to both prompt-tuning and full fine-tuning baselines in the CLIP-Adapter context (Gao et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

CLIP-Adapter: Better Vision-Language Models with Feature Adapters (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Adapter Block (RAB).