Two-Stream Adapter for Cross-Modal Fusion

Updated 15 January 2026

Two-Stream Adapter is a module that enables efficient feature transformation and fusion between two parallel data streams using adapter-based design.
It employs intra-stream adaptation and bi-directional cross-stream fusion through techniques like bottleneck projections, temporal convolutions, and multi-head attention.
The design achieves significant parameter efficiency while delivering state-of-the-art performance on tasks such as tracking and segmentation with minimal overhead.

A Two-Stream Adapter, in contemporary visual perception architectures, refers to an adapter-based module or set of modules enabling efficient, learnable feature transformation and fusion across two parallel data streams—typically corresponding to distinct modalities (e.g., RGB and thermal, or image and side network), or to parallel encoders or temporal segments—inserted into a predominantly frozen backbone. These adapters achieve parameter-efficient fine-tuning, robust cross-modal or cross-stream information exchange, and rapid adaptation to downstream tasks while minimizing the number of trainable parameters.

1. Architectural Foundations of Two-Stream Adapters

A Two-Stream Adapter architecture employs two main feature-processing branches (or “streams”), each handling either a different modality (e.g., RGB and depth, image and side-convolutional network) or different tokens (e.g., template and search). Adapters are strategically inserted at specific depth positions within these streams—generally after or within the backbone's transformer blocks or convolutional stages. The adapters may serve several distinct roles:

Intra-stream adaptation: Modules such as Spatio-Temporal Modality Adapters (STMA) (Li et al., 3 Aug 2025) or Convolutional Side Adapters (CSA) (Yu et al., 2024) process features within each modality/stream, providing self-prompting or intra-stream refinement.
Cross-stream fusion: Bi-directional adapters (e.g., BAT’s universal bi-directional adapter (Cao et al., 2023), DSTA’s trainable BA blocks (Zeng et al., 2024), or Progressive Modality Complementary Adapters (PMCA) (Li et al., 3 Aug 2025)) inject cross-modal information, enabling dynamic reciprocal fusion.

Adapter placement varies by model. For instance, DSTA blocks are inserted in specific transformer layers following both self-attention and MLP sublayers (e.g., layers 5, 6, 11 in a 12-layer ViT) (Zeng et al., 2024), while CSAs in TS-SAM are present after every block of the frozen SAM encoder (Yu et al., 2024). In DMTrack, both STMA (per stream) and PMCA (cross-stream) are distributed across all transformer layers (Li et al., 3 Aug 2025).

2. Mathematical Formulation and Module Details

Across works, Two-Stream Adapters adopt a combination of bottleneck projections and attention or gating mechanisms for feature fusion and transformation:

Linear Bottlenecks: A standard adapter design employs down-projection, intermediate bottleneck, and up-projection, e.g., in DSTA, each BA is parameterized as three linear projections: down (RN×D → RN×D/16), intermediate (RN×D/16 → RN×D/16), and up (RN×D/16 → RN×D), followed by non-linearities (Zeng et al., 2024).
Residual Cross-Injection: In TS-SAM, CSA implements cross-stream fusion by injecting the side-adapter output into the main stream with 1×1 convolutions and batch normalization, maintaining signal alignment (Yu et al., 2024).
Feature Prompting: BAT’s adapter function is $F^{Ada}(x) = W_u \cdot \sigma(W_m (W_d x))$ , where down/up dimensions are much smaller than the token dimension, and prompts are added after MSA and MLP blocks in both RGB and TIR branches (Cao et al., 2023).
Pixel-wise Multi-Head Attention: DMTrack’s Deep Adapter employs pixel-wise multi-head attention, where modalities exchange K/V signals weighted by a learnable gating mechanism, producing modality-aware prompts (Li et al., 3 Aug 2025).
Temporal Convolutions in Self-Adapters: DMTrack’s STMA leverages 1D temporal convolutions within each modality to capture spatio-temporal evolution from a memory bank of frames (Li et al., 3 Aug 2025).

3. Parameter Efficiency and Training Characteristics

A primary motivation behind Two-Stream Adapter designs is significant parameter reduction for downstream fine-tuning:

Model	Adapter Params	% of Backbone	Backbone Status
BAT (Cao et al., 2023)	0.32M	<0.3%	All backbone frozen
DSTA (CFBT) (Zeng et al., 2024)	0.259M	<0.3%	Backbone + orig. BAs frozen
TS-SAM_B (Yu et al., 2024)	11.9M	4.4%	Backbone frozen
DMTrack (Li et al., 3 Aug 2025)	0.93M	0.9%	Backbone frozen

Adapters are typically trained jointly with, at most, the task-specific prediction head. Optimizer choice often centers around Adam/AdamW with moderate weight decay. Training regimens freeze all major backbone parameters and focus learning solely on adapter and head weights, ensuring stability and efficiency (Cao et al., 2023, Zeng et al., 2024, Yu et al., 2024, Li et al., 3 Aug 2025).

4. Information Flow and Fusion Mechanisms

Two-Stream Adapters structure information exchange with distinct but complementary strategies:

Bi-Directional Additive Prompts: Adapters generate prompts in both directions, enabling each stream to dynamically assimilate features from the other (e.g., RGB↔TIR in BAT (Cao et al., 2023), RGB↔Aux modality in DMTrack (Li et al., 3 Aug 2025)). This design maintains full reciprocal fusion per layer.
Template-Search Token Interaction: DSTA further splits branches into template and search tokens, applies fusion for search tokens only, and reconsolidates to optimize for tracking scenarios (Zeng et al., 2024).
Mutual Cross-Attention and Complementarity: CSTAF and CSTCF in CFBT exemplify interleaved attention-based fusion for template and search tokens, with DSTA propagating these enhancements deeper by means of adapters (Zeng et al., 2024).
Side-Stream Feature Injection: In TS-SAM, the side network (CSA stream) is repeatedly injected into the main ViT stream via 1×1 convolutions, and enhanced features are fused in the Feature Fusion Decoder using both global pooling and multi-stage upsampling (Yu et al., 2024).

5. Empirical Performance and Impact

Experimental results consistently demonstrate substantial parameter efficiency, high tracking/segmentation accuracy, and robustness to cross-modal or temporal variation:

BAT attains 86.8%/64.1% (MPR/MSR) on RGBT234, exceeding both full-tuning (DMCNet) and prompt-learning (ViPT) by 2–5% while using only 0.32M extra parameters (Cao et al., 2023).
DSTA-equipped CFBT achieves state-of-the-art benchmark performance on three RGB-T benchmarks with <0.3% overhead (Zeng et al., 2024).
TS-SAM-H outperforms SAM-Adapter by 3.1% (S_α), 5.5% (F_β^ω), and −0.008 MAE on COD10K, with a 4.4% parameter cost (Yu et al., 2024).
DMTrack achieves SOTA on five benchmarks, including 90.3% MPR (RGBT234) and 79.4% EAO (VOT-RGBD2022), demonstrating improved resilience to occlusion, distractors, and domain shift (Li et al., 3 Aug 2025).

Ablation studies in cited works show that full bi-directional adapters outperform single-direction or split adapters, and that both intra-stream self-adaptation and cross-stream prompting are critical for optimal performance (Cao et al., 2023, Li et al., 3 Aug 2025).

6. Generalization and Adaptation to Broader Domains

The Two-Stream Adapter paradigm generalizes beyond vision tasks and specific modal pairs:

Any two (or more) modalities, e.g., depth+RGB, audio+video, LIDAR+camera, can be fused with the same adapter logic by designating one branch as the primary conduit (the “hub”) or extending PMCA to additional modalities (Zeng et al., 2024, Li et al., 3 Aug 2025).
The same adapter form can implement temporally aligned multi-view fusions (treating time-indexed views as separate streams), or be embedded in hierarchical transformers such as Swin/DeiT by adjusting dimension reductions or distribution of adapter blocks (Zeng et al., 2024, Li et al., 3 Aug 2025).
In all cases, the paradigm remains highly parameter- and computation-efficient (<1% overhead), facilitating rapid adaptation to new tasks or modalities.

A plausible implication is that this modular adapter approach will underpin scalable multi-modal and spatio-temporal architectures across varied domains due to the minimal parameter and compute footprint combined with strong empirical performance.

7. Comparative Analysis and Relationship to Prior Methods

Two-Stream Adapters distinguish themselves from related approaches through their balance of frozen-backbone architecture, deep reciprocal fusion, and parameter efficiency. Unlike prompt tuning, which typically injects prompts only at the input or select layers, these adapters operate per-layer and are often bi-directional and universal, as in BAT (Cao et al., 2023). Compared to full fine-tuning or single-stream adapter tuning, the two-stream approach offers both robust cross-modal generalization and practical leverage of large pre-trained models. The modularity and extensibility of these schemes have been quantitatively validated in ablation and cross-benchmark comparisons (Cao et al., 2023, Zeng et al., 2024, Yu et al., 2024, Li et al., 3 Aug 2025).

No systematic evidence exists in these works for a performance gap between two-stream adapters and domain-specialized models when sufficient adapter capacity is allocated; in TS-SAM, for instance, the two-stream adapter matches strong domain-specific baselines on multiple SOD tasks (Yu et al., 2024). Single-direction adapters and fully independent adapters tend to deliver inferior results or unnecessary parameter cost, underscoring the distinct advantage of the two-stream, bi-directional, and universal recipe (Li et al., 3 Aug 2025).