SA Merge: Selective Attention Fusion

Updated 16 January 2026

Selective Attention Merge is a technique that fuses domain-specific attention parameters using exponential mixing for both speech model adaptation and sparse attention.
It enhances low-resource ASR by merging task vectors across transformer layers, achieving up to 14% relative WER reduction and new SOTA benchmarks.
It optimizes long-context inference by selectively merging semantically correlated regions, enabling inference with over 1M tokens while reducing GPU load.

Selective Attention Merge (SA Merge) denotes two algorithmically distinct approaches situated at the intersection of transformer attention efficiency and parameter-adaptive representation fusion. In recent literature, SA Merge refers, first, to a domain-adaptive model merging technique for Speech Foundation Models (SFMs), in which attention-layer “task vectors” from multiple fine-tuned models are fused via exponentially weighted schedules to enhance low-resource ASR (Shankar et al., 14 Jan 2025). Second, SA Merge designates a correlation-aware sparse attention framework for length-efficient transformers, in which query/key regions with maximal semantic similarity are selectively attended and then merged for computational tractability and accuracy preservation (Wang et al., 2024). Both approaches target resource-constrained scenarios—either data-limited or hardware-limited—and are characterized by non-uniform attention-parameter fusion.

1. Definitions and Mathematical Formalism

SA Merge for Speech Model Fusion

Given a pretrained SFM $\mathcal{M}_0$ , a child-speech–adapted version $\mathcal{M}_1$ , and an adult-speech–adapted version $\mathcal{M}_2$ , attention-layer task vectors are defined for transformer layer $i$ as: $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ The merged model’s task vector is: $\tau_{SA,i}^{Q,K,V} = \lambda_i \tau_{1,i}^{Q,K,V} + (1-\lambda_i) \tau_{2,i}^{Q,K,V}$ where $\lambda_i = \lambda^{\alpha_i}$ , with global mixing factor $\lambda$ and decay $\alpha_i$ . The resulting attention matrices are: $W_{SA,i}^{Q,K,V} = W_{0,i}^{Q,K,V} + \tau_{SA,i}^{Q,K,V}$

SA Merge for Sparse Attention Extension

Inputs $\mathcal{M}_1$ 0 are segmented into query and key regions. Semantic tokens $\mathcal{M}_1$ 1, $\mathcal{M}_1$ 2 are obtained (e.g., via mean pooling), and region-wise affinity is

$\mathcal{M}_1$ 3

For each query region, the $\mathcal{M}_1$ 4 top-correlated key regions are selected, indices merged across $\mathcal{M}_1$ 5 adjacent query regions, and a final multi-query attention is computed over the consolidated key/value set. This yields $\mathcal{M}_1$ 6 time/memory and tunable compression.

2. Algorithms and Implementation Protocols

Speech SFM Task-Vector Merge

Construction of $\mathcal{M}_1$ 7 proceeds as follows:

For each transformer layer $\mathcal{M}_1$ 8, extract $\mathcal{M}_1$ 9, $\mathcal{M}_2$ 0, $\mathcal{M}_2$ 1.
Compute task vectors $\mathcal{M}_2$ 2 and $\mathcal{M}_2$ 3.
Exponentiate mixing ratio: $\mathcal{M}_2$ 4.
Merge $\mathcal{M}_2$ 5 deltas and reconstruct $\mathcal{M}_2$ 6.
All non-attention parameters are sourced from $\mathcal{M}_2$ 7.

Model families used include Whisper (all variants), Wav2Vec 2.0-base, HuBERT-base, and WavLM-base. Tooling is provided via HuggingFace Transformers, fairseq, and MergeKit (Shankar et al., 14 Jan 2025).

Correlation-Aware Sparse Attention Pipeline

Selection and merge stages are implemented as:

Segment $\mathcal{M}_2$ 8 into $\mathcal{M}_2$ 9 query and $i$ 0 key regions.
Pool region tokens to generate $i$ 1, $i$ 2.
Compute dot-product correlations and select top- $i$ 3 key regions per query-region.
For every $i$ 4 neighboring query regions, unique-merge their selection indices, keep top- $i$ 5 key/value regions.
For each merged block, compute multi-head attention with gathered $i$ 6 regions.
Positional encoding augmentation is performed post-selection using CRD-NTK (cyclic/randomly truncated/dynamically growing NTK positional embeddings) (Wang et al., 2024).

3. Empirical Results and Baselines

Low-Resource ASR with SA Merge

WER reduction for Whisper-small on MyST is recorded as:

Train Subset (h)	Fine-tuned WER	SA Merge WER	Relative Reduction
1	10.64%	10.40%	−2.3%
5	10.05%	9.85%	−2.0%
10	9.94%	9.80%	−1.4%
full	9.34%	8.85%	−5.2%

Data augmentation plus SA Merge sets a new SOTA of 8.69% with SpecAugment (Shankar et al., 14 Jan 2025).

Efficient Long-Context Fine-Tuning

For Llama2-7B, SA Merge achieves context extension to up to $i$ 7M tokens with stable perplexity and exact passkey recall ( $i$ 8 at 4M). GPU resource use is reduced by $i$ 9 compared to full attention (Wang et al., 2024).

4. Analytical Insights and Ablation Studies

Layerwise Fusion for Acoustic-Linguistic Feature Adaptation

High mixing ratios $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 0 in lower layers preferentially preserve acoustic/phonetic adaptation, while upper layers employ broader-source linguistic patterns. Distinct from uniform merging, the exponential $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 1 schedule emulates transformer feature stratification (Shankar et al., 14 Jan 2025). Comparative benchmarking against Lerp, Slerp, TA, RegMean, TIES, and DARE+TA demonstrates statistical superiority (Whisper-small, $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 2).

Sparse Selection Coverage Tradeoff

Merging query regions enables shared access to top-K key-value regions, mitigating isolated context starvation and enhancing long-sequence generalization. Segment/merge factors ( $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 3) allow controllable compute–accuracy balances (Wang et al., 2024).

Task-Vector Orthogonality

Cosine similarity analysis reveals signal-processing–based augmentation vectors (PP, SP, VTLP, SpecAug) are highly aligned ( $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 4) while synthetic TTS vectors are orthogonal ( $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 5– $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 6), implying complementary robustness when combined (Shankar et al., 14 Jan 2025).

5. Practical Applications and Limitations

Model Fusion for Low-Resource Domains

SA Merge demonstrates efficacy for child ASR benchmarks where pretraining data is scarce. By isolating attention-layer adaptation, parameter efficiency is achieved without disruptive confounding of non-attention layers. Extensions to dysarthric/accented speech and multilingual adaptation are logical next steps (Shankar et al., 14 Jan 2025).

Sparse Attention for Commodity Hardware

SA Merge enables inference and fine-tuning of 7B+ parameter models with $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 7K tokens on single A100s, outperforming LongLoRA/Longformer in resource usage. Positional encoding augmentation is critical for extrapolation ( $\tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$ 8M+ $tokens) (<a href="/papers/2410.04211" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Wang et al., 2024</a>).</p> <h3 class='paper-heading' id='limitations-and-future-enhancements'>Limitations and Future Enhancements</h3> <ul> <li>Hyperparameter schedules ($ \tau_{1,i}^{Q,K,V} = W_{1,i}^{Q,K,V} - W_{0,i}^{Q,K,V};\quad \tau_{2,i}^{Q,K,V} = W_{2,i}^{Q,K,V} - W_{0,i}^{Q,K,V}$9, $\tau_{SA,i}^{Q,K,V} = \lambda_i \tau_{1,i}^{Q,K,V} + (1-\lambda_i) \tau_{2,i}^{Q,K,V}$0, region/merge/sparsity factors) currently require grid or manual search.

Non-attention parameter merging remains unexplored.

CRD-NTK positional augmentation could be further developed by integrating relative positional encodings.

Routing complexity for extreme context lengths still presents bottlenecks.

SA Merge’s speech-domain instantiation is conceptually analogous to techniques such as Task Arithmetic and DARE+TA but is distinguished by its selective, exponentially scheduled fusion specific to attention matrices. The sparse attention variant advances beyond BigBird, Longformer, Routing Transformers, and Biformer by leveraging single-pass, correlation-driven selection rather than fixed local/global windows or clustering. Both frameworks illustrate the trend toward targeted adaptation of transformer attention for domain specificity and computational scalability.

7. Summary of Impact and Research Directions

Selective Attention Merge constitutes an algorithmic advance in both speech foundation model adaptation and length-efficient transformer attention. It offers up to $\tau_{SA,i}^{Q,K,V} = \lambda_i \tau_{1,i}^{Q,K,V} + (1-\lambda_i) \tau_{2,i}^{Q,K,V}$1 relative WER reduction over conventional fine-tuning (child ASR, Whisper-small) and, separately, unlocks $\tau_{SA,i}^{Q,K,V} = \lambda_i \tau_{1,i}^{Q,K,V} + (1-\lambda_i) \tau_{2,i}^{Q,K,V}$2M+ context-length inference on a single A100 with competitive PPL and passkey recall for LLMs. A plausible implication is that selective domain and sparse attention fusion—when combined with principled positional augmentation—will become standard practice in settings where either data or hardware are severely limited. Key future directions include per-head adaptive schedules, extension of merging to non-attention submodules, and learned selection controllers for sparse attention routing (Shankar et al., 14 Jan 2025, Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Selective Attention Merging for low resource tasks: A case study of Child ASR (2025)

Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Attention Merge (SA Merge).