Papers
Topics
Authors
Recent
Search
2000 character limit reached

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

Published 24 Apr 2026 in cs.CL and cs.LG | (2604.22127v1)

Abstract: Hybrid LLMs that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.

Summary

  • The paper introduces a systematic empirical analysis of LoRA placement in hybrid models, demonstrating that attention-only adaptation outperforms full-model tuning with significantly fewer parameters.
  • It finds that the adaptation impact on the recurrent backbone is topology-dependent, with sequential hybrids suffering performance drops while parallel hybrids benefit.
  • The study highlights that attention-only LoRA minimizes catastrophic forgetting and achieves notable efficiency gains in domain adaptation across tasks.

Component-Type LoRA Placement in Hybrid LLMs

Introduction

The paper "Where Should LoRA Go? Component-Type Placement in Hybrid LLMs" (2604.22127) introduces a systematic empirical analysis of LoRA (Low-Rank Adaptation) placement strategies across distinct architectural component types in hybrid LLMs. Rather than adhering to default Transformer-centric PEFT conventions—where LoRA adapters typically target attention projections—the authors examine whether the functionally distinct modules within hybrid models (e.g., attention, recurrent/state-space modules, MLPs) require differentiated adaptation protocols. They focus on two representative hybrid architectures: the sequential Qwen3.5-0.8B (GatedDeltaNet backbone interleaved with softmax attention) and the parallel Falcon-H1-0.5B (Mamba-2 SSM and attention in parallel), assessing domain adaptation dynamics and transfer asymmetries under various LoRA schema. Figure 1

Figure 1

Figure 1: Parameter breakdown for Qwen3.5-0.8B, highlighting the dominance of recurrent and MLP over attention.

Experimental Design

The study isolates and combines LoRA placement across attention, recurrent, and MLP components. Six LoRA placement conditions are defined per architecture, and fine-tuning is conducted on three distinct domains: GSM8K (grade-school math), CodeAlpaca (code generation), and UltraChat (general instruction). For each domain, the resulting model variants are evaluated on MMLU, GSM8K, ARC-Challenge, HellaSwag, and HumanEval. All models are trained with consistent hyperparameters (LoRA rank r=16r=16, fixed learning rates, identical hardware) to ensure meaningful comparisons. The evaluation explicitly addresses not only in-domain performance but also cross-task transfer and catastrophic forgetting. Figure 2

Figure 2

Figure 2: GSM8K accuracy heatmap for Qwen3.5-0.8B under various LoRA placement conditions.

Key Findings

Attention as an Adaptation Hotspot

A core and statistically robust result is that, across both sequential and parallel hybrid architectures, LoRA adapters applied solely to the attention pathway match or exceed the performance of full-model adaptation despite utilizing only 5–10×\times fewer trainable parameters. For Qwen3.5-0.8B, the softmax-attention-only LoRA attains a +10.2+10.2 target-domain improvement (GSM8K), outperforming all-layers adaptation by a substantial margin. For Falcon-H1-0.5B, attention-only LoRA yields an even more pronounced enhancement (+17.2+17.2 on GSM8K).

Topology-Dependent Adaptation Asymmetry

The response to LoRA placement on the recurrent backbone diverges fundamentally according to model topology. For Qwen3.5-0.8B (sequential), LoRA on GatedDeltaNet is strongly destructive (–14.8 GSM8K) whereas for Falcon-H1-0.5B (parallel), LoRA on the SSM backbone is constructive (+8.6). This finding underscores that the interconnection pattern of recurrent and attention modules determines their adaptation fragility or robustness.

Cross-Task Transfer and Forgetting

Cross-domain evaluation reveals an additional pronounced asymmetry. Sequential hybrids (Qwen3.5-0.8B) are highly susceptible to catastrophic forgetting under LoRA adaptation—UltraChat training results in up to −16.0-16.0 GSM8K accuracy loss. Conversely, parallel hybrids (Falcon-H1-0.5B) exhibit positive transfer, observing accuracy gains on untrained tasks after instruction tuning. Notably, attention-only LoRA placement offers the best robustness across cross-task transfer scenarios, consistently minimizing forgetting. Figure 3

Figure 3

Figure 3: Cross-task accuracy impact heatmap after UltraChat training for Qwen3.5-0.8B, indicating severe catastrophic forgetting on GSM8K.

Analysis and Theoretical Implications

The observed effectiveness of attention-only adaptation, despite attention representing a parameter minority (4.4–6.4% of model parameters), aligns with the functional specialization hypothesis: the recurrent backbone encodes the core sequence dynamics and is finely optimized (and thus fragile to perturbation), whereas attention modules perform task-adaptable refinement. This rigidity/plasticity dichotomy is substantiated by prior functional ablation evidence showing that loss of recurrent modules produces orders-of-magnitude larger degradation than loss of attention (Borobia et al., 23 Mar 2026).

The destructive effect of simultaneous LoRA on attention and MLP in sequential hybrids (e.g., a collapse in GSM8K performance for Qwen3.5-0.8B) suggests optimization conflicts may emerge when both adaptation pathways are modified without backbone plasticity. The parallel hybrid’s independence of branches affords it resilience to such interference. Figure 4

Figure 4

Figure 4: Pareto efficiency plot for Qwen3.5-0.8B, with attention-only LoRA on the dominant frontier (max accuracy per parameter).

Practical Recommendations

The results yield clear actionable guidelines:

  • For sequential hybrids, restrict LoRA placement to attention modules; avoid the recurrent backbone.
  • For parallel hybrids, attention-only LoRA is both maximally efficient and robust, but backbone adaptation is also safe.
  • For all hybrids, attention-only LoRA provides minimal-parameter adaptation with low risk of catastrophic forgetting.
  • Practitioners should prefer parallel hybrid architectures when substantial domain adaptation or cross-task robustness is required. Figure 5

Figure 5

Figure 5: GSM8K-trained Qwen3.5-0.8B accuracy as a function of placement condition: attention-adapted variants outperform alternatives with fewer parameters.

Directions for Future Work

The paper identifies several limitations and future research trajectories:

  • Larger models (>1>1B, e.g., 7B or 70B) may exhibit different component sensitivities.
  • Multi-seed experiments are necessary to fully quantify statistical effects.
  • Component-type sensitivity under alternative PEFT methods (e.g., IA3^3, prefix tuning) remains unexplored.
  • Extension to other hybrid typologies (e.g., interleaved or deeper SSM-attention stacks) could reveal further adaptation regularities.

Conclusion

This systematic study demonstrates that LoRA placement in hybrid LLMs should be explicitly guided by architectural component type and topology. Attention-only adaptation is distinctly advantageous (parameter efficiency, robustness), while the recurrent backbone’s adaptability is highly topology-dependent. These findings establish component-type targeting as a core dimension of PEFT design for hybrid architectures and motivate both architecture and adaptation method refinements for future large-scale LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.