Scaling sufficiency of a small set of retrieval heads

Determine whether the same small subset of attention heads that is sufficient for retrieval-aware distillation below 3B parameters remains sufficient to preserve retrieval behavior in Transformer–SSM hybrids at larger model scales, and characterize how retrieval behavior and attention-head redundancy change as parameter count increases.

Background

The paper shows that preserving a very small subset of retrieval-critical attention heads (about 2% of heads in 1B–1.5B models) and replacing the rest with SSM heads can recover most of the teacher’s retrieval-heavy performance while greatly reducing memory. However, the experiments are limited to models under 3B parameters.

The authors note that retrieval behavior and head redundancy may evolve with scale. Verifying whether a similarly small number of heads remains sufficient at larger scales is central to understanding the scalability and generality of retrieval-aware distillation.

References

First, we only evaluate models below 3B parameters, so it is unclear whether the same small set of heads remains sufficient at larger scale, where retrieval behavior and head redundancy may change.

Retrieval-Aware Distillation for Transformer-SSM Hybrids  (2602.11374 - Bick et al., 11 Feb 2026) in Section: Conclusion and Future Work