Scaling sufficiency of a small set of retrieval heads
Determine whether the same small subset of attention heads that is sufficient for retrieval-aware distillation below 3B parameters remains sufficient to preserve retrieval behavior in Transformer–SSM hybrids at larger model scales, and characterize how retrieval behavior and attention-head redundancy change as parameter count increases.
References
First, we only evaluate models below 3B parameters, so it is unclear whether the same small set of heads remains sufficient at larger scale, where retrieval behavior and head redundancy may change.
— Retrieval-Aware Distillation for Transformer-SSM Hybrids
(2602.11374 - Bick et al., 11 Feb 2026) in Section: Conclusion and Future Work