Cleaner functional separation of retrieval and recurrent components

Establish methods that achieve a cleaner functional separation between retrieval handled by preserved attention heads and recurrent modeling handled by SSM components in Transformer–SSM hybrids, so that reducing the SSM state dimension does not degrade retrieval-intensive performance; quantify and minimize the residual contribution of SSM heads to retrieval tasks.

Background

Ablation studies show that even after preserving retrieval-critical attention heads, some SSM heads still affect retrieval tasks, indicating residual coupling between recurrent and retrieval mechanisms.

A cleaner separation would enable more aggressive reductions in SSM state dimension without harming retrieval-heavy performance, improving efficiency while maintaining capability.

References

Our study leaves several open questions about when retrieval-aware distillation transfers cleanly. Finally, retrieval is not fully separated from the recurrent backbone: our ablations show that a small number of SSM heads still affect retrieval tasks. This coupling limits how aggressively we can shrink d_state without degrading retrieval-intensive performance, and achieving a cleaner separation remains an open direction.

Retrieval-Aware Distillation for Transformer-SSM Hybrids  (2602.11374 - Bick et al., 11 Feb 2026) in Section: Conclusion and Future Work