Cleaner functional separation of retrieval and recurrent components
Establish methods that achieve a cleaner functional separation between retrieval handled by preserved attention heads and recurrent modeling handled by SSM components in Transformer–SSM hybrids, so that reducing the SSM state dimension does not degrade retrieval-intensive performance; quantify and minimize the residual contribution of SSM heads to retrieval tasks.
References
Our study leaves several open questions about when retrieval-aware distillation transfers cleanly. Finally, retrieval is not fully separated from the recurrent backbone: our ablations show that a small number of SSM heads still affect retrieval tasks. This coupling limits how aggressively we can shrink d_state without degrading retrieval-intensive performance, and achieving a cleaner separation remains an open direction.