KV sharing during distillation to further reduce cache cost
Develop and evaluate distillation procedures that enforce key–value sharing across layers or within layers (e.g., grouped-query attention tying) in Transformer–SSM hybrids, and determine the resulting effects on KV-cache size and downstream performance.
References
Our study leaves several open questions about when retrieval-aware distillation transfers cleanly. Third, although our hybrid stores fewer total KV pairs than prior hybrids, our distillation does not enforce KV sharing across layers (or within layers via GQA-style tying). Enabling KV sharing during distillation could further reduce the KV cache, but we leave this to future work.
— Retrieval-Aware Distillation for Transformer-SSM Hybrids
(2602.11374 - Bick et al., 11 Feb 2026) in Section: Conclusion and Future Work