KV sharing during distillation to further reduce cache cost

Develop and evaluate distillation procedures that enforce key–value sharing across layers or within layers (e.g., grouped-query attention tying) in Transformer–SSM hybrids, and determine the resulting effects on KV-cache size and downstream performance.

Background

Although the proposed hybrids store far fewer KV pairs by retaining only a few attention heads, the current distillation pipeline does not enforce sharing of K and V across or within layers. Such sharing could further shrink the KV cache but has not been integrated into the distillation framework.

Formalizing and testing KV sharing in the context of retrieval-aware distillation would clarify the trade-off between additional memory savings and any potential performance impact.

References

Our study leaves several open questions about when retrieval-aware distillation transfers cleanly. Third, although our hybrid stores fewer total KV pairs than prior hybrids, our distillation does not enforce KV sharing across layers (or within layers via GQA-style tying). Enabling KV sharing during distillation could further reduce the KV cache, but we leave this to future work.

Retrieval-Aware Distillation for Transformer-SSM Hybrids  (2602.11374 - Bick et al., 11 Feb 2026) in Section: Conclusion and Future Work