Efficiently recovering post-training capabilities after hybrid distillation
Develop an efficient method to recover the instruction-following and alignment capabilities introduced by post-training in the original Transformer base models after they are converted via distillation into RNN–attention hybrid architectures using pre-training-style data.
References
How to efficiently recover the base models' capabilities remains an open question.
— Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
(2601.22156 - Chen et al., 29 Jan 2026) in Limitations