Which LLM capabilities are easier or harder to make robust to downstream fine-tuning
Determine which specific capabilities of large language models are inherently easier or harder to make robust to future downstream fine-tuning, where robustness means preserving the model’s desired behavior or reward when adapted to new tasks after initial RLHF training.
References
Moreover, understanding which capabilities are inherently easier, or harder, to make robust remains an open question, with implications for alignment and continual learning.
— Robust Policy Optimization to Prevent Catastrophic Forgetting
(2602.08813 - Sabbaghi et al., 9 Feb 2026) in Conclusion