Which LLM capabilities are easier or harder to make robust to downstream fine-tuning

Determine which specific capabilities of large language models are inherently easier or harder to make robust to future downstream fine-tuning, where robustness means preserving the model’s desired behavior or reward when adapted to new tasks after initial RLHF training.

Background

The paper introduces Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that trains a base policy to maintain reward under downstream adaptation by optimizing worst-case performance over a KL-bounded neighborhood of policies. This aims to prevent catastrophic forgetting when the model is later fine-tuned on other tasks.

Empirically, the authors demonstrate improved retention of safety alignment under instruction and math fine-tuning, and preservation of mathematical accuracy under subsequent code fine-tuning. Despite these results, the authors note that the broader question of which capabilities are inherently more amenable—or more resistant—to such robustness remains unresolved, with implications for alignment and continual learning.

References

Moreover, understanding which capabilities are inherently easier, or harder, to make robust remains an open question, with implications for alignment and continual learning.

Robust Policy Optimization to Prevent Catastrophic Forgetting  (2602.08813 - Sabbaghi et al., 9 Feb 2026) in Conclusion