Robust Safety Alignment Without Utility Loss
Develop safety alignment methods for large language models that achieve robust refusal of harmful inputs while preserving core utility on benign queries, thereby resolving the safety–utility trade-off observed in supervised safety fine-tuning approaches.
References
As a result, achieving robust safety alignment without sacrificing core utility remains an open problem.
— Robust Multimodal Safety via Conditional Decoding
(2604.00310 - Kumar et al., 31 Mar 2026) in Section 1: Introduction