Robust Safety Alignment Without Utility Loss

Develop safety alignment methods for large language models that achieve robust refusal of harmful inputs while preserving core utility on benign queries, thereby resolving the safety–utility trade-off observed in supervised safety fine-tuning approaches.

Background

The introduction discusses how supervised safety fine-tuning (SSFT) improves refusal rates on harmful inputs but often degrades model performance on benign inputs, leading to overblocking. The authors highlight that balancing these two objectives—safety and utility—has proven difficult in practice.

They further note that this trade-off is especially pronounced in multimodal systems, where cross-modal interactions can weaken prior alignment. This motivates the need for methods that deliver strong safety without compromising general utility, which the paper attempts to address via the proposed CASA framework.

References

As a result, achieving robust safety alignment without sacrificing core utility remains an open problem.

Robust Multimodal Safety via Conditional Decoding  (2604.00310 - Kumar et al., 31 Mar 2026) in Section 1: Introduction