Reliable Alignment of AI Behavior with Complex Values

Develop reliable methodologies to align AI behavior with complex human values so that AI systems consistently pursue intended objectives rather than undesirable goals.

Background

In discussing the risks from autonomous AI systems, the paper emphasizes the danger of systems pursuing undesirable goals, whether due to malicious design or unintended outcomes of training. The authors explicitly state that reliable alignment of AI behavior with complex values is not currently known, necessitating research breakthroughs.

This problem is central to preventing harmful objectives in autonomous AI and is presented as a key technical challenge that cannot be solved by merely increasing capabilities without targeted safety research.

References

Moreover, no one currently knows how to reliably align AI behavior with complex values; several research breakthroughs are needed (see below).

Managing extreme AI risks amid rapid progress  (2310.17688 - Bengio et al., 2023) in Subsection Societal-scale risks