Interpretation of multiple constitutional principles in Constitutional AI-trained models

Determine how Anthropic’s Claude model trained via the Collective Constitutional AI method applies multiple constitutional principles when evaluating and generating prompt completions. Specifically, establish whether the model implicitly selects a single governing principle per prompt or employs an internal weighting scheme across principles when multiple principles are simultaneously relevant, to clarify the model’s decision-making and compliance behavior.

Background

The paper describes the Collective Constitutional AI process, in which a producer model generates completions and an evaluator model judges which completion better complies with a randomly sampled constitutional principle. These pairwise judgments are used to train a reward model that subsequently guides reinforcement learning of the LLM. However, the procedure provides no explicit guidance for applying multiple principles to a single prompt completion.

The authors highlight that when multiple principles are relevant to a single completion, it is unclear whether the model implicitly chooses one principle per case or uses some internal, possibly learned, weighting across principles. Clarifying this behavior is important for understanding how a constitution is operationalized by the model and for assessing authorisation, transparency, and accountability in democratic contexts.

References

Importantly, the LLM is given no guidance on how to apply multiple principles to a given completion; indeed, we do not know how it handles such cases - whether it implicitly sorts each prompt that it is given so that it is governed by one principle, or whether it has some 'intuitive' weighting of principles that it uses.

Can LLMs advance democratic values?  (2410.08418 - Lazar et al., 2024) in Section II (Collective Constitutional AI description)