Interpretation of multiple constitutional principles in Constitutional AI-trained models
Determine how Anthropic’s Claude model trained via the Collective Constitutional AI method applies multiple constitutional principles when evaluating and generating prompt completions. Specifically, establish whether the model implicitly selects a single governing principle per prompt or employs an internal weighting scheme across principles when multiple principles are simultaneously relevant, to clarify the model’s decision-making and compliance behavior.
References
Importantly, the LLM is given no guidance on how to apply multiple principles to a given completion; indeed, we do not know how it handles such cases - whether it implicitly sorts each prompt that it is given so that it is governed by one principle, or whether it has some 'intuitive' weighting of principles that it uses.