Validate impact of Claude 4.5 Haiku summarized CoTs on honesty and faithfulness scores
Determine whether summarized chain-of-thought outputs returned by the Anthropic API for Claude 4.5 Haiku differ from the model’s original full chain-of-thought in ways that alter measured honesty and faithfulness scores, and quantify the magnitude of any such deviations by comparing metrics computed from summaries versus the underlying full chains of thought.
References
Thus, there is the potential for a gap between content appearing in the original CoT (which is hidden) and in the summary returned from the API, which could conceivably lead to deviations between the measured and true honesty and faithfulness scores for Claude 4.5 Haiku. Given the directness of our instructions about verbalization—and thus the salience of verbalized hint presence/reliance to the summarization model—we think it is unlikely these deviations are large, but we cannot validate this.