Validate impact of Claude 4.5 Haiku summarized CoTs on honesty and faithfulness scores

Determine whether summarized chain-of-thought outputs returned by the Anthropic API for Claude 4.5 Haiku differ from the model’s original full chain-of-thought in ways that alter measured honesty and faithfulness scores, and quantify the magnitude of any such deviations by comparing metrics computed from summaries versus the underlying full chains of thought.

Background

The paper measures honesty and faithfulness of reasoning by inspecting chains of thought (CoTs). For Claude 4.5 Haiku, the Anthropic API returns a summarized CoT rather than the model’s full internal CoT. This creates a potential measurement artifact: if the summary omits relevant mentions of hint presence or reliance, computed honesty and faithfulness scores could diverge from those that would be obtained from the full CoT.

The authors note that while they believe large deviations are unlikely given their explicit instructions that emphasize verbalization, they cannot validate this belief because the full CoTs are not exposed via the API. Accurately establishing whether and how much the summarization process biases the reported metrics remains unresolved.

References

Thus, there is the potential for a gap between content appearing in the original CoT (which is hidden) and in the summary returned from the API, which could conceivably lead to deviations between the measured and true honesty and faithfulness scores for Claude 4.5 Haiku. Given the directness of our instructions about verbalization—and thus the salience of verbalized hint presence/reliance to the summarization model—we think it is unlikely these deviations are large, but we cannot validate this.

— Reasoning Models Will Blatantly Lie About Their Reasoning (2601.07663 - Walden, 12 Jan 2026) in Limitations (Section)

Validate impact of Claude 4.5 Haiku summarized CoTs on honesty and faithfulness scores

Background

References

Related Problems