Causal role of reasoning bonds in Long CoT learning and why imitation-based distillation fails

Determine whether the three reasoning bond types—Deep-Reasoning, Self-Reflection, and Self-Exploration—causally drive the learning of Long Chain-of-Thought structure in large language models, and explain why explicit human imitation or random in-context-learning-based distillation of these bond markers often fails to induce this structure.

Background

The paper hypothesizes that effective long chain-of-thought (Long CoT) trajectories are organized by three stable behavior ‘bonds’: Deep-Reasoning, Self-Reflection, and Self-Exploration. Analyses suggest supervised fine-tuning internalizes structural reasoning behaviors rather than surface keywords, and only certain behavior distributions support stable Long CoT learning.

Despite evidence that models learn behavior-level structure, the authors explicitly note uncertainty about whether these bonds themselves drive learning and why direct human imitation or random ICL-style distillation of markers does not reproduce Long CoT structure. This motivates a deeper causal and mechanistic investigation of bond roles and imitation failure modes.

References

However, a key open question remains: do these bonds drive Long CoT structure learning, and if so, why do explicit human imitation or random ICL distillation of these markers often fail?

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning  (2601.06002 - Chen et al., 9 Jan 2026) in Verification: Molecular Structure — Subsection “SFT actually learns these bond structures rather than keywords.”