Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Published 29 May 2025 in cs.LG | (2505.23648v2)

Abstract: Modern LLMs generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial "subset sum problem" given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the LLM outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism. Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CoT2, demonstrating that continuous tokens can effectively enable parallel reasoning in models.
The methodology combines attention and mixture-of-experts layers with continuous supervision and reinforcement learning to enhance convergence and accuracy on tasks like MNNS and ProsQA.
Empirical results reveal that CoT2 outperforms traditional discrete chain-of-thought methods, showcasing its efficacy in complex reasoning and combinatorial problem-solving.

Continuous Chain of Thought Enables Parallel Exploration and Reasoning

The paper introduces the concept of conducting chain-of-thought (CoT) reasoning with continuously-valued tokens, referred to as CoT2. This approach is posited as an enhancement over traditional discrete token methods due to its potential to handle multiple parallel reasoning paths, offering richer and more efficient solutions in logical reasoning tasks.

Theoretical Foundations and Model Construction

Theoretical analysis in the paper shows that CoT2 allows models to track multiple reasoning paths concurrently, significantly boosting inference efficiency. The paper provides a comprehensive example with the Minimum Non-Negative Sum (MNNS) task, demonstrating that a single-layer transformer using CoT2 can solve complex combinatorial problems with sufficient embedding dimensions. The structure effectively combines attention mechanisms with a mixture-of-experts MLP layer, ensuring all partial sums are strategically encoded and managed. *Figure 1: The figure reveals that CoT2 model is superior to discrete CoT in ProsQA, while also exhibiting faster convergence. Setting: 4âlayer, 4âhead GPT2 with $d \in \{24,32, 40$ .

Supervised Training Techniques

The paper details a Continuous Supervised Fine-Tuning (CSFT) strategy for training CoT2 models. This approach involves fitting the softmax outputs to empirical token distributions from a set of target traces, providing a dense supervision signal that guides models toward efficient parallel exploration. Such an approach capitalizes on the potential of continuous tokens to represent multiple hypotheses, improving both accuracy and convergence speed, as illustrated in various tasks like MNNS and ProsQA.

*Figure 2: CoT2 vs.\ discrete CoT (Pass@k) accuracies for different temperatures on MNNS task.

Reinforcement Learning and Sampling Strategies

The work further enhances CoT2 using reinforcement learning (RL) approaches, introducing Multi-Token Sampling (MTS) and continuous exploration via Dirichlet sampling. These techniques encourage the model to explore wider reasoning paths and improve its policy over time. The RL strategies are supported by experiments showing improved accuracy and reduced token entropy, indicating more confident and precise model predictions. *Figure 3: Comparison between CoT2 and discrete CoT model for different embedding dimensions. The figure demonstrates that above a certain embedding dimension threshold, the CoT2 model outperforms the discrete CoT model in the MNNS task.

Empirical Results and Comparative Analysis

Experiments conducted on the MNNS, ProntoQA, and ProsQA datasets underscore the efficacy of CoT2, particularly at higher embedding dimensions where the model exhibits superior performance and faster training convergence compared to discrete CoT models. These results validate the theoretical benefits, with CoT2 models effectively balancing exploration and computation efficiency through its continuous token approach. *Figure 4: The figure illustrates that when the range of digits makes the question non-trivial on an MNNS task, the discrete CoT model trained with full token supervision outperforms sparse supervisions; in particular, single token supervision yields the worst performance.

Conclusion

The paper presents CoT2 as a significant advancement in the implementation of logical reasoning in AI models. By leveraging continuous tokens for more expressive and parallel path exploration, CoT2 demonstrates substantial improvements in both theoretical robustness and empirical performance. Future work is encouraged to explore further optimization methods and applications of CoT2 in diverse reasoning tasks, expanding the boundaries of efficient and scalable AI reasoning systems.

Markdown