Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Abstract: LLMs have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper studies how LLMs can “think” better before answering, especially on logic problems. It compares two ways of thinking:
- Discrete chain-of-thought: the model writes out thinking steps as words or tokens.
- Continuous chain-of-thought (called “Coconut”): the model thinks in hidden vectors (numbers inside the model) instead of words.
The authors show, with theory and experiments, that continuous thoughts can solve certain reasoning problems much more efficiently than discrete thoughts, because they let the model keep many possibilities in mind at the same time.
What questions did the researchers ask?
In simple terms, they asked:
- Why and how can “continuous thoughts” help a model reason better than traditional, token-by-token “chain-of-thought”?
- Can a small model with continuous thoughts solve a classic logic task—graph reachability—much faster than models that use discrete thoughts?
- Do the model’s learned “thought vectors” actually store many possible reasoning paths at once, like a blended “superposition” of ideas?
How did they try to answer them?
The authors focus on a core reasoning task called “directed graph reachability.” Think of a map of cities connected by one-way roads:
- Given all the roads, a starting city, and two candidate destination cities, decide which destination can be reached from the start.
- This is important because many reasoning and planning tasks can be turned into this kind of “which places can I reach?” question.
Here’s how they approached it:
Discrete chain-of-thought vs continuous thoughts
- Discrete chain-of-thought: The model writes out each thought token and commits to it, step by step. That’s like choosing one path to follow at a time. If it guesses badly, it must backtrack, which can be slow.
- Continuous thoughts (Coconut): The model keeps its thoughts as hidden vectors and directly feeds them into the next step. This lets it store many possibilities at once, similar to keeping a “blurry highlight” over lots of promising places. The authors call this a “superposition” state (like having multiple options layered together, not actual quantum physics).
Graph reachability as a test problem
- The model reads the list of edges (one-way roads), the question, the two candidate destinations, and the starting node.
- Then it takes several “continuous thought” steps. Each step is like one round of exploring outward from the places it can currently reach.
The key idea: superposition and breadth-first search (BFS)
- Each thought vector encodes many “frontier” nodes at once—the set of cities you can reach in a certain number of steps.
- This is like running BFS in parallel: instead of exploring one path, it spreads out and explores all promising next steps together.
- Discrete chain-of-thought usually explores paths one-by-one (more like depth-first search), which can take much longer.
The model: a small transformer with a clever setup
- They use a 2-layer transformer (a common neural network for language).
- At each step, the model uses attention (a mechanism to “look back” at parts of the input) to expand the set of reachable nodes.
- It uses “buffers” (extra parts of each vector) as scratch space to store source and target nodes while it reasons.
- It works with standard position marking methods used in real LLMs (like sinusoidal or RoPE position encodings).
What did they find?
Main theoretical result
- A 2-layer transformer with D steps of continuous thoughts can solve reachability on any graph where D is the graph’s diameter (roughly, the number of steps needed to reach the farthest place in the shortest way).
- In contrast, known results for constant-depth transformers with discrete chain-of-thought need about O(n2) steps for a graph with n nodes—much slower.
- In other words, continuous thoughts can do the job in a number of steps that scales with how far you need to explore, not with the square of the graph size.
Why this works:
- Each continuous thought is a “superposition” that stores many reachable nodes at once.
- Attention acts like a smart pointer: it copies the right pieces (like source/target of each edge) into its buffers and expands the frontier in parallel.
- A simple clean-up step (the MLP part of the transformer) filters out noise and equalizes the weights so all truly reachable nodes are represented clearly.
It also works with common positional encodings (the usual ways models track word positions), so it’s practical, not a custom trick for a single problem size.
Experimental results
- A 2-layer model using continuous thoughts (Coconut) nearly solves the whole benchmark (ProsQA subset) perfectly.
- A standard discrete chain-of-thought model, even with 12 layers, does worse—continuous thoughts with just 2 layers outperform it.
- When they look inside the model:
- Layer 1 attention “copies” each edge’s source and target into the right places (like setting up the data for reasoning).
- Layer 2 attention focuses on edges whose source can currently be reached, which matches how BFS expands the frontier.
- The continuous thought vectors are most similar to the nodes that are reachable now—especially the “frontier” (the newest reachable nodes)—and often give extra weight to nodes on the optimal path. This “priority” emerges from training, even without being told to explore multiple paths.
Why does this matter?
- It shows that letting models think in continuous, latent space can make reasoning both faster and more reliable, because the model can keep many options open and explore them in parallel.
- This could help LLMs handle bigger, more complex reasoning tasks in planning, math, and science.
- It also offers a clear, testable mechanism (superposition of search frontiers) for why continuous thoughts help, guiding future model design and training.
- Future directions include:
- Proving strict lower bounds for how many steps discrete chain-of-thought must use on reachability (to formalize the gap).
- Understanding why “good exploration” emerges naturally during training.
- Extending these benefits to broader reasoning problems beyond graphs.
In short: continuous thoughts turn step-by-step guessing into parallel exploration, making small models surprisingly strong reasoners.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable future research.
- Formal separation between continuous and discrete CoT
- No lower bound is provided proving that constant-depth transformers with discrete CoT cannot solve directed reachability in o(n²) decoding steps; a tight lower bound and explicit separation are left open.
- It is unclear whether discrete CoT augmented with parallel decoding (e.g., beam search or ensembles) can asymptotically match the D-step performance of continuous CoT.
- Realism of the theoretical construction under standard transformer constraints
- The construction relies on orthonormal token embeddings and explicit “buffer” subspaces; demonstrate feasibility (or alternatives) with standard learned embeddings and without hand-engineered subspace separation.
- The MLP filter uses a hard thresholding nonlinearity σ(x)=1{x≥ε}; show the same effect with common activations (ReLU, GELU) and finite precision, or quantify the approximation error and its impact.
- The attention chooser depends on exact sinusoidal PE algebra (rotation, distinct inner products) and large scaling parameters to make attention “almost deterministic”; analyze stability under realistic magnitudes, noise, and finite precision.
- The main theorem assumes d=O(|Voc|), which is impractical for real vocabularies; derive minimal dimensionality requirements and compression schemes that preserve the superposition mechanism.
- Robustness and scalability beyond small synthetic settings
- Experiments target 3–4-hop graphs on a ProsQA subset; evaluate performance and learned superposition on larger graphs (n, D), higher branching factors, and deeper reasoning (e.g., D≥10, ≥20).
- Test generalization to out-of-distribution diameters: can the model adaptively use more thought steps than seen during training and still succeed?
- Characterize memory/compute trade-offs for increasing continuous-thought steps, and practical limits on T_max in training and inference.
- Training dynamics and emergence of superposition
- Provide a theoretical explanation for why gradient descent with the given objectives and curriculum induces superpositional BFS without explicit supervision.
- Quantify and explain the observed bias toward frontier and optimal edges (e.g., derive a training-time potential or implicit regularizer that drives prioritization).
- Compare curricula (Coconut vs Coconut-BFS vs other variants) with controlled ablations to isolate which signals are necessary and sufficient for superposition to emerge.
- Measurement/prediction and decision-making
- The “measurement” step via <A> and greedy decoding is described informally; formalize the readout mechanism (attention + W_O) and prove correctness for k>2 candidate queries, ties, or ambiguous cases (both or neither reachable).
- Evaluate robustness of the final decision under sampling, temperature, and noise, and characterize failure modes.
- Minimal architectural requirements and lower bounds
- Is two layers the minimum depth for continuous-thought reachability with D steps? Establish lower bounds on depth and width for this mechanism.
- Investigate whether fewer than D continuous thoughts can suffice (e.g., via multi-hop expansion per step) and under what conditions.
- Extension to broader reasoning tasks
- Generalize the superposition-based construction to other graph problems (shortest path, connectivity, cycle detection with guarantees) and to non-graph algorithmic tasks (dynamic programming, arithmetic), with clear assumptions and proofs.
- Identify limitations where superposition may be detrimental (e.g., tasks needing strict sequencing or irreversible decisions) and propose mitigations.
- Handling cycles, duplicates, and noise formally
- The MLP filter is claimed to remove noise and equalize weights; provide formal guarantees of correctness in cyclic graphs (duplicate expansions) using realistic activations, and quantify residual error.
- Analyze sensitivity to spurious edges, noisy graph encodings, or extraneous tokens, and the degree to which superposition amplifies or dampens noise.
- Positional encodings and sequence-length generalization
- The core proofs hinge on sinusoidal PE; the RoPE case is only mentioned—provide full constructions and empirical verification with RoPE and learned PEs.
- Study the effect of position-index range and aliasing on attention chooser reliability, especially for longer sequences than seen in training.
- Integration with pretrained LLMs and natural language inputs
- Demonstrate Coconut within large pretrained LLMs on natural-language graph reasoning (e.g., text-described knowledge graphs), where embeddings are not orthonormal and vocabularies are large.
- Investigate how continuous thoughts interact with textual CoT (hybrid schemes), and whether superposition emerges or is disrupted by general-purpose token distributions.
- Discrete approximations to superposition
- Assess whether multi-trajectory discrete CoT (beam search, stochastic sampling with aggregation) can emulate parallel BFS sufficiently to close the empirical gap, and quantify the step/computation trade-offs.
- Multi-answer, uncertainty, and more general queries
- The task assumes exactly one of c1,c2 is reachable; extend to multiple candidates, ranking, uncertainty calibration, and cases with both or neither reachable.
- Explore queries requiring path extraction (not just existence), and evaluate how superposition can be “collapsed” to produce explicit paths.
- Readout matrix and interpretability of thought vectors
- Provide quantitative recovery of coefficients in the supposed superposition (e.g., projecting thought vectors onto the node basis to estimate per-node weights), beyond inner-product histograms.
- Use causal interventions (e.g., zeroing components corresponding to frontier nodes) to test whether superposition components are necessary for correct reasoning.
- Training stability and reproducibility
- Report variance across seeds, training curves, and failure modes at scale; characterize when superposition fails to emerge and how to remedy it (optimizer, normalization, curriculum, regularization).
- LayerNorm and other architectural choices
- The proofs rely on LayerNorm; analyze sensitivity to normalization choices (RMSNorm, no norm), dropout, residual scaling, and attention temperature.
- Stopping criteria and adaptive thought length
- The method requires an explicit <A> token after a fixed number of thoughts; design and evaluate mechanisms for adaptive stopping based on the latent state (e.g., confidence thresholds).
- Practical vocabulary and embedding constraints
- The construction presumes bespoke graph tokens; assess performance when graph nodes must be represented via subword tokens or learned entity embeddings in realistic vocabularies.
Collections
Sign up for free to add this paper to one or more collections.