Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Published 18 May 2025 in cs.LG | (2505.12514v2)

Abstract: LLMs have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens'' before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that a simple two-layer transformer employing continuous thought outperforms discrete CoT in solving directed graph reachability.
It introduces continuous thought vectors as superposition states to implicitly perform parallel breadth-first search in reasoning tasks.
Experimental validation on graph tasks confirms near-perfect accuracy, emphasizing efficient reasoning and innovative transformer design.

This paper, "Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought," investigates the mechanisms behind the superior performance of continuous chain-of-thought (CoT) reasoning in LLMs compared to discrete CoT, particularly for tasks like graph reachability. The authors provide a theoretical proof that a simple two-layer transformer using continuous thoughts can solve the directed graph reachability problem more efficiently than known results for discrete CoT.

The core idea is that continuous thought vectors act as "superposition states," simultaneously encoding multiple potential search frontiers. This allows the model to perform an implicit parallel breadth-first search (BFS) on a graph. In contrast, discrete CoT requires serializing thoughts, forcing the model to pick a single path at each step, which can be less efficient and prone to local optima.

Problem Formulation and Input Structure

The paper focuses on the directed graph reachability problem: given a graph, a starting node $r$ , and two candidate destination nodes $c_1$ and $c_2$ (where exactly one is reachable from $r$ ), determine the reachable node.

The input to the transformer is structured as follows:

<s> (Beginning of sentence token)
A sequence of edges, each represented as (source_node, target_node, <e>), where <e> is a special edge marker.
<Q> (Special question token)
c_1 (First candidate destination)
c_2 (Second candidate destination)
<R> (Special reasoning token)
r (Root/starting node)

This initial prompt has length $t_0 = 3m+6$ (for $m$ edges). The model then generates $C$ continuous thought vectors, $[t]_1, [t]_2, \ldots, [t]_C$ , where $[t]_c = TF_\theta(h_1, \ldots, h_{t_0+c-1})$ . Finally, a special answer token <A> is appended, and the model predicts the answer based on $TF_\theta(h_1, \ldots, h_{t_0+C}, u_{<A>})$ .

Theoretical Construction and Key Results

The authors prove that a two-layer transformer can solve graph reachability in $D$ steps of continuous thought, where $D$ is the graph's diameter. This is significantly more efficient than the $O(n^2)$ steps required by the best-known constant-depth transformers using discrete CoT for $n$ vertices. The embedding dimension $d$ is $3d_{TE} + d_{PE}$ , where $d_{TE}$ is for token content and $d_{PE}$ for positional encoding. The embedding is divided into content, buffer_1, and buffer_2 (each $d_{TE}$ dims), plus an effective positional encoding ( $d_{PE}$ dims).

Attention Chooser (Lemma 4.1): A crucial building block is the "attention chooser" head. Using sinusoidal positional encodings, this head can be constructed to selectively attend to a specific relative position $(i-\ell)$ if the current token $h_i$ is a particular token <x>, or attend to a default position (e.g., the first token, acting as an attention sink) otherwise. This allows for dynamic attention patterns based on the current processing context. The construction involves carefully crafting Query ( $Q$ ) and Key ( $K$ ) matrices that leverage the properties of sinusoidal encodings, such as $p_{i+\ell} = R^{(\ell)}p_i$ .
Continuous Thought as Superposition (Lemma 4.2): The central theoretical claim is that the $c$ $c$ -th continuous thought vector $[t]_c$ $[t]_{c}$ represents a normalized superposition of all vertices $V_c$ $V_{c}$ reachable from the root $r$ $r$ within $c$ $c$ steps:

$[t]_c = \frac{1}{\sqrt{|V_c|}} \sum_{v \in V_c} u_v$

This is achieved through a two-layer transformer:
- Layer 1 (Attention & MLP):
  - Attention: Uses five attention chooser heads. For an edge token <e> at position $Idx(<e>,i)$ , these heads copy the source node $s_i$ embedding into $buffer_1(h_{Idx(<e>,i)})$ and the target node $t_i$ embedding into $buffer_2(h_{Idx(<e>,i)})$ . Similar operations store candidate nodes $c_1, c_2$ with the <R> token and the last thought $[t]_C$ with the <A> token.
  - MLP: Acts as a filter to clean up noise from attention and normalize the copied information. It uses a structure like $W_2 \sigma(W_1 x)$ where $W_1$ projects to a basis where each token is a coordinate, $\sigma$ is an indicator function $\mathbbm{1}\{x \geq \varepsilon\}$, and $W_2$ projects back.
- Layer 2 (Attention & MLP):
  - Attention (Thought Generation, Fig. 4): The current thought $[t]_c$ (a superposition of $V_c$ ) forms a query. It attends to all edge tokens <e> whose source node $s_i$ (stored in their $buffer_1$ ) is in $V_c$ . The value vectors retrieve the target nodes $t_i$ (from $buffer_2$ of these attended edge tokens). The sum of these target nodes is added to $[t]_c$ . This effectively performs one step of a breadth-first search expansion.
  - MLP: Filters the resulting sum to form $[t]_{c+1}$ as a clean, normalized superposition of all nodes in $V_{c+1}$ .
  - Attention (Prediction, Fig. 7 in Appendix): When the <A> token is processed, its query (containing $u_{<A>}$ ) attends to the <R> token. The <R> token's $buffer_2$ (prepared by Layer 1) contains $u_{c_1} + u_{c_2}$ . The value from <R> is combined with the final thought $[t]_C$ (stored in <A>'s $buffer_1$ by Layer 1). The final output embedding $h_T^{(L)}$ after the <A> token will have larger components corresponding to the reachable candidate node $c_{i^*}$ because $c_{i^*} \in V_C$ and $c_{3-i^*} \notin V_C$ . The decoding matrix $W_O = U^\top$ (where $U$ is the matrix of token embeddings) can then pick out $c_{i^*}$ .
Main Theorem (Theorem 4.1): A two-layer transformer with parameters independent of specific graphs (but dependent on vocabulary size, $d=O(|Voc|)$ ) can solve graph reachability in $D$ continuous thought steps if $D$ is the graph diameter.

Implementation Considerations

Embedding Space: The construction uses separate "buffer" spaces within the embedding to shuttle information. In practice, these could be projected into a more compact space.
Positional Encoding: The construction works with standard sinusoidal positional encoding and is adaptable to Rotary Position Embeddings (RoPE), as discussed in Appendix A.3.
MLP as Filter: The MLP layers play a critical role in selecting relevant signals and normalizing states, effectively implementing $f(x) = \sum_{v} \mathbbm{1}\{\lambda_v \geq \varepsilon\} u_v$ after normalization.
Computational Cost: Generating $D$ continuous thoughts involves $D$ full transformer forward passes. However, $D$ is typically much smaller than $O(n^2)$ .

Experimental Validation

Experiments were conducted on a subset of the ProsQA dataset (graph reasoning requiring 3-4 hops).

Model: A 2-layer GPT-2 style decoder ( $d_{model}=768, n_{heads}=8$ ) trained from scratch.
Training: A multi-stage curriculum where stage $i$ trains the model to use $i$ continuous thoughts.
Results:
- The 2-layer Coconut model achieved near-perfect accuracy (Fig. 5), significantly outperforming 2-layer discrete CoT (~75%) and even a 12-layer discrete CoT (83%).
- Layer 1 Attention (Fig. 6): Visualizations confirmed that edge tokens <e> attend to their source and target nodes, implementing the designed copying mechanism.
- Layer 2 Attention (Table 1 in paper, renumbered as Table 2 in document): When generating the $i$ -th thought, attention scores were highest for "Reachable" edges (source node in current search set), particularly "Frontier" edges (source node $i$ hops away) and "Optimal" edges (on the solution path). This supports the BFS-like expansion.
- Continuous Thought Representation (Fig. 7): The inner product between the $i$ -th thought $[t]_i$ and node embeddings $u_v$ was high for nodes reachable within $i$ steps, especially high for "Frontier" nodes, and highest for "Optimal" nodes. This directly visualizes the superposition state emphasizing the search frontier.
- Exploration Priority: The model's tendency to focus on optimal paths was observed even when trained with a "Coconut-BFS" strategy (supervision from random frontier nodes, not just the optimal path). This suggests the superposition mechanism and training dynamics inherently learn efficient search strategies.

def generate_next_thought(h_prompt, prev_thoughts, V_prev, transformer_params):
    # Current input sequence to the transformer
    current_sequence = h_prompt + prev_thoughts
    current_thought_position = len(current_sequence) # Position for [t]_c

    # --- Layer 1 ---
    h_layer1_out = [None] * len(current_sequence)
    # For each position in current_sequence:
    #   Apply Layer 1 Attention (Attention Choosers)
    #   - For edge tokens <e>: copy source to buffer1, target to buffer2
    #     e.g., h_layer1_out[Idx(<e>,i)].buffer1 = embedding(source_node_of_edge_i)
    #            h_layer1_out[Idx(<e>,i)].buffer2 = embedding(target_node_of_edge_i)
    #   - For previous thought [t]_{c-1}: its content is V_prev
    #   Apply Layer 1 MLP (Filter & Normalize)
    #   h_layer1_out[j] = MLP1(Attn1(current_sequence[j])) (simplified)

    # --- Layer 2 ---
    # Focus on the current_thought_position which will become [t]_c
    # Query for [t]_c is derived from content([t]_{c-1}), which is sum over V_prev
    q_current_thought = project_query(h_layer1_out[current_thought_position - 1].content) # Query from [t]_{c-1}

    attended_target_nodes_sum = zero_vector()
    # For each edge token <e> in h_layer1_out:
    #   Key_edge = project_key(h_layer1_out[Idx(<e>)].buffer1) # Key from source node of edge
    #   attention_score = softmax_sim(q_current_thought, Key_edge)
    #   if source_node_of_edge is in V_prev (high attention_score):
    #     Value_edge = project_value(h_layer1_out[Idx(<e>)].buffer2) # Value from target node
    #     attended_target_nodes_sum += attention_score * Value_edge

    # Combine with previous thought's content (residual connection)
    h_current_thought_after_attn = h_layer1_out[current_thought_position - 1].content + attended_target_nodes_sum

    # Apply Layer 2 MLP (Filter & Normalize to get V_curr)
    # This MLP will effectively select nodes that are in V_prev OR are new targets from attended_target_nodes_sum
    # and normalize them to form the superposition for V_curr
    # [t]_c.content = MLP2(h_current_thought_after_attn)
    # [t]_c.content is now the superposition for V_curr (nodes reachable in c steps)

    # LayerNorm is applied after MLP in the paper's Algorithm 1
    # next_thought_embedding = LayerNorm(MLP2(h_current_thought_after_attn))

    # For the actual transformer, the output at the *last* token's position
    # after processing the *entire* sequence up to that point is the next thought.
    # The pseudocode above simplifies the attention mechanism to be more illustrative
    # of the information flow for generating one thought.

    # Actual generation:
    # tf_output_at_last_pos = Transformer(current_sequence, transformer_params)
    # next_thought_embedding = tf_output_at_last_pos
    # return next_thought_embedding
    pass

Practical Implications

Efficient Reasoning: Continuous CoT can solve complex reasoning problems with shallower networks and fewer steps compared to discrete CoT.
Parallel Search: The superposition mechanism allows LLMs to implicitly explore multiple reasoning paths in parallel, making them more robust for problems with large search spaces or branching factors.
Model Design: The findings suggest that even simple 2-layer transformers, if structured correctly (e.g., with mechanisms for information routing like buffers and filtering like MLPs), can perform sophisticated reasoning when continuous thoughts are employed.
Training Strategies: The multi-stage training curriculum, even with supervision only on optimal paths, can lead to the emergence of this parallel, superpositional search behavior.

Conclusion

The paper provides a strong theoretical and empirical case for the power of continuous thoughts in LLMs. By demonstrating that continuous thought vectors can maintain a superposition of reachable states, it explains how these models can perform efficient parallel searches. This work offers valuable insights for designing more capable and efficient reasoning systems. Future directions include deriving lower bounds for discrete CoT on such problems and further understanding the training dynamics that lead to the emergence of these exploration behaviors.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies how LLMs can “think” better before answering, especially on logic problems. It compares two ways of thinking:

Discrete chain-of-thought: the model writes out thinking steps as words or tokens.
Continuous chain-of-thought (called “Coconut”): the model thinks in hidden vectors (numbers inside the model) instead of words.

The authors show, with theory and experiments, that continuous thoughts can solve certain reasoning problems much more efficiently than discrete thoughts, because they let the model keep many possibilities in mind at the same time.

What questions did the researchers ask?

In simple terms, they asked:

Why and how can “continuous thoughts” help a model reason better than traditional, token-by-token “chain-of-thought”?
Can a small model with continuous thoughts solve a classic logic task—graph reachability—much faster than models that use discrete thoughts?
Do the model’s learned “thought vectors” actually store many possible reasoning paths at once, like a blended “superposition” of ideas?

How did they try to answer them?

The authors focus on a core reasoning task called “directed graph reachability.” Think of a map of cities connected by one-way roads:

Given all the roads, a starting city, and two candidate destination cities, decide which destination can be reached from the start.
This is important because many reasoning and planning tasks can be turned into this kind of “which places can I reach?” question.

Here’s how they approached it:

Discrete chain-of-thought vs continuous thoughts

Discrete chain-of-thought: The model writes out each thought token and commits to it, step by step. That’s like choosing one path to follow at a time. If it guesses badly, it must backtrack, which can be slow.
Continuous thoughts (Coconut): The model keeps its thoughts as hidden vectors and directly feeds them into the next step. This lets it store many possibilities at once, similar to keeping a “blurry highlight” over lots of promising places. The authors call this a “superposition” state (like having multiple options layered together, not actual quantum physics).

Graph reachability as a test problem

The model reads the list of edges (one-way roads), the question, the two candidate destinations, and the starting node.
Then it takes several “continuous thought” steps. Each step is like one round of exploring outward from the places it can currently reach.

The key idea: superposition and breadth-first search (BFS)

Each thought vector encodes many “frontier” nodes at once—the set of cities you can reach in a certain number of steps.
This is like running BFS in parallel: instead of exploring one path, it spreads out and explores all promising next steps together.
Discrete chain-of-thought usually explores paths one-by-one (more like depth-first search), which can take much longer.

The model: a small transformer with a clever setup

They use a 2-layer transformer (a common neural network for language).
At each step, the model uses attention (a mechanism to “look back” at parts of the input) to expand the set of reachable nodes.
It uses “buffers” (extra parts of each vector) as scratch space to store source and target nodes while it reasons.
It works with standard position marking methods used in real LLMs (like sinusoidal or RoPE position encodings).

What did they find?

Main theoretical result

A 2-layer transformer with D steps of continuous thoughts can solve reachability on any graph where D is the graph’s diameter (roughly, the number of steps needed to reach the farthest place in the shortest way).
In contrast, known results for constant-depth transformers with discrete chain-of-thought need about O(n²⁾ steps for a graph with n nodes—much slower.
In other words, continuous thoughts can do the job in a number of steps that scales with how far you need to explore, not with the square of the graph size.

Why this works:

Each continuous thought is a “superposition” that stores many reachable nodes at once.
Attention acts like a smart pointer: it copies the right pieces (like source/target of each edge) into its buffers and expands the frontier in parallel.
A simple clean-up step (the MLP part of the transformer) filters out noise and equalizes the weights so all truly reachable nodes are represented clearly.

It also works with common positional encodings (the usual ways models track word positions), so it’s practical, not a custom trick for a single problem size.

Experimental results

A 2-layer model using continuous thoughts (Coconut) nearly solves the whole benchmark (ProsQA subset) perfectly.
A standard discrete chain-of-thought model, even with 12 layers, does worse—continuous thoughts with just 2 layers outperform it.
When they look inside the model:
- Layer 1 attention “copies” each edge’s source and target into the right places (like setting up the data for reasoning).
- Layer 2 attention focuses on edges whose source can currently be reached, which matches how BFS expands the frontier.
- The continuous thought vectors are most similar to the nodes that are reachable now—especially the “frontier” (the newest reachable nodes)—and often give extra weight to nodes on the optimal path. This “priority” emerges from training, even without being told to explore multiple paths.

Why does this matter?

It shows that letting models think in continuous, latent space can make reasoning both faster and more reliable, because the model can keep many options open and explore them in parallel.
This could help LLMs handle bigger, more complex reasoning tasks in planning, math, and science.
It also offers a clear, testable mechanism (superposition of search frontiers) for why continuous thoughts help, guiding future model design and training.
Future directions include:
- Proving strict lower bounds for how many steps discrete chain-of-thought must use on reachability (to formalize the gap).
- Understanding why “good exploration” emerges naturally during training.
- Extending these benefits to broader reasoning problems beyond graphs.

In short: continuous thoughts turn step-by-step guessing into parallel exploration, making small models surprisingly strong reasoners.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable future research.

Formal separation between continuous and discrete CoT
- No lower bound is provided proving that constant-depth transformers with discrete CoT cannot solve directed reachability in o(n²) decoding steps; a tight lower bound and explicit separation are left open.
- It is unclear whether discrete CoT augmented with parallel decoding (e.g., beam search or ensembles) can asymptotically match the D-step performance of continuous CoT.
Realism of the theoretical construction under standard transformer constraints
- The construction relies on orthonormal token embeddings and explicit “buffer” subspaces; demonstrate feasibility (or alternatives) with standard learned embeddings and without hand-engineered subspace separation.
- The MLP filter uses a hard thresholding nonlinearity σ(x)=1{x≥ε}; show the same effect with common activations (ReLU, GELU) and finite precision, or quantify the approximation error and its impact.
- The attention chooser depends on exact sinusoidal PE algebra (rotation, distinct inner products) and large scaling parameters to make attention “almost deterministic”; analyze stability under realistic magnitudes, noise, and finite precision.
- The main theorem assumes d=O(|Voc|), which is impractical for real vocabularies; derive minimal dimensionality requirements and compression schemes that preserve the superposition mechanism.
Robustness and scalability beyond small synthetic settings
- Experiments target 3–4-hop graphs on a ProsQA subset; evaluate performance and learned superposition on larger graphs (n, D), higher branching factors, and deeper reasoning (e.g., D≥10, ≥20).
- Test generalization to out-of-distribution diameters: can the model adaptively use more thought steps than seen during training and still succeed?
- Characterize memory/compute trade-offs for increasing continuous-thought steps, and practical limits on T_max in training and inference.
Training dynamics and emergence of superposition
- Provide a theoretical explanation for why gradient descent with the given objectives and curriculum induces superpositional BFS without explicit supervision.
- Quantify and explain the observed bias toward frontier and optimal edges (e.g., derive a training-time potential or implicit regularizer that drives prioritization).
- Compare curricula (Coconut vs Coconut-BFS vs other variants) with controlled ablations to isolate which signals are necessary and sufficient for superposition to emerge.
Measurement/prediction and decision-making
- The “measurement” step via <A> and greedy decoding is described informally; formalize the readout mechanism (attention + W_O) and prove correctness for k>2 candidate queries, ties, or ambiguous cases (both or neither reachable).
- Evaluate robustness of the final decision under sampling, temperature, and noise, and characterize failure modes.
Minimal architectural requirements and lower bounds
- Is two layers the minimum depth for continuous-thought reachability with D steps? Establish lower bounds on depth and width for this mechanism.
- Investigate whether fewer than D continuous thoughts can suffice (e.g., via multi-hop expansion per step) and under what conditions.
Extension to broader reasoning tasks
- Generalize the superposition-based construction to other graph problems (shortest path, connectivity, cycle detection with guarantees) and to non-graph algorithmic tasks (dynamic programming, arithmetic), with clear assumptions and proofs.
- Identify limitations where superposition may be detrimental (e.g., tasks needing strict sequencing or irreversible decisions) and propose mitigations.
Handling cycles, duplicates, and noise formally
- The MLP filter is claimed to remove noise and equalize weights; provide formal guarantees of correctness in cyclic graphs (duplicate expansions) using realistic activations, and quantify residual error.
- Analyze sensitivity to spurious edges, noisy graph encodings, or extraneous tokens, and the degree to which superposition amplifies or dampens noise.
Positional encodings and sequence-length generalization
- The core proofs hinge on sinusoidal PE; the RoPE case is only mentioned—provide full constructions and empirical verification with RoPE and learned PEs.
- Study the effect of position-index range and aliasing on attention chooser reliability, especially for longer sequences than seen in training.
Integration with pretrained LLMs and natural language inputs
- Demonstrate Coconut within large pretrained LLMs on natural-language graph reasoning (e.g., text-described knowledge graphs), where embeddings are not orthonormal and vocabularies are large.
- Investigate how continuous thoughts interact with textual CoT (hybrid schemes), and whether superposition emerges or is disrupted by general-purpose token distributions.
Discrete approximations to superposition
- Assess whether multi-trajectory discrete CoT (beam search, stochastic sampling with aggregation) can emulate parallel BFS sufficiently to close the empirical gap, and quantify the step/computation trade-offs.
Multi-answer, uncertainty, and more general queries
- The task assumes exactly one of c1,c2 is reachable; extend to multiple candidates, ranking, uncertainty calibration, and cases with both or neither reachable.
- Explore queries requiring path extraction (not just existence), and evaluate how superposition can be “collapsed” to produce explicit paths.
Readout matrix and interpretability of thought vectors
- Provide quantitative recovery of coefficients in the supposed superposition (e.g., projecting thought vectors onto the node basis to estimate per-node weights), beyond inner-product histograms.
- Use causal interventions (e.g., zeroing components corresponding to frontier nodes) to test whether superposition components are necessary for correct reasoning.
Training stability and reproducibility
- Report variance across seeds, training curves, and failure modes at scale; characterize when superposition fails to emerge and how to remedy it (optimizer, normalization, curriculum, regularization).
LayerNorm and other architectural choices
- The proofs rely on LayerNorm; analyze sensitivity to normalization choices (RMSNorm, no norm), dropout, residual scaling, and attention temperature.
Stopping criteria and adaptive thought length
- The method requires an explicit <A> token after a fixed number of thoughts; design and evaluate mechanisms for adaptive stopping based on the latent state (e.g., confidence thresholds).
Practical vocabulary and embedding constraints
- The construction presumes bespoke graph tokens; assess performance when graph nodes must be represented via subword tokens or learned entity embeddings in realistic vocabularies.

View Paper Prompt View All Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Tweets

YouTube

Show All Videos

HackerNews

Reasoning by Superposition: A Perspective on Chain of Continuous Thought (60 points, 1 comment)

[R] Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought (47 points, 9 comments)

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they try to answer them?

Discrete chain-of-thought vs continuous thoughts

Graph reachability as a test problem

The key idea: superposition and breadth-first search (BFS)

The model: a small transformer with a clever setup

What did they find?

Main theoretical result

Experimental results

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Tweets

YouTube

HackerNews

Reddit