Papers
Topics
Authors
Recent
Search
2000 character limit reached

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Published 26 Sep 2025 in cs.AI and cs.LG | (2509.21743v1)

Abstract: Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.

Summary

  • The paper presents the RoT framework which efficiently reuses reasoning steps through a thought graph to reduce computational load.
  • It dynamically assembles problem-specific templates using sequential and semantic edges, achieving a 40% reduction in tokens and an 82% reduction in latency.
  • The approach incorporates reward-guided retrieval with minimal overhead, offering scalable improvements for large reasoning models' inference costs.

Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Introduction

The paper "Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts" presents an innovative framework aimed at improving inference-time efficiency in Large Reasoning Models (LRMs). By proposing the Retrieval-of-Thought (RoT) framework, the authors demonstrate a method of reusing prior reasoning steps to solve new problems efficiently. RoT leverages a thought graph, which organizes reasoning steps into nodes and uses retrieval-based methods to dynamically assemble templates for problem-solving. This approach significantly reduces the number of output tokens required, thereby lowering latency and cost, while maintaining high accuracy.

Thought Graph and Dynamic Template Assembly

Central to the RoT framework is the construction of a thought graph, which encodes reusable reasoning steps as nodes within a directed, weighted graph. Nodes are interconnected by two types of edges: sequential edges, which capture the natural order of steps within templates, and semantic edges, which link semantically similar steps across different templates (Figure 1). Figure 1

Figure 1: The figure contrasts Chain-of-Thought (CoT) inference in LRMs with our Retrieval-of-Thought (RoT) approach.

Semantic edges are established using cosine similarity between step embeddings, facilitating the retrieval of relevant reasoning steps for a given query. By leveraging a combination of reward-guided traversal and retrieval techniques, the RoT framework dynamically assembles problem-specific templates during inference (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2: Key observations motivating the RoT framework.

Efficiency and Performance Metrics

The paper highlights substantial efficiency gains achieved through the RoT framework. Evaluations conducted across various reasoning benchmarks showed that RoT can reduce output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. Specifically, comparison with traditional Chain-of-Thought (CoT) methods shows that RoT consistently achieves higher efficiency by navigating directly to promising reasoning paths, thereby reducing unnecessary exploration and path switching (Figure 3). Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Average accuracy versus output tokens across Qwen3 models (1.7B, 4B, 8B).

Implementation of RoT involves low memory overhead and negligible latency contribution from the retrieval process, making it suitable for deployment alongside existing LRM infrastructures. The thought graph's scalability is demonstrated by increasing template numbers, showing performance improvements as graph size increases (Figure 4). Figure 4

Figure 4: Template scalability analysis of RoT+TI.

Implications and Future Developments

The RoT framework offers a scalable foundation for efficient LRM reasoning, suggesting potential extensions beyond current mathematical domains. Its innovative use of dynamic retrieval and template assembly can provide a robust solution to the growing computational demands of reasoning-intensive AI applications. Future developments may focus on enhancing the adaptability of the thought graph across broader domain applications and refining the underlying retrieval mechanisms to further optimize performance metrics.

Conclusion

The Retrieval-of-Thought paradigm presents a promising advancement in efficient reasoning within Large Reasoning Models. By reusing structured reasoning steps as dynamically composable templates, RoT achieves significant reductions in output tokens and inference latency without sacrificing accuracy. The work provides a practical and scalable approach to addressing the high computational costs associated with reasoning-heavy inference, paving the way for broader applicability and refinement in intelligent systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 432 likes about this paper.