RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

Published 20 May 2025 in cs.AI | (2505.14140v2)

Abstract: Despite rapid advancements in LLMs, the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques by up to 13.4%. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://anonymous.4open.science/r/RL-LLM-Reasoning-1A30 for reproducibility.

Abstract PDF Upgrade to Chat

Summary

The paper introduces RL-of-Thoughts, a reinforcement learning framework that dynamically selects logical blocks to enhance LLM reasoning adaptability.
It employs a navigator model trained with the Double-Dueling DQN algorithm within an MDP framework to structure reasoning dynamically.
Experiments show up to 13.4% improvement on benchmarks across STEM, mathematics, and commonsense tasks with strong transferability.

RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

The paper "RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning" introduces a novel inference-time technique to enhance the reasoning capabilities of LLMs using a reinforcement learning framework. The main contribution is the RL-of-Thoughts (RLoT) method, which employs a navigator model trained with reinforcement learning to dynamically select and sequence logical blocks during the reasoning process. This approach aims to create task-specific logical structures that improve the adaptability and performance of LLMs in complex reasoning tasks.

Problem Statement

Despite the significant progress made by LLMs like GPT, Llama, and others, their token-level autoregressive nature limits their ability to handle complex reasoning tasks, which demand sophisticated logical structures and long-term dependencies. Existing inference-time techniques such as Chain-of-Thought and Tree-of-Thoughts provide lightweight alternatives by introducing predefined logical structures. However, these approaches are task-agnostic and inflexible, applying the same structures across diverse tasks without adaptation.

Methodology

The RLoT framework leverages reinforcement learning to train a navigation model that constructs adaptive logical structures at inference time based on specific problem characteristics.

MDP Framework: The reasoning process is modeled as a Markov Decision Process (MDP), with specially defined state, action, reward, and state transition mechanisms.

State: Captured through a self-evaluation mechanism, offering a concise summary of problem-solving status based on correctness, complexity, and completeness aspects.
Action: Inspired by human cognitive strategies, five basic logic blocks (Reason one step, Decompose, Debate, Refine, Terminate) are designed as potential actions for decision-making.
Reward: Process reward models (PRM) score intermediate results to provide feedback on single-step quality.

Navigator Model Training: The RLoT framework trains the navigator model using the Double-Dueling DQN algorithm. The navigator dynamically selects logic blocks based on the current reasoning state, creating task-specific logical structures for LLMs.

Figure 1: Framework of RL-of-Thoughts (RLoT), enhancing LLMs' ability to handle complex reasoning tasks via dynamic selection and combination of logic blocks.

Experiments

Experiments were conducted across various reasoning benchmarks, including mathematics, STEM, and commonsense tasks, to evaluate the efficacy of RLoT.

Results: RLoT significantly outperforms established inference-time techniques, demonstrating up to 13.4% improvement over baselines. With less than 3K parameters, the RL navigator enhances sub-10B LLMs to perform comparably to much larger models.

Transferability: The navigator model exhibits strong transferability across different LLMs and tasks without fine-tuning, further demonstrating its practical utility.

Figure 2: Learning curves during RL training of all navigator models, indicating good convergence of the training process.

Figure 3: A case study comparing few-shot CoT and RLoT on a representative problem in the MATH dataset, showing the superiority of the RLoT-generated reasoning pathway.

Implications and Future Work

The use of RL in inference-time reasoning presents a promising pathway for enhancing LLM capabilities without the need for costly model parameter updates. The adaptability and efficiency of the RLoT framework make it well-suited for deployment in diverse real-world applications requiring complex reasoning. Future developments could explore extending this approach to broader problem domains and refining logic block designs to further enhance adaptability and effectiveness.

Conclusion

RLoT introduces a significant advancement in the flexible and adaptive enhancement of LLM reasoning capabilities at inference time. By training a lightweight navigator model with reinforcement learning, the framework enables task-specific logical structures that significantly improve LLM performance across diverse problem domains, demonstrating both high efficiency and transferability. The findings highlight the potential of RL in unlocking more adaptive and efficient reasoning processes in AI systems.

Markdown