Papers
Topics
Authors
Recent
Search
2000 character limit reached

Policy of Thoughts (PoT)

Updated 4 February 2026
  • Policy of Thoughts (PoT) is a dynamic framework that refines large language model reasoning by converting inference into instance-specific policy optimization.
  • It integrates latent-space optimization (LTPO) and transient LoRA adaptation (GRPO) to update thought processes in real time using intrinsic rewards and guided search.
  • Empirical evaluations show that PoT enhances robustness and efficiency, significantly outperforming static chain-of-thought approaches on complex reasoning tasks.

The Policy of Thoughts (PoT) paradigm represents a significant advance in LLM reasoning by reframing inference as an instance-specific, online policy optimization process. PoT encompasses both latent-space policy evolution (as exemplified by Latent Thought Policy Optimization, LTPO) and text-based policy adaptation driven by explicit execution feedback (as in transient LoRA adaptation via Group Relative Policy Optimization, GRPO). This framework internalizes feedback—execution, confidence, or other signals—into transient policy shifts, enabling LLMs to escape the limitations of frozen reasoning protocols during test-time. The underlying insight is that multi-step reasoning in LLMs should be dynamically refined “on the fly,” in analogy with epistemic cycles of conjecture and refutation, rather than remaining fixed throughout problem solving.

1. Theoretical Foundations: Dynamic Reasoning as Policy Evolution

PoT is inspired by Popperian epistemology, drawing a direct analogy between scientific progress—where conjectures are tested and refuted—and LLM reasoning, where the model iteratively generates, tests, and revises its own reasoning strategy. In classical chain-of-thought (CoT) and tree-of-thoughts (ToT) frameworks, the policy governing thought generation remains static, with feedback only used for post-hoc selection or rejection. PoT instead internalizes this feedback to update the model’s inference-time policy:

  • In (Jiao et al., 28 Jan 2026), reasoning is formalized as a finite-horizon Markov Decision Process (MDP), where the state comprises the original problem and all prior (thought, feedback) pairs, and actions correspond to the generation of new thoughts (code or text).
  • The policy is parameterized by a frozen LLM backbone with an online-adapted “ephemeral” component (e.g., LoRA adapter), so test-time feedback directly shapes the decision process within the current instance.

This closed-loop formulation transforms test-time inference from static exploration and selection to adaptive policy refinement, enabling improved stability and robustness, especially in long-horizon and out-of-distribution tasks (Jiao et al., 28 Jan 2026).

2. Methodologies: Instantiations and Algorithms

The PoT paradigm admits several rigorous algorithmic instantiations:

LTPO instantiates PoT in the LLM’s latent vector space. Here, K “thought slots” are introduced as placeholder tokens within the input, with their embeddings (latent vectors) h(t)RK×dh^{(t)} \in \mathbb{R}^{K \times d} actively optimized at test time:

  • At each optimization step tt:
    • The latent is perturbed: h(t)N(h(t),σt2I)h'^{(t)} \sim \mathcal{N}(h^{(t)}, \sigma_t^2 I).
    • An intrinsic reward R(h)R(h') is defined as the mean negative average log-probability (confidence) over top-kk next-token predictions.
    • The latent is updated via a policy-gradient (REINFORCE) step:

h(t+1)=h(t)+ηR(h)hh(t)σt2h^{(t+1)} = h^{(t)} + \eta\, R(h')\, \frac{h' - h^{(t)}}{\sigma_t^2}

  • No model weights are updated. All adaptation is per-instance, per-episode, and parameter-free in terms of frozen LLM weights.

PoT is also realized at the textual level via policy adaptation:

  • An auxiliary LoRA adapter (rank r=8r=8) is zero-initialized at the start of each instance.
  • Trajectory exploration is conducted via MCTS, guided by the current policy parameters πθ,ϕt\pi_{\theta, \phi_t}.
  • After each exploration step, group-relative advantages based on execution signals are computed and a GRPO (PPO-style) policy update is applied:

LGRPO(ϕ)=1Gi=1G1τij=1τi[min(rij(ϕ)A^i,clip(rij(ϕ),1ϵ,1+ϵ)A^i)βDKL()]\mathcal{L}_{\mathrm{GRPO}}(\phi) = \frac{1}{G} \sum_{i=1}^G \frac{1}{|\tau_i|} \sum_{j=1}^{|\tau_i|} \Big[ \min(r_{ij}(\phi) \hat{A}_i, \text{clip}(r_{ij}(\phi), 1-\epsilon, 1+\epsilon) \hat{A}_i) - \beta D_{\mathrm{KL}}(\cdots) \Big]

Where rijr_{ij} is the per-token importance ratio, A^i\hat{A}_i is the group-relative advantage, and β\beta controls KL regularization.

Policy adaptation is discarded and reset after each problem, yielding pure test-time, instance-specific adaptation.

PoT is further instantiated as a policy-guided heuristic search within ToT decoders, where LM-assigned probabilities πτ(s)\pi_\tau(s) are used as cost heuristics:

  • The LTS algorithm prioritizes tree expansions by h(s)=g(s)/πτ(s)h(s) = g(s)/\pi_\tau(s), where g(s)g(s) is the number of thoughts in state ss.
  • Theoretical bounds guarantee expansion efficiency under computational budgets, and the framework remains sample-efficient even under strict query budgets.

3. Empirical Evaluation and Comparative Performance

PoT-based approaches showcase substantial empirical gains across a range of challenging reasoning tasks:

Model Zero-shot CoT SoftCoT LTPO (PoT)
LLaMA-3.1-8B 42.35 41.55 48.66
LLaMA-3.2-3B 41.54 38.61 46.38
Qwen-2.5-7B 54.53 47.88 56.79
Qwen-3-14B 54.59 51.60 55.86
  • On AIME benchmarks with significant distribution shift, SoftCoT collapses to 0% across models, Zero-shot CoT ranges from 6–10%, whereas LTPO achieves 13–17%, exhibiting a 3× gain and unique robustness.
Method GSM8K MATH-500 AIME2024
Genius (self-RL) 78.09 47.60 3.33
SimpleRL-Zoo 79.20 23.00 0.00
SoftCoT 80.36 39.80 0.00
LTPO (no training) 81.27 49.00 16.67

PoT-based LTPO achieves the best or competitive accuracy on GSM8K and MATH-500, while outperforming all in AIME2024.

Method LCB V5 LCB V6 Overall
DeepSeek-V3 (235B) 31.74 16.00 50.55
GPT-4o (large) 29.94 29.71 49.75
Qwen3-4B (PoT, full) 57.49 49.71 58.98

A PoT 4B model achieves 49.71% on LCB V6, outperforming DeepSeek-V3 and GPT-4o despite being more than 50× smaller.

3.4 Efficiency and Stability

Ablations in (Jiao et al., 28 Jan 2026) show that removing transient LoRA adaptation in PoT reduces LCB V6 accuracy from 49.71% to 37.14%. The number of rollouts needed is reduced by 2.8× compared to static ensembles, indicating substantially higher computational efficiency.

4. Algorithmic Details and Policy Optimization Procedures

4.1 Policy Evolution Procedures

  • LTPO: Optimization is performed entirely in the latent embedding space using Gaussian perturbations and REINFORCE. The reward is self-derived, requiring only confidence statistics from the LLM output distribution.
  • GRPO: Test-time adaptation employs PPO-style loss with group-relative advantage, KL-regularized, applied for a small number of gradient steps (E=3) per reasoning turn.
  • MCTS-Guided Exploration: Test-time thought generation employs tree search, with up to k=3k=3 candidate expansions per node and M=20M=20 simulations per reasoning turn.
  • LTS-Guided Search: Thought expansion probability, as provided by the LLM, is used to prioritize search. The LTS expansion bound ensures that the number of generated thoughts is controlled by bmaxminsHg(s)/πτ(s)b_{\max} \min_{s\in H'} g(s) / \pi_\tau(s).

4.3 Adaptation Parameters

  • Transient Adaptation: All per-instance adaptation parameters (LoRA adapters, latent vectors) are reset at the start of each new problem.
  • Heuristic Control: Softmax temperature τ\tau, rank rr for LoRA, group size GG, and learning rates η\eta are tuned for the best speed–diversity–capacity trade-off.

5. Practical Considerations and Model Robustness

PoT frameworks display:

  • No requirement for external supervision, as both rewards (in LTPO) and adaptation signals (in GRPO) are intrinsic, derived solely from model confidence or unit-test pass rates.
  • Robust performance across architectures and scales (3B–14B), with qualitative improvements particularly pronounced in out-of-distribution and long-horizon reasoning.
  • Stable optimization with respect to hyperparameters: Sensitivity analyses show LTPO’s stability across 1–16 thought tokens, negligible change with top-kk between 5 and 100, and accuracy gains from best-reward selection over final iterate (Ye et al., 5 Oct 2025).

A plausible implication is that the universal, self-derived adaptation signal enables PoT to overcome dataset and domain shifts that defeat both frozen and static offline methods.

6. Comparative Analysis and Theoretical Guarantees

PoT bridges the gap between search-based augmentation (e.g., best-of-N, beam) and true policy adaptation:

A distinctive feature is that this axis of improvement—instance-level, data-driven adaptation at test-time—is orthogonal to improvements due to scaling model parameters or improved prompt engineering.

7. Broader Implications and Future Directions

PoT represents a unifying framework for robust, scalable, and efficient LLM reasoning:

  • Per-instance, test-time adaptation: Each input is treated as a mini “learning problem,” with policy refinement localized to the current instance.
  • Lightweight compute requirements: A small number of forward and (in textual adaptation) local gradient steps suffice to produce substantial accuracy improvements, contrasting with resource-intensive ensemble or retraining protocols.
  • Generalization to hybrid frameworks: PoT naturally subsumes both latent and textual intermediate representations, and its core algorithmic strategies (online policy optimization, confidence or execution-driven feedback) are extendable to new architectures and domains.

The Policy of Thoughts paradigm thus reframes LLM inference from static reasoning trajectory generation to adaptive, intrinsically driven policy evolution, supporting both rigorous theoretical guarantees and empirical performance gains in demanding reasoning scenarios (Ye et al., 5 Oct 2025, Jiao et al., 28 Jan 2026, Pendurkar et al., 7 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy of Thoughts (PoT).