- The paper demonstrates that RL reward improvements do not guarantee increased information acquisition and belief refinement in multi-turn active reasoning tasks.
- It introduces the AREW framework, which uses directional critique signals to reweight both Action Selection and Belief Tracking updates effectively.
- Empirical evaluations show AREW achieving up to a 60% performance gain and breaking through information self-locking plateaus in diverse RL benchmarks.
Problem Characterization and Failure Modes
This paper investigates the phenomenon of information self-locking (SeL) in reinforcement learning-driven training of LLM-based agents engaged in active reasoning. Unlike standard RL on outcome-based rewards, which often suffices for single-turn or passive tasks, agents in multi-turn active reasoning must strategically query their environment to acquire and internalize new information. The paper formalizes agentic behavior by splitting it into Action Selection (AS)—which governs the acquisition of informative observations—and Belief Tracking (BT)—which manages the integration of new evidence.
Empirical analysis across multiple multi-turn active reasoning benchmarks shows that RL-trained LLM agents frequently become trapped in low-information regimes. In these regimes:
- AS becomes conservative, causing information acquisition to plateau or even degrade despite reward improvements.
- BT stagnates, failing to absorb new evidence efficiently.
- The mutual dependence between AS and BT creates a negative feedback loop: weak BT masks credit assignment for informative queries, while poor AS starves BT of new evidence to integrate, thus enforcing SeL.
This decoupling is robustly demonstrated with precise proxy measurements of AS and BT. The paper reveals that reward improvements through RL do not automatically translate into higher information acquisition or belief refinement.
Theoretical Framework
To elucidate the origin and persistence of SeL, the authors introduce a rigorous theoretical model:
- Active reasoning is formalized as a POMDP with belief-state abstraction. The agent's trajectory is characterized in terms of its AS policy and BT update operator.
- Two capability indices are defined:
- AS Informativeness (Ith​(w)): The expected improvement in oracle belief quality induced by a query policy.
- Belief Tracking (CBT(w)): The realized belief improvement by the agent's actual update mechanism.
The paper establishes that the outcome reward's policy gradient can be decomposed into AS- and BT-isolated update directions. Within the SeL regime (low AS, low BT), learning signals scale linearly with current information and tracking capacities, resulting in diminished optimization magnitude. Theorem 3.4 proves that escaping SeL requires significant explicit intervention; otherwise, the agent remains trapped for a large number of RL update steps.
AREW: Directional Critique-Driven Reweighting
Building on the formalism, the authors propose Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW). This lightweight method leverages stepwise diagnostic signals:
- AS Critiques: Binary labels for queries (+1 if informative evidence is obtained, -1 if not).
- BT Critiques: Scalar readouts tracking whether belief updates increase confidence in the correct hypothesis.
These critiques are injected as a margin-aware auxiliary objective into the policy-gradient optimization. The resulting surrogate loss reweights the advantage estimation such that positively critiqued steps increase their probability mass, while negatively critiqued steps decrease theirs. Critique signals are cheaply acquired from user feedback or confidence margins after belief updates.
The paper’s analysis establishes that AREW does not depend on perfect critique accuracy; it is effective as long as weighted alignment with oracle-good actions exceeds random chance (AccQ​(w)>1/2).
Empirical Evaluation
AREW is evaluated across seven tasks in three domains (preference estimation, medical diagnosis, troubleshooting) with multiple RL baselines (PPO, GRPO, GSPO) and major model families (Qwen-2.5-7B-Instruct, LLaMA-3.1-8B-Instruct). Key findings include:
- AREW yields substantial improvements in cumulative task rewards and AS/BT capability proxies: up to 60% performance gains over vanilla RL.
- The AS+BT variant of AREW dominates AS-only in the majority of settings, confirming the benefit of critique injection in both channels.
- AREW demonstrates robust performance even under strong directional critique noise (perturbation ratios up to 0.5) and across different RL algorithms, indicating method generality and stability.
- Reward curves show that AREW breaks through SeL-induced plateaus, enabling continual reward and information acquisition improvements.
Numerical highlights from Table 1:
- AREW-AS+BT achieves up to 62-point reward improvements in preference estimation (PE-GS=3) and 21-point gains in troubleshooting (FloDial-Hard), compared to PPO vanilla baselines.
Practical and Theoretical Implications
The study identifies SeL as a structural RL failure mode in agentic LLMs, driven by misaligned credit assignment in coupled multi-turn environments. AREW offers an extremely practical remedy, requiring only lightweight labels available during interaction, without complex reward shaping or external model verifiers.
Theoretically, this work advances understanding of RL dynamics in interactive reasoning. It provides formal justification for decoupled and margin-aware training objectives, underlining the necessity for stepwise credit assignment beyond outcome rewards.
Practically, AREW opens up new directions for designing RL schemes for LLM agents in domains where information acquisition and internalization are critical, including complex dialogue systems, decision support, and collaborative agents.
Conclusion
The paper delivers a comprehensive diagnosis of information self-locking in RL for active reasoning, grounded in both empirical and theoretical contributions. The AREW framework effectively reallocates learning signal and successfully mitigates SeL across diverse settings. The results suggest that robust agentic behaviors in LLMs demand explicit intervention in multi-turn RL—and that stepwise directional critique-driven reweighting is a principled, scalable solution. Future research may extend AREW to more adaptive reweighting strategies, finer-grained critique signals, and broader agentic environments.
Reference: "On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents" (2603.12109)