Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

Published 17 May 2025 in cs.LG | (2505.11821v2)

Abstract: This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of LLM agents in long-horizon, multi-turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi-turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of \textit{turn-level reward design} for multi-turn RL algorithms and agent applications. By integrating turn-level rewards, we extend GRPO and PPO to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge. Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards. Both training and validation reward curves illustrate that our method achieves \textit{greater stability}, \textit{faster convergence}, and \textit{higher accuracy}. Numerical results across diverse question-answering datasets further show that our approach consistently delivers highest answer correctness and 100\% format correctness.

Summary

  • The paper introduces a paradigm shift from traditional trajectory-level rewards to turn-level reward design, enabling fine-grained credit assignment in multi-turn reasoning tasks.
  • It presents adapted GRPO and PPO variants (MT-GRPO and MT-PPO) that integrate intermediate and outcome rewards to significantly improve training efficiency and agent performance.
  • Experimental results show that turn-level rewards boost training stability, answer correctness, and overall task performance in complex, multi-turn interactive environments.

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

The paper "Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design" investigates reinforcement learning techniques to enhance the reasoning capabilities of LLM agents in scenarios requiring complex, multi-turn interactions. The study focuses on overcoming limitations of sparse outcome rewards by introducing systematic turn-level reward design to facilitate fine-grained credit assignment.

Introduction to Turn-Level Reward Design

Reinforcement Learning (RL) approaches such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) are traditionally used to train LLM agents but typically operate with trajectory-level rewards. This paper introduces a paradigm shift by systematically studying and implementing turn-level rewards to improve these RL algorithms' effectiveness. Turn-level rewards allow the agent to receive immediate feedback on intermediate steps of decision-making, crucial for refining complex reasoning tasks. Figure 1

Figure 1

Figure 1: Curves for different training reward components during training with various algorithms. GRPO-OR denotes GRPO with outcome rewards while GPRO-MR indicates GRPO with merged outcome and intermediate rewards.

Implementation of Turn-Level GRPO and PPO Variants

GRPO with Turn-Level Rewards

GRPO traditionally estimates advantage on a trajectory level, ignoring intermediary actions' contributions during long-horizon tasks. The paper modifies GRPO to a new variant—multi-turn GRPO (MT-GRPO)—which accommodates turn-level rewards.

MT-GRPO incorporates iterative advantages for both intermediate and outcome results:

  • Intermediate Rewards guide actions during the process.
  • Outcome Rewards reflect final task success.

The MT-GRPO algorithm thus computes advantages by leveraging both:

  • Current step outcome rewards.
  • Intermediate turn-specific feedback.

The complexity arises from dynamically managing rewards per turn, demanding exponential rollout samples, which can be computationally prohibitive in highly interactive tasks.

PPO with Turn-Level Rewards

To address GRPO’s complexity, the paper proposes an adaptation of PPO—multi-turn PPO (MT-PPO)—which employs generalized advantage estimation (GAE) to compute token-level advantages efficiently. This allows assigning specific rewards at each turn without the overhead imposed by GRPO’s sample complexity.

MT-PPO utilizes a critic model for value estimation, thereby achieving scalable and efficient training with turn-specific signals rather than relying solely on sparse trajectory-level rewards. Figure 2

Figure 2: Overview of the multi-turn reasoning-augmented search agent pipeline, highlighting interleaved reasoning and search iterations.

Case Studies in Multi-Turn Reasoning-Augmented Agents

The approach has been validated on reasoning-augmented search agents that integrate external searches into their decision-making processes. Here, two types of turn-level rewards were carefully crafted:

  • Verifiable Rewards: Measure strict correctness, ensuring responses are formatted and content-valid.
  • LLM-as-Judge Rewards: Leverage LLMs to evaluate nuanced response quality beyond binary validity.

Experiments demonstrated improved stability in training dynamics and higher accuracy with both types of rewards. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Training reward curves for PPO baselines and MT-PPO across datasets, showcasing stability and convergence improvements in MT-PPO.

Experimental Evaluation and Results

An extensive suite of experiments was conducted across multiple datasets, revealing substantial improvements in answer correctness and format adherence. The MT-PPO variant consistently outperformed conventional PPO algorithms, achieving close to 100% format correctness and demonstrable improvements in answer validity across domain-specific tasks. Figure 4

Figure 4

Figure 4

Figure 4: Ablation studies showcasing the impact of search count rewards and maximum turn adjustments on performance.

Conclusion

The study establishes the importance of turn-level reward mechanisms in multi-turn agent interactions. By reformulating RL approaches to utilize turn-specific rewards, both training stability and interactive agent accuracy are significantly enhanced. This suggests broader applicability for turn-level reward frameworks in dynamic environments where complex agent decisions guide task completion.

Future work could expand these paradigms to diverse agent-based systems, refining interactions in domains such as complex software navigation, real-time user support interfaces, and adaptive content generation tasks.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper is about teaching AI chatbots (LLMs, or LLMs) to solve problems step by step over several turns, especially when they use tools like a search engine. The authors show how to give the AI useful feedback at each step (not just at the end), so it can learn faster, make better decisions, and produce answers in the right format.

Think of it like a long puzzle. If you only find out at the end whether you were right or wrong, it’s hard to know which steps helped. This paper designs “turn-level rewards,” like points or feedback after each move, to guide the AI through multi-step tasks.

Key Questions the Paper Tries to Answer

  • How can we train LLM agents to make better multi-step decisions instead of judging them only at the end?
  • What’s the best way to give feedback (rewards) at each turn so the AI knows which actions helped?
  • Can adding turn-level feedback make training more stable, faster, and more accurate?
  • How do popular training methods (GRPO and PPO) need to change to use turn-level rewards?

How They Did It (Methods and Ideas)

First, two important ideas:

  • Multi-turn tasks: The AI doesn’t just answer once. It thinks, searches for information, reads results, thinks again, and finally answers. Each cycle is a “turn.”
  • Rewards: Like points in a game. Outcome rewards score the final answer. Turn-level rewards score each step to show if the AI is on the right track.

Here’s the everyday picture: training without turn-level rewards is like playing a long game and only getting a score at the very end. With turn-level rewards, you get feedback after each move—“Nice search query,” “Good formatting,” “Found something useful,” or “Too many searches—slow down.”

The paper adapts two popular Reinforcement Learning (RL) methods:

  • GRPO (Group Relative Policy Optimization): Think of it as comparing multiple answers from the AI and pushing it toward the better ones. The authors create a multi-turn version (MT-GRPO) that tries to score each turn. But there’s a catch: it needs a lot of samples (like trying many paths at every step), which becomes expensive and forces a fixed number of turns. That makes it hard to use for longer tasks.
  • PPO (Proximal Policy Optimization): This method uses a “critic” model—like a coach that estimates how good the current step is likely to be. The authors make a multi-turn version (MT-PPO) that inserts rewards at the right moments (end of each turn). The critic helps assign credit to specific tokens (words and symbols) and steps. MT-PPO is much more efficient and flexible than MT-GRPO.

They test these ideas on a reasoning-augmented search agent:

  • The agent follows a loop: 1) Thinks about what’s needed, 2) Creates a search query, 3) Reads results, 4) Repeats until ready, 5) Writes the final answer.

Turn-Level Reward Design

To make feedback clear and fair, they use two types of turn-level rewards:

  • Verifiable rewards (strict, rule-based):
    • Intermediate steps:
    • Did the retrieved text contain the correct answer (or a clue) from Wikipedia?
    • Did the agent use the correct format tags (like >, <search>, <information>) in the right order? > - Did it avoid overusing the search tool? (Too many searches get a small penalty.) > - Final step: > - Exact match: Does the answer between <answer> tags match the ground truth? > - Format check: Did the solution contain only <think> and <answer> tags in the right order? > > - LLM-as-judge rewards (flexible, expert grading): > - A strong LLM acts like a teacher. It reads each turn, reasons through it, then scores based on a rubric (format, reasoning quality, search usefulness, etc.). This captures nuanced improvements that strict rules might miss. > > ## Main Findings and Why They Matter > > The authors run lots of experiments on question-answering datasets (like NQ, TriviaQA, HotpotQA, etc.) and compare different training setups. Their key results: > > - Training is more stable with turn-level rewards. The learning curves don’t swing wildly or crash. > > - Training is faster early on. Turn-level feedback helps the agent quickly learn good habits (like writing properly formatted output and making efficient searches). > > - Accuracy improves. The agent answers more questions correctly, especially on harder multi-hop tasks that require multiple searches and reasoning steps (e.g., HotpotQA). > > - Format correctness is near-perfect. MT-PPO produces outputs in the required structure about 100% of the time, which is crucial for automated grading and tool use. > > - Overall, MT-PPO (the PPO version with turn-level rewards) consistently beats baselines that only use final rewards or merge everything into one lump score. > > In simple terms: giving the AI feedback at each step made it smarter, faster, and more reliable. > > ## Implications and Potential Impact > > This approach can make AI agents better at complex, step-by-step tasks—not just answering from memory, but using tools (like search engines, calculators, or code interpreters) responsibly and effectively. With turn-level rewards: > > - Agents learn which actions help long-term success, not just whether the final answer was right. > > - Outputs are more trustworthy and well-structured, which helps real-world systems that depend on correct formats. > > - The idea generalizes beyond search: it could improve tool use in coding, math, data analysis, and more. > > Bottom line: turn-level rewards are a practical way to build smarter, more dependable AI agents that can reason over multiple steps and interact with the world in a controlled, productive way.

Knowledge Gaps

Here is a concise, actionable list of the knowledge gaps, limitations, and open questions left unresolved by the paper:

  • Generalization beyond search: Effectiveness of turn-level rewards for other tool-augmented tasks (e.g., calculators, code execution, database queries, APIs with errors/latency) is not evaluated.
  • Dynamic environments: Results are on a static 2018 Wikipedia index; robustness to live web dynamics, content drift, and tool failures remains untested.
  • Horizon length: MT-PPO is only evaluated with short horizons (N_max ≤ 6); scalability and stability on long-horizon, branching tasks are unknown.
  • MT-GRPO practicality: The proposed MT-GRPO is only demonstrated for two turns; no approximation or variance-reduction strategy is offered to mitigate its exponential rollout cost or fixed-turn constraint for K > 2.
  • Reward shaping sensitivity: The paper fixes ad-hoc intermediate reward magnitudes (e.g., 0.3, 0.1, −0.2) and search penalty λ_s = 0.1; sensitivity analyses and principled tuning guidelines are missing.
  • Discounting and GAE choices: No study on the impact of γ and GAE λ on stability/credit propagation across turns; best practices for these hyperparameters in multi-turn settings remain unclear.
  • Token-level credit within a turn: Rewards are placed only on the last token of a turn; whether denser token-level signals (e.g., at reasoning/tool-call boundaries) yield better credit assignment is not explored.
  • Value function design: Architecture, conditioning, and training choices for the critic (e.g., turn markers, action-type embeddings, separate value heads per turn) are not analyzed; their effect on collapse/variance is unknown.
  • PPO collapse diagnosis: While MT-PPO appears more stable, the root causes of PPO collapse in long-CoT settings (e.g., value overfitting, bootstrapping bias) are not systematically diagnosed or mitigated.
  • Reward hacking risks: The retrieval-existence reward (string match) can be gamed by retrieving passages containing the surface form without true evidential grounding; safeguards or evidence-quality signals are absent.
  • Judge reliability: The LLM-as-judge’s calibration, bias, inter-rater reliability, and susceptibility to manipulation by the trained agent are not measured; no adversarial or cross-judge robustness checks.
  • Cost and throughput: The computational and wall-clock overhead of turn-level rewards (especially judge-based) is not quantified; trade-offs vs. trajectory-level training are unclear.
  • Fairness of baselines: Some baselines reportedly crash; comparisons use “final or last-before-collapse” checkpoints; controlled re-trains under identical seeds, budgets, and early stopping criteria are needed.
  • Out-of-domain coverage: Benchmarks vary within QA but remain English, Wikipedia-centric; transfer to domains with noisy/long documents (legal, biomedical) or multilingual settings is untested.
  • Termination behavior: How turn-level rewards influence the stop decision (premature vs. overly long searches) is only partially probed via λ_s; a principled termination reward or learnable stopping criterion is not developed.
  • Multi-objective trade-offs: There is no explicit framework for balancing accuracy, format adherence, tool cost, and latency; Pareto analyses or meta-optimization of weights are missing.
  • Evidence sufficiency: Intermediate rewards do not assess whether retrieved context is sufficient/causally supportive; no supervision on reasoning-faithfulness or citation grounding.
  • Credit across tool chains: The approach assumes a single search tool per turn; credit assignment across multi-tool pipelines (e.g., retrieve→parse→compute) is not addressed.
  • Robustness to retriever quality: Results hinge on a fixed retriever (E5) and k=3 passages; sensitivity to retriever model, k, re-ranking, and noisy top-k is not studied.
  • Format generalization: Near-100% format correctness may reflect strong tag adherence; whether such structural control transfers to more flexible or schema-free outputs is unknown.
  • Safety and alignment: Effects on hallucination rates, harmful content, and miscalibration are not evaluated; no human evaluation or preference alignment beyond exact match/format metrics.
  • Sample efficiency: Improvements are reported in curves but not normalized per environment interaction or token generated; data-efficiency and compute-efficiency comparisons are missing.
  • Continual/transfer learning: How turn-level reward policies adapt across datasets over time (catastrophic forgetting, positive transfer) has not been explored.
  • Stochastic tool feedback: The framework assumes deterministic feedback; handling noisy, delayed, or partial tool responses and their impact on advantage estimation is open.
  • Theoretical guarantees: There is no theoretical analysis of bias/variance in turn-level reward assignment with GAE, or convergence/sample-complexity benefits over trajectory-level methods.
  • Open benchmarking: Code/reward templates appear adapted from Search-R1; standardized, open benchmarks for turn-level reward design (with ablations and judge audits) would aid reproducibility.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 518 likes about this paper.