Multi-Turn Reinforcement Learning

Updated 16 January 2026

Multi-Turn Reinforcement Learning is a paradigm where agents, often large language models, learn to perform complex tasks over multiple interactive turns.
It defines state, action, reward, and policy over extended dialogues, enabling efficient credit assignment and non-myopic planning in sequential decision-making.
Applications span search, reasoning, dialogue, tool-use, and software engineering, with innovations in turn-level optimization and hierarchical reward shaping enhancing performance.

Multi-Turn Reinforcement Learning (MT-RL) is a paradigm in which an agent, typically instantiated as a LLM, learns to interact with an environment over multiple interconnected turns to accomplish complex, temporally extended tasks. Unlike single-turn or bandit RL, MT-RL addresses the core RL setting where state, action, reward, and policy are all defined over structured, multi-step dialogues or decision processes. Tasks span search, reasoning, dialogue, tool-use, code synthesis, software engineering, navigation, and collaborative scenarios—each presenting distinct challenges for credit assignment, non-myopic planning, and sample efficiency.

1. Formalization and Core Definitions

Multi-turn RL is formalized as a Markov Decision Process (MDP) or Partially Observable MDP (POMDP) defined over episodes lasting several turns:

State space $\mathcal{S}$ : The agent’s state $s_t$ at turn $t$ typically encodes the history of observations, actions, tool calls, and environment feedback up to $t$ . For language agents, this can include token-level or utterance-level dialog histories—potentially truncated or encoded for memory efficiency (Kalyan et al., 28 Oct 2025).
Action space $\mathcal{A}$ : At each turn, actions may be single tokens, full utterances, or structured tool invocations (search, code, environment commands) (Wei et al., 17 May 2025, Abdulhai et al., 2023).
Transition function $P(s_{t+1}|s_t,a_t)$ : Deterministic or stochastic; in tool-augmented settings, a structured action triggers tool or environment state transitions, modifying $s_{t+1}$ (Kalyan et al., 28 Oct 2025, Golubev et al., 5 Aug 2025).
Reward function $R(s_t,a_t)$ : May be sparse (only at the end of the episode), turn-level (providing intermediate feedback), or hierarchical (outcome + process rewards) (Wei et al., 17 May 2025, Xiong et al., 8 Dec 2025).
Objective: Maximize discounted cumulative reward,

$J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T-1} \gamma^t r_t\right],\quad \gamma \in [0.9, 1).$

Multi-turn reinforcement learning seeks to optimize policies that operate over long horizons—typically tens of turns—with delayed or partial feedback, requiring robust credit assignment and planning mechanisms (Abdulhai et al., 2023, Kalyan et al., 28 Oct 2025, Zhou et al., 2024).

2. Algorithmic Foundations and Variants

Trajectory- and Turn-Level Policy Optimization

Early MT-RL for LLMs relied on extending single-turn methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) to the multi-turn domain (Wang et al., 1 Oct 2025, Li et al., 18 Dec 2025). However, trajectory-level (final outcome only) advantages proved insufficient for dense credit assignment or stable learning, motivating several key innovations:

Turn-Level Advantage Estimation: Algorithms such as Turn-PPO fit critics and estimate advantages at the level of full dialogue turns rather than tokens, reducing variance and permitting more granular credit assignment (Li et al., 18 Dec 2025).
Hierarchical RL: ArCHer (Zhou et al., 2024) executes high-level policy/value RL at the utterance/turn boundary (aggregating delayed reward) and token-level REINFORCE updates within each turn, dramatically improving sample efficiency in dialog and web-invocation tasks.
Group-Normalized and Group-Turn Methods: Both GTPO (Ding et al., 18 Nov 2025) and MT-GRPO/MT-PPO (Wei et al., 17 May 2025) introduce fine-grained turn-level or return-based normalization for advantages across stochastic trajectory batches, supporting stable optimization in environments with high trajectory variance.
Gated Reward Accumulation: Masking or gating low-level rewards with respect to meeting outcome thresholds prevents reward hacking and misalignment in sparse-reward, long-horizon domains such as software engineering (Sun et al., 14 Aug 2025).

Reinforcement Learning from Preferences

Building on reinforcement learning from human feedback (RLHF), MT-RL has been extended to incorporate preference feedback over entire conversations or trajectories, enabling optimization with only relative human or oracle judgments at the dialogue level. Mirror-descent policy optimization (MTPO) converges to Nash equilibria between competing policies given pairwise trajectory preferences, and deep actor-critic instantiations have demonstrated that multi-turn preference learning can match explicit reward-based RL (Shani et al., 2024).

3. Reward Shaping and Credit Assignment

Dense, per-turn reward schemes have proven critical for practical MT-RL:

Turn-Level (Verifiable and LLM-Judged) Rewards: Dense, actionable feedback—whether computed via structured tags, retrieval correctness, compliance scoring, or LLM-judge models—yields faster convergence, lower variance, and higher stability than sparse, final-outcome-only rewards (Wei et al., 17 May 2025, Xue et al., 2 Sep 2025, Ding et al., 18 Nov 2025).
Self-supervised and Embedding-Based Shaping: Similarity between intermediate tool outputs (e.g., code) and reference successes can be used to shape rewards for turns leading toward the correct outcome, even in the absence of explicit supervision (Ding et al., 18 Nov 2025).
Process vs. Outcome Rewards: In adversarial or long-term interactive settings, hybrid schemes blending sparse outcome-judgment and process-based heuristics (e.g., for progress, safety, adversarialness) improve attack strategies and sample efficiency over naively optimizing only for terminal scores (Xiong et al., 8 Dec 2025).
Gated Accumulation: Explicit thresholding of which reward components may be accumulated (G-RA) avoids premature or misleading low-level reward propagation, especially for partially-verified substeps (Sun et al., 14 Aug 2025).

4. Architectures, Implementation, and Scalability

Stateful multi-turn RL requires agentic infrastructures capable of long-context rollouts, heterogeneous tool chains, and efficient batch processing.

Batching and Asynchrony: AgentRL (Zhang et al., 5 Oct 2025) deploys fully asynchronous rollout-training pipelines, with cross-policy sampling from stale and current models to maintain exploration, and per-task advantage normalization to ensure stability in multi-task, heterogeneous agent settings.
Unified Function-Call and Containerization: Environments are exposed to the agent via unified APIs, often containerized for OS- and tool-level isolation, enabling scalable, reproducible experimentation across text-based games, web interfaces, SQL, shell, and search (Zhang et al., 5 Oct 2025, Kalyan et al., 28 Oct 2025).
Off-Policy and Replay Buffers: Hierarchical or value-based approaches exploit replay of prior utterance-level transitions, accelerating learning and credit assignment across long horizons (Zhou et al., 2024).
Gradient Filtering and Stability: Filtering and masking of gradient updates corresponding to low-probability, distributionally-drifted tokens stabilize policy gradients in feedback-rich MT-RL, as in SimpleTIR (Xue et al., 2 Sep 2025).

5. Empirical Findings and Benchmarks

MT-RL has been applied to a wide range of domains, yielding state-of-the-art or near-SOTA results:

Search and Retrieval: RL-trained LLMs in legal document search surpass prompt-based and frontier API models (e.g., 85% vs. 78%) and demonstrate monotonic accuracy improvement with increased turn budget, contingent on sufficient horizon during training (Kalyan et al., 28 Oct 2025).
Medical Dialogue: DoctorAgent-RL achieves higher diagnostic and recommendation accuracy by learning to dynamically adjust questioning and diagnosis strategies compared to SFT or static approaches (Feng et al., 26 May 2025).
Tool-Augmented Reasoning and SWE: MT-RL methods, particularly those with turn-level or gated rewards, significantly outpace supervised or trajectory-level RL on software engineering (SWE-bench, kBench) and math mathematical reasoning benchmarks, demonstrating strong gains in both sample efficiency and end-task accuracy (Sun et al., 14 Aug 2025, Ding et al., 18 Nov 2025, Xue et al., 2 Sep 2025).
Multi-Task, Multi-Domain: AgentRL (multi-task, multi-turn) achieves new performance records across five heterogeneous benchmarks, matching specialist models while significantly outperforming both prompt-only and prior agentic RL frameworks (Zhang et al., 5 Oct 2025).
Benchmarks: LMRL-Gym (Abdulhai et al., 2023) provides a standardized suite spanning RL capability tests (Maze, Chess, Wordle) and interactive dialogue (20Q, Guess My City, negotiation), revealing the broad applicability and challenge diversity of MT-RL.

6. Challenges, Limitations, and Design Considerations

MT-RL faces several open challenges:

Credit Assignment and Reward Sparsity: Dense, well-designed turn-level rewards are critical for PPO/GRPO stability, while unbiased estimators (e.g., RLOO) are more robust to moderate sparsity but generally less sample-efficient (Wang et al., 1 Oct 2025).
Distributional Drift and Stability: Distributional shift from tool/environment feedback frequently induces low-probability or OOD contexts, driving instability unless filtered or robustified by architectural/algorithmic adjustments (Xue et al., 2 Sep 2025).
Policy Generalization and Sample Efficiency: Hierarchical RL, multi-level advantage estimation, and replay buffer reuse (as in ArCHer) markedly improve generalization and enable scaling to larger model sizes and longer horizons (Zhou et al., 2024).
Reward Misalignment and Hacking: Unchecked accumulation of low-level rewards leads to policy collapse. Gated or prioritized reward accumulation aligns agent learning with true long-term objectives (Sun et al., 14 Aug 2025).
Scalability to Real-World Environments: Full asynchrony, containerization, and API-based environment stacks are now standard for robust, reproducible, large-scale MT-RL (Zhang et al., 5 Oct 2025).

7. Future Directions and Open Problems

Research directions include:

Richer Feedback and Hierarchical Credit: Integration of more nuanced intermediate rewards, meta-learned or model-based reward schemes, and hierarchical credit assignment across sub-task/action boundaries (Wei et al., 17 May 2025, Li et al., 18 Dec 2025).
Preference-Based Learning at Scale: Broader application of Nash-equilibrium, mirror-descent, and deep preference learning across dialogue, planning, and tool-use tasks (Shani et al., 2024).
Automated Reward Design: Automated discovery or learning of reward functions to facilitate porting MT-RL to new domains without extensive engineering (Wei et al., 17 May 2025, Ding et al., 18 Nov 2025).
Task and Policy Curriculum: Dynamic task weighting, multi-domain and multi-agent scaling to accelerate generalist policy learning (Zhang et al., 5 Oct 2025).
Robustness and Adversarial Safety: Safe exploration, red-teaming via agentic adversaries, and defense against multi-turn jailbreaks are emerging considerations for safe MT-RL deployment (Xiong et al., 8 Dec 2025).
Human-in-the-Loop Optimization: Scalable human preference collection and evaluation in more naturalistic and longer-horizon agent dialogues (Shani et al., 2024, Abdulhai et al., 2023).

The MT-RL paradigm now underpins a new generation of interactive, tool-using, decision-theoretic LLM agents—characterized by sophisticated planning, non-myopic optimization, and real-world adaptability—across ever-longer and more complex decision horizons.