From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

Published 14 Apr 2026 in cs.CL | (2604.12385v1)

Abstract: Multi-turn dialogue is the predominant form of interaction with LLMs. While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces DialRouter, a long-horizon LLM routing paradigm that uses Monte Carlo Tree Search for expert trajectory generation and behavior cloning for policy learning.
It achieves a 5–6% improvement in dialogue success across diverse domains while reducing inference costs by up to 80% with minimal quality loss.
The framework balances current context with retrieval-based future state approximation, enabling cost-aware, sequential decision-making in multi-turn dialogues.

Long-Horizon LLM Routing for Multi-Turn Dialogue: DialRouter

Introduction and Motivation

LLMs are predominantly engaged through multi-turn dialogues, where user intent develops iteratively and queries are contextually linked. Classical LLM routing frameworks optimize for single-turn interactions, focusing on per-turn model selection based on immediate response quality. However, this myopic strategy neglects the long-term effects of model selection on cumulative dialogue outcomes, feedback loops, dialogue branching, and intent fulfillment dynamics. Existing approaches underperform in multi-turn contexts due to their inability to leverage delayed rewards and model the sequential interaction process, where each LLM selection can induce altered user behaviors and divergent dialogue trajectories.

Problem Formulation and Limitations of Myopic Routing

The multi-turn dialogue routing problem is formalized as a sequential decision process in which, at every turn, the router must select an LLM from a candidate pool to maximize a global reward reflecting total user intent fulfillment across the entire dialogue. The reward is defined using a checklist-based evaluation, where fulfillment incorporates both binary and graded signals for nuanced progress tracking. Unlike single-turn routing, where optimality is local, in multi-turn dialogue, each selection impacts both the current and all subsequent interactions due to the context-dependent evolution of user inputs. The complexity is compounded by delayed feedback, dynamic state transitions, and cost accumulation, rendering purely per-turn strategies suboptimal when considered over the entire horizon.

DialRouter: Sequential LLM Routing via Long-Horizon Planning

To address these deficiencies, the paper introduces DialRouter, a long-horizon routing paradigm designed for multi-turn dialogue. DialRouter decomposes the problem into two phases: (1) generation of high-quality sequential LLM routing trajectories using Monte Carlo Tree Search (MCTS) within a simulated environment, and (2) learning a lightweight router capable of long-horizon model selection by behavior cloning from the MCTS-derived expert data, augmented by retrieval-based approximate future state modeling.

Search-Derived Routing Trajectories

DialRouter employs an LLM-based user simulator to drive conversation in a realistic manner, capturing how LLM outputs affect next-turn user inputs and thus the evolving state space. Dialogue branches induced by alternative LLM selections are systematically explored via MCTS, estimating cumulative returns using a checklist-oriented reward scaffolding. Each dialogue task is thus decomposed into state–action pairs reflecting the long-term impact of each selection, forming a dataset of search-derived expert trajectories.

Policy Learning with Future State Retrieval

The core challenge for long-term routing is modeling the consequences of current choices on future dialogue states. DialRouter proposes a retrieval-based future state approximation, whereby a fixed retriever locates semantically similar trajectories in the expert dataset to approximate future evolution given the current state. This mechanism, together with gated fusion of current and future state representations, enables the router to condition decisions not only on present context but also on potential (approximate) future states, thereby embedding long-horizon awareness into a lightweight routing policy.

Supervised by cross-entropy minimization between the learned policy and the MCTS-derived expert policy, DialRouter absorbs long-horizon information and generalizes this knowledge for efficient search-free routing at inference time.

Integration of Cost-Aware Objectives

A critical extension is the explicit modeling of inference cost within the routing objective. Real-world deployments, especially with closed-source LLMs leveraging caching mechanisms, pay nontrivial invocation costs, especially under frequent model switching which negates KV cache reuse. DialRouter’s extended reward function integrates both checklist-based performance metrics and monetary cost, weighted by a hyperparameter, thereby permitting Pareto-efficient trade-offs between answer quality and deployment cost.

Empirical Evaluation

Performance across Domains and Model Sets

Extensive experiments are conducted across three multi-turn dialogue datasets: ShareGPT (open-domain), JDDC (e-commerce service), and MedDG (medical consultation), each annotated with fine-grained checklists. DialRouter is benchmarked against strongest single LLMs, numerous routing baselines (KNN Router, RouterDC, Avengers, Matrix Factorization approaches), and simulation-based oracles (Greedy and MCTS Routers).

Key empirical findings include:

DialRouter achieves average success rate improvements of 5–6% over the strongest single model across all candidate sets and domains.
It consistently outperforms best baseline routers by over 5%, demonstrating the necessity and benefit of long-horizon planning in multi-turn settings.
DialRouter exhibits robust cross-domain generalization, retaining clear margins over baselines on out-of-domain legal/financial tasks.
On mixed LLM pools, DialRouter effectively exploits model heterogeneity, achieving higher robustness and performance.

Cost-Performance Trade-Offs

With cost-aware reward, DialRouter reduces total inference cost by up to 80% with only marginal performance degradation (about 1.4 points in task success), which baseline routers fail to achieve due to policy collapse towards single low-cost models. The cost-control mechanism is especially effective in candidate sets with large price disparities, where simplistic ratio-based or greedy reward formulations fail to balance cost and quality.

Ablation Studies

Long-horizon expert data (MCTS) is critical: omitting it lowers task success by over 3.5%.
Removal or randomization of future state retrieval likewise leads to pronounced performance drops (3–4%).
The adaptive gating mechanism for fusing current and retrieval-based future states yields more robust routing than naïve fusion approaches.

Theoretical and Practical Implications

The transition from myopic to long-horizon routing reframes multi-turn LLM orchestration as a sequential, non-Markovian planning problem. This paradigm (a) rigorously accounts for the global effects of sequential routing and (b) allows search-based knowledge distillation into lightweight, inference-efficient policies. The integration of cost into the reward function further aligns routing with real-world deployment constraints, offering a blueprint for scalable, high-utility LLM system coordination.

Practically, DialRouter enables service providers to harmonize heterogeneous LLM deployments, facilitating dynamic model selection that is both quality-driven and cost-sensitive. Applications encompassing customer service, clinical decision support, education, and agentic systems stand to benefit from improved multi-turn task completion, user alignment, and resource efficiency.

Future Directions

Future research could extend DialRouter via:

Incorporation of more sophisticated user simulation or reward modeling (e.g., preference learning from human feedback).
Joint model and prompt routing to further optimize system responses.
Online adaptation of routing policies for nonstationary user behavior or model drift.
Scaling to larger LLM pools, more diverse cost models, and settings with additional constraints (latency, policy compliance).

Conclusion

DialRouter establishes a new methodological foundation for LLM routing in multi-turn dialogue, leveraging long-horizon search, future state approximation, and cost-awareness to maximize cumulative dialogue success. It demonstrates marked improvements over both fixed and dynamically-routed baselines in challenging domains, substantiating the necessity for sequential long-range planning in LLM orchestration under realistic cost constraints.

(2604.12385)

Markdown Report Issue