- The paper introduces a novel framework that leverages collaborative self-play and Reinforced Self-Training to learn cost-sensitive clarification strategies.
- The methodology employs multi-agent self-play with action clustering to balance clarification, multi-answering, and direct guessing based on numerical cost coefficients.
- Experimental results on the AmbigQA and Pacific benchmarks demonstrate that the steerable policies generalize to unseen coefficients, yielding improved accuracy and cost rewards.
Learning Steerable Clarification Policies with Collaborative Self-play
Motivation and Problem Statement
Natural language user queries to AI assistants frequently exhibit varying degrees of underspecification and ambiguity, often due to ellipsis, domain assumptions, or broad intent. AI systems must implement effective clarification policies to maximize communication efficiency without sacrificing interpretive accuracy. Traditional approaches generally enforce uniform clarification heuristics—i.e., always ask a clarifying question when ambiguity is detected—ignoring context-driven variability such as user modality preferences or the cost-benefit tradeoff of additional dialog turns and output verbosity.
This work addresses the challenge of enabling dynamic, steerable clarification policies whereby the assistant modifies its behavior contingent on explicit, user- or context-conditioned cost coefficients. These coefficients numerically specify the penalty incurred for extra conversational turns or for producing lengthy responses. The aim is to maximize a cost-sensitive accuracy reward, incorporating factors like conversation length and answer succinctness, as parameterized by user-supplied α,β.
Figure 1: Illustration of steerable policy: the assistant adapts its response strategy—clarification, multi-answer, or direct educated guess—depending on given cost coefficients α and β.
Method: Collaborative Self-Play and Reinforced Self-Training
The core training framework leverages multi-agent self-play, wherein an assistant and a simulated user jointly construct QA rollouts. The environment injects ambiguity by supplying both the ambiguous query q and a sampled ground-truth interpretation i, only accessible to the user. The assistant, informed of the current dialog history, any relevant non-parametric context (e.g., tables), and the given α,β coefficients, chooses one of three actions: ask a clarification question, respond directly, or enumerate multiple plausible answers.
Figure 2: Example rollout showing the interactive exchange, from ambiguous query to clarification, multi-answer, and final resolution, given the user's true intent.
To drive learning, Reinforced Self-Training (ReST) is deployed—a low-variance on-policy approximation that uses sampled episode rewards to select high-yield trajectories for iterative finetuning. Rather than naively maximizing per-rollout reward, the algorithm clusters rollouts by action sequence and cost coefficients, then identifies the action pattern which yields maximal expectation across interpretations. This process ensures robust association between particular coefficient settings and the optimal disambiguation strategy.
The reward function formalizes the cost-accuracy tradeoff: R(ρ;a∗)=acc(a,a∗)−α⋅nclar(ρ)−β⋅∣oT−1∣
where acc(a,a∗) is token-level F1 with respect to gold answer a∗, nclar(ρ) is the count of clarification questions, and ∣oT−1∣ the final answer length.
Importantly, the assistant can be exposed to non-parametric context—such as private tables—to more accurately assess ambiguity grounded in external data. This creates asymmetric knowledge conditions and further tests the steerability of the learned policy.
Figure 3: Assistant equipped with private context necessary for ambiguity detection, while the user remains unaware of the underlying data.
Experimental Design
Evaluation is performed with the Gemma 2 9B model on two benchmarks:
- AmbigQA: Parametric open-domain QA with annotation of ambiguous queries and multiple interpretations. Filtering ensures alignment between model and human annotation. Dataset includes 1,776 train and 382 dev instances, average 2.4–2.7 interpretations per query.
- Pacific: Table- and document-grounded QA in finance, supporting both symmetric and asymmetric knowledge regimes and complex conversational phenomena. 3,744 train and 640 dev samples.
Baselines comprise direct prompting (with and without chain-of-thought reasoning), constrained one-shot strategies (always answer, always clarify, always use multi-answers, or clarify-and-multi-answer), and an oracle SGP upper bound. SGP models are tested for both in-distribution and out-of-distribution cost coefficient generalization.
Results and Analysis
SGP outperforms all baselines in both AmbigQA and Pacific. In AmbigQA, SGP achieves average ambiguous query token F1 of 19.94 and reward 6.63, compared to the best baseline F1 of 16.94 and reward of only -9.59. For unambiguous queries, SGP fares even better with F1 41.28 and reward 31.36. SGP reduces clarification and multi-answer overuse compared to direct prompting, producing strategies that optimally balance cost and accuracy according to α,β input.
In Pacific, SGP achieves ambiguous F1 65.73 and reward 54.32, unambiguous F1 79.52 and reward 65.70, outperforming static and prompted approaches by wide margins.
A critical result is the generalization to unseen coefficients: SGP-trained models maintain monotonic, correct sensitivity in the frequency of clarification and multi-answer actions even for coefficient values absent during training, confirming the continuity and robustness of policy steerability.
Empirical analysis further demonstrates that chain-of-thought prompting only marginally improves coefficient sensitivity unless specialized finetuning is applied. SGP policies can select better clarification questions and multi-answer expansions, as illustrated by head-to-head comparisons.
The training procedure exhibits monotonically increasing performance across ReST epochs, with observable stability in generalization and diminishing returns after several iterations.
Practical and Theoretical Implications
This work introduces direct algorithmic support for steerability in dialog grounding—flexible control over the assistant's disambiguation strategy via numerical coefficients in prompts. The results show that large LMs, when trained under collaborative self-play with explicit reward/cost modeling, can learn to reason about cost-benefit profiles of clarification, multi-answering, and guessing, dynamically adjusting their conversational policies to maximize user-aligned utility.
This mechanism is highly generalizable. In addition to clarification, steerable policies can be extended to control tool use, resource consumption, personalization, or output style, simply by defining appropriate cost coefficients. This provides a unified framework for fine-grained, context-sensitive LM alignment and adaptation.
Theoretically, this offers a pragmatic approach to reward-model conditioning in reinforcement learning for dialog; by leveraging action sequence clustering and episode-level reward aggregation, SGP achieves Pareto improvements without the instability common in RL finetuning of LMs.
Future Directions
Several avenues for extension emerge:
- Preference Inference: Integrate automatic estimation of α,β coefficients via online user modeling, enabling end-to-end adaptation to individual preferences and environmental constraints.
- Fine-Grained Grounding Acts: Expand the action repertoire beyond clarification, multi-answer, and guessing—incorporating succinct low-cost grounding acts, conversational acknowledgments, or nonverbal signals.
- Multi-Objective Steerability: Apply steerable policies for multi-task, multi-modal, or multi-user settings, exploiting the demonstrated generalization properties.
- Theory-of-Mind Integration: Couple steerable policies with explicit inference over user state to optimize collaborative efficiency and practical utility in real-world agentic contexts.
Conclusion
This work establishes that dynamic, steerable clarification policies can be efficiently acquired by LLMs via collaborative self-play and Reinforced Self-Training. The proposed approach reliably induces sensitivity to user- or context-provided numerical cost coefficients, leading to optimal, adaptive dialog strategies. The framework is robust to unseen steering coefficients and marks a significant step towards contextually intelligent grounding in AI assistants (2512.04068).