Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Published 29 Jan 2026 in cs.CL and cs.AI | (2601.22139v1)

Abstract: Reasoning-oriented LLMs have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

Abstract PDF Upgrade to Chat

Summary

The paper proposes the PIR framework that transforms passive LLM solvers into proactive inquirers by identifying uncertainty and initiating targeted clarification.
It employs a two-phase method—Interactive Capability Activation and User-Intent Alignment with reinforcement learning—to optimize reasoning and reduce unnecessary computation.
Experimental results across benchmarks demonstrate significant accuracy gains and improved efficiency, reducing token usage and interaction turns compared to conventional models.

Proactive Interactive Reasoning: Transforming LLMs from Passive Solvers to Active Inquirers

Contemporary reasoning-oriented LLMs employing CoT prompting, such as GPT-o1 and DeepSeek-R1, exhibit significant progress in explicit stepwise reasoning but are critically impaired by blind self-thinking: they perform exhaustive internal reasoning without accounting for ambiguity or information gaps in user queries. This leads to phenomena such as overthinking, hallucinations, and misaligned conclusions, requiring users to provide iterative corrective feedback—reducing efficiency and user satisfaction.

The PIR paradigm introduced in this paper aims to resolve this deficiency by recasting reasoning LLMs as proactive inquirers, which strategically interleave reasoning with clarification, targeting premise- and intent-level uncertainty through interaction with user simulators rather than relying solely on external knowledge retrieval mechanisms.

Figure 1: PIR paradigm schematic contrasting blind self-thinking (inefficient, uninformed reasoning) with the PIR approach, which incorporates proactive clarification leveraging uncertainty detection and targeted user simulator interaction.

PIR Framework: Architecture and Phases

The PIR framework consists of two main phases: Interactive Capability Activation and User-Intent Alignment.

Interactive Capability Activation: This stage leverages uncertainty-aware supervised fine-tuning. Reasoning trajectories from a frozen teacher model (e.g., DeepSeek-R1) are segmented and evaluated for prediction entropy (PE) at each reasoning step. High-entropy regions signify decision points with high uncertainty, at which the dataset is augmented with clarification questions and simulated user responses. Training on these "think-ask-respond" chains enables the model to learn when and how to initiate clarifications, activating proactive interaction capability.

User-Intent Alignment: To optimize interactive reasoning beyond cold-start, PIR integrates a reinforcement learning pipeline, US-GRPO (User Simulator-GRPO). A dynamic, instruction-following LLM user simulator is constructed for multi-turn rollouts. Composite rewards explicitly evaluate correctness (extrinsic) and reasoning trajectory (intrinsic: helpfulness, efficiency), guiding policy optimization towards strategies that actively resolve intent ambiguity with minimal unnecessary interaction.

Figure 2: PIR framework operations: transition from passive solver to active inquirer using two-phase optimization and user simulator interaction.

Experimental Results: Efficiency and Accuracy

PIR models evaluated on Math-Chat, BigCodeBench-Chat, and DocEdit-Chat benchmarks decisively outperform baseline multi-turn LLMs (including instruction-tuned, proactive prompting, STaR-GATE, CollabLLM, and non-interactive reasoning LLMs).

Dataset	Metric	Baseline Best	PIR (US-GRPO)	Relative Gain
MATH-Chat	Accuracy	21.30	32.70	+11.40
BigCodeBench-Chat	Pass Rate	19.70	22.90	+3.20
DocEdit-Chat	BLEU	28.00	41.36	+13.36

Notable findings:

PIR maintains low token usage (1.3–1.7k tokens per session), with halve of the interaction turns compared to non-interactive baselines.
Reinforcement alignment via US-GRPO is essential; models trained only with active SFT degrade in interactive settings.
PIR models generate clarification questions with consistently higher helpfulness metrics (0.44–0.66 vs. <0.42 for baselines).
Figure 3: Comparative performance of PIR models with different reward modeling/user simulators in US-GRPO on Math-Chat.

Reliability, Robustness, and Generalization

PIR generalizes to non-interactive tasks such as factual knowledge (MMLU/MMLU-Pro), question answering (TriviaQA/SQuAD), and missing premise scenarios. On pure knowledge tasks, PIR abstains from unnecessary interaction but delivers decisive gains when ambiguity is present.

For Missing Premise testing (e.g., MIP-GSM8K, MIP-MATH), PIR maintains high accuracy (up to 25.00 on MIP-MATH) with reduced computation, whereas conventional LLMs show degraded performance due to overconfidence and ungrounded reasoning.
Figure 4: Mathematical reasoning PIR case study, showing targeted clarification followed by accurate reasoning and minimal compute.

Mechanistic Insights: Uncertainty-Driven Interactivity

Analysis of PIR SFT reveals that the model’s querying behavior is tightly correlated with internal predictive entropy. As training scales, the distribution of PE for trigger sentences becomes elevated and stable, indicating refined capacity for precision-initiated clarification. Template accuracy for generating structured "think-ask-respond" chains converges rapidly as dataset size increases.

Figure 5: Uncertainty analysis: distribution and convergence of predictive entropy and template correctness at different dataset sizes.

Reward Design and Ablation Analysis

Composite reward modeling proves essential. Ablations reveal the following:

Removing reasoning-aware rewards increases interaction turns and destabilizes learning (reward hacking risk).
Helpfulness-only rewards elevate accuracy but promote over-asking.
Efficiency-only rewards minimize tokens but degrade accuracy through premature reasoning commitment.
Fully inclusive composite rewards yield optimal interaction balance and performance.
Figure 6: Training dynamics: effect of excluding reward components on PIR framework learning curves.

Case Studies: Interactive Efficiency in Mathematical and Code Domains

Concrete cases demonstrate how PIR proactively clarifies intent prior to reasoning, preventing misaligned conclusions and unnecessary computation with minimal turns.

Figure 7: PIR vs. blind reasoning in math task—PIR queries for missing premise and produces a concise, correct solution.

Figure 8: PIR handling code generation—proactively inquiring about user preferences, yielding modular, intent-aligned code and efficient interaction.

Practical and Theoretical Implications

The PIR approach establishes a robust mechanism for intent-aligned, ambiguity-aware reasoning in LLMs. By integrating policy optimization with user simulator-driven rewards and leveraging uncertainty-aware trajectory augmentation, models are endowed with strategic clarification capabilities, directly addressing the limitations of conventional CoT-based systems. This paradigm supports the development of next-generation LLMs capable of adaptive, efficient, and human-centric problem solving across open-ended domains.

Potential future directions include extending user simulation diversity, direct deployment in live settings, and incorporating explicit safety alignment for sensitive topics.

Conclusion

The PIR framework advances the state-of-the-art in reasoning LLMs by transforming passive solvers into proactive inquirers capable of interactive, uncertainty-resolving clarification. Through mechanism-driven reward modeling and rigorous evaluation, PIR models achieve strong gains in accuracy, compute efficiency, and generalization, establishing a scalable foundation for reliable, intent-aligned AI reasoning systems (2601.22139).