Are Large Reasoning Models Interruptible?

Published 13 Oct 2025 in cs.CL and cs.LG | (2510.11713v2)

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a benchmark suite to test LRMs’ ability to handle time-constrained interruptions and update-driven changes mid-reasoning.
It reveals that early interruptions can lead to reasoning leakage and that late updates significantly decrease performance due to integration challenges.
It finds that larger models are more robust under interruptions compared to smaller ones, underscoring scalability and reliability differences.

Are Large Reasoning Models Interruptible?

The paper "Are Large Reasoning Models Interruptible?" (arXiv ID: (2510.11713)) evaluates Large Reasoning Models (LRMs), such as Qwen3, GPT-OSS, and others, in dynamic scenarios beyond the typical static evaluation frameworks. The goal is to assess how LRMs handle interruptions and updates to context during reasoning processes, addressing the assumption that environments remain static while the model computes.

Evaluation of Dynamic Contexts

Interruptible Reasoning

The assumption that models operate in a frozen context is scrutinized through experimental setups that involve interruption and dynamic context assessments. Two core scenarios are examined:

Time-Constrained Interruptions: Models are interrupted during their reasoning processes, either needing to produce partial responses immediately or modify their reasoning speed.
Update-Driven Interruptions: Models must adapt to changes in task specifications provided mid-inference.

Methodology

The paper introduces a new benchmark suite that assesses model behavior across mathematical and coding tasks. It analyzes model behavior by stopping reasoning at various points and processing interruptions under both time constraints and updates.

Mathematical tasks include datasets like GSM8K, MATH500, and AIME, while programming evaluations use LiveCodeBench. Key metrics involve evaluating models' ability to adapt and produce correct responses when interrupted or updated mid-inference.

Findings

Time-Constrained Interruptions

LRMs exhibit anytime models' characteristics as they show progressive improvement with increased reasoning budgets. However, early interruptions can lead to reasoning leakage, where the model continues reasoning within the final answer space. The results emphasize the discrepancy between efficiency and expected behavior when dynamic interactions interject long-running reasoning processes.

Update-Driven Interruptions

There is a significant drop in performance when models must incorporate updates into their reasoning process, especially if updates occur later. Models struggle with integrating updated information, often exhibiting self-doubt where they either fail to integrate updates or disregard new information, somewhat alleviated with prompt guidance.

Scalability and Robustness

The scalability of LRMs influences their robustness under interruptions. Larger models tend to better handle changes and interruptions, whereas smaller ones can struggle noticeably, indicating performance scaling differences across model architectures and capacities.

Conclusion

The paper critiques the assumption that reasoning tasks occur in static environments, highlighting that LRMs are vulnerable to interruptions and dynamic shifts. It reveals several failure modes such as reasoning leakage, panic, and self-doubt, providing a basis for further research into building adaptable and reliable LRMs in dynamic and interactive environments. The research suggests further exploration into developing models that can adapt robustly to dynamic, real-world conditions beyond the static assumptions of conventional evaluations.

Markdown Report Issue