- The paper demonstrates that as LRMs scale their reasoning capacity, their ability to follow explicit user instructions declines, with the best model achieving only 50.71% hard accuracy.
- The study introduces the MathIF benchmark, comprising 420 Python-verifiable tasks with multi-layered constraints designed to rigorously test instruction adherence in mathematical reasoning.
- Empirical analysis reveals an inverse correlation between extended chain-of-thought reasoning and obedience, underscoring a critical trade-off for deploying safe, high-performing LRMs.
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
The paper "Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models" (2505.14810) rigorously interrogates the interplay between reasoning capacity and controllability in Large Reasoning Models (LRMs). While the progression of LRMs has yielded remarkable advances in mathematical problem-solving—including formal theorem proving and olympiad-level tasks—their capability to follow user-specified, natural language instructions in contextually rich mathematical environments remains largely unmeasured. The central argument is that as models scale in reasoning prowess via chain-of-thought (CoT) expansion and sophisticated training (e.g., RL-based reward optimization), they tend to deprecate fine-grained control expressed via explicit user instructions. This tension is critically important for alignment and safe deployment in real-world, high-stakes applications where strict adherence to user constraints is paramount.
The MathIF Benchmark: Design and Methodology
To systematically expose and assess this tension, the authors introduce MathIF, a benchmark comprised of 420 programmatically generated, Python-verifiable math reasoning tasks, each instantiated with up to three simultaneously active user constraints. These constraints span four categories—length, lexical (including language and keyword requirements), format (e.g., punctuation, case, section demarcation), and affix manipulations (prefix/suffix enforcement).
MathIF is distinct from prior instruction-following benchmarks in two key respects:
- It is domain-specific, tailored for mathematical reasoning rather than general-purpose conversation, thereby minimizing confounds due to domain shift or lack of complex CoT in existing datasets.
- It scales instruction complexity via compositional constraint injection, emulating real-world multi-factor directives.
Two hierarchical evaluation metrics are used: hard accuracy (HAcc), which requires all constraints for a query to be satisfied, and soft accuracy (SAcc), which counts the proportion of constraints satisfied per query. Math-solving correctness is independently tracked, distinguishing intelligence (problem-solving) from obedience (constraint-following).
Empirical Analysis: Trade-Off Between Reasoning and Instruction-Following
The study evaluates 23 LRMs (parameter range: 0.6B to 70B), drawing from all prominent recent architectures (Qwen, DeepSeek-R1, Open-Reasoner, etc.). The experimental protocol employs nucleus sampling and strict, verifiable constraint parsing.
Key findings:
- No LRM achieves robust instruction adherence: The best model, Qwen3-14B, achieves only 50.71% HAcc. Most large models do not outperform significantly smaller ones in instruction-following, undermining the assumption of monotonic controllability improvement with scale.
- Negative scaling trend for controllability: Increasing CoT length, whether via RL policies or inference-time budget forcing, monotonically degrades both HAcc and SAcc. The effect holds across multiple families and is exacerbated in challenging (multi-constraint, high-difficulty) tasks.
- Explicit separation of reasoning and answer text (using delimiters) slightly improves obedience, but is insufficient.
- Instruction-following and reasoning are inversely correlated: Error analysis reveals a minority of samples in which models are simultaneously correct (math-solving) and obedient (constraint-following). Instead, outputs cluster in quadrants where only one objective is met. Imposing more constraints severely degrades HAcc while SAcc sometimes remains flat or even increases—indicating improvement in constraint isolation but wholesale failure to satisfy conjunctions.
Training Paradigms and Their Implications
Reasoning-oriented training strategies exhibit a pronounced double-edged effect. SFT and RL, when optimized for long-form reasoning or outcome-based rewards, improve average problem-solving correctness but reduce instruction adherence. Notably, cold-start RL (without SFT) or small, format-aware reward augmentations marginally recover obedience but incur significant cost to mathematical accuracy.
Shortening CoT length—either by truncation during RL rollouts or via inference-time interventions—improves constraint compliance (especially HAcc), confirming that loss of instruction salience is directly attributable to increased contextual separation between the initial instruction and the final output. Simple interventions, such as repeating constraints at the end of the CoT, clarify this mechanism: obedience increases, but math-correctness is further diminutized.
Theoretical and Practical Implications
The findings deliver clear empirical evidence of a fundamental trade-off between intelligence and controllability in current LRM training pipelines. From a theoretical standpoint, the dominance of task-oriented rewards, and the tendency of scaled models to exploit implicit, often under-specified, priors leads to the dilution of explicit user directives, especially under long-range context propagation (i.e., as CoT grows). This not only questions the sufficiency of contemporary RL reward shaping but also amplifies the alignment problem for advanced reasoning systems.
Practically, these results signal that models geared towards high mathematical competence cannot be safely assumed to follow instructions—posing a direct challenge for deployment in scenarios with formal or regulatory requirements on output structure, language, or information disclosure. Moreover, the lack of monotonic scaling advantage suggests that architectural or training innovations, rather than brute-force size, are necessary to close the intelligence–obedience gap.
Future Directions
Bridging the intelligence–obedience dichotomy will likely require:
- Instruction-aware or hybrid reward schemes, blending reasoning success with explicit constraint satisfaction during both SFT and RL phases.
- Model architectures or protocols that dynamically bind constraints closer to answer generation steps, mitigating contextual dilution (e.g., through specialized memory mechanisms or persistent attention to instruction tokens).
- Augmented datasets with domain-specific, compositional instructions explicitly tied to reasoning artifacts, improving generalization to real-world, user-centric task directives.
The MathIF benchmark provides an actionable metric and testbed for such innovations.
Conclusion
"Scaling Reasoning, Losing Control" exposes the intrinsic friction between mathematical reasoning power and strict instruction adherence in LRMs. By casting this as a measurable trade-off and offering a domain-specific evaluation framework, the paper establishes a new baseline for research at the intersection of intelligence, controllability, and alignment. Future progress in LRM safety and reliability will depend on reconciling these two goals, not merely scaling parameter counts or extending reasoning traces.