EntroCut: Adaptive Entropy Truncation for LRMs
- EntroCut is a training-free, entropy-guided method that adaptively terminates chain-of-thought reasoning in large reasoning models by using per-token entropy as a confidence signal.
- It leverages early-step entropy statistics over probe tokens to determine a dynamic stopping point, significantly reducing computational tokens with minimal accuracy loss.
- Experimental results on mathematical benchmarks demonstrate up to 47% token savings and a high efficiency-performance ratio (EPR), highlighting its practical impact on inference efficiency.
EntroCut is a training-free, entropy-guided procedure designed to reduce computational costs in Large Reasoning Models (LRMs) during chain-of-thought (CoT) reasoning by adaptively truncating the generation process once the model's confidence—measured by per-token entropy—indicates sufficient certainty. By leveraging early-step entropy statistics as a stopping criterion, EntroCut enables dynamic reasoning budgets, achieving substantial token savings while maintaining accuracy, as validated across multiple mathematical reasoning benchmarks and small-scale LRMs (Yan et al., 30 Jan 2026).
1. Theoretical Foundations: Entropy as a Confidence Signal
At the core of EntroCut lies the use of the model's per-token entropy as a proxy for its internal uncertainty during CoT generation. Specifically, at each generation step in the probe phase, the Shannon entropy is computed over the model's output distribution for the next token:
where is the vocabulary, and represents the full context up to step , encompassing the input query, all generated reasoning tokens thus far, and any appended probe prompt. Low entropy in this context reflects high model confidence regarding the next generation step. Empirical findings indicate that low entropy during early probe steps reliably distinguishes correct from incorrect reasoning trajectories in LRMs, motivating its adoption as a truncation signal.
2. Adaptive Reasoning Termination: Core Algorithmic Workflow
EntroCut introduces an adaptive stopping rule that departs from fixed-length truncation or reliance on predetermined special tokens. Upon detecting a "reflection cue" token (such as "Wait" or "So") in the generated sequence, EntroCut appends a short probe string (e.g., </think>\n\nSo the final answer is) and proceeds to generate probe tokens. The entropy of each probe token is computed, and their average
serves as a global confidence score. If , where is a predefined entropy threshold, EntroCut terminates further CoT generation and transitions to answer synthesis. This mechanism adaptively determines the minimal sufficient reasoning budget in a context-sensitive and model-agnostic manner.
EntroCut Inference Procedure
The following pseudocode encapsulates EntroCut’s inference pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
T = [] # Reasoning sequence while True: τ_t = πθ(T; q) # Generate next reasoning token T.append(τ_t) if τ_t in R: # Reflection cue detected C = T + S_probe H = [] for i in range(k): p_i = πθ(· | C, previous_r) H.append( -sum_v (p_i[v] * log p_i[v]) ) r_i = sample from p_i C.append(r_i) if mean(H) <= τ: break # Early stop criterion met R = πθ(R; q, T) # Generate answer tokens return T + R |
3. Efficiency-Performance Trade-off: The Efficiency-Performance Ratio (EPR)
To systematically evaluate the utility of early truncation, EntroCut introduces the Efficiency-Performance Ratio (EPR):
where
Here, the "Vanilla" baseline denotes full CoT generation without truncation, while "NOWAIT" refers to immediate answer synthesis without reflection. An EPR signifies achieving large token savings for minimal accuracy loss, with EPR enabling direct, quantitative comparison between efficiency-driven inference schemes.
4. Experimental Methodology and Quantitative Results
Experiments were conducted on DeepSeek-R1-Distill-Qwen-1.5B (DS-1.5B) and DS-7B models using four mathematical reasoning benchmarks: AIME24, AIME25, MATH500, and AMC23. Standard decoding hyperparameters were temperature , top-, and the probe consisted of –$5$ tokens. Entropy thresholds were tuned per dataset and model (ranging $0.15$–$0.225$). Each configuration was evaluated over multiple runs (16 for AIME24/25 and AMC23; 4 for MATH500).
Main results demonstrate that EntroCut secures substantial token savings with only marginal losses in accuracy. On DS-1.5B, token usage is reduced by up to (AIME24) with a $0.4$ percentage point drop in accuracy (EPR avg.), while DS-7B achieves savings with a $4.1$ percentage point accuracy decrease (EPR avg.).
Comparative Results Table
| Model | Method | AIME24 (Acc, Tok) | AIME25 (Acc, Tok) | MATH500 (Acc, Tok) | AMC23 (Acc, Tok) | Avg. EPR |
|---|---|---|---|---|---|---|
| DS-1.5B | Vanilla | 29.2, 15591 | 25.4, 15099 | 83.1, 4984 | 73.0, 8834 | — |
| NOWAIT | 22.1, 8196 | 18.8, 7772 | 79.0, 3086 | 65.3, 4594 | — | |
| TIP | 27.9, 11458 | 22.9, 9160 | 80.8, 3419 | 72.0, 5998 | 2.9 | |
| DEER | 27.5, 8695 | 21.6, 8807 | 70.1, 2575 | 68.2, 5152 | 1.7 | |
| EntroCut | 28.8, 8295 | 23.8, 7912 | 81.6, 3341 | 72.8, 6000 | 12.4 | |
| DS-7B | Vanilla | 55.8, 12775 | 42.3, 14319 | 92.1, 4097 | 90.3, 6434 | — |
| NOWAIT | 46.3, 7921 | 30.0, 7449 | 88.7, 2768 | 82.2, 3811 | — | |
| TIP | 51.3, 9919 | 32.9, 10138 | 90.5, 3055 | 87.3, 4714 | 1.3 | |
| DEER | 49.2, 9138 | 37.7, 9586 | 89.0, 2342 | 86.9, 4322 | 1.5 | |
| EntroCut | 51.7, 9663 | 38.1, 9555 | 91.1, 3046 | 89.2, 5216 | 2.4 |
EntroCut consistently outperforms TIP and DEER on the Pareto frontier delineating token usage versus accuracy.
5. Ablation Studies and Hyperparameter Sensitivity
Ablation experiments highlight the importance of both the entropy probe and the probe-tuned threshold . Imposing a hard fixed budget for reasoning, as opposed to entropy-adaptive termination, leads to up to a $21$ percentage point drop in accuracy, accompanied by diminished EPR despite additional token savings. Eliminating the entropy probe in favor of fixed-length truncation reduces EPR by one-half and incurs a $1.1$–$1.9$ percentage point decrease in accuracy. This indicates that the entropy-based, context-sensitive stopping rule is essential for optimal trade-offs.
| Model | Variant | Acc | Tok | EPR |
|---|---|---|---|---|
| DS-1.5B | EntroCut | 23.8 | 7912.8 | 4.05 |
| + hard budget | 22.0 | 7590.8 | 1.99 | |
| – entropy probe | 21.9 | 8547.1 | 1.69 | |
| DS-7B | EntroCut | 38.1 | 9555.3 | 2.04 |
| + hard budget | 17.2 | 6815.8 | 0.54 | |
| – entropy probe | 37.0 | 9536.6 | 1.62 |
Optimal performance depends on careful tuning of (within $0.15$–$0.225$ depending on model and dataset), the probe length (typically $3$–$5$), and the set of reflection-cue tokens.
6. Significance and Broader Context
Empirical results validate the hypothesis that foundational model uncertainty, as quantified by early-step entropy, robustly signals the completion of effective reasoning. Adaptive truncation via EntroCut demonstrably outperforms naive suppression techniques or fixed reasoning budgets. The procedure is model-agnostic and avoids retraining or architectural modification, offering immediate practicality for practitioners aiming to improve LRM inference efficiency.
A notable implication is that entropy-guided truncation can generalize across domains and architectures with minimal parameterization. The EPR metric further sets a standard for fair comparison of efficiency-accuracy trade-offs in inference-time reasoning control. Given the increasing proliferation of cost-sensitive LRM deployments, EntroCut furnishes a rigorous, lightweight approach for dynamic reasoning that harmonizes accuracy and computational efficiency (Yan et al., 30 Jan 2026).