Papers
Topics
Authors
Recent
Search
2000 character limit reached

EntroCut: Adaptive Entropy Truncation for LRMs

Updated 6 February 2026
  • EntroCut is a training-free, entropy-guided method that adaptively terminates chain-of-thought reasoning in large reasoning models by using per-token entropy as a confidence signal.
  • It leverages early-step entropy statistics over probe tokens to determine a dynamic stopping point, significantly reducing computational tokens with minimal accuracy loss.
  • Experimental results on mathematical benchmarks demonstrate up to 47% token savings and a high efficiency-performance ratio (EPR), highlighting its practical impact on inference efficiency.

EntroCut is a training-free, entropy-guided procedure designed to reduce computational costs in Large Reasoning Models (LRMs) during chain-of-thought (CoT) reasoning by adaptively truncating the generation process once the model's confidence—measured by per-token entropy—indicates sufficient certainty. By leveraging early-step entropy statistics as a stopping criterion, EntroCut enables dynamic reasoning budgets, achieving substantial token savings while maintaining accuracy, as validated across multiple mathematical reasoning benchmarks and small-scale LRMs (Yan et al., 30 Jan 2026).

1. Theoretical Foundations: Entropy as a Confidence Signal

At the core of EntroCut lies the use of the model's per-token entropy as a proxy for its internal uncertainty during CoT generation. Specifically, at each generation step tt in the probe phase, the Shannon entropy H(t)H(t) is computed over the model's output distribution for the next token:

H(t)=vVp(vxt)logp(vxt),H(t) = - \sum_{v \in V} p(v \mid \mathbf{x}_{\leq t}) \log p(v \mid \mathbf{x}_{\leq t}),

where VV is the vocabulary, and xt\mathbf{x}_{\leq t} represents the full context up to step tt, encompassing the input query, all generated reasoning tokens thus far, and any appended probe prompt. Low entropy in this context reflects high model confidence regarding the next generation step. Empirical findings indicate that low entropy during early probe steps reliably distinguishes correct from incorrect reasoning trajectories in LRMs, motivating its adoption as a truncation signal.

2. Adaptive Reasoning Termination: Core Algorithmic Workflow

EntroCut introduces an adaptive stopping rule that departs from fixed-length truncation or reliance on predetermined special tokens. Upon detecting a "reflection cue" token (such as "Wait" or "So") in the generated sequence, EntroCut appends a short probe string (e.g., </think>\n\nSo the final answer is) and proceeds to generate kk probe tokens. The entropy of each probe token H(i)H(i) (i=1,,k)(i=1,\ldots,k) is computed, and their average

Hˉprobe=1ki=1kH(i)\bar H_{\text{probe}} = \frac{1}{k} \sum_{i=1}^k H(i)

serves as a global confidence score. If Hˉprobeτ\bar H_{\text{probe}} \leq \tau, where τ\tau is a predefined entropy threshold, EntroCut terminates further CoT generation and transitions to answer synthesis. This mechanism adaptively determines the minimal sufficient reasoning budget in a context-sensitive and model-agnostic manner.

EntroCut Inference Procedure

The following pseudocode encapsulates EntroCut’s inference pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
T = []  # Reasoning sequence
while True:
    τ_t = πθ(T; q)      # Generate next reasoning token
    T.append(τ_t)
    if τ_t in R:        # Reflection cue detected
        C = T + S_probe
        H = []
        for i in range(k):
            p_i = πθ(· | C, previous_r)
            H.append( -sum_v (p_i[v] * log p_i[v]) )
            r_i = sample from p_i
            C.append(r_i)
        if mean(H) <= τ:
            break       # Early stop criterion met

R = πθ(R; q, T)         # Generate answer tokens
return T + R

3. Efficiency-Performance Trade-off: The Efficiency-Performance Ratio (EPR)

To systematically evaluate the utility of early truncation, EntroCut introduces the Efficiency-Performance Ratio (EPR):

EPR=TokenSavingRatioAccuracyLossRatioEPR = \frac{\text{TokenSavingRatio}}{\text{AccuracyLossRatio}}

where

  • TokenSavingRatio=TokVanillaTokTargetTokVanillaTokNOWAIT\text{TokenSavingRatio} = \frac{\mathrm{Tok}_\mathrm{Vanilla} - \mathrm{Tok}_\mathrm{Target}}{\mathrm{Tok}_\mathrm{Vanilla} - \mathrm{Tok}_\mathrm{NOWAIT}}
  • AccuracyLossRatio=AccVanillaAccTargetAccVanillaAccNOWAIT\text{AccuracyLossRatio} = \frac{\mathrm{Acc}_\mathrm{Vanilla} - \mathrm{Acc}_\mathrm{Target}}{\mathrm{Acc}_\mathrm{Vanilla} - \mathrm{Acc}_\mathrm{NOWAIT}}

Here, the "Vanilla" baseline denotes full CoT generation without truncation, while "NOWAIT" refers to immediate answer synthesis without reflection. An EPR 1\gg 1 signifies achieving large token savings for minimal accuracy loss, with EPR enabling direct, quantitative comparison between efficiency-driven inference schemes.

4. Experimental Methodology and Quantitative Results

Experiments were conducted on DeepSeek-R1-Distill-Qwen-1.5B (DS-1.5B) and DS-7B models using four mathematical reasoning benchmarks: AIME24, AIME25, MATH500, and AMC23. Standard decoding hyperparameters were temperature =0.6=0.6, top-p=1.0p=1.0, and the probe consisted of k=3k=3–$5$ tokens. Entropy thresholds τ\tau were tuned per dataset and model (ranging $0.15$–$0.225$). Each configuration was evaluated over multiple runs (16 for AIME24/25 and AMC23; 4 for MATH500).

Main results demonstrate that EntroCut secures substantial token savings with only marginal losses in accuracy. On DS-1.5B, token usage is reduced by up to 47%47\% (AIME24) with a $0.4$ percentage point drop in accuracy (EPR =12.4=12.4 avg.), while DS-7B achieves 24%24\% savings with a $4.1$ percentage point accuracy decrease (EPR =2.4=2.4 avg.).

Comparative Results Table

Model Method AIME24 (Acc, Tok) AIME25 (Acc, Tok) MATH500 (Acc, Tok) AMC23 (Acc, Tok) Avg. EPR
DS-1.5B Vanilla 29.2, 15591 25.4, 15099 83.1, 4984 73.0, 8834
NOWAIT 22.1, 8196 18.8, 7772 79.0, 3086 65.3, 4594
TIP 27.9, 11458 22.9, 9160 80.8, 3419 72.0, 5998 2.9
DEER 27.5, 8695 21.6, 8807 70.1, 2575 68.2, 5152 1.7
EntroCut 28.8, 8295 23.8, 7912 81.6, 3341 72.8, 6000 12.4
DS-7B Vanilla 55.8, 12775 42.3, 14319 92.1, 4097 90.3, 6434
NOWAIT 46.3, 7921 30.0, 7449 88.7, 2768 82.2, 3811
TIP 51.3, 9919 32.9, 10138 90.5, 3055 87.3, 4714 1.3
DEER 49.2, 9138 37.7, 9586 89.0, 2342 86.9, 4322 1.5
EntroCut 51.7, 9663 38.1, 9555 91.1, 3046 89.2, 5216 2.4

EntroCut consistently outperforms TIP and DEER on the Pareto frontier delineating token usage versus accuracy.

5. Ablation Studies and Hyperparameter Sensitivity

Ablation experiments highlight the importance of both the entropy probe and the probe-tuned threshold τ\tau. Imposing a hard fixed budget for reasoning, as opposed to entropy-adaptive termination, leads to up to a $21$ percentage point drop in accuracy, accompanied by diminished EPR despite additional token savings. Eliminating the entropy probe in favor of fixed-length truncation reduces EPR by one-half and incurs a $1.1$–$1.9$ percentage point decrease in accuracy. This indicates that the entropy-based, context-sensitive stopping rule is essential for optimal trade-offs.

Model Variant Acc Tok EPR
DS-1.5B EntroCut 23.8 7912.8 4.05
+ hard budget 22.0 7590.8 1.99
– entropy probe 21.9 8547.1 1.69
DS-7B EntroCut 38.1 9555.3 2.04
+ hard budget 17.2 6815.8 0.54
– entropy probe 37.0 9536.6 1.62

Optimal performance depends on careful tuning of τ\tau (within $0.15$–$0.225$ depending on model and dataset), the probe length kk (typically $3$–$5$), and the set of reflection-cue tokens.

6. Significance and Broader Context

Empirical results validate the hypothesis that foundational model uncertainty, as quantified by early-step entropy, robustly signals the completion of effective reasoning. Adaptive truncation via EntroCut demonstrably outperforms naive suppression techniques or fixed reasoning budgets. The procedure is model-agnostic and avoids retraining or architectural modification, offering immediate practicality for practitioners aiming to improve LRM inference efficiency.

A notable implication is that entropy-guided truncation can generalize across domains and architectures with minimal parameterization. The EPR metric further sets a standard for fair comparison of efficiency-accuracy trade-offs in inference-time reasoning control. Given the increasing proliferation of cost-sensitive LRM deployments, EntroCut furnishes a rigorous, lightweight approach for dynamic reasoning that harmonizes accuracy and computational efficiency (Yan et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EntroCut.