EntroCut: Adaptive Entropy Truncation for LRMs

Updated 6 February 2026

EntroCut is a training-free, entropy-guided method that adaptively terminates chain-of-thought reasoning in large reasoning models by using per-token entropy as a confidence signal.
It leverages early-step entropy statistics over probe tokens to determine a dynamic stopping point, significantly reducing computational tokens with minimal accuracy loss.
Experimental results on mathematical benchmarks demonstrate up to 47% token savings and a high efficiency-performance ratio (EPR), highlighting its practical impact on inference efficiency.

EntroCut is a training-free, entropy-guided procedure designed to reduce computational costs in Large Reasoning Models (LRMs) during chain-of-thought (CoT) reasoning by adaptively truncating the generation process once the model's confidence—measured by per-token entropy—indicates sufficient certainty. By leveraging early-step entropy statistics as a stopping criterion, EntroCut enables dynamic reasoning budgets, achieving substantial token savings while maintaining accuracy, as validated across multiple mathematical reasoning benchmarks and small-scale LRMs (Yan et al., 30 Jan 2026).

1. Theoretical Foundations: Entropy as a Confidence Signal

At the core of EntroCut lies the use of the model's per-token entropy as a proxy for its internal uncertainty during CoT generation. Specifically, at each generation step $t$ in the probe phase, the Shannon entropy $H(t)$ is computed over the model's output distribution for the next token:

$H(t) = - \sum_{v \in V} p(v \mid \mathbf{x}_{\leq t}) \log p(v \mid \mathbf{x}_{\leq t}),$

where $V$ is the vocabulary, and $\mathbf{x}_{\leq t}$ represents the full context up to step $t$ , encompassing the input query, all generated reasoning tokens thus far, and any appended probe prompt. Low entropy in this context reflects high model confidence regarding the next generation step. Empirical findings indicate that low entropy during early probe steps reliably distinguishes correct from incorrect reasoning trajectories in LRMs, motivating its adoption as a truncation signal.

2. Adaptive Reasoning Termination: Core Algorithmic Workflow

EntroCut introduces an adaptive stopping rule that departs from fixed-length truncation or reliance on predetermined special tokens. Upon detecting a "reflection cue" token (such as "Wait" or "So") in the generated sequence, EntroCut appends a short probe string (e.g., </think>\n\nSo the final answer is) and proceeds to generate $k$ probe tokens. The entropy of each probe token $H(i)$ $(i=1,\ldots,k)$ is computed, and their average

$\bar H_{\text{probe}} = \frac{1}{k} \sum_{i=1}^k H(i)$

serves as a global confidence score. If $\bar H_{\text{probe}} \leq \tau$ , where $\tau$ is a predefined entropy threshold, EntroCut terminates further CoT generation and transitions to answer synthesis. This mechanism adaptively determines the minimal sufficient reasoning budget in a context-sensitive and model-agnostic manner.

EntroCut Inference Procedure

The following pseudocode encapsulates EntroCut’s inference pipeline:

T = []  # Reasoning sequence
while True:
    τ_t = πθ(T; q)      # Generate next reasoning token
    T.append(τ_t)
    if τ_t in R:        # Reflection cue detected
        C = T + S_probe
        H = []
        for i in range(k):
            p_i = πθ(· | C, previous_r)
            H.append( -sum_v (p_i[v] * log p_i[v]) )
            r_i = sample from p_i
            C.append(r_i)
        if mean(H) <= τ:
            break       # Early stop criterion met

R = πθ(R; q, T)         # Generate answer tokens
return T + R

3. Efficiency-Performance Trade-off: The Efficiency-Performance Ratio (EPR)

To systematically evaluate the utility of early truncation, EntroCut introduces the Efficiency-Performance Ratio (EPR):

$EPR = \frac{\text{TokenSavingRatio}}{\text{AccuracyLossRatio}}$

where

$\text{TokenSavingRatio} = \frac{\mathrm{Tok}_\mathrm{Vanilla} - \mathrm{Tok}_\mathrm{Target}}{\mathrm{Tok}_\mathrm{Vanilla} - \mathrm{Tok}_\mathrm{NOWAIT}}$
$\text{AccuracyLossRatio} = \frac{\mathrm{Acc}_\mathrm{Vanilla} - \mathrm{Acc}_\mathrm{Target}}{\mathrm{Acc}_\mathrm{Vanilla} - \mathrm{Acc}_\mathrm{NOWAIT}}$

Here, the "Vanilla" baseline denotes full CoT generation without truncation, while "NOWAIT" refers to immediate answer synthesis without reflection. An EPR $\gg 1$ signifies achieving large token savings for minimal accuracy loss, with EPR enabling direct, quantitative comparison between efficiency-driven inference schemes.

4. Experimental Methodology and Quantitative Results

Experiments were conducted on DeepSeek-R1-Distill-Qwen-1.5B (DS-1.5B) and DS-7B models using four mathematical reasoning benchmarks: AIME24, AIME25, MATH500, and AMC23. Standard decoding hyperparameters were temperature $=0.6$ , top- $p=1.0$ , and the probe consisted of $k=3$ –$5$ tokens. Entropy thresholds $\tau$ were tuned per dataset and model (ranging $0.15$–$0.225$). Each configuration was evaluated over multiple runs (16 for AIME24/25 and AMC23; 4 for MATH500).

Main results demonstrate that EntroCut secures substantial token savings with only marginal losses in accuracy. On DS-1.5B, token usage is reduced by up to $47\%$ (AIME24) with a $0.4$ percentage point drop in accuracy (EPR $=12.4$ avg.), while DS-7B achieves $24\%$ savings with a $4.1$ percentage point accuracy decrease (EPR $=2.4$ avg.).

Comparative Results Table

Model	Method	AIME24 (Acc, Tok)	AIME25 (Acc, Tok)	MATH500 (Acc, Tok)	AMC23 (Acc, Tok)	Avg. EPR
DS-1.5B	Vanilla	29.2, 15591	25.4, 15099	83.1, 4984	73.0, 8834	—
	NOWAIT	22.1, 8196	18.8, 7772	79.0, 3086	65.3, 4594	—
	TIP	27.9, 11458	22.9, 9160	80.8, 3419	72.0, 5998	2.9
	DEER	27.5, 8695	21.6, 8807	70.1, 2575	68.2, 5152	1.7
	EntroCut	28.8, 8295	23.8, 7912	81.6, 3341	72.8, 6000	12.4
DS-7B	Vanilla	55.8, 12775	42.3, 14319	92.1, 4097	90.3, 6434	—
	NOWAIT	46.3, 7921	30.0, 7449	88.7, 2768	82.2, 3811	—
	TIP	51.3, 9919	32.9, 10138	90.5, 3055	87.3, 4714	1.3
	DEER	49.2, 9138	37.7, 9586	89.0, 2342	86.9, 4322	1.5
	EntroCut	51.7, 9663	38.1, 9555	91.1, 3046	89.2, 5216	2.4

EntroCut consistently outperforms TIP and DEER on the Pareto frontier delineating token usage versus accuracy.

5. Ablation Studies and Hyperparameter Sensitivity

Ablation experiments highlight the importance of both the entropy probe and the probe-tuned threshold $\tau$ . Imposing a hard fixed budget for reasoning, as opposed to entropy-adaptive termination, leads to up to a $21$ percentage point drop in accuracy, accompanied by diminished EPR despite additional token savings. Eliminating the entropy probe in favor of fixed-length truncation reduces EPR by one-half and incurs a $1.1$–$1.9$ percentage point decrease in accuracy. This indicates that the entropy-based, context-sensitive stopping rule is essential for optimal trade-offs.

Model	Variant	Acc	Tok	EPR
DS-1.5B	EntroCut	23.8	7912.8	4.05
	+ hard budget	22.0	7590.8	1.99
	– entropy probe	21.9	8547.1	1.69
DS-7B	EntroCut	38.1	9555.3	2.04
	+ hard budget	17.2	6815.8	0.54
	– entropy probe	37.0	9536.6	1.62

Optimal performance depends on careful tuning of $\tau$ (within $0.15$–$0.225$ depending on model and dataset), the probe length $k$ (typically $3$–$5$), and the set of reflection-cue tokens.

6. Significance and Broader Context

Empirical results validate the hypothesis that foundational model uncertainty, as quantified by early-step entropy, robustly signals the completion of effective reasoning. Adaptive truncation via EntroCut demonstrably outperforms naive suppression techniques or fixed reasoning budgets. The procedure is model-agnostic and avoids retraining or architectural modification, offering immediate practicality for practitioners aiming to improve LRM inference efficiency.

A notable implication is that entropy-guided truncation can generalize across domains and architectures with minimal parameterization. The EPR metric further sets a standard for fair comparison of efficiency-accuracy trade-offs in inference-time reasoning control. Given the increasing proliferation of cost-sensitive LRM deployments, EntroCut furnishes a rigorous, lightweight approach for dynamic reasoning that harmonizes accuracy and computational efficiency (Yan et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

EntroCut: Entropy-Guided Adaptive Truncation for Efficient Chain-of-Thought Reasoning in Small-scale Large Reasoning Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EntroCut.