Overthinking-Adjusted Accuracy (OAAₜ)

Updated 27 January 2026

Overthinking-Adjusted Accuracy (OAAₜ) is a metric family that jointly quantifies model correctness and token efficiency by rewarding precise, concise reasoning within a defined token budget.
OAAₜ employs fixed and difficulty-adaptive token thresholds to penalize excessive reasoning, enabling rigorous cross-model comparisons on both accuracy and computational parsimony.
Empirical findings show that high raw accuracy can mask overthinking, as models generating overly long explanations score lower on OAAₜ, guiding improvements in efficient problem solving.

Overthinking-Adjusted Accuracy (OAAₜ) is a family of evaluation metrics designed to jointly quantify the correctness and efficiency of machine reasoning models, particularly LLMs equipped with chain-of-thought (CoT) capabilities. OAAₜ penalizes excessive token usage (“overthinking”) by rewarding only those answers which are not just correct, but also achieved within a specified reasoning–token budget. This unifies notions of accuracy and computational parsimony, allowing rigorous model comparisons along the accuracy–efficiency spectrum. The concept and its principal computational variants are formalized and empirically validated in recent large-scale benchmarking and efficient reasoning literature, including OptimalThinkingBench (Aggarwal et al., 18 Aug 2025), THOUGHTTERMINATOR (Pu et al., 17 Apr 2025), “Correct, Concise and Complete” (Rakotonirina et al., 6 Jan 2026), TRAAC (Singh et al., 2 Oct 2025), and LLMThinkBench (Srivastava et al., 5 Jul 2025).

1. Formal Definition and Mathematical Structure

For an evaluation set of n questions, the canonical version of Overthinking-Adjusted Accuracy at budget t, denoted OAAₜ, is defined as:

$\mathrm{OAA}_t = \frac{1}{n} \sum_{i=1}^{n} \left[\text{Correctness}_i \cdot \mathbb{I}(\text{ThinkTokens}_i < t)\right]$

Where:

$\text{Correctness}_i \in \{0,1\}$ indicates if the model’s answer for item $i$ is correct.
$\text{ThinkTokens}_i$ is the number of reasoning (a.k.a. “thinking,” “CoT”) tokens produced on item $i$ .
$t$ is the reasoning–token threshold.
$\mathbb{I}(\text{ThinkTokens}_i < t)$ is the indicator that the sample fits within the token budget.

Alternative instantiations replace $t$ with a question-specific token budget $\tau(q)$ , which is adaptively set using a difficulty estimator, yielding:

$\mathrm{OAA}_\tau(M) = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\;M\text{ correct for }q_i\;\wedge\; \text{spend} \le \tau(q_i)\;] \tag{[2504.13367]}$

Other variants (see LLMThinkBench) employ harmonic means of accuracy and a normalized efficiency factor for additional sensitivity to average token usage (Srivastava et al., 5 Jul 2025).

2. Motivation and Conceptual Rationale

Standard accuracy metrics reward models solely for correctness, regardless of computational resource expenditure. This is pathological in the context of LLMs with CoT decoders, which frequently generate unnecessarily lengthy solutions—a phenomenon known as overthinking—especially on trivial inputs (Aggarwal et al., 18 Aug 2025, Rakotonirina et al., 6 Jan 2026). OAAₜ addresses this by:

Assigning zero credit to answers requiring more than $t$ tokens, regardless of correctness.
Creating a strict trade-off: models that “overthink” are penalized unless their solutions are both correct and concise.
Enabling rigorous cross-model comparisons conditioned on token budgets tailored to user requirements or estimated task difficulty.

By integrating over a range of thresholds $t$ , OAA metrics produce a curve whose area under the curve (AUC) aggregates overall performance across efficiency regimes.

3. Principal Variants and Curve-Based Summarization

Fixed-Threshold OAAₜ

All major benchmarking frameworks (OptimalThinkingBench, TRAAC, THOUGHTTERMINATOR) report OAAₜ at various preset values of $t$ , enabling practitioners to examine performance given specific compute constraints (Aggarwal et al., 18 Aug 2025, Singh et al., 2 Oct 2025).

Difficulty-Adapted OAAₜ

THOUGHTTERMINATOR (Pu et al., 17 Apr 2025) introduces a generalization:

$\tau(q): \text{difficulty} \rightarrow \text{token budget}$

where $\tau(q)$ is calibrated via model error rates or shallow LM regressors. OAA $_{\tau}$ then penalizes overthinking relative to task difficulty rather than a universal fixed threshold, providing more granular calibration on mixed-difficulty evaluations.

Area-Under-Curve AUC₍OAA₎

To avoid arbitrariness in $t$ selection, leading studies report the (normalized) area-under-the-OAA curve:

$\mathrm{AUC}_{\mathrm{OAA}} = \frac{1}{t_{\max}} \int_{0}^{t_{\max}} \mathrm{OAA}_t dt \approx \frac{1}{t_{\max}} \sum_{t=0}^{t_{\max}} \mathrm{OAA}_t$

Here, $t_{\max}$ is either a fixed maximal token budget or the mean required by the base model (Rakotonirina et al., 6 Jan 2026, Aggarwal et al., 18 Aug 2025, Singh et al., 2 Oct 2025). AUC $_{\mathrm{OAA}}$ supplies a scalar summary of the entire correctness–efficiency trade-off curve, yielding a statistic in $[0,1]$ (often scaled to percentage points).

4. Empirical Methodology and Implementation

The practical computation of OAAₜ and its summary statistics adheres to the following protocol:

For each example $i$ , record ( $\text{Correctness}_i$ , $\text{ThinkTokens}_i$ ).
For a range of thresholds $t$ (or for each $\tau(q_i)$ ), compute OAAₜ per formula above.
Aggregate across thresholds to obtain AUC $_{\mathrm{OAA}}$ using a Riemann sum.
For difficulty-adaptive OAA, employ a per-example $\tau(q_i)$ determined either by empirical calibration or difficulty regression (Pu et al., 17 Apr 2025, Singh et al., 2 Oct 2025).
Tabulate and/or plot OAAₜ vs. $t$ to compare model behaviors.

Illustrative pseudocode (OptimalThinkingBench):

def compute_oaa_and_auc(tokens: List[int], correct: List[int], t_max: int = 1000):
    n = len(tokens)
    oaa = [0.0] * (t_max+1)
    for t in range(t_max+1):
        count = 0
        for i in range(n):
            if correct[i] == 1 and tokens[i] < t:
                count += 1
        oaa[t] = count / n
    auc = sum(oaa) / (t_max + 1)
    return oaa, auc

(Aggarwal et al., 18 Aug 2025)

5. Comparative Performance and Observed Trade-Offs

Comprehensive benchmarks demonstrate that OAAₜ and AUC $_{\mathrm{OAA}}$ provide sharper discrimination than raw accuracy when evaluating reasoning models:

Model	Avg Think Tokens	Raw Accuracy (%)	AUC $_{\mathrm{OAA}}$ (%)	Source
Llama-3.3-70B	0	96.8	96.8	(Aggarwal et al., 18 Aug 2025)
GPT-OSS-120B	110	94.9	84.3	(Aggarwal et al., 18 Aug 2025)
Magistral-Small-2506	2303	92.7	11.6	(Aggarwal et al., 18 Aug 2025)
Qwen3-4B base	—	—	80.1	(Singh et al., 2 Oct 2025)
Qwen3-4B + TRAAC	—	—	85.1	(Singh et al., 2 Oct 2025)

Key experimental findings:

Non-thinking models (no CoT tokens) match raw accuracy and OAAₜ.
Many CoT-equipped reasoning LLMs incur severe penalty: high accuracy with large average token counts lowers $\mathrm{AUC}_{\mathrm{OAA}}$ dramatically.
Adaptive reasoning or compression methods increase AUC $_{\mathrm{OAA}}$ by truncating redundant reasoning on simple instances.
Difficulty-adaptive OAAₜ reveals that most models vastly exceed minimal necessary token budgets on easy queries (Pu et al., 17 Apr 2025).

6. Diagnostic and Optimization Roles

OAAₜ and AUC $_{\mathrm{OAA}}$ are diagnostically and algorithmically significant:

Provide an explicit scalarization of efficiency–accuracy Pareto frontier, enabling quantitative model selection.
Serve as training objectives for model optimization, either via RL (reward shaping with post-answer token penalties (Rakotonirina et al., 6 Jan 2026)) or post-hoc calibration (via budgeted decoding (Pu et al., 17 Apr 2025)).
Clarify calibration failures: high raw accuracy often obscures substantial unnecessary reasoning, which OAAₜ reveals in summary and per-instance analyses.
Underwrite composite metrics (e.g., optimal-thinking F₁ score between AUC $_{\mathrm{OAA}}$ and underthinking accuracy (Aggarwal et al., 18 Aug 2025)).

7. Extensions, Limitations, and Interpretative Nuances

Extensions: Harmonic mean variants allow smooth interpolation between accuracy and efficiency (Srivastava et al., 5 Jul 2025). Difficulty-adaptive OAAₜ (THOUGHTTERMINATOR) generalizes fixed-budget analysis to heterogeneous test distributions.
Limitations: OAAₜ’s penalization is strict—correct answers using ≥ $t$ tokens are ignored, which for very hard problems may understate valid but costly reasoning. This motivates using adaptive thresholds $\tau(q)$ aligned with estimated task difficulty (Pu et al., 17 Apr 2025, Singh et al., 2 Oct 2025).
Interpreting Results: High OAAₜ at low t indicates concise problem solving; large gaps between OAAₜ and raw accuracy signal overthinking. The shape and knee of the OAAₜ curve can reveal model calibration or lack thereof with respect to input difficulty (Rakotonirina et al., 6 Jan 2026).

Overthinking-Adjusted Accuracy metrics provide a principled, interpretable framework for evaluating and developing reasoning models that are not only correct but computationally efficient, forming a core part of modern evaluation and mitigation strategies for large-scale reasoning-capable LLMs (Aggarwal et al., 18 Aug 2025, Pu et al., 17 Apr 2025, Rakotonirina et al., 6 Jan 2026, Singh et al., 2 Oct 2025, Srivastava et al., 5 Jul 2025).