MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Published 17 Apr 2026 in cs.AI | (2604.16009v1)

Abstract: Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces MEDLEY-BENCH, demonstrating that increasing model scale improves evaluation ability while control remains flat, challenging traditional scaling assumptions.
It employs a three-step protocol—solo, private, and social—and uses rule-based and judge-based scoring to derive detailed metacognitive scores (MMS and MAS).
The study highlights practical implications by identifying vulnerabilities to social pressure and recommending ensemble strategies for robust, calibrated AI deployment.

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Introduction and Framework

"MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition" (2604.16009) introduces MEDLEY-BENCH, a benchmark specifically designed to evaluate behavioural metacognition in LLMs. Metacognition is operationalized in terms of four sub-abilities (Monitoring, Control, Evaluation, Self-regulation) aligned with Nelson and Narens’s framework and the DeepMind cognitive taxonomy. MEDLEY-BENCH separates independent reasoning, private self-revision, and socially influenced revision by genuine inter-model disagreement, providing a fine-grained analysis that is methodologically orthogonal to classic accuracy-focused benchmarks.

The benchmark employs a three-step protocol:

Solo (Step A): Independent analysis.
Private (Step B-Private): Structured self-review nudge, isolating self-revision without social input.
Social (Step B-Social): Exposure to responses and consensus from eight anonymised analyst models.

Unique in its design, MEDLEY-BENCH builds on genuine ambiguity and real inter-model disagreement, avoiding synthetic conflict. It evaluates models on 130 instances across five domains (medical diagnosis, system troubleshooting, code review, architecture design, statistical reasoning), using both rule-based and judge-based scoring to derive the Medley Metacognition Score (MMS) and Medley Ability Score (MAS).

Evaluation–Control Dissociation and Scaling Effects

A central empirical result is the dissociation between metacognitive evaluation and control with respect to model scale. Analysis across 35 models from 12 families demonstrates that evaluation (i.e., the ability to assess reasoning quality) increases with model scale, but control (the capacity to adjust strategy or resist unjustified consensus pressure) does not.

Figure 1: Evaluation ability rises consistently with model size, while control remains flat or inconsistent across model generations.

For example, in the Gemma family, evaluation ability improves monotonically from Gemma-3N (4B) through Gemma-3 (27B), yet control remains nearly constant and drops in the newest variants. Similar dissociation is observed within the GPT family. These results directly contradict the conventional assumption that scaling LLMs uniformly enhances all cognitive-related abilities.

Several numerical highlights:

Smaller models, such as Claude Haiku 4.5, ranked first in MMS (62.2), surpassing larger and more expensive models.
GEMMA-3 (12B) closely matches GEMMA-3 (27B) and overtakes GEMMA-4 (31B) in metacognitive quality, illustrating non-monotonicity of competence with model scale.

The progressive adversarial stage scrutinizes how models revise beliefs under manipulated consensus signals. This stage reveals two distinct behavioural profiles:

Argument-evaluators: Revise beliefs in response to analyst arguments, showing robustness against manipulated consensus.
Statistics-followers: Capitulate to consensus statistics rather than epistemic arguments, showing large positive shifts when consensus is adversarially inverted.
Figure 2: Cognitive map differentiates models as argument-evaluators or statistics-followers under adversarial and stripped social pressure conditions.

No model occupies the off-diagonal quadrants of the two-dimensional cognitive map, indicating that argument-evaluation and analyst-dependence are manifestations of the same underlying source monitoring capability.

Further, cross-validation identifies that the Normative/Informational judge dimension in normal mode robustly predicts adversarial susceptibility:

Figure 3: Only the Normative vs Informational judge dimension significantly correlates with adversarial pressure sensitivity ( $\rho = -0.82$ ).

Models citing “most analysts agree” in normal mode capitulate under adversarial conditions, while those citing specific arguments resist. This critical dimension provides an actionable deployment signal for screening models based on vulnerability to social pressure.

Ipsative Profiling and Universal Evaluation Deficit

Applying ipsative scoring to decouple the dominant general articulation factor shows family-level specialization:

Anthropic models: Monitoring-dominant.
GPT-5.4 generation: Control-dominant.
Google Gemma models: Self-regulation-dominant.

Strikingly, evaluation is universally the weakest relative ability across all models; no model demonstrates evaluation as its dominant capacity under ipsative scoring.

Figure 4: Heatmap of ipsative ability profiles highlights the universal evaluation deficit and family-level metacognitive specializations.

This pattern exposes a systematic knowing–doing gap, substantiating the discrepancy between recognizing the quality of reasoning and regulating one’s own reasoning processes.

Practical and Theoretical Implications

The results have direct ramifications for both the design and deployment of AI systems. Aggregate task performance, commonly used in leaderboards, insufficiently captures metacognitive robustness or belief revision under social pressure. Explicitly optimizing models for metacognitive calibration and resistance to consensus error is absent in current alignment pipelines (including RLHF, Constitutional AI, DPO).

MEDLEY-BENCH provides deterministic, behaviourally targeted metrics suitable for incorporation into training reward signals. Screening for metacognitive diversity and robustness (rather than only output quality) is relevant for ensemble system design and regulatory compliance, especially under frameworks such as the EU AI Act.

Practically, models vary in their failure modes under social influence. Deploying models that privilege argument quality over majority agreement is critical in multi-agent contexts to avoid propagation of consensus errors and mitigate sycophantic interactions. Ipsative profiling enables construction of ensembles with complementary metacognitive strengths.

Limitations and Future Directions

While MEDLEY-BENCH advances the evaluation of metacognitive behaviour, it is bounded by several limitations:

It measures externally prompted reconsideration, not spontaneous metacognitive monitoring.
The four-ability structure is strongly influenced by a general factor, though ipsative scoring ameliorates this at the instance level.
No direct human baseline; future work should benchmark human metacognitive updating under ambiguous and socially pressurized settings.
Some benchmarked models overlap with analyst pools, but residual analysis suggests minimal bias.

Extension to spontaneous metacognition, expanded domain coverage, and implementation of the progressive adversarial protocols as training signals represent promising future research trajectories. The non-monotonic scaling of metacognitive competence suggests that architectural innovation, not parameter count alone, governs progress toward robust metacognitive AI.

Conclusion

MEDLEY-BENCH establishes a new paradigm for evaluating behavioural metacognition in LLMs. Its three-step protocol, adversarial analysis, and ipsative profiling uncover a pronounced evaluation–control dissociation and a universal evaluation deficit—phenomena not detectable by classical QA benchmarks. The practical implications span ensemble design, deployment screening, and regulatory audit. Theoretically, the benchmark problematizes notions of scale-driven cognitive enhancement and underlines the necessity of direct measurement and reward of calibrated, proportional belief revision and social robustness in next-generation AI training schemes.

Markdown Report Issue