BBEH: Advanced LLM Reasoning Benchmark

Updated 6 February 2026

BBEH is a benchmark suite featuring 23 engineered tasks that test large language models' compositional, adversarial, and in-depth reasoning skills.
It upgrades previous benchmarks by introducing longer prompts, deeper chain-of-thought requirements, and sophisticated distractors to expose model deficits.
Evaluation using micro-averages, harmonic means, and zero/few-shot protocols highlights significant performance gaps, driving innovations like per-instance program synthesis.

BIG-Bench Extra Hard (BBEH) is a benchmark suite developed to systematically evaluate and push the boundaries of general reasoning in LLMs. BBEH supersedes earlier efforts—BIG-Bench and BIG-Bench Hard (BBH)—by introducing 23 carefully engineered tasks that require deeper, more compositional, and adversarial reasoning. Its construction, metrics, and utility are directed at revealing deficits in current LLM generalization and reasoning capabilities, establishing rigorous criteria for progress in the field (Kazemi et al., 26 Feb 2025).

1. Genesis and Motivation

BBEH was introduced to address saturation in existing reasoning benchmarks. While BIG-Bench provided broad multi-domain coverage, and BBH distilled a subset of tasks where LLMs fell short of average human performance, both have seen near-perfect scores from contemporary models such as Gemini 2.0 Flash and GPT-4o. This diminishing utility motivated the introduction of BBEH, which not only preserves the original domains but replaces each with substantially harder, novel tasks. These tasks systematically defeat prior solution heuristics via adversarial input construction, increased context length, and sophisticated distractors. Construction of BBEH employs semi-adversarial iteration with state-of-the-art LLMs as black-box oracles, iteratively escalating complexity until top models score below 70% (Kazemi et al., 26 Feb 2025).

2. Task Taxonomy and Benchmark Construction

BBEH consists of 23 tasks, each corresponding to a reasoning family from BBH but designed for greater difficulty and resistance to memorization or shallow heuristics. The coverage includes:

Logical and deductive reasoning (Boolean Expressions, Boardgame QA, Web of Lies)
Temporal and causal reasoning (Temporal Sequence, Time Arithmetic, Causal Understanding)
Spatial and geometric reasoning (Spatial Reasoning, Geometric Shapes)
Multi-hop and compositional inference (Zebra Puzzles, Multistep Arithmetic, SportQA)
Challenge types such as learning on the fly (Hyperbaton, Linguini), distractor resilience (Shuffled Objects, Word Sorting), error detection (Dyck Languages), and understanding of humor/sarcasm (NYCC, Sarc Triples) (Kazemi et al., 26 Feb 2025, Stein et al., 26 Oct 2025).

Each task includes 200 instances (except Disambiguation QA with 120), offered exclusively for evaluation (no train/dev/test splits). Prompts are substantially longer—on average 6× longer than BBH—and require deeper chains of inference; CoT responses are 7× longer than prior benchmarks.

3. Quantifying Task Difficulty and Metrics

Task hardness is engineered along several axes:

Input complexity: Macro-average prompt length is increased by a factor of six relative to BBH.
Reasoning depth: Measured via average CoT length, which is approximately sevenfold higher in BBEH.
Distractor and adversarial content: Tasks include out-of-distribution patterns and “needle in a haystack” retrieval scenarios.
Baseline performance: Random accuracy is reduced to ≈8.4% (micro-average), with a harmonic mean of 2.4%—an order of magnitude below BBH.

For aggregate reporting, per-task accuracy $a_i = c_i / n_i$ is used. Harmonic mean, which penalizes uneven per-task performance, is defined as

$H = \frac{N}{\sum_{i=1}^N \frac{1}{a_i'}}$

with $a_i' = a_i + 0.01$ for numerical stability (Kazemi et al., 26 Feb 2025).

4. Model Evaluation Protocols and Results

Evaluation is conducted in zero- or few-shot settings, using standardized prompting with explicit reasoning requests (“Think step by step…”). Answer extraction employs minimal normalization and strict exact-match metrics (Kazemi et al., 26 Feb 2025).

The table below summarizes key metrics for representative models:

Model	Params	Micro-avg	Harmonic Mean
Random baseline	—	8.4%	2.4%
Llama 3.1 8b Instruct	8B	10.6%	3.6%
Gemini 2.0 Flash	≈200B	23.9%	9.8%
GPT-4o (2024-11-20)	—	22.3%	6.0%
o3-mini (high; specialized)	—	54.2%	44.8%

Performance on BBEH is generally an order of magnitude lower than on BBH (e.g., Gemini 2.0 Flash: 85.2% → 23.9% micro). Even the best reasoning-specialized model (o3-mini) falls short of 50% harmonic mean on the suite, and on several tasks (e.g., Buggy Tables, Object Properties, Temporal Sequence), micro-averages remain below 10% (Kazemi et al., 26 Feb 2025).

5. Impact on Model Development and Methodological Innovations

The persistent difficulty of BBEH has catalyzed methodological advances, notably structured reasoning such as Chain of Thought (CoT), Program of Thought (PoT), and Per-Instance Program Synthesis (PIPS) (Stein et al., 26 Oct 2025):

CoT and PoT: Enhance performance on tasks requiring multi-step reasoning, but struggle with algorithmic and symbolic domains, often yielding trivial or broken solutions.
PIPS: Advances over PoT by introducing a per-instance symbolic extraction, iterative program refinement using explicit structural feedback, and a dynamic confidence-based mode selection between CoT and program synthesis. Program correctness is enforced by penalizing outputs for hard-coded returns, syntax/type errors, and missing-use of extracted symbols.

On the full BBEH suite (Gemini-2.0-Flash), PIPS achieves up to 8.6% and 9.4% absolute harmonic mean improvements over PoT and CoT, respectively, while reducing PoT’s erroneous program generations by 65.1% (well-formed code rate rising from 38% to 83% on algorithmic tasks). Confidence metrics accurately switch between CoT and synthesis in 65.3% of critical cases, and ablations confirm that both symbol extraction and iterative feedback are critical to performance (Stein et al., 26 Oct 2025).

6. Insights from Scaling and Emergent Abilities

Scaling studies indicate that BBEH tasks predominantly exhibit “emergent” properties: only large models with explicit reasoning scaffolds—and sometimes architectural or training innovations—achieve non-random accuracy. For example, UL2R “mixture of denoisers” (as in U-PaLM) shifts the scaling curve, enabling strong performance (up to ~2x compute savings over vanilla scaling laws) and unlocking capabilities (e.g., spatial reasoning, sarcasm detection) at significantly smaller model sizes (Tay et al., 2022).

Specifically, U-PaLM 62B, with UL2R continuation, outperforms or matches PaLM 540B on spatial and visual reasoning tasks and demonstrates earlier “emergence” on several BBEH-aligned tasks (e.g., “navigate,” “geometric_shapes,” “snarks”). This suggests that inductive biases introduced by UL2R or similar objectives may play a significant role in sample efficiency and few/fewer-shot reasoning on BBEH-class tasks.

7. Significance, Shortcomings, and Future Directions

BBEH provides a robust standard for evaluating LLM generalization in reasoning, exposing large headroom for progress and sharply highlighting the current limitations of state-of-the-art models. Model performance remains far behind human (or expert annotator) accuracy—gaps above 55 points (harmonic mean) are typical. Reasoning-specialized models yield the largest advantages only for formal, algorithmic, and counting domains; gains on commonsense or social reasoning remain minimal.

Recommendations for further benchmark development emphasize adversarial, model-in-the-loop construction, broader inclusion of soft and “normative” reasoning, multi-modal grounding (integration of text, code, tables, and vision), challenging long-context and state-tracking tasks, and systematic evaluation of meta-reasoning and compositional planning (Kazemi et al., 26 Feb 2025). These directions are vital to sustain evaluation rigor as LLM capabilities continue to advance.

References

(Kazemi et al., 26 Feb 2025) "BIG-Bench Extra Hard"
(Stein et al., 26 Oct 2025) "Once Upon an Input: Reasoning via Per-Instance Program Synthesis"
(Suzgun et al., 2022) "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them"
(Tay et al., 2022) "Transcending Scaling Laws with 0.1% Extra Compute"

Markdown Report Issue Upgrade to Chat

References (4)

BIG-Bench Extra Hard (2025)

Once Upon an Input: Reasoning via Per-Instance Program Synthesis (2025)

Transcending Scaling Laws with 0.1% Extra Compute (2022)

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BIG-Bench Extra Hard (BBEH).