How2Bench: LLM Procedure & Benchmark

Updated 10 February 2026

How2Bench is a benchmark that evaluates procedural validity in LLM-generated instructions using a dataset of 7,000 goal-conditioned examples and a multi-stage web-mining pipeline.
It introduces How2Score, a binary evaluation protocol and automated judge model that identifies critical failures with 80.5% human agreement, highlighting scaling trends across model families.
The 55-criterion checklist provides a systematic approach to constructing reproducible, high-quality code benchmarks across design, construction, evaluation, analysis, and release phases.

How2Bench designates two prominent contributions in the benchmarking of LLMs: (1) as a dataset and protocol for evaluating goal-conditioned procedure generation in LLMs (Chang et al., 9 Feb 2026), and (2) as a 55-item methodological checklist for constructing reliable, reproducible code-related benchmarks (Cao et al., 18 Jan 2025). Both frameworks address persistent challenges in dataset quality, procedural validity, evaluation reliability, and transparency in LLM assessment.

1. How2Bench for Procedure Generation: Dataset and Composition

How2Bench is a benchmark designed to systematically evaluate the procedural validity of LLMs in generating stepwise, goal-conditioned instructions across real-world tasks (Chang et al., 9 Feb 2026). It consists of exactly 7,000 examples, evenly distributed across 14 diverse topics—Art & Design, Crime & Law, Education & Jobs, Electronics & Hardware, Fashion & Beauty, Food & Dining, Health, Home & Hobbies, Industrial, Religion, Science/Math/Technology, Sports & Fitness, Transportation, and Travel & Tourism. Each example includes a clear goal, an explicit list of required resources, and a reference procedure containing 5–15 ordered steps.

Dataset construction is grounded in a multi-stage web-mining pipeline (How2Mine) involving:

Sampling & Topic Stratification: Extraction from the DCLM web corpus filtered for tutorial content, balanced by topic.
Procedure Extraction: Automated identification and parsing of procedural tasks and outcome-focused goal statements by GPT-4.1.
Heuristics Filtering: Exclusion based on step count and n-gram repetition thresholds (e.g., bigram ≥ 40%, trigram ≥ 35%, four-gram ≥ 30%).
LLM-Based Filtering: Removal of procedures unsuitable for procedural evaluation (e.g., those dependent on named entities, UI-driven, pure calculations, creative or logically incoherent tasks).
Post-processing & Validation: Deterministic rewriting of goals, explicit resource extraction, and GPT-4.1-based final checks.

The final evaluation set is sampled from 351,162 high-quality procedures; the remainder (≈344,162) forms a training pool, How2Train. GPT-4.1 spot checks confirm 96.6% reference reasonableness.

2. Critical-Failure Evaluation Protocol (How2Score)

How2Bench employs How2Score, a stringent binary evaluation protocol for model-generated procedures. For each sample, the LLM-generated procedure is tested for the presence of “critical failures”—errors sufficient to prevent achieving the stated goal under supplied resource constraints. Failure modes include essential step omissions, harmful or extraneous steps, logical contradictions, and severe vagueness.

Formally, for an input $x = (g, R, S^*, S)$ with goal $g$ , resources $R$ , reference $S^*$ , and model output $S$ , the judge function

$J(g, R, S^*, S) \in \{\mathrm{no\_failure},\ \mathrm{has\_failure}\}$

determines validity, and the aggregate Success Rate is

$\mathrm{Score}(D) = \frac{1}{|D|}\sum_{x \in D} \mathbf{1}[J(g, R, S^*, S) = \mathrm{no\_failure}]$

Consensus labeling by annotators achieves Krippendorff’s $\alpha = 0.593$ after binary aggregation. Automated judging is executed by "How2Judge," an 8B open model distilled from GPT-5. This model attains 80.5% agreement with human majority, making low-cost, large-scale evaluation tractable.

3. Empirical Benchmarks and Scaling Analysis

How2Bench reveals clear performance stratification and scaling effects across major model families. Selected How2Score results (closed/open) are summarized below:

Model	How2Score (%)	Avg. Tokens
GPT-5	67.99	103
Claude Opus 4.5	64.26	110
Gemini 2.5 Pro	56.11	101
Qwen 3 32B	46.04	106
OLMo 3 32B	43.16	101
Llama 3.1 70B	42.13	114
Qwen 3 8B	35.34	99
OLMo 3 7B	30.23	102
Llama 3.1 8B	26.86	99

Comprehensive scaling experiments using OLMo checkpoints demonstrate monotonic gains in How2Score with increased pretraining progress and model size. Notably, formatting proxy metrics stabilize within 10% of pretraining, while How2Score improvements reflect gains in deeper procedural validity. Correlation between conditional perplexity and How2Score is only partial ( $\rho \in [0.23, 0.97]$ ), indicating that likelihood-based metrics do not reliably capture procedural correctness.

At the instance level, reference step count is the principal determinant of difficulty (odds ratio ≈ 0.75 per additional step), while resource count exerts minimal influence. Residual verbosity introduces mild bias (odds ratio ≈ 1.015 per increased percentage point in generation/reference length ratio).

4. Usage Policies and Procedural Constraints

How2Bench requires strict adherence to evaluation constraints: models must be supplied the goal, resource list, and required number of steps; outputs must match the reference step count, with each step corresponding to a discrete action. Released artifacts include How2Judge weights, prompts, data splits, and judge model, facilitating reproducible, scalable deployment. Key caveats are that all procedures are descriptive (not executable), critical-failure detection is only a proxy for true execution suitability, and judge disagreement with human annotation remains at approximately 19.5%. Memorization risks are empirically limited: midtraining exposure yields only modest (+3 pp) How2Score gains.

5. Representative Data Format and Task Instances

Each How2Bench entry contains:

Goal: Textual description of the task’s intended outcome.
Resources: Explicitly enumerated objects, tools, or ingredients.
Reference Steps: Ordered sequence, length 5–15.

Examples include legal property transfer (Crime & Law) and culinary procedures (Food & Dining), indicating coverage of both procedural formality and practical, multi-step tasks. This structure enforces specificity and determinism in model outputs, a critical property for robust evaluation.

6. How2Bench as a Benchmark Construction Checklist

An alternative and independent use of the term How2Bench denotes a 55-criterion checklist for the creation of high-quality, reproducible code-related benchmarks (Cao et al., 18 Jan 2025). This framework systematically organizes benchmark design into five phases:

Design (gap analysis, scope definition, capability specification, application context)
Construction (data traceability, contamination checks, deduplication, coverage guarantees, validation)
Evaluation (model/prompting selection, environment reproducibility, trial repetition, logging)
Analysis (difficulty calibration, variance measurement, differentiability, visualization, qualitative review)
Release (open sourcing, licensing, documentation, results/log release, removal of sensitive data, community support)

Empirical analysis using the checklist reveals extensive gaps in current practice: nearly 70% of surveyed code-related benchmarks lack any quality assurance, and only 8.7% report code-coverage as part of test oracle design. Common failures include non-removal of secrets, missing prompt disclosures, and deficient reproducibility information (e.g., hardware/OS configurations). Human studies confirm strong consensus (>80% agreement) on the checklist criteria’s importance, but also identify persistent awareness gaps, especially in denoising, environment logging, and experiment replication.

7. Significance, Impact, and Recommendations

How2Bench, in both its dataset/protocol and benchmarking guideline incarnations, constitutes a comprehensive effort to address persistent deficits in LLM evaluation—chiefly data curation, evaluation rigor, and procedural correctness in real-world settings. As a procedural dataset, How2Bench offers robust signal on both scaling trends and failure modes across model families, supporting closed-loop improvement (e.g., reinforcement learning over How2Score, yielding gains >10 points on held-out tasks without regressions on existing benchmarks) (Chang et al., 9 Feb 2026). As a benchmarking checklist, How2Bench advances community standards for transparency, data hygiene, reproducibility, and comprehensive documentation (Cao et al., 18 Jan 2025).

Adoption of How2Bench methodologies is recommended at the earliest design stages, with automation of data checks and regular community-facing releases of all associated artifacts. These combined practices advance the trustworthiness, utility, and comparability of LLM evaluations by enforcing strong guarantees on both data and metric integrity.

Markdown Report Issue Upgrade to Chat

References (2)

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs (2026)

How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to How2Bench.