How2Bench: LLM Procedure & Benchmark
- How2Bench is a benchmark that evaluates procedural validity in LLM-generated instructions using a dataset of 7,000 goal-conditioned examples and a multi-stage web-mining pipeline.
- It introduces How2Score, a binary evaluation protocol and automated judge model that identifies critical failures with 80.5% human agreement, highlighting scaling trends across model families.
- The 55-criterion checklist provides a systematic approach to constructing reproducible, high-quality code benchmarks across design, construction, evaluation, analysis, and release phases.
How2Bench designates two prominent contributions in the benchmarking of LLMs: (1) as a dataset and protocol for evaluating goal-conditioned procedure generation in LLMs (Chang et al., 9 Feb 2026), and (2) as a 55-item methodological checklist for constructing reliable, reproducible code-related benchmarks (Cao et al., 18 Jan 2025). Both frameworks address persistent challenges in dataset quality, procedural validity, evaluation reliability, and transparency in LLM assessment.
1. How2Bench for Procedure Generation: Dataset and Composition
How2Bench is a benchmark designed to systematically evaluate the procedural validity of LLMs in generating stepwise, goal-conditioned instructions across real-world tasks (Chang et al., 9 Feb 2026). It consists of exactly 7,000 examples, evenly distributed across 14 diverse topics—Art & Design, Crime & Law, Education & Jobs, Electronics & Hardware, Fashion & Beauty, Food & Dining, Health, Home & Hobbies, Industrial, Religion, Science/Math/Technology, Sports & Fitness, Transportation, and Travel & Tourism. Each example includes a clear goal, an explicit list of required resources, and a reference procedure containing 5–15 ordered steps.
Dataset construction is grounded in a multi-stage web-mining pipeline (How2Mine) involving:
- Sampling & Topic Stratification: Extraction from the DCLM web corpus filtered for tutorial content, balanced by topic.
- Procedure Extraction: Automated identification and parsing of procedural tasks and outcome-focused goal statements by GPT-4.1.
- Heuristics Filtering: Exclusion based on step count and n-gram repetition thresholds (e.g., bigram ≥ 40%, trigram ≥ 35%, four-gram ≥ 30%).
- LLM-Based Filtering: Removal of procedures unsuitable for procedural evaluation (e.g., those dependent on named entities, UI-driven, pure calculations, creative or logically incoherent tasks).
- Post-processing & Validation: Deterministic rewriting of goals, explicit resource extraction, and GPT-4.1-based final checks.
The final evaluation set is sampled from 351,162 high-quality procedures; the remainder (≈344,162) forms a training pool, How2Train. GPT-4.1 spot checks confirm 96.6% reference reasonableness.
2. Critical-Failure Evaluation Protocol (How2Score)
How2Bench employs How2Score, a stringent binary evaluation protocol for model-generated procedures. For each sample, the LLM-generated procedure is tested for the presence of “critical failures”—errors sufficient to prevent achieving the stated goal under supplied resource constraints. Failure modes include essential step omissions, harmful or extraneous steps, logical contradictions, and severe vagueness.
Formally, for an input with goal , resources , reference , and model output , the judge function
determines validity, and the aggregate Success Rate is
Consensus labeling by annotators achieves Krippendorff’s after binary aggregation. Automated judging is executed by "How2Judge," an 8B open model distilled from GPT-5. This model attains 80.5% agreement with human majority, making low-cost, large-scale evaluation tractable.
3. Empirical Benchmarks and Scaling Analysis
How2Bench reveals clear performance stratification and scaling effects across major model families. Selected How2Score results (closed/open) are summarized below:
| Model | How2Score (%) | Avg. Tokens |
|---|---|---|
| GPT-5 | 67.99 | 103 |
| Claude Opus 4.5 | 64.26 | 110 |
| Gemini 2.5 Pro | 56.11 | 101 |
| Qwen 3 32B | 46.04 | 106 |
| OLMo 3 32B | 43.16 | 101 |
| Llama 3.1 70B | 42.13 | 114 |
| Qwen 3 8B | 35.34 | 99 |
| OLMo 3 7B | 30.23 | 102 |
| Llama 3.1 8B | 26.86 | 99 |
Comprehensive scaling experiments using OLMo checkpoints demonstrate monotonic gains in How2Score with increased pretraining progress and model size. Notably, formatting proxy metrics stabilize within 10% of pretraining, while How2Score improvements reflect gains in deeper procedural validity. Correlation between conditional perplexity and How2Score is only partial (), indicating that likelihood-based metrics do not reliably capture procedural correctness.
At the instance level, reference step count is the principal determinant of difficulty (odds ratio ≈ 0.75 per additional step), while resource count exerts minimal influence. Residual verbosity introduces mild bias (odds ratio ≈ 1.015 per increased percentage point in generation/reference length ratio).
4. Usage Policies and Procedural Constraints
How2Bench requires strict adherence to evaluation constraints: models must be supplied the goal, resource list, and required number of steps; outputs must match the reference step count, with each step corresponding to a discrete action. Released artifacts include How2Judge weights, prompts, data splits, and judge model, facilitating reproducible, scalable deployment. Key caveats are that all procedures are descriptive (not executable), critical-failure detection is only a proxy for true execution suitability, and judge disagreement with human annotation remains at approximately 19.5%. Memorization risks are empirically limited: midtraining exposure yields only modest (+3 pp) How2Score gains.
5. Representative Data Format and Task Instances
Each How2Bench entry contains:
- Goal: Textual description of the task’s intended outcome.
- Resources: Explicitly enumerated objects, tools, or ingredients.
- Reference Steps: Ordered sequence, length 5–15.
Examples include legal property transfer (Crime & Law) and culinary procedures (Food & Dining), indicating coverage of both procedural formality and practical, multi-step tasks. This structure enforces specificity and determinism in model outputs, a critical property for robust evaluation.
6. How2Bench as a Benchmark Construction Checklist
An alternative and independent use of the term How2Bench denotes a 55-criterion checklist for the creation of high-quality, reproducible code-related benchmarks (Cao et al., 18 Jan 2025). This framework systematically organizes benchmark design into five phases:
- Design (gap analysis, scope definition, capability specification, application context)
- Construction (data traceability, contamination checks, deduplication, coverage guarantees, validation)
- Evaluation (model/prompting selection, environment reproducibility, trial repetition, logging)
- Analysis (difficulty calibration, variance measurement, differentiability, visualization, qualitative review)
- Release (open sourcing, licensing, documentation, results/log release, removal of sensitive data, community support)
Empirical analysis using the checklist reveals extensive gaps in current practice: nearly 70% of surveyed code-related benchmarks lack any quality assurance, and only 8.7% report code-coverage as part of test oracle design. Common failures include non-removal of secrets, missing prompt disclosures, and deficient reproducibility information (e.g., hardware/OS configurations). Human studies confirm strong consensus (>80% agreement) on the checklist criteria’s importance, but also identify persistent awareness gaps, especially in denoising, environment logging, and experiment replication.
7. Significance, Impact, and Recommendations
How2Bench, in both its dataset/protocol and benchmarking guideline incarnations, constitutes a comprehensive effort to address persistent deficits in LLM evaluation—chiefly data curation, evaluation rigor, and procedural correctness in real-world settings. As a procedural dataset, How2Bench offers robust signal on both scaling trends and failure modes across model families, supporting closed-loop improvement (e.g., reinforcement learning over How2Score, yielding gains >10 points on held-out tasks without regressions on existing benchmarks) (Chang et al., 9 Feb 2026). As a benchmarking checklist, How2Bench advances community standards for transparency, data hygiene, reproducibility, and comprehensive documentation (Cao et al., 18 Jan 2025).
Adoption of How2Bench methodologies is recommended at the earliest design stages, with automation of data checks and regular community-facing releases of all associated artifacts. These combined practices advance the trustworthiness, utility, and comparability of LLM evaluations by enforcing strong guarantees on both data and metric integrity.