CombiBench Subset Overview

Updated 6 February 2026

CombiBench Subset is a carefully selected sample of combinatorial problems designed to improve computational efficiency and target specific statistical objectives.
Various methodologies—including DPP, BISS, Scales++, and COMBSS—offer distinct trade-offs in preserving ranking, predictive fidelity, and reducing evaluation costs.
Empirical studies show that targeted subset selection can significantly lower evaluation overhead while maintaining performance accuracy and benchmark reliability.

A CombiBench Subset refers to a selected subset of items from the CombiBench combinatorics benchmark, constructed for the purpose of reducing evaluation cost, improving computational efficiency, or targeting specific statistical objectives while retaining fidelity to the original benchmark's evaluation goals. Multiple advanced methodologies have been proposed for constructing such subsets, including item-centric, model-centric, optimization-based, and determinant-based selection, each offering distinct trade-offs in terms of fidelity, computational cost, and suitability for varying evaluation scenarios.

1. Foundations: CombiBench Structure and Use Cases

CombiBench is a formal mathematics benchmark comprising 100 combinatorial problems, each presented with both an informal statement and a rigorously type-checked Lean 4 formalization. The problem set spans a wide range of difficulty, from middle-school and undergraduate exercises to International Mathematical Olympiad (IMO) and other high-level competition problems. Problems are distributed over at least ten combinatorial topics, such as permutations, the pigeonhole principle, inclusion–exclusion, recurrence relations, combinatorial designs, and graph theory (Liu et al., 6 May 2025).

Each instance pairs the informal English statement with a Lean formalization; problems are structured in either theorem proof-based or fill-in-the-blank formats, with formal objects like Finset α, Equiv.Perm, and SimpleGraph encoding combinatorial concepts. The Fine-Eval framework provides standardized, automation-friendly evaluation of model submissions by checking both solution codes and fully formal proofs, in "with solution" (answer provided) and "without solution" (blind) settings.

A CombiBench Subset thus refers to a selection from this pool, typically for facilitating rapid model evaluation, ranking, or for statistical tasks such as subset selection in design and modelling.

2. Statistical and Algorithmic Principles for Subset Construction

Selection of a CombiBench subset may follow multiple theoretical frameworks, which include:

D-Optimal Subset Selection: Maximizing the determinant of a principal submatrix indexed by the chosen subset, a classical design-theoretic criterion (Wang et al., 2017).
Ranking Preservation: Selecting a minimal subset such that the ranking of solvers (models) by cumulative performance remains unchanged (measured by Kendall's τ) (Matricon et al., 8 Sep 2025).
Predictive Fidelity: Choosing a subset that permits the accurate estimation of a model's full-benchmark performance score using only its scores on the subset, quantified by mean absolute error (MAE) (Bean et al., 30 Oct 2025).
Best Subset Selection in Regression: For tasks with response data, finding a feature subset optimizing fit quality under cardinality constraints (ℓ₀-sparsity in linear regression), via continuous optimization (Moka et al., 2022).

The specific selection algorithm and corresponding statistical objective depend on whether the subset is to be used for benchmarking, statistical design, regression modelling, or meta-evaluation.

3. Dominant Methodologies for CombiBench Subset Generation

Below is a summary of the principal algorithms and paradigms for generating CombiBench subsets, each with characteristic strengths and operational regimes:

A. DPP-Based Maximum-Determinant Selection

The DPP (Determinantal Point Process) algorithm samples size- $k$ subsets $S$ from the item set $[n]$ with probability proportional to $\det(A_{S,S})$ , where $A$ is a symmetric positive definite matrix reflecting item correlations. The two-phase algorithm first samples eigenvectors and then points, yielding subsets favoring statistical diversity (D-optimality). The process, including eigen-decomposition and sampling, scales as $O(n^3)$ upfront and $O(nk^2+k^3)$ per iteration. Iterated sampling with recording of the best found determinant achieves eventual convergence to the optimal subset as sample count increases, with each subset $S$ having strictly positive probability (Wang et al., 2017).

B. BISS (BISection Sampling) for Ranking Stability

Designed for ranking-based benchmarks, BISS minimizes the number of instances needed to preserve the rank-ordering (by total score) of all solvers. It uses recursive bisection and necessary-test filtering to eliminate tests that do not impact the global ranking. The DC-merge orchestration and anytime improvements provide scalability for large test suites. BISS provably reduces suite size (often by over 90%), maintaining exact or approximate τ depending on the admissible slack parameter δ. Weighted variants can further compress subsets using non-uniform instance weights determined by regression (Matricon et al., 8 Sep 2025).

C. Item-Centric Selection: Scales++ and Scales++ Lite

Scales++ generates a small, highly predictive subset by leveraging cognitive scale embeddings—a 16-dimensional vector per item encoding required cognitive skills and knowledge domains (e.g., logical reasoning, algebra). These embeddings are derived either via LLM annotation (GPT-4o) or predicted via a GNN from LLM token embeddings (Scales++ Lite, for reduced cost). A diverse k-medoids subset is selected post-UMAP projection and k-means clustering of scale vectors. Predictive aggregation combines cluster-weighted averages and scale-wise logistic regression to estimate total model scores. Scales++ achieves MAE ≈2–3% on real LLM leaderboards at k/N=0.005–0.01, with upfront annotation cost slashed by more than 18x compared to IRT-type (model-centric) approaches and immediate applicability to any new benchmark, including CombiBench (Bean et al., 30 Oct 2025).

D. Continuous Best Subset Selection (COMBSS)

For regression tasks on combinatorial data, COMBSS recasts ℓ₀-constrained subset selection as a continuous optimization problem on $[0,1]^p$ ("t-variables"), penalizing non-sparsity via Lagrange multipliers. Closed-form gradients are computed, and gradient descent (or Adam) tracks a solution path. Subset extraction is done by thresholding or maximizing over candidate supports along the path, integrating directly with benchmarking suites. Empirical studies show COMBSS obtains near-exact recovery in low dimensions and lowest MSE in high dimensions, outperforming classic stepwise and Lasso alternatives (Moka et al., 2022).

4. Empirical Results and Comparative Performance

The following table synthesizes summary results for subset construction algorithms applied to various evaluation tasks, focusing on benchmark and solver ranking preservation.

Method	Objective/Metric	Subset Size (Example)	Fidelity (Ranking/Score)	Reduction	Runtime
DPP (Wang et al., 2017)	Maximize $\det(A_{S,S})$	$S$ 0	Matches/exceeds GA, Greedy	Heavy for GA (up to 90%)	$S$ 120 min–1 hr (parallel)
BISS (Matricon et al., 8 Sep 2025)	Exact rank preservation (τ=1)	$S$ 2 of original	τ(π_S,π_T) $S$ 3 1/0.99	Often $S$ 4	30–60 min (DC merge)
Scales++ (Bean et al., 30 Oct 2025)	Minimize MAE for score pred.	0.5–1% of full set	MAE 2–3 pp on LLM boards	$S$ 5 (size)	Sub-hour (Lite)
COMBSS (Moka et al., 2022)	Minimize regression loss	Application-specific	Predictive/MSE, recovery	n/a	Sec-min (p $S$ 61000)

All methods have been successfully applied in domains involving LLM leaderboards, SAT competitions, and performance modeling for combinatorial systems. On CombiBench-scale data (N≈20,000), Scales++ achieves MAE ≈3% with subsets of size k=100 (0.5%), while BISS can shrink ranking test suites to a few dozen items without τ degradation in most empirical cases.

5. Integration and Practical Recommendations

Integration of subset selection algorithms into CombiBench evaluation or benchmarking workflows requires several key steps:

Data Preparation: Construct the relevant performance matrix (solvers × instances), ensure penalization for timeouts/failures, and encode item metadata if required (Matricon et al., 8 Sep 2025).
Algorithm Selection: Match the subset construction approach to the evaluation objective (ranking preservation, score prediction, D-optimality, regression).
Hyperparameter Tuning: Select subset size, approximation tolerance (τ_target or MAE), and algorithmic specifics (DC fan-in, step-size, λ-grid) appropriate to the computational budget and desired fidelity (Matricon et al., 8 Sep 2025, Moka et al., 2022).
Evaluation and Validation: For ranking-focused subsets, report post-selection τ(π_S,π_T) (on both included and holdout solver sets), and for scoring, track MAE on held-out models. Distributional checks of item features in subset vs. full set are advised to prevent pathological coverage gaps (Bean et al., 30 Oct 2025).
Reproducibility: When using randomized or anytime variants (BISS, Scales++), aggregate over multiple seeds to report stability and standard deviation of the achieved metrics (Matricon et al., 8 Sep 2025, Bean et al., 30 Oct 2025).

A plausible implication is that, because Scales++ is model-agnostic and operates solely on item "content" embeddings, it is advantageous for cold-start and transfer scenarios, whereas DPP and BISS may excel in settings with extensive solver performance history or statistical design requirements.

6. Outlook and Limitations

CombiBench subset methodologies now span a spectrum from highly interpretable, scalable cognitive embedding–driven selection (Scales++), through combinatorial optimization and kernel methods (DPP), to advanced ranking- and regression-based schemes (BISS, COMBSS). Each approach brings unique strengths—Scales++ for initialization and transfer, DPP for experimental designs, BISS for benchmark minimization, and COMBSS for high-dimensional regression sparsification.

Empirical findings indicate that targeted subset selection can reduce evaluation cost by orders of magnitude with minimal or no compromise in ranking stability or predictive fidelity. However, the hardest formal combinatorics problems—such as those in the IMO subset of CombiBench—remain essentially unsolved by current LLMs and solvers, regardless of subset size or construction technique (Liu et al., 6 May 2025). This suggests that improvements in formal library coverage and LLM combinatorics reasoning remain bottlenecks for further progress.