Predict-then-Optimize ML Benchmarks

Updated 9 February 2026

Predict-then-optimize benchmarks rigorously evaluate how machine learning predictions influence downstream optimization decisions through a two-stage process.
They encompass classical prediction-centric losses and modern decision-focused, loss-adaptive methods, enabling precise regret minimization.
Empirical frameworks span synthetic combinatorial tasks to realistic AI system configurations, highlighting robust metrics and scalable loss designs.

Predict-then-optimize (PtO) machine learning benchmarks are designed to rigorously evaluate the interplay between prediction accuracy and downstream decision quality when machine learning predictions are integrated into optimization-based decision systems. These benchmarks have rapidly evolved from classical two-stage approaches to encompass decision-focused, loss-adaptive, and joint modeling paradigms, capturing a broad spectrum of problem structures, loss functions, and evaluation protocols. This survey consolidates the technical design principles, methodological variants, empirical frameworks, and domain coverage of modern PtO benchmarks, referencing leading developments in online algorithms, combinatorial optimization, resource allocation, and system deployment.

1. Formal Structure of Predict-Then-Optimize Benchmarks

The canonical PtO pipeline is a two-stage process:

A supervised machine learning model predicts uncertain parameters or objective coefficients of an optimization problem, conditioned on observable features.
The predicted parameters are used as fixed inputs in a downstream deterministic or stochastic optimization to compute the decision.

Formally, let $x \in \mathbb{R}^p$ denote observable features, $y \in \mathbb{R}^d$ the true, a priori unknown parameters (e.g., costs or returns), and $\mathcal{M}_\theta(x) = \hat{y}$ the ML predictor. The induced optimization is then

$z^*(\hat{y}) = \arg\min_{z \in \mathcal{Z}} \hat{y}^T z$

where $\mathcal{Z}$ is the feasible set (e.g., combinatorial, integer, or convex constraints). Solution quality is measured using the true objective function evaluated at the decision produced by the predicted parameters: $f(z^*(\hat{y}), y)$ , which motivates regret-based metrics

$\mathrm{Regret}(\hat{y}, y) = f(z^*(\hat{y}), y) - f(z^*(y), y)$

as the central criterion for benchmarking (Geng et al., 2023).

This structure generalizes to settings with more general objectives (e.g., continuous treatment allocation (Vos et al., 2024), portfolio optimization with autocorrelated uncertainty (Wang et al., 2 Feb 2026)), and supports customizable constraints and decision regimes.

2. Loss Functions and Decision-Focused Training Objectives

Classical PtO relies on prediction-centric losses such as mean squared error (MSE) or mean absolute error (MAE) for $\ell_2$ or $\ell_1$ regression, respectively. However, these are agnostic to how prediction errors propagate through the downstream optimization. To remedy this, decision-focused losses explicitly incorporate the structure of the optimization problem into ML training:

Benchmark-aware losses: For online algorithms, loss functions directly penalize excess incurred cost over the offline optimum, e.g.,

$L_{\text{bench}}(y, \hat{y}) = C_{\text{alg}}(y; \hat{y}) - C^*(y)$

or its normalized variant (Anand et al., 2022).

SPO loss and surrogates: The "Smart Predict then Optimize" (SPO) loss measures decision regret: $L_{\mathrm{SPO}}(\hat{c}, c) = c^T w^*(\hat{c}) - c^T w^*(c)$ and its convex surrogate (SPO+) allows tractable minimization over arbitrary polyhedral feasible sets (Elmachtoub et al., 2017).
Task-specific, differentiable surrogates: Approaches such as SPO+, differentiable perturbation (DBB/DPO/PFY), and custom relaxations embed solver behavior inside the learning loop to enable gradient-based training (Tang et al., 2022), and learn-to-optimize surrogates (e.g., Lagrangian dual, primal-dual, and constraint completion/correction) that are tuned for joint feature-to-solution prediction (Kotary et al., 2023, Kotary et al., 2024).
Empirical Soft Regret (ESR): In bandit or binary-action settings, differentiable "soft" regret surrogates directly target per-instance regret, yielding asymptotically optimal regret in parametric settings (Tan et al., 2024).
Efficient Global Losses (EGL): Sample-efficient frameworks learn loss functions parameterized by instance features and calibrated against decision quality, dramatically reducing the sample complexity for optimizing loss adaption (Shah et al., 2023).

In all cases, empirical or surrogate minimization aligns model selection more directly with operational end goals, outperforming purely predictive losses as problem complexity increases or as model misspecification becomes pronounced.

3. Benchmark Domains, Algorithms, and Realizations

A spectrum of benchmark tasks has been developed to expose both the capabilities and limitations of PtO models under controlled settings:

Synthetic and combinatorial problems: Shortest path (Elmachtoub et al., 2017, Tang et al., 2022), multi-dimensional knapsack (Geng et al., 2023, Tang et al., 2022), traveling salesperson (Tang et al., 2022), portfolio optimization (Elmachtoub et al., 2017, Kotary et al., 2024, Wang et al., 2 Feb 2026), single-machine scheduling (Smet, 2 Sep 2025), and cubic Top-K resource allocation (Shah et al., 2023). These enable systematic variation of nonlinearity, structure, and noise.
Online algorithms: The ski rental (rent-or-buy) problem is used to demonstrate end-to-end decision performance under adversarial and typical input regimes, including decision-focused vs. standard ML losses (Anand et al., 2022).
Prescriptive resource allocation and causal uplift: Continuous-dose treatment assignment with fair allocation constraints (Vos et al., 2024).
Realistic, high-dimensional domains: Combinatorial advertising with submodular and budgeted constraints (Geng et al., 2023), AC optimal power flow in electricity networks (Kotary et al., 2023), and AI system configuration for MLPerf benchmarking (Fursin et al., 14 Sep 2025).
Simulated error benchmarking: Synthetic confusion matrices for multiclass classification/scheduling create controlled error surfaces, decoupling ML accuracy from decision regret and enabling full mapping of cost-error landscapes (Smet, 2 Sep 2025).

Benchmarking frameworks (e.g., PyEPO (Tang et al., 2022), PredictiveCO-Benchmark (Geng et al., 2023)) provide modular infrastructure for training, evaluation, and replication across problem classes and loss strategies, supporting quantitative reporting of normalized regret, regret distributions, constraint violations, and inference time.

4. Comparison of PtO and PnO Paradigms

Predict-and-Optimize (PnO, sometimes called decision-focused or end-to-end) approaches represent an evolution of the PtO paradigm, training predictors to minimize the ultimate decision loss—often by differentiating through the optimization procedure or by leveraging surrogate or ranking-based losses. Systematic benchmarking demonstrates:

PnO methods (discrete/continuous/statistical/surrogate): These include Blackbox, Identity, QPTL, CPLayer, SPO-relax, NCE, LTR, and LODL, as well as learned surrogate objectives (e.g., SurCO). PnO consistently outperforms PtO across complex nonlinear or constraint-rich benchmarks: PnO wins in 7 out of 8 synthetic/real-world tasks, with pronounced gains in energy-aware scheduling and budget allocation (Geng et al., 2023).
When PtO suffices: For simple objective functions or when computational cost and label availability prohibit solving (possibly expensive) surrogate problems, strong PtO predictors can be competitive (e.g., in cubic Top-K or when surrogate gradients are intractable) (Geng et al., 2023, Shah et al., 2023).
Proxy and joint models: Joint mapping of features directly to solutions (LtOF) via learning-to-optimize losses provides robust regret reductions and order-of-magnitude speedups over two-stage or solver-differentiation pipelines, especially under high feature-complexity, nonconvexity, or lack of differentiability (Kotary et al., 2023, Kotary et al., 2024).

5. Empirical Insights: Metrics, Trade-offs, and Best Practices

Assessment of PtO methods centers on decision-centric metrics:

Regret (decision error): Average, worst-case, or relative regret, i.e., the gap between predicted and optimal decisions under true costs, is the primary metric (Elmachtoub et al., 2017, Geng et al., 2023).
Cost ratio: For online or budgeted problems, the ratio of incurred to optimal cost provides an interpretable measure (Anand et al., 2022).
Normalized regret / relative regret: Used for cross-dataset and cross-domain comparison, facilitating aggregation over instances (Tang et al., 2022, Geng et al., 2023).
Solution-level statistics: Distributional metrics (e.g., 90th-percentile regret), constraint satisfaction rates, and robustness to misalignment between predictive and operational objectives.
Sample efficiency: Advanced loss-adaptation frameworks (EGL) achieve higher decision quality with orders-of-magnitude fewer solver calls than per-instance local loss learning (Shah et al., 2023).

Notable empirical findings include:

Optimizing standard predictive loss (MSE/MAE) is generally suboptimal for decision quality as model misspecification or structural complexity increases.
Decision-focused surrogate losses (SPO+, PFY, LTR, EGL, ESR) consistently lower regret, particularly in the presence of restrictions on prediction/optimization coupling or high-dimensional mapping complexity (Elmachtoub et al., 2017, Shah et al., 2023, Tan et al., 2024).
End-to-end, proxy, and statistical approaches—especially those integrating constraint satisfaction and feasibility correction—enable practical and scalable application to both convex and nonconvex, continuous and discrete, and large-scale domains (Kotary et al., 2023, Kotary et al., 2024).
Empirical cost-error landscapes can be highly nonlinear with respect to classification error, indicating that small improvements in certain error types (e.g., FPR) produce disproportionately large gains in decision quality (Smet, 2 Sep 2025).

6. Extensions, Open Problems, and Domain-Specific Developments

Recent benchmark expansions include:

Causal and treatment assignment: Estimators such as S-learner, DRNet, and VCNet in uplift modeling with continuous treatments, assessed by area under the uplift curve (AUUC), fair allocation, and resource- or cost-sensitive prescriptions (Vos et al., 2024).
Autocorrelated estimation: PtO and FPtP models under autocorrelated processes (e.g., VARMA) reveal further separation between predictive and prescriptive performance, with robust regret minimization available only under specialized optimize-via-estimate (A-OVE) models (Wang et al., 2 Feb 2026).
AI system configuration: Predict-then-optimize deployed in large-scale AI system benchmarking (FlexBench), confronting heterogeneous, high-cardinality categorical features, constraint-based pruning, and cost/throughput/energy trade-offs (Fursin et al., 14 Sep 2025).
Simulation-driven benchmarking: Simulation frameworks decouple classifier accuracy from solution quality, allowing precise mapping of error-regret landscapes for combinatorial decision settings (Smet, 2 Sep 2025).

Key open challenges include extending learn-to-optimize proxies to structured, mixed-integer, and combinatorial settings, theoretical analysis of generalization/regret under covariate shift, robust optimization under simulated error models, and integration with reinforcement learning and bandit feedback scenarios (Tan et al., 2024).

7. Implementation Recommendations and Reproducibility Tools

Best practice recommendations for constructing and evaluating PtO benchmarks include:

Choose or design loss functions that reflect true operational regret, such as SPO+, L_{bench}, or soft regret surrogates, instead of (or in addition to) standard predictive error losses (Elmachtoub et al., 2017, Anand et al., 2022, Shah et al., 2023).
Use modular pipeline frameworks (e.g., PyEPO (Tang et al., 2022), PredictiveCO-Benchmark (Geng et al., 2023)) for rapid prototyping, model comparison, and fair replication across tasks.
Leverage simulation of error profiles to map decision-quality surfaces and set evidence-based targets for classifier or regressor performance (Smet, 2 Sep 2025).
For high-complexity, time-constrained, or non-differentiable settings, employ joint or proxy models (LtOF) with constraint-aware surrogates to maximize computational and decision efficiency (Kotary et al., 2023, Kotary et al., 2024).
Use robust cross-validation, grid search for hyperparameters, and careful train/validation/test splits to ensure model selection is performed with respect to decision-centric metrics.
Measure both predictive and decision-level metrics, and report both average and tail performance to reveal robustness.

A systematic PtO benchmarking strategy thus incorporates decision-focused loss design, domain-appropriate simulation or real-world data, modular training and evaluation infrastructure, and interpretable performance metrics tightly aligned with the operational use case. These benchmarks now underpin both empirical and theoretical advances in ML-augmented decision-making (Geng et al., 2023, Elmachtoub et al., 2017, Tang et al., 2022, Kotary et al., 2023, Smet, 2 Sep 2025, Anand et al., 2022, Shah et al., 2023, Vos et al., 2024, Kotary et al., 2024, Tan et al., 2024, Wang et al., 2 Feb 2026, Fursin et al., 14 Sep 2025).