Predict-then-Optimize Approach

Updated 26 January 2026

Predict-then-Optimize (PO) is a decision-making framework that first predicts uncertain model parameters using machine learning before solving an optimization problem.
It is widely applied in areas like supply chain, scheduling, and routing where key model inputs such as costs and demands are inferred from data.
Advanced PO methods leverage SPO loss and convex surrogates to directly connect predictive errors with decision quality improvements in optimization tasks.

A Predict-then-Optimize (PO) Approach is a methodology for decision-making under parameter uncertainty that decomposes a data-driven decision problem into two stages: first, using machine learning to predict unknown input parameters of an optimization model from observed features, followed by plugging those predictions into the model to make downstream decisions via mathematical optimization. The classical PO paradigm is ubiquitously applied in domains such as supply chain management, scheduling, routing, recommendation, resource allocation, and more, wherever essential model parameters (costs, demands, availabilities) are not known at decision time but can be inferred from data (Elmachtoub et al., 2020).

1. Formal Definition and Key Principles

The canonical PO pipeline consists of:

Prediction Stage: Train a function $f \colon x \mapsto \hat{c}$ mapping contextual features $x$ to predictions $\hat{c}$ of the unknown model parameter $c$ . In the context of, e.g., shortest path problems, $c$ encodes edge travel times or costs. The prediction model may take any form—including regression, decision trees, or modern deep neural architectures.
Optimization Stage: Substitute the prediction $\hat{c}$ into the mathematical optimization problem to obtain a prescriptive decision:

$w^*(\hat{c}) \in \arg\min_{w \in S} \hat{c}^T w,$

where $S \subset \mathbb{R}^d$ is the feasible set (possibly with convex, integer, or combinatorial structure) (Elmachtoub et al., 2020).

Classically, predictors are trained to minimize input parameter error, e.g., mean-squared error (MSE), without regard for how this error propagates into the optimization and impacts decision quality in the true problem.

2. Decision-Quality Loss: SPO and Generalizations

The critical insight motivating advanced PO approaches is that high predictive accuracy in parameters does not necessarily guarantee high-quality decisions, as downstream optimization can amplify small prediction errors, especially near boundaries where decisions are sensitive to parameter changes (Elmachtoub et al., 2020). This motivates performance metrics and training paradigms that directly capture operational loss.

SPO Loss: The Smart Predict-then-Optimize (SPO) loss quantifies the realized suboptimality of a decision induced by predicted parameters $\hat{c}$ when evaluated under the true $c$ :

$x$ 0

where $x$ 1 is the optimal (minimum) value under $x$ 2, and $x$ 3 is the set of optimal solutions under $x$ 4. The SPO loss thus measures the excess objective incurred by using a parameter estimate $x$ 5 instead of $x$ 6 (Elmachtoub et al., 2020).

Empirical Risk Minimization under SPO Loss: Given data $x$ 7, the learning objective is

$x$ 8

For general function classes, optimizing the SPO loss is computationally intractable due to nonconvexity and discontinuities.

Convex Surrogates: Convex and differentiable relaxations, e.g., the SPO+ loss, are developed to enable tractable training and gradient-based optimization (Elmachtoub et al., 2020, Tang et al., 2022).

3. Model Classes and Algorithmic Techniques

Decision Trees under PO

In the context of interpretability and efficient substructure exploitation, (Elmachtoub et al., 2020) demonstrates that decision trees can be directly trained to minimize SPO loss:

Leaf-wise Closed-form: For any partition of the data into leaves, the leaf-wise SPO loss minimizer is the average of true costs in each leaf (assuming a mild uniqueness condition on the optimization solution).
Greedy and MILP Algorithms: Split selection is performed by evaluating SPO-based decision loss for every possible split, recursively partitioning until stopping criteria are met. Entire decision trees can also be learned by mixed-integer linear programming (MILP), yielding globally optimal (on training data) trees under SPO loss.

General Learning Architectures

Beyond trees, recent developments generalize the PO approach to neural predictors, decision-focused fine-tuning, autoregressive models for dependent data, and settings involving complex downstream optimization (integer, QP, or MIP) (Tang et al., 2022, Liu et al., 2024, Yang et al., 3 Jan 2025).

Table: Representative PO Approaches

Approach	Predictor $x$ 9	Loss Function	Optimization Layer
Classical Two-stage	Regression (any class)	MSE, MAE	Argmin under $\hat{c}$ 0
Decision-focused	Regression, Tree, NN	Decision loss (SPO, regret)	Argmin, KKT, surrogates
SPO Tree	Decision tree	SPO	MILP/recursive splitting
End-to-end neural	NNs, ensembles	Surrogate (SPO+, PFY, etc.)	Differentiable LP/MIP

(Elmachtoub et al., 2020, Tang et al., 2022, Yang et al., 3 Jan 2025)

4. Interpretability and Structural Properties

A central consequence of decision-focused PO learning is that the learned predictor aligns partitions of the feature space with boundaries that correspond to distinct optimal solutions, not just regions of similar cost predictions. In the context of trees, SPOTs are observed to produce shallow, easily interpretable models with leaves corresponding to regimes with qualitatively distinct optimal actions (Elmachtoub et al., 2020). This contrasts sharply with standard CART (regression tree) models, which may require deep, convoluted splits to optimize parameter fit rather than operational decision quality.

5. Empirical Performance and Practical Considerations

Extensive empirical studies on both synthetic and real-world benchmarks consistently demonstrate that PO-motivated models (e.g., SPOT, SPO+ NNs) outperform conventional MSE-minimization-based predictors in terms of realized decision quality, often with simpler or more interpretable models. Key findings include:

In shortest-path tasks, SPO-trained trees recover the correct decision boundary with much fewer leaves and depth than CART, incurring zero excess travel cost at shallow depth where CART incurs >20% overhead.
In grid-based synthetic routing problems, SPOTs yield approximately 25% reduction in extra travel time relative to depth-matched CART trees in data-limited regimes, and maintain performance advantages even as sample sizes increase.
In large-scale real-world news recommendation (Yahoo! click-prediction), depth-2 SPOTs improved click-through rates by 4.3% over depth-2 CART and remained superior even versus unconstrained-depth CART (Elmachtoub et al., 2020).

6. Limitations, Assumptions, and Open Research Directions

Key limitations and domain assumptions of the PO approach include:

Linear Objective Structure: The SPO formalism fundamentally assumes that the downstream optimization is a linear function of the parameter vector. Generalization to nonlinear or dynamic optimization is open.
Uniqueness Condition: Leaf-wise SPO minimizers rely on the feasible set yielding unique solutions under average costs, but this is typically satisfied with continuous distributions or by tie-breaking.
Scalability: MILP-based global tree search and end-to-end optimization may be limited by problem and model size; practical implementations use heuristics, warm starts, or relaxed formulations.
Generalization Guarantees: Statistical rates and robust surrogates that balance tractability, calibration, and generalization (especially under dependent data or model misspecification) are ongoing research topics (Elmachtoub et al., 2020, Yang et al., 3 Jan 2025, Liu et al., 2024).

Open research avenues include the development of surrogate losses with strong theoretical guarantees, PO frameworks for nonlinear, nonconvex, or stochastic optimization, pipelines that are robust to model misspecification and constraint uncertainty, and scalable implementation for large combinatorial problems.

7. Broader Context and Extensions

The PO approach is a foundation for a range of advanced prescriptive analytics methodologies:

Multi-task PO: Simultaneously predicting parameters for, and optimizing, across multiple related decision problems (Tang et al., 2022).
End-to-End PO Learning: Fully differentiable optimization pipelines, leveraging differentiable solvers, KKT-based layers, or surrogate losses for gradient-based parameter updates (Tang et al., 2022).
Black-box and Bandit PO: Learning policies or predictors directly from observed actions and rewards under partial feedback with regret minimization as an objective (Tan et al., 2024).
Data Acquisition and Robustness: Extending PO to settings where observation of features or input blocks is itself a constrained decision, or the problem requires robust optimization under uncertainty of predictions (Peršak et al., 21 Apr 2025, Pan et al., 2024).
Time Series and Dependent Data: Handling temporally dependent parameters with autoregressive models, mixing conditions, and corresponding calibration guarantees under non-i.i.d. sampling (Liu et al., 2024).

This body of work demonstrates that the PO approach is an essential and generalizable paradigm for integrating predictive machine learning and operational optimization, with a rapidly expanding theoretical and practical toolkit for real-world prescriptive analytics.

References: