Learning-to-Optimize from Features (LtOF)

Updated 26 January 2026

LtOF is a machine learning framework that maps observed problem features directly to high-quality optimization decisions, eliminating the predict-then-optimize step.
It leverages surrogate losses and methods like primal-dual learning to incorporate task structure, reducing sample complexity and enhancing computational efficiency.
Empirical results show LtOF outperforms traditional two-stage approaches across diverse domains such as convex, nonconvex, sequential, and decentralized optimization tasks.

Learning-to-Optimize from Features (LtOF) refers to a class of machine learning frameworks in which one learns a mapping from observed features of a problem instance directly to high-quality solutions for an underlying optimization task. Unlike classical predict-then-optimize (PtO) pipelines, which first use features to predict missing optimization parameters and then solve the induced problem, LtOF eliminates the intermediate parameter estimation and instead targets direct decision or policy output. This approach is applicable to a wide range of optimization domains—including constrained, convex, nonconvex, combinatorial, sequential (MDPs), and decentralized problems—and offers advantages in computational efficiency, sample complexity, and generalization to unseen problems and features. LtOF incorporates task structure (objective and constraints) into end-to-end training, often through surrogate losses, primal-dual or Lagrangian methods, or interior-point–style barriers, and has been empirically shown to outperform classical PtO and two-stage learning approaches across canonical benchmarks and real-world applications (Shah et al., 2023, Kotary et al., 2023, Kotary et al., 2024, Babier et al., 2018, Wang et al., 2021, He et al., 2024, Kotary et al., 2024).

1. Foundational Problem Setting and Motivation

In the canonical PtO setting, the learner observes input features $x \in \mathbb{R}^n$ or $z \in \mathcal{Z}$ and faces an optimization problem parameterized by unknown or uncertain variables—most commonly a cost vector $c$ or parameters $\zeta$ . Traditionally, one first trains a predictive model $C_\theta(x)$ to estimate these parameters from $x$ , and then solves $y^*(\hat c)$ (or $x^*(\hat\zeta)$ ) using an optimization routine. Performance is assessed via regret, i.e., the gap $c^\top y^*(C_\theta(x)) - c^\top y^*(c)$ averaged over data.

LtOF reframes this two-stage process by introducing a direct map $f_\phi(x) \approx y^*(c)$ or $J_\phi(z) \approx x^*(\zeta)$ , trained end-to-end to produce solutions from features without explicit parameter recovery. This formulation supports integration of task-aware loss functions and backpropagation through the full “feature-to-decision” pipeline, maintaining a focus on decision quality under the true (unobserved) problem parameters (Kotary et al., 2023, Kotary et al., 2024).

2. LtOF Algorithmic Paradigms and Loss Functions

LtOF admits a broad variety of algorithmic designs, unified by their feature-to-solution mapping and surrogate objective structure. Core paradigms include:

Direct solution networks: Feedforward or recurrent neural architectures $f_\phi : x \mapsto y$ trained to approximate $y^*(c)$ by minimizing surrogate losses, e.g., quadratic regret, constraint violations, or regularized objective value (Kotary et al., 2023, Kotary et al., 2024, Babier et al., 2018).
Task-aware loss parameterization: LtOF employs loss functions $L_\theta(\hat y, y; x)$ whose parameters $\theta$ are themselves feature-dependent, e.g., via networks $P_\psi(x)$ generating per-feature loss weights or matrices. The training loop matches the learned loss to the true regret incurred by predicted decisions, using model-based prediction sampling for sample efficiency (Shah et al., 2023).
Primal-dual and Lagrangian methods: Primal-dual learning (PDL), (augmented) Lagrangian Dual (LD), and Deep Constraint Completion & Correction (DC³) are instantiated as surrogate loss choices—each enforcing constraints and optimality via distinct penalty and dual variable strategies, and all adaptable to the feature-input setting (Kotary et al., 2024, Kotary et al., 2023, Kotary et al., 2024).
Barrier-adversarial and dual-feature learning: For contextually constrained problems, LtOF architectures may include adversarial classifiers to encode feasible sets or learn barrier-augmented generators. Dual variable prediction (with ReLU-enforced feasibility) further supports feasibility in constrained optimization (Babier et al., 2018, Kotary et al., 2024).

Representative loss formulations in the literature:

Method	Surrogate Loss	Notes
LD	$\\|\hat y - y^*(c)\\|^2 + \lambda^\top [g(\hat y)]_+ + \mu^\top h(\hat y)$	Iterative multipliers
PDL	$f(\hat y, c) + \tilde\lambda^\top g(\hat y) + \tilde\mu^\top h(\hat y) + \frac{\rho}{2}(\\|[g(\hat y)]_+\\|^2 + \\|h(\hat y)\\|^2)$	Separate dual-net prediction
DC³	$f(\hat y, c) + \alpha\\|[g(\hat y)]_+\\|^2 + \beta\\|h(\hat y)\\|^2$	Completion/correction layers
FBP	$L_\theta(\hat y, y; x)$ feature-conditioned weighted or quadratic surrogate	Fisher-consistent if linear

3. Computational Strategies and Theoretical Guarantees

Key computational ingredients in LtOF algorithms include:

Low-rank, sample-efficient losses: Model-Based Sampling (MBS) focuses sampling on prediction-relevant $\hat y$ , while Feature-Based Parameterization (FBP) reduces the sample complexity of high-dimensional loss fitting, achieving order-of-magnitude efficiency gains (Shah et al., 2023).
Unbiased and scalable derivatives: In sequential or high-dimensional settings—e.g., Markov Decision Processes (MDPs)—differentiation through the optimization step is made tractable by trajectory sampling (for Bellman/policy gradients) and low-rank Hessian approximations using the Woodbury identity, scaling to large state and policy spaces (Wang et al., 2021).
Backpropagation through differentiable proxies: All major LtOF classes allow end-to-end differentiation through the feature-to-decision network using automatic differentiation, eschewing the need for hand-crafted KKT or black-box solver gradients (Kotary et al., 2023, Kotary et al., 2024).
Feasibility enforcement: For constrained or dual-parameterized problems, architecture choices like output activation (ReLU for dual variables), box-constrained inner solves, and primal recovery by stationarity enable approximate or exact feasibility at inference (Kotary et al., 2024).

Theoretical analyses encompass Fisher consistency (in the FBP--weighted MSE case, minimizers of the learned surrogate recover expected optimal predictions under linear cost), sample complexity bounds (LtOF achieves substantial reduction vs per-instance methods), and robustness to distribution shift (unlike pretrained proxies, LtOF’s joint training avoids input drift) (Shah et al., 2023, Kotary et al., 2023).

4. Sequential, Decentralized, and High-Order Extensions

LtOF methods have been generalized to address domains beyond static convex optimization:

Sequential decision problems/MDPs: Direct mapping from observed features (e.g., terrain descriptors, patient covariates) to missing MDP parameters, with RL-based solvers embedded in memory- and computation-efficient end-to-end training—addressing both intractable dimensionality and expensive policy differentiation (Wang et al., 2021).
Decentralized optimization: In the MiLoDo framework, LtOF parametrizes algebraically-structured per-node primal/dual updates using node- and neighbor-specific feature vectors, enforced through LSTM-based diagonal preconditioners and consensus-aware mixing weights, with strong convergence and generalization guarantees (He et al., 2024).
Meta-learner optimizers: Techniques such as Dynamic Mode Decomposition (DMD) extract trajectory-level features, which, when fed to LSTM meta-optimizers, yield robustly generalized learning-to-optimize behaviors not possible with coordinate-wise gradient input alone (Šimánek et al., 2022).
Zeroth-order and high-dimensional navigation: Policies trained as navigation agents on prototypical loss surfaces, and extended to high-d via random 2D projections, are representative of LtOF applied to black-box and partial-information scenarios (Faury et al., 2018).

5. Empirical Performance and Benchmarks

LtOF consistently yields substantial performance, runtime, and generalization benefits across diverse testbeds:

Standard PtO benchmarks: On tasks such as cubic Top-K selection, portfolio optimization, and web advertising, LtOF achieves normalized decision quality improvements up to 200% and matches/exceeds state-of-the-art surrogate or per-instance loss methods while using an order of magnitude fewer samples (Shah et al., 2023).
Constrained and nonconvex programs: In convex/nonconvex QP and AC-OPF, regret is reduced by 10× relative to two-stage PtO. Inference speeds are accelerated by 10–100× due to the elimination of per-instance solver calls (Kotary et al., 2023, Kotary et al., 2024, Babier et al., 2018).
Sequential/MDP tasks: In partially observed gridworlds, decision-focused Bellman-W (Woodbury) and PG-W methods double OPE scores against two-stage baselines (e.g., 1.5 vs 0.8) and reach near-optimal performance (Wang et al., 2021).
Decentralized networks and deep nets: MiLoDo-trained optimizers demonstrate 1.5–3× faster convergence on LASSO, logistic regression, and deep architectures, maintaining empirical stability across $10^5$ iterations and successfully transferring to unseen dimensions and tasks (He et al., 2024).

6. Practical Considerations, Limitations, and Future Directions

Strengths of LtOF include broad applicability across decision domains, avoidance of restrictive localness and per-instance loss assumptions, elimination of hand-crafted differentiation, and real-time solution inference. The methodology is robust under complex and nonlinear feature–parameter relationships and does not suffer from drift issues seen in pretrained proxy pipelines (Kotary et al., 2023, Kotary et al., 2024, Shah et al., 2023).

Limitations remain, including the need for task-appropriate LtO surrogates, absence of explicit parameter recovery (for downstream auditing), and feasibility restoration complexity in highly nonconvex or combinatorial domains. Future work includes extensions to mixed-integer optimization, online and streaming settings, incorporation of domain knowledge into loss architectures, and development of regret-bounding theory in the presence of uncertain features (Shah et al., 2023, Kotary et al., 2023, Kotary et al., 2024).

References:

"Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize" (Shah et al., 2023)
"Predict-Then-Optimize by Proxy: Learning Joint Models of Prediction and Optimization" (Kotary et al., 2023)
"Learning Joint Models of Prediction and Optimization" (Kotary et al., 2024)
"Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning" (Wang et al., 2021)
"Learning to Optimize Contextually Constrained Problems for Real-Time Decision-Generation" (Babier et al., 2018)
"Learning Constrained Optimization with Deep Augmented Lagrangian Methods" (Kotary et al., 2024)
"A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization" (He et al., 2024)
"Learning to Optimize with Dynamic Mode Decomposition" (Šimánek et al., 2022)
"Rover Descent: Learning to optimize by learning to navigate on prototypical loss surfaces" (Faury et al., 2018)