Discretize–Optimize Methods for SDEs

Updated 15 January 2026

Discretize–Optimize for SDEs is a computational paradigm that approximates continuous stochastic processes via discrete integrators and then optimizes the resulting system.
It employs forward and adjoint sensitivity techniques to yield unbiased gradient estimates and enhance control outcomes in stochastic settings.
Adaptive mesh optimization and tailored discretization schemes minimize numerical error and boost efficiency in simulation and neural SDE training.

Discretize-Optimize for SDEs

The Discretize–Optimize paradigm for stochastic differential equations (SDEs) is an algorithmic and conceptual workflow for computing sensitivities, solving optimal control problems, or bounding expectations of SDE functionals by first discretizing the SDE dynamics in time and then performing optimization (or differentiation or dynamic programming) at the discrete level. This stands in contrast to the optimize–discretize alternative, where continuous adjoints or gradients are derived and then discretized post hoc; the two approaches may not commute for SDEs. Discretize–Optimize is central in adjoint sensitivity analysis, pathwise differentiation, optimal sampling design, stochastic control, numerical solution of backward SDEs (BSDEs), neural SDE training, and mesh optimization for error control. This article surveys the mathematical foundations, discretization strategies, gradient computation, convergence properties, sample efficiency, and problem-specific algorithmics associated with the Discretize–Optimize methodology for SDEs.

1. Mathematical and Algorithmic Foundation

Given a continuous-time SDE, either in the Itô or Stratonovich sense,

$dX(s) = f(s, X(s), \theta) \, ds + g(s, X(s), \theta) \diamond dW(s), \quad X(0) = x_0,$

with $\diamond =$ standard Itô or $\circ$ (Stratonovich) integration, the primary objective is often to compute or optimize

$J(x_0, \theta) = E[\Phi(X(T; x_0, \theta))],$

where $\Phi$ is a given terminal payoff or cost. The typical goal is to compute the gradient $\nabla_{x_0,\theta} J$ or solve related optimal control problems.

Discretize–Optimize proceeds in two steps:

Discretization: The SDE is approximated by a discrete-time integrator (e.g., Euler–Maruyama for Itô, Heun or midpoint for Stratonovich), yielding a Markov chain $X_0,X_1,\ldots,X_N$ over mesh $t_0<t_1<\cdots<t_N$ . The functional $J$ is replaced by $J_{\Delta t}$ , e.g., $J_{\Delta t}(x_0,\theta)=E[\Phi(X_N)]$ .
Optimization/Sensitivity: All optimization, backward recursion, control or differentiation is applied to the discrete problem. For sensitivities, this means differentiating through the stepwise update rules and adjoint recursions using autodiff or manual reverse-mode derivation (Leburu et al., 13 Jan 2026, Kidger et al., 2021).

This methodology reliably produces unbiased discrete gradients, correct KKT systems for control, and accurate strong error estimates for sample expectations under conventional regularity assumptions.

2. Time Discretization Strategies and Mesh Optimization

A critical discretize–optimize ingredient is the integrator and grid choice:

Schemes: Euler–Maruyama (Itô), Heun/midpoint (Stratonovich), Milstein (for higher strong order), and implicit Euler (for control with constraints) are standard (Leburu et al., 13 Jan 2026, Chaudhary, 2024, Przybyłowicz, 2017).
Adaptive and Optimal Grids: Instead of equidistant meshes, a mesh density $\psi(t)$ can be optimized to minimize strong mean-square error. In linear SDEs, the optimal density follows the 1/3 power law: $\psi^*(t) = \frac{F(t)^{1/3}}{\int_0^T F(s)^{1/3} ds}$ where $F(t)$ is the local error density function, often expressed via Lyapunov or observability Gramians. Nodes $t_k$ are then given by $C(t_k) = k/N$ for $C(t) = \int_0^t F(s)^{1/3} ds/\int_0^T F(s)^{1/3} ds$ (Vladimirov, 5 Aug 2025, Przybyłowicz, 2017). For jump-diffusion SDEs, a similar construction applies, with sampling density optimized to local variance (Przybyłowicz, 2017).

These principles ensure that discretization error is minimized for a fixed computational budget.

3. Pathwise and Adjoint Sensitivity Computation

Discretize–Optimize produces discrete pathwise sensitivity algorithms:

Forward mode: The one-step Jacobians of the integrator are differentiated directly. In Euler–Maruyama,

$\frac{\partial X_{n+1}}{\partial X_n} = I + \Delta t\,\partial_x f_n + (\partial_x g_n) \Delta W_n.$

For gradient computation, these matrices are chain-multiplied to get $\frac{\partial X_N}{\partial x_0}$ and $\frac{\partial X_N}{\partial \theta}$ (Leburu et al., 13 Jan 2026).

Reverse (adjoint) mode: Introduce discrete adjoint variables $p_n, q_n$ and propagate them backward using the discrete chain rule and the linearizations of the stepwise update. For Itô SDEs, the recursion is

$p_n = [I + \Delta t\, \partial_x f_n^\top + \Delta W_n^\top\, \partial_x g_n^\top ] p_{n+1},$

with $p_N = \partial_X \Phi(X_N)$ , and similarly for $q_n$ (Leburu et al., 13 Jan 2026). For Stratonovich/Heun, correction terms arise from the predictor–corrector update.

For jump-diffusion SDEs, the backward recursion and adjoint "re-start" conditions must accommodate both diffusion steps and discrete jumps (Bartsch et al., 7 May 2025).

Monte Carlo expectation over $M$ sample trajectories yields unbiased estimators of gradients. Pathwise propagation is the core of autodiff compatibility in differentiable programming frameworks (Leburu et al., 13 Jan 2026, Kidger et al., 2021).

4. SDE Optimal Control and BSDEs: Algorithms and Performance

In stochastic optimal control, discretize–optimize is applied by first discretizing both state and (when present) control variables and constructing the Lagrangian or dynamic programming recursion at the discrete level:

For jump-diffusion SDE control, the discrete system samples Wiener and Poisson increments, evolves forward over merged grids, and solves backward adjoint recursions. The discrete KKT conditions or Lagrange multipliers are assembled in a Monte Carlo–averaged gradient estimator, often within a stochastic gradient descent or projected gradient iteration (Bartsch et al., 7 May 2025, Chaudhary, 2024).
In linear-quadratic control, implicit Euler discretization is coupled with recursive calculation of discrete adjoints and conditional expectations, sometimes avoiding Monte Carlo altogether via closed-form propagation in high dimension (Chaudhary, 2024).
For backward SDEs and convex dynamic programming, one constructs primal-dual iterates (pathwise lower/upper solutions) by recursion on the time grid; each iteration tightens the bounds, with explicit guarantees on convergence and error under time discretization (Bender et al., 2016).

Empirical evidence confirms that, in these settings, discretize–optimize achieves strong order $1/2$ errors (or higher with Milstein/Heun/Midpoint), and gradient error norm decays as $O(\sqrt{\Delta t})$ or $O(\Delta t)$ for smooth payoffs and strong (Stratonovich-compatible) integrators (Leburu et al., 13 Jan 2026).

5. Neural SDEs and Memory-Efficient Differentiable Implementations

Training of Neural SDEs via backpropagation exemplifies scalability and efficiency in discretize–optimize:

The reversible Heun method is a two-stage, algebraically reversible SDE integrator, enabling exact reversal of the discrete trajectory for gradient computation and eliminating numerical gradient error. The Brownian Interval method allows O(1) memory and fast reevaluation of arbitrary Brownian increments (Kidger et al., 2021).
Implementation in autodiff libraries (e.g., torchsde) is straightforward: the time-stepping SDE solver is written naively, with all randomness stored or reproducibly generated, and autodiff computes the correct pathwise adjoint. The discrete gradient matches the autodiff output to machine precision when algebraic reversibility is used.
Empirically, the reversible Heun and Brownian Interval approaches yield 2–10× speedups in large-batch training and zero relative error in discrete gradients, outperforming standard midpoint or Heun adjoints that suffer $10^{-2}$ – $10^{-3}$ relative gradient error due to discretization mismatches (Kidger et al., 2021).

This paradigm demonstrates the centrality of discretize–optimize in modern large-scale SDE-based learning.

6. Specialized Approaches: Particle Flows, Filtering, and Grid Optimization

Beyond pathwise differentiation, discretize–optimize underpins a range of innovative SDE solvers:

Particle flows for control: Solutions of controlled Fokker–Planck equations are approximated by deterministic ODEs for particle clouds, where optimal control is directly extracted from score differences between constrained and unconstrained flows. Here, the optimization is achieved in a single “one-shot” iteration after time discretization (Maoutsa et al., 2021).
Kalman filtering and grid shaping: In linear SDEs, minimization of strong mean-square approximation error with respect to time grid or sampling density is cast as a variational calculus problem, solved by the 1/3 power law (Vladimirov, 5 Aug 2025). Similarly, in jump–diffusion SDEs, optimal mesh design is realized by matching local strong error densities and constructing nonequidistant grids that realize minimal error constants (Przybyłowicz, 2017).

These strategies generalize discretize–optimize to mesh-adaptive, strongly convergent, and high-dimensional regimes.

7. Limitations, Practical Considerations, and Extensions

Commutation issues: For Itô SDEs, optimize–discretize and discretize–optimize may not commute. Discretize–optimize is "correct at the discrete level"—that is, the discrete gradient is always unbiased for the discrete expectation (Leburu et al., 13 Jan 2026).
Monte Carlo cost: All practical schemes require averaging over $M$ paths, with gradient variance $O(M^{-1/2})$ . Checkpointing, seed control, and Brownian intervalization mitigate memory and compute bottlenecks (Kidger et al., 2021).
High-dimensional control: In linear-quadratic and some convex control problems, recursive formulas for conditional expectations eliminate the need for Monte Carlo, yielding deterministic closed-form iterates (Chaudhary, 2024). For nonlinear or nonconvex problems, variance reduction and control variates can accelerate convergence (Bender et al., 2016).
Score-based/particle methods: In high dimensions, particle-based or kernel-based score matching for optimal control suffers from the curse of dimensionality, although neural score estimators and entropic optimal transport offer plausible improvements (Maoutsa et al., 2021).
Extensions: Mesh optimization, adaptive time-stepping, and discrete strong approximation constants enable application-specific tailoring of step sizes and sampling grids in complex SDE models (Vladimirov, 5 Aug 2025, Przybyłowicz, 2017).

A plausible implication is that future research may further couple mesh-adaptive discretization, pathwise adjoint methods, and variance-controlled Monte Carlo to push the computational frontier of discretize–optimize algorithms for SDEs.

Key References:

"Differentiating through Stochastic Differential Equations: A Primer" (Leburu et al., 13 Jan 2026)
"Efficient and Accurate Gradients for Neural SDEs" (Kidger et al., 2021)
"Adjoint-based optimal control of jump-diffusion processes" (Bartsch et al., 7 May 2025)
"Filtering and 1/3 Power Law for Optimal Time Discretisation in Numerical Integration of Stochastic Differential Equations" (Vladimirov, 5 Aug 2025)
"Deterministic particle flows for constraining SDEs" (Maoutsa et al., 2021)
"Optimal sampling design for global approximation of jump diffusion SDEs" (Przybyłowicz, 2017)
"A numerical method to simulate the stochastic linear-quadratic optimal control problem with control constraint in higher dimensions" (Chaudhary, 2024)
"Pathwise Iteration for Backward SDEs" (Bender et al., 2016)