Causal Sinkhorn DRO Optimization

Updated 6 February 2026

Causal-SDRO is a robust optimization framework that integrates causal optimal transport with Sinkhorn entropic regularization to construct ambiguity sets respecting non-anticipativity and temporal dynamics.
It employs dual reformulations and gradient-based algorithms to ensure tractable computation and interpretable decision rules under distributional uncertainty.
Causal-SDRO offers provable learning guarantees and superior empirical performance in finance, optimal control, and contextual decision-making.

Causal Sinkhorn Distributionally Robust Optimization (Causal-SDRO) is a class of distributionally robust optimization (DRO) frameworks that combine causal optimal transport and Sinkhorn entropic regularization. These approaches enable the construction of ambiguity sets respecting information flow, non-anticipativity, and conditional independence, thereby addressing dynamic or contextual settings where decisions or predictions must account for temporal structure, stochastic control, or covariate-dependence. Causal-SDRO achieves robust risk control and statistically grounded guarantees, producing tractable, interpretable procedures for robust optimization and learning under distributional uncertainty.

1. Core Principles and Mathematical Foundations

Causal-SDRO extends classical distributionally robust optimization by replacing the standard Wasserstein ambiguity set with a causal (non-anticipative) version, further regularized by an entropic (Sinkhorn) penalty. This yields ambiguity sets of the form

$\mathcal{U}_{\mathrm{causal}}(\hat\mu_N;\varepsilon) = \{\nu : W_{c,\mathrm{causal}}(\hat\mu_N, \nu) \leq \varepsilon\},$

where $W_{c,\mathrm{causal}}(\cdot, \cdot)$ is the causal Wasserstein (optimal transport) distance (Han, 2022). The causal distance is defined by minimizing transport cost over couplings satisfying

$\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$

enforcing that realizations at each time $t$ depend causally, reflecting the information structure of time series or stochastic control.

Sinkhorn regularization augments the transport cost with an entropy term: $d_c^\epsilon(\mu, \nu) = \inf_{\pi \in \Pi_c(\mu, \nu)} \left\{ \mathbb{E}_\pi[c] + \epsilon\, \mathrm{KL}(\pi \| \mu \otimes \nu) \right\}$ inducing continuous, absolutely continuous couplings and facilitating efficient solution via iterative scaling (Sinkhorn) algorithms (Jiang, 2024, Zhang et al., 16 Jan 2026, Cescon et al., 31 Aug 2025). The resulting ambiguity set is then specified by a causal Sinkhorn discrepancy (CSD) ball: $\mathcal{U} = \{ \mathbb{P} : R_p(\widehat{\mathbb{P}}, \mathbb{P}) \leq \rho \},$ with $R_p$ denoting the entropy-regularized causal Wasserstein distance (Zhang et al., 16 Jan 2026).

2. Duality and Structural Extensions

Causal-SDRO admits strong dual reformulations. Under appropriate continuity and growth conditions, the worst-case expectation over the ambiguity set can be recast as a minimization involving test functions drawn from a space reflecting the causal structure: $J(\varepsilon; \hat\mu_N) = \inf_{\lambda \geq 0,\, \gamma \in \Gamma} \left\{ \lambda \varepsilon + \int_\mathcal{X} F(x; \lambda, \gamma)\, \hat\mu_N(dx) \right\},$ where

$F(x;\lambda,\gamma) = \sup_{y \in \mathcal{X}} \{ f(y) - \lambda c(x,y) + \gamma(x,y) \}$

and $\Gamma$ consists of sums of test functions encoding the non-anticipative constraints (Han, 2022).

The framework can incorporate structural information through intersection with a model class $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 0, restricting the ambiguity set to distributions with parametric, factor, or moment structures (e.g., induced by RNNs, moment constraints) (Han, 2022, Zhang et al., 16 Jan 2026). The corresponding dual then involves nested minimax optimizations over $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 1, test functions, and worst-case measures in $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 2.

In the entropically regularized (Sinkhorn) case, dual potentials at each time step are updated via coordinate ascents: $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 3 with $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 4 forming a nested, time-indexed family of potentials, coupled recursively (Jiang, 2024).

3. Optimization Algorithms and Computational Aspects

Causal-SDRO problems (including their duals) reduce, after parameterization of test functions and structural classes, to large-scale finite-dimensional minimax or saddle programs. Prototypical solution algorithms are variants of gradient descent-ascent (GDA) or stochastic compositional gradient methods, exploiting the differentiability and convexity induced by the entropic penalty (Han, 2022, Zhang et al., 16 Jan 2026).

Algorithm templates:

COT-GDA: Parameterize the dual test function network, perform inner gradient ascent in $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 5, and descent in network parameters and $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 6 (Han, 2022).
SCOT-GDA: Parameterize both the generator $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 7 and dual network $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 8, alternate Sinkhorn computations and adversarial updates to $W_{c,\mathrm{causal}}(\cdot, \cdot)$ 9.
Dynamic Sinkhorn: For time-indexed processes, employ backward recursion to update dual potentials $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 0 using a "soft max" version of dynamic programming (Jiang, 2024).

Stochastic compositional optimization: For contextual settings (covariate-dependent policies), substitute a parameterized policy (e.g., soft regression forest), yielding a three-level stochastic composition problem: $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 1 optimized via a stochastically corrected scheme converging at rate $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 2 (Zhang et al., 16 Jan 2026).

In optimal control, Causal-SDRO LQG admits exact reformulations as convex semidefinite programs by restricting nature to Gaussian laws, reducing computational burden and ensuring global saddle points achieved by linear policies (Cescon et al., 31 Aug 2025).

4. Sample Complexity and Learning Guarantees

Parametric approximations of the dual test function space enable explicit learning-theoretic guarantees. Rademacher complexity controls the generalization error: $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 3 with $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 4, and explicit rates $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 5 for neural network classes (Han, 2022).

Universal approximation holds: as parameterization is refined, the empirical and population values converge, ensuring the statistical validity of robust risk estimates and decisions.

5. Theoretical and Practical Applications

Robust Finance

Volatility Estimation: SCOT (structurally-constrained Causal-SDRO) produces strictly tighter, smoother volatility scenarios compared to classical OT. Table 5.1 in (Han, 2022) demonstrates that SCOT achieves lower mean dual values (1.9284) and smaller volatility than SOT/OT, with statistically significant improvement ( $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 6). Worst-case paths generated by OT "spike" unrealistically, whereas SOT/SCOT preserve temporal coherence.
SP500 Index Prediction: SCOT and SOT provide higher prediction coverage with minimally increased MAE compared to classical OT, which substantially degrades MAE. In Table 5.3 of (Han, 2022), SCOT achieves 67.3% coverage with 1.5% MAE, OT reaches 73.8% with 1.87% MAE, and the non-robust benchmark underperforms both for coverage and stability.

Robust Control

Distributionally Robust LQG: Causal-SDRO enables the synthesis of globally optimal linear policies under entropy-regularized Wasserstein ambiguity. Gaussian restrictions lead to precise semidefinite program (SDP) reductions where strong duality holds and linear policies are proven optimal. The worst-case nature's law is Gaussian with covariance in a Gelbrich set determined by the entropy parameter (Cescon et al., 31 Aug 2025).

Contextual Decision-Making

CSD-Based SDRO: Incorporating contextual information, Causal-SDRO ambiguity sets are governed by the causal Sinkhorn discrepancy between joint empirical and candidate laws (Zhang et al., 16 Jan 2026). The worst-case law admits a Gibbs-mixture form. Decision rules are parameterized using Soft Regression Forests (SRF), which are differentiable, smooth, and interpretable. Empirical results indicate that SRF-Causal-SDRO achieves substantial improvements in prescriptiveness scores (e.g., ≈50%) and outperforms neural nets and classical ERM in nonlinear newsvendor, inventory-substitution, and portfolio selection tasks.

6. Interpretability, Algorithmic Features, and Empirical Evidence

Causal-SDRO procedures support interpretability both globally and locally. In SRF-based approaches, feature importance measured via $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 7-norms of input gradients correlates highly ( $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 8) with permutation-based measures, while local attributions via Empirical Integrated Gradients closely match SHAP values ( $\pi(dy_t \mid x_{1:T}, y_{1:t-1}) = \pi(dy_t \mid x_{1:t}, y_{1:t-1}),$ 9) (Zhang et al., 16 Jan 2026). This reflects the compatibility of Causal-SDRO with post-hoc and intrinsic interpretability frameworks.

Algorithmic convergence is geometric for the inner Sinkhorn scaling, with outer convex optimization in dual variables. In all empirical cases surveyed, Causal-SDRO methods yield improved robustness to distributional shifts, reduced tail sensitivity, and realistic scenario generation compared to both classical OT-DRO and naive non-robust baselines (Han, 2022, Zhang et al., 16 Jan 2026, Cescon et al., 31 Aug 2025).

Application Domain	Key Causal-SDRO Methodology	Principal Outcome
Volatility Estimation	SCOT (structural, causal, Sinkhorn)	Smoother, realistic paths
S&P500 Prediction	SCOT, SOT, classical OT comparison	Higher coverage, lower MAE
LQG Control	Causal-Sinkhorn, SDP reformulation	Globally optimal linearity
Portfolio, Inventory, Newsvendor	CSD-based, SRF-policy, stochastic optimization	Robust outperformance, interpretability

7. Summary and Scope

Causal Sinkhorn Distributionally Robust Optimization frameworks provide a rigorous, computationally tractable approach for robust decision-making and risk estimation in dynamic and contextual environments. The central innovation is the fusion of causal optimal transport—with respect for filtration and non-anticipative flows—and entropic regularization, leading to ambiguity sets and solutions that are statistically controlled, interpretable, and directly applicable in high-dimensional, structured tasks. Strong duality, statistical learning guarantees, efficient computational algorithms, and demonstrated empirical superiority establish Causal-SDRO as foundational in robust machine learning, financial risk management, and distributionally robust control (Han, 2022, Zhang et al., 16 Jan 2026, Cescon et al., 31 Aug 2025, Jiang, 2024).