Efficient ε-Optimal Dynamic Policies

Updated 28 January 2026

The paper presents a primal–dual π-learning algorithm that reformulates the Bellman equation as a convex–concave saddle-point problem for efficient ε-optimal policy computation.
It leverages online mirror descent and explicit policy extraction to achieve sublinear sample and run-time complexity relative to the state-action space size.
The approach extends to robust, multi-objective, and partially observable settings, offering provable performance guarantees under strict ergodicity and mixing-time conditions.

Efficient Design of $\epsilon$ -Optimal Dynamic Policies

The efficient design of $\epsilon$ -optimal dynamic policies concerns the development of algorithms and analytical frameworks for computing or approximating dynamic decision policies whose performance is within an additive $\epsilon$ of the optimum for classes of stochastic control, reinforcement learning, and inventory management problems. This topic spans Markov decision processes (MDPs), robust RL, inventory scheduling, and associated sample- and computational-complexity lower and upper bounds. Theoretical advances enable the computation of near-optimal policies under both model-based and model-free information, often by exploiting duality, convexity, and structure-specific algorithmic constructions.

1. Problem Formulations and Complexity Parameters

A unifying setup is the infinite-horizon, finite-state, finite-action, undiscounted ergodic MDP. Here, the system transitions under a stationary randomized policy $\pi$ according to a Markovian transition kernel $P$ with state space $S$ (size $|S|$ ), action space $A$ ( $|A|$ ), and rewards $r_{ij}(a)\in [0,1]$ defined on transitions. The long-run average reward is

$\bar v^{\pi} = \lim_{T\to\infty} \mathbb{E}\left[ \frac{1}{T} \sum_{t=1}^T r_{i_t i_{t+1}}(a_t) \mid i_1 = i \right]$

and the goal is to find $\pi^*$ maximizing this average.

Two critical ergodicity parameters control complexity:

Stationary distribution parameter $\tau$ : ensures all stationary policies yield ergodic Markov chains with stationary distributions bounded as $\frac{1}{\sqrt{\tau}|S|} \leq \nu^\pi(i) \leq \frac{\sqrt{\tau}}{|S|}$ .
Uniform mixing-time bound $t^*_{\mathrm{mix}}$ : max over policies of the minimal time until the total-variation distance to stationarity is below $1/4$ for every state.

These encode the "evenness" of recurrent distributions and the speed of convergence to stationarity, and directly enter sample and run-time guarantees for algorithms in this regime (Wang, 2017).

2. Primal-Dual $\pi$ -Learning Algorithm

The efficient design of $\epsilon$ -optimal dynamic policies in this domain is exemplified by the primal-dual $\pi$ -learning method, whose central features are:

Saddle-point reformulation: The Bellman equation is recast as a convex–concave saddle-point problem:

$\min_{h:\|h\|_\infty \leq 2 t^*_{\mathrm{mix}}} \max_{\mu \in U} \sum_{i,a} \mu_{i,a}\left[ (P_a - I)h + r_a \right]_i$

where $U = \{ \mu \geq 0 : \sum_{i,a} \mu_{i,a} = 1, \mu_{\cdot,a} \geq 1/(\sqrt{\tau}|S|) \cdot 1 \}$ .

Online primal–dual mirror-descent: Iterates maintain primal (differential value) and dual (policy mixture) variables and update them by online stochastic gradients using next-state and reward feedback, with Bregman/KL projection in the dual and Euclidean projection in the primal:
- At each iteration, sample $(i,a) \sim \mu^t$ , obtain reward and next state, and form stochastic gradients.
- Dual (policy) update: exponentiated-gradient followed by KL-projection onto $U$ .
- Primal (value) update: projection onto $\|h\|_\infty \leq 2 t^*_{\mathrm{mix}}$ .
Policy extraction: At each step, extract a policy as $\pi^t_i(a) = \mu^t_{i,a} / \sum_b \mu^t_{i,b}$ . The final output is the time-average policy $\bar \pi = (1/T)\sum_t \pi^t$ (Wang, 2017).

3. Sample and Run-Time Complexity

The main result gives a high-probability sample complexity to achieve $\epsilon$ -optimality: $T = \tilde{O}\left( \frac{(\tau \cdot t^*_{\mathrm{mix}})^2 |S| |A|}{\epsilon^2} \right)$ iterations and generative samples suffice for $\bar v^{\bar\pi} \geq \bar v^* - \epsilon$ . This bound arises from the decay of the expected Bellman duality gap under mirror descent and its relation to policy suboptimality.

Sublinear-time regime: In the explicit-model setting (where $P$ and $r$ are given), state-action sampling and updates are $O(1)$ per step, so the overall algorithm runs in time and space sublinear in the input matrix size $|S|^2 A$ as long as $\epsilon^2$ is not too small relative to $|S|$ .
Model-free regime: When only a generative sampling oracle is available, each iteration samples a transition and performs $O(1)$ arithmetic, so sample and run-time complexities match (Wang, 2017).

4. Comparison to Classical Methods

The primal-dual $\pi$ -learning paradigm critically improves on Bellman-iteration-based methods:

Method	Sample/Run-Time (Ergodic MDP)	Remarks
Value/Policy Iteration	$\Omega(\|S\|^2 A)$ (per iteration)	At least linear in input size
Q-learning/sampled VI	Extra factors in $(1-\gamma)^{-p}$	Discounted MDPs, polynomial p
Primal-dual $\pi$ -learning	$\tilde{O}((\tau t^*_{\mathrm{mix}})^2 \|S\| A /\epsilon^2)$	Sublinear in $\|S\|^2 A$ for fast mixing

The optimization leverages linear-programming duality, replacing successive nonlinear Bellman backups with bilinear saddle-point updates and mirroring both policy and value. The dominant terms in complexity depend on mixing and stationary distribution uniformity, which can be substantially smaller than discount-factor-induced terms $(1-\gamma)^{-2}$ in undiscounted rapidly mixing processes (Wang, 2017).

5. Bellman Duality, Concentration, and High-Probability Guarantees

Sample complexity guarantees are established via:

Analysis of the expected “duality gap”—the departure from saddle-point optimality in the value-policy variables.
Mirror-prox and martingale concentration results showing that the time-averaged duality gap decays as $O(t^*_{\mathrm{mix}} \sqrt{(S A)/T})$ .
The suboptimality guarantee translates duality gap into average reward error as

$\bar v^* - \bar v^{\bar\pi} \leq \tau \cdot \frac{1}{T} \sum_t \mathbb{E}[G^t] \leq \epsilon$

for $T \gtrsim (\tau t^*_{\mathrm{mix}})^2 S A / \epsilon^2$ (Wang, 2017).

All constants and log $(1/\delta)$ terms in high-probability guarantees are handled by martingale concentration on the stochastic updates fed by the sampling oracle.

6. Extensions and Broader Impact

The efficient design of $\epsilon$ -optimal dynamic policies via primal–dual and saddle-point frameworks has broader applicability:

The same duality-and-mirror-descent mechanisms underpin polynomial-time approximation schemes (EPTAS) for dynamic policies in continuous-time joint replenishment and economic warehouse scheduling, where the policy space is infinite-dimensional and classical DP is intractable but multi-scale grid alignment, recursion across “frequency classes”, and analytic rounding yield polynomial-space and time approximate representations (Segev, 2023, Segev, 23 Jun 2025, Segev, 21 Jan 2026).
Sublinear-time and sample complexity regimes can be achieved for dynamic control tasks provided ergodicity and mixing-time assumptions, leveraging the boundedness of reachable set size for saddle-point domains.
The paradigm generalizes to robust constrained MDPs where mirror-descent is used in conjunction with robust evaluation oracles over uncertainty sets, ensuring feasibility and $\epsilon$ -optimality in $O(1/\epsilon^2)$ iterative updates without need for binary search or Lagrangian duality (Ganguly et al., 25 May 2025).

These methods fundamentally alter the algorithmic landscape for dynamic policy computation, reducing complexity from $O(|S|^2 A)$ or worse to a regime dictated by mixing, stationarity, and desired precision rather than the full cardinality of the state-action space. The approach admits extensions to multi-objective, robust, and partially observable settings, contingent on structure enabling saddle-point characterizations.