Efficient ε-Optimal Dynamic Policies
- The paper presents a primal–dual π-learning algorithm that reformulates the Bellman equation as a convex–concave saddle-point problem for efficient ε-optimal policy computation.
- It leverages online mirror descent and explicit policy extraction to achieve sublinear sample and run-time complexity relative to the state-action space size.
- The approach extends to robust, multi-objective, and partially observable settings, offering provable performance guarantees under strict ergodicity and mixing-time conditions.
Efficient Design of -Optimal Dynamic Policies
The efficient design of -optimal dynamic policies concerns the development of algorithms and analytical frameworks for computing or approximating dynamic decision policies whose performance is within an additive of the optimum for classes of stochastic control, reinforcement learning, and inventory management problems. This topic spans Markov decision processes (MDPs), robust RL, inventory scheduling, and associated sample- and computational-complexity lower and upper bounds. Theoretical advances enable the computation of near-optimal policies under both model-based and model-free information, often by exploiting duality, convexity, and structure-specific algorithmic constructions.
1. Problem Formulations and Complexity Parameters
A unifying setup is the infinite-horizon, finite-state, finite-action, undiscounted ergodic MDP. Here, the system transitions under a stationary randomized policy according to a Markovian transition kernel with state space (size ), action space (), and rewards defined on transitions. The long-run average reward is
and the goal is to find maximizing this average.
Two critical ergodicity parameters control complexity:
- Stationary distribution parameter : ensures all stationary policies yield ergodic Markov chains with stationary distributions bounded as .
- Uniform mixing-time bound : max over policies of the minimal time until the total-variation distance to stationarity is below $1/4$ for every state.
These encode the "evenness" of recurrent distributions and the speed of convergence to stationarity, and directly enter sample and run-time guarantees for algorithms in this regime (Wang, 2017).
2. Primal-Dual -Learning Algorithm
The efficient design of -optimal dynamic policies in this domain is exemplified by the primal-dual -learning method, whose central features are:
- Saddle-point reformulation: The Bellman equation is recast as a convex–concave saddle-point problem:
where .
- Online primal–dual mirror-descent: Iterates maintain primal (differential value) and dual (policy mixture) variables and update them by online stochastic gradients using next-state and reward feedback, with Bregman/KL projection in the dual and Euclidean projection in the primal:
- At each iteration, sample , obtain reward and next state, and form stochastic gradients.
- Dual (policy) update: exponentiated-gradient followed by KL-projection onto .
- Primal (value) update: projection onto .
- Policy extraction: At each step, extract a policy as . The final output is the time-average policy (Wang, 2017).
3. Sample and Run-Time Complexity
The main result gives a high-probability sample complexity to achieve -optimality: iterations and generative samples suffice for . This bound arises from the decay of the expected Bellman duality gap under mirror descent and its relation to policy suboptimality.
- Sublinear-time regime: In the explicit-model setting (where and are given), state-action sampling and updates are per step, so the overall algorithm runs in time and space sublinear in the input matrix size as long as is not too small relative to .
- Model-free regime: When only a generative sampling oracle is available, each iteration samples a transition and performs arithmetic, so sample and run-time complexities match (Wang, 2017).
4. Comparison to Classical Methods
The primal-dual -learning paradigm critically improves on Bellman-iteration-based methods:
| Method | Sample/Run-Time (Ergodic MDP) | Remarks |
|---|---|---|
| Value/Policy Iteration | (per iteration) | At least linear in input size |
| Q-learning/sampled VI | Extra factors in | Discounted MDPs, polynomial p |
| Primal-dual -learning | Sublinear in for fast mixing |
The optimization leverages linear-programming duality, replacing successive nonlinear Bellman backups with bilinear saddle-point updates and mirroring both policy and value. The dominant terms in complexity depend on mixing and stationary distribution uniformity, which can be substantially smaller than discount-factor-induced terms in undiscounted rapidly mixing processes (Wang, 2017).
5. Bellman Duality, Concentration, and High-Probability Guarantees
Sample complexity guarantees are established via:
- Analysis of the expected “duality gap”—the departure from saddle-point optimality in the value-policy variables.
- Mirror-prox and martingale concentration results showing that the time-averaged duality gap decays as .
- The suboptimality guarantee translates duality gap into average reward error as
for (Wang, 2017).
All constants and log terms in high-probability guarantees are handled by martingale concentration on the stochastic updates fed by the sampling oracle.
6. Extensions and Broader Impact
The efficient design of -optimal dynamic policies via primal–dual and saddle-point frameworks has broader applicability:
- The same duality-and-mirror-descent mechanisms underpin polynomial-time approximation schemes (EPTAS) for dynamic policies in continuous-time joint replenishment and economic warehouse scheduling, where the policy space is infinite-dimensional and classical DP is intractable but multi-scale grid alignment, recursion across “frequency classes”, and analytic rounding yield polynomial-space and time approximate representations (Segev, 2023, Segev, 23 Jun 2025, Segev, 21 Jan 2026).
- Sublinear-time and sample complexity regimes can be achieved for dynamic control tasks provided ergodicity and mixing-time assumptions, leveraging the boundedness of reachable set size for saddle-point domains.
- The paradigm generalizes to robust constrained MDPs where mirror-descent is used in conjunction with robust evaluation oracles over uncertainty sets, ensuring feasibility and -optimality in iterative updates without need for binary search or Lagrangian duality (Ganguly et al., 25 May 2025).
These methods fundamentally alter the algorithmic landscape for dynamic policy computation, reducing complexity from or worse to a regime dictated by mixing, stationarity, and desired precision rather than the full cardinality of the state-action space. The approach admits extensions to multi-objective, robust, and partially observable settings, contingent on structure enabling saddle-point characterizations.