Anchor-Changing Regularized NPG (ARNPG)
- The paper introduces ARNPG, which unifies mirror descent, optimism, and extra-gradient methods into a natural policy gradient framework for multi-objective reinforcement learning.
- It optimizes policies in finite MDPs with several reward functions using an anchor-changing strategy and KL regularization to enhance convergence and performance.
- Empirical evaluations demonstrate that ARNPG achieves O(1/T) convergence and outperforms baseline methods in both tabular and deep RL settings.
Anchor-Changing Regularized Natural Policy Gradient (ARNPG) is a multi-objective reinforcement learning (MORL) framework for policy optimization in Markov decision processes (MDPs) with several scalar reward functions. ARNPG unifies and systematically embeds advanced first-order optimization techniques—such as mirror descent, optimism, and extra-gradient methods—into natural policy gradient (NPG) architectures, introducing an anchor-changing strategy and Kullback–Leibler (KL) regularization to achieve theoretically optimal convergence and empirical superiority in complex policy search spaces (Zhou et al., 2022).
1. Multi-Objective Markov Decision Processes
The ARNPG framework operates on a discounted finite MDP defined by:
- finite state space ,
- finite action space ,
- transition kernel ,
- initial distribution ,
- discount factor ,
- reward functions for .
For a stationary policy , the state-action visitation distribution is: Value for reward is: The achievable multi-reward value set is for policy class , typically parametrized as softmax policies .
2. Optimization Objectives in ARNPG
ARNPG addresses three principal MORL optimization paradigms:
- Smooth Concave Scalarization (Proportional Fairness):
Here is -smooth, .
- Hard Constraints (Constrained MDP):
Using the Lagrangian formulation,
- Max–Min Trade-Off:
Formulated as a min-max saddle-point problem in the dual space.
3. The ARNPG Meta-Algorithm
The core ARNPG method iterates through macro-steps indexed by anchors and ascent directions via advanced regularized subproblems:
- Macro-Iteration Setup: Anchor policy , ascent reward , KL regularization .
- Subproblem (“INNER”):
where
and
is the state visitation-weighted KL divergence to the anchor.
- Natural Policy Gradient Update: Inner loops apply steps of natural gradient descent on (INNER):
with Fisher information matrix .
- Anchor Update: ; switch anchor to next iterate.
- Dual Updates (for CMDP and max–min): Lagrange multipliers and dual weights updated via optimistic mirror-descent or extra-primal-dual (EPD) mechanisms.
Reward ascent directions are chosen according to specific paradigm:
- Scalarization: .
- CMDP: .
- Max–Min: , with .
4. Theoretical Guarantees and Convergence
ARNPG-derived algorithms exhibit global convergence for all three objective classes (macro-steps ). Under softmax parametrization, step-size , and sufficient inner iterations , the anchor update satisfies a fundamental mirror-descent inequality relating anchor KL divergence to value improvement.
Convergence Theorem (Smooth Scalarization Example):
For concave, -smooth, -Lipschitz scalarizations, choosing regularization , inner iterates , after macro-steps: Analogous proofs (with dual updates for CMDP and max–min) yield rates via telescoping fundamental inequalities and mirror descent arguments.
5. Connections to First-Order Optimization
ARNPG establishes a formal connection between policy gradient optimization and first-order algorithms:
- Mirror Descent: KL regularization to the anchor policy implements a Bregman divergence (), rendering the inner loop a mirror ascent step in the policy space with mirror map .
- Accelerated and Optimistic Methods: For CMDP, the Extra-Primal-Dual (EPD) update is used to achieve rates (consistent with Yu–Neely 2017). For max–min, an Optimistic Mirror-Descent Ascent (OMDA) coupling is applied, enabling simultaneous primal-dual optimization.
By virtue of the anchor/KL design, ARNPG can systematically “plug in” advanced first-order strategies (accelerated mirror descent, optimistic gradient descent, etc.) for use in policy-gradient-based MORL.
6. Empirical Evaluation Across Tabular and Deep RL
Empirical assessments demonstrate ARNPG’s theoretical properties and practical efficacy in both tabular and deep RL domains:
- Tabular CMDP (Exact Gradients): On random MDPs (, , , one constraint), ARNPG-EPD consistently outperforms NPG-Primal-Dual [Ding et al. 2020] and CRPO [Xu et al. 2021] in average reward gap () and constraint violation (), achieving convergence versus for baselines (log–log slope ).
- Tabular CMDP (Sample-Based): ARNPG-EPD retains fast convergence and constraint satisfaction with sample-based gradient estimation (generative trajectories).
- Deep-RL CMDP (Acrobot-v1, Hopper-v3): Applied with actor-critic neural softmax policies and constraints on link angles/velocities, ARNPG-EPD (1–5 inner loops) achieves higher cumulative reward while satisfying safety constraints than TRPO-based FOCOPS, CRPO, NPG-PD.
- Smooth & Max–Min in Tabular/Deep RL: ARNPG-IMD (scalarization) and ARNPG-OMDA (max–min) demonstrate rapid convergence and outperform subgradient-based multi-objective NPG in both exact and sample-driven settings.
Critical implementation practices include:
- Selecting to absorb smoothness constants,
- Updating anchor every inner steps,
- Tuning EPD step-size for CMDP objective-feasibility balancing,
- Employing optimistic mirror descent for dual variables in max–min.
7. Synthesis and Operational Significance
ARNPG delineates a generalizable “anchor-change + KL + NPG” structure, integrating sophisticated first-order update ideas into multi-objective, constrained, and max–min RL. It offers unified global theoretical guarantees ( convergence rates with exact gradients) and demonstrates robust empirical superiority across tabular and deep RL benchmarks (Zhou et al., 2022). Code and models are accessible at https://github.com/tliu1997/ARNPG-MORL.
A plausible implication is that ARNPG’s anchor-regularized paradigm may serve as an extensible template for future multi-objective policy optimization algorithms needing principled first-order acceleration in RL.