Anchor-Changing Regularized NPG (ARNPG)

Updated 2 February 2026

The paper introduces ARNPG, which unifies mirror descent, optimism, and extra-gradient methods into a natural policy gradient framework for multi-objective reinforcement learning.
It optimizes policies in finite MDPs with several reward functions using an anchor-changing strategy and KL regularization to enhance convergence and performance.
Empirical evaluations demonstrate that ARNPG achieves O(1/T) convergence and outperforms baseline methods in both tabular and deep RL settings.

Anchor-Changing Regularized Natural Policy Gradient (ARNPG) is a multi-objective reinforcement learning (MORL) framework for policy optimization in Markov decision processes (MDPs) with several scalar reward functions. ARNPG unifies and systematically embeds advanced first-order optimization techniques—such as mirror descent, optimism, and extra-gradient methods—into natural policy gradient (NPG) architectures, introducing an anchor-changing strategy and Kullback–Leibler (KL) regularization to achieve theoretically optimal convergence and empirical superiority in complex policy search spaces (Zhou et al., 2022).

1. Multi-Objective Markov Decision Processes

The ARNPG framework operates on a discounted finite MDP defined by:

finite state space $S$ ,
finite action space $A$ ,
transition kernel $P: S \times A \rightarrow \Delta(S)$ ,
initial distribution $\rho \in \Delta(S)$ ,
discount factor $\gamma \in (0,1)$ ,
$m$ reward functions $r_i : S \times A \rightarrow [0,1]$ for $i=1,\dots,m$ .

For a stationary policy $\pi: S \rightarrow \Delta(A)$ , the state-action visitation distribution is: $d_\rho^\pi(s,a) = (1-\gamma) \mathbb{E}_{s_0 \sim \rho} \left[ \sum_{t \geq 0} \gamma^t \Pr(s_t = s, a_t = a \mid s_0, \pi) \right].$ Value for reward $i$ is: $V_i^\pi(\rho) = \frac{1}{1-\gamma} \sum_{s,a} d_\rho^\pi(s,a) r_i(s,a).$ The achievable multi-reward value set is $V = \left\{ V_{1:m}^\pi(\rho): \pi \in \Pi \right\}$ for policy class $\Pi$ , typically parametrized as softmax policies $\{\pi_\theta\}$ .

2. Optimization Objectives in ARNPG

ARNPG addresses three principal MORL optimization paradigms:

Smooth Concave Scalarization (Proportional Fairness):

$\max_{\pi} F(V_{1:m}^\pi(\rho)), \quad F \text{ concave, e.g. } F(v) = \sum_{i=1}^m a_i \log(\delta + v_i).$

Here $F$ is $\beta$ -smooth, $\|\nabla F\|_1 \leq L$ .

Hard Constraints (Constrained MDP):

$\max_{\pi} V_1^\pi(\rho) \quad \text{s.t. } V_i^\pi(\rho) \geq b_i, \, i=2,\dots,m.$

Using the Lagrangian formulation,

$L(\pi,\lambda) = V_1^\pi(\rho) + \sum_{i=2}^m \lambda_i (V_i^\pi(\rho) - b_i), \; \lambda_i \geq 0.$

Max–Min Trade-Off:

$\max_{\pi} \min_{i=1,\dots,m}\frac{V_i^\pi(\rho)}{c_i} = \max_{\pi}\min_{\lambda \in \Delta([m])} \sum_{i=1}^m \lambda_i \frac{V_i^\pi(\rho)}{c_i}.$

Formulated as a min-max saddle-point problem in the dual space.

3. The ARNPG Meta-Algorithm

The core ARNPG method iterates through macro-steps indexed by anchors $\pi_k$ and ascent directions via advanced regularized subproblems:

Macro-Iteration Setup: Anchor policy $\pi_k$ , ascent reward $\tilde r_k$ , KL regularization $\alpha>0$ .
Subproblem (“INNER”):

$\max_{\pi} V_{\tilde r_k}^\pi(\rho) - \frac{\alpha}{1-\gamma} D_{d_\rho^\pi}(\pi\|\pi_k),$

where

$V_{\tilde r_k}^\pi(\rho) = \frac{1}{1-\gamma} \sum_{s,a} d_\rho^\pi(s,a) \tilde r_k(s,a)$

and

$D_{d}(\pi\|\pi') = \sum_s d(s)\sum_a \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}$

is the state visitation-weighted KL divergence to the anchor.

Natural Policy Gradient Update: Inner loops apply $t_k$ steps of natural gradient descent on (INNER):

$\theta^{(t+1)} = \theta^{(t)} + \eta F_\rho(\theta^{(t)})^\dagger \nabla_\theta\left\{ V_{\tilde r_k}^{\pi_\theta}(\rho) - \frac{\alpha}{1-\gamma} D_{d_\rho^{\pi_\theta}}(\pi_\theta\|\pi_k) \right\},$

with Fisher information matrix $F_\rho(\theta)$ .

Anchor Update: $\pi_{k+1} = \pi_k^{(t_k)}$ ; switch anchor to next iterate.
Dual Updates (for CMDP and max–min): Lagrange multipliers and dual weights updated via optimistic mirror-descent or extra-primal-dual (EPD) mechanisms.

Reward ascent directions $\tilde r_k$ are chosen according to specific paradigm:

Scalarization: $\tilde r_k(s,a) = \langle \nabla_v F(V_{1:m}^{\pi_k}(\rho)), r_{1:m}(s,a) \rangle$ .
CMDP: $\tilde r_k(s,a) = r_1 + \sum_{i=2}^m (\lambda_{k,i} + \eta'(b_i - V_i^{\pi_k})) r_i$ .
Max–Min: $\tilde r_k(s,a) = \langle \tilde G_k^v, r_{1:m}(s,a) \rangle$ , with $\tilde G_k^v = \nabla_v \Phi(V_{1:m}^{\tilde \pi_k}, \tilde \lambda_k)$ .

4. Theoretical Guarantees and Convergence

ARNPG-derived algorithms exhibit $\tilde O(1/T)$ global convergence for all three objective classes (macro-steps $K \approx T/(1-\gamma)\ln T$ ). Under softmax parametrization, step-size $\eta = (1-\gamma)/\alpha$ , and sufficient inner iterations $t_k = O(\ln(1/\epsilon_k))$ , the anchor update satisfies a fundamental mirror-descent inequality relating anchor KL divergence to value improvement.

Convergence Theorem (Smooth Scalarization Example):

For concave, $\beta$ -smooth, $L$ -Lipschitz scalarizations, choosing regularization $\alpha\geq\beta/(1-\gamma)^3$ , inner iterates $t_k=\left\lceil\frac1{1-\gamma}\ln\frac{5\,L\,K}{\beta\ln|A|}\right\rceil$ , after $K$ macro-steps: $F(V^{\pi^*})-\max_{1\le k\le K}F(V^{\pi_k}) \le \frac{2\,\alpha\ln|A|}{(1-\gamma)\,K} \implies O\left(\frac{\beta\ln T}{(1-\gamma)^5\,T}\right).$ Analogous proofs (with dual updates for CMDP and max–min) yield $\tilde O(1/T)$ rates via telescoping fundamental inequalities and mirror descent arguments.

5. Connections to First-Order Optimization

ARNPG establishes a formal connection between policy gradient optimization and first-order algorithms:

Mirror Descent: KL regularization to the anchor policy implements a Bregman divergence ( $D_{d_\rho^\pi}(\pi\|\pi_k)$ ), rendering the inner loop a mirror ascent step in the policy space with mirror map $h(d) = \sum d\ln d - \sum d\ln\sum_a d$ .
Accelerated and Optimistic Methods: For CMDP, the Extra-Primal-Dual (EPD) update is used to achieve $O(1/T)$ rates (consistent with Yu–Neely 2017). For max–min, an Optimistic Mirror-Descent Ascent (OMDA) coupling is applied, enabling simultaneous primal-dual optimization.

By virtue of the anchor/KL design, ARNPG can systematically “plug in” advanced first-order strategies (accelerated mirror descent, optimistic gradient descent, etc.) for use in policy-gradient-based MORL.

6. Empirical Evaluation Across Tabular and Deep RL

Empirical assessments demonstrate ARNPG’s theoretical properties and practical efficacy in both tabular and deep RL domains:

Tabular CMDP (Exact Gradients): On random MDPs ( $|S|=20$ , $|A|=10$ , $\gamma=0.8$ , one constraint), ARNPG-EPD consistently outperforms NPG-Primal-Dual [Ding et al. 2020] and CRPO [Xu et al. 2021] in average reward gap ( $V_1^*-V_1(\pi_k)$ ) and constraint violation ( $b_2-V_2(\pi_k)$ ), achieving $O(1/T)$ convergence versus $O(1/\sqrt{T})$ for baselines (log–log slope $\approx -1$ ).
Tabular CMDP (Sample-Based): ARNPG-EPD retains fast convergence and constraint satisfaction with sample-based gradient estimation (generative trajectories).
Deep-RL CMDP (Acrobot-v1, Hopper-v3): Applied with actor-critic neural softmax policies and constraints on link angles/velocities, ARNPG-EPD (1–5 inner loops) achieves higher cumulative reward while satisfying safety constraints than TRPO-based FOCOPS, CRPO, NPG-PD.
Smooth & Max–Min in Tabular/Deep RL: ARNPG-IMD (scalarization) and ARNPG-OMDA (max–min) demonstrate rapid $O(1/T)$ convergence and outperform subgradient-based multi-objective NPG in both exact and sample-driven settings.

Critical implementation practices include:

Selecting $\alpha$ to absorb smoothness constants,
Updating anchor every $t_k>1$ inner steps,
Tuning EPD step-size $\eta'$ for CMDP objective-feasibility balancing,
Employing optimistic mirror descent for dual variables in max–min.

7. Synthesis and Operational Significance

ARNPG delineates a generalizable “anchor-change + KL + NPG” structure, integrating sophisticated first-order update ideas into multi-objective, constrained, and max–min RL. It offers unified global theoretical guarantees ( $\tilde O(1/T)$ convergence rates with exact gradients) and demonstrates robust empirical superiority across tabular and deep RL benchmarks (Zhou et al., 2022). Code and models are accessible at https://github.com/tliu1997/ARNPG-MORL.

A plausible implication is that ARNPG’s anchor-regularized paradigm may serve as an extensible template for future multi-objective policy optimization algorithms needing principled first-order acceleration in RL.

Markdown Report Issue Upgrade to Chat

References (1)

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchor-Changing Regularized Natural Policy Gradient (ARNPG).