Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anchor-Changing Regularized NPG (ARNPG)

Updated 2 February 2026
  • The paper introduces ARNPG, which unifies mirror descent, optimism, and extra-gradient methods into a natural policy gradient framework for multi-objective reinforcement learning.
  • It optimizes policies in finite MDPs with several reward functions using an anchor-changing strategy and KL regularization to enhance convergence and performance.
  • Empirical evaluations demonstrate that ARNPG achieves O(1/T) convergence and outperforms baseline methods in both tabular and deep RL settings.

Anchor-Changing Regularized Natural Policy Gradient (ARNPG) is a multi-objective reinforcement learning (MORL) framework for policy optimization in Markov decision processes (MDPs) with several scalar reward functions. ARNPG unifies and systematically embeds advanced first-order optimization techniques—such as mirror descent, optimism, and extra-gradient methods—into natural policy gradient (NPG) architectures, introducing an anchor-changing strategy and Kullback–Leibler (KL) regularization to achieve theoretically optimal convergence and empirical superiority in complex policy search spaces (Zhou et al., 2022).

1. Multi-Objective Markov Decision Processes

The ARNPG framework operates on a discounted finite MDP defined by:

  • finite state space SS,
  • finite action space AA,
  • transition kernel P:S×AΔ(S)P: S \times A \rightarrow \Delta(S),
  • initial distribution ρΔ(S)\rho \in \Delta(S),
  • discount factor γ(0,1)\gamma \in (0,1),
  • mm reward functions ri:S×A[0,1]r_i : S \times A \rightarrow [0,1] for i=1,,mi=1,\dots,m.

For a stationary policy π:SΔ(A)\pi: S \rightarrow \Delta(A), the state-action visitation distribution is: dρπ(s,a)=(1γ)Es0ρ[t0γtPr(st=s,at=as0,π)].d_\rho^\pi(s,a) = (1-\gamma) \mathbb{E}_{s_0 \sim \rho} \left[ \sum_{t \geq 0} \gamma^t \Pr(s_t = s, a_t = a \mid s_0, \pi) \right]. Value for reward ii is: Viπ(ρ)=11γs,adρπ(s,a)ri(s,a).V_i^\pi(\rho) = \frac{1}{1-\gamma} \sum_{s,a} d_\rho^\pi(s,a) r_i(s,a). The achievable multi-reward value set is V={V1:mπ(ρ):πΠ}V = \left\{ V_{1:m}^\pi(\rho): \pi \in \Pi \right\} for policy class Π\Pi, typically parametrized as softmax policies {πθ}\{\pi_\theta\}.

2. Optimization Objectives in ARNPG

ARNPG addresses three principal MORL optimization paradigms:

  • Smooth Concave Scalarization (Proportional Fairness):

maxπF(V1:mπ(ρ)),F concave, e.g. F(v)=i=1mailog(δ+vi).\max_{\pi} F(V_{1:m}^\pi(\rho)), \quad F \text{ concave, e.g. } F(v) = \sum_{i=1}^m a_i \log(\delta + v_i).

Here FF is β\beta-smooth, F1L\|\nabla F\|_1 \leq L.

  • Hard Constraints (Constrained MDP):

maxπV1π(ρ)s.t. Viπ(ρ)bi,i=2,,m.\max_{\pi} V_1^\pi(\rho) \quad \text{s.t. } V_i^\pi(\rho) \geq b_i, \, i=2,\dots,m.

Using the Lagrangian formulation,

L(π,λ)=V1π(ρ)+i=2mλi(Viπ(ρ)bi),  λi0.L(\pi,\lambda) = V_1^\pi(\rho) + \sum_{i=2}^m \lambda_i (V_i^\pi(\rho) - b_i), \; \lambda_i \geq 0.

  • Max–Min Trade-Off:

maxπmini=1,,mViπ(ρ)ci=maxπminλΔ([m])i=1mλiViπ(ρ)ci.\max_{\pi} \min_{i=1,\dots,m}\frac{V_i^\pi(\rho)}{c_i} = \max_{\pi}\min_{\lambda \in \Delta([m])} \sum_{i=1}^m \lambda_i \frac{V_i^\pi(\rho)}{c_i}.

Formulated as a min-max saddle-point problem in the dual space.

3. The ARNPG Meta-Algorithm

The core ARNPG method iterates through macro-steps indexed by anchors πk\pi_k and ascent directions via advanced regularized subproblems:

  1. Macro-Iteration Setup: Anchor policy πk\pi_k, ascent reward r~k\tilde r_k, KL regularization α>0\alpha>0.
  2. Subproblem (“INNER”):

maxπVr~kπ(ρ)α1γDdρπ(ππk),\max_{\pi} V_{\tilde r_k}^\pi(\rho) - \frac{\alpha}{1-\gamma} D_{d_\rho^\pi}(\pi\|\pi_k),

where

Vr~kπ(ρ)=11γs,adρπ(s,a)r~k(s,a)V_{\tilde r_k}^\pi(\rho) = \frac{1}{1-\gamma} \sum_{s,a} d_\rho^\pi(s,a) \tilde r_k(s,a)

and

Dd(ππ)=sd(s)aπ(as)logπ(as)π(as)D_{d}(\pi\|\pi') = \sum_s d(s)\sum_a \pi(a|s)\log \frac{\pi(a|s)}{\pi'(a|s)}

is the state visitation-weighted KL divergence to the anchor.

  1. Natural Policy Gradient Update: Inner loops apply tkt_k steps of natural gradient descent on (INNER):

θ(t+1)=θ(t)+ηFρ(θ(t))θ{Vr~kπθ(ρ)α1γDdρπθ(πθπk)},\theta^{(t+1)} = \theta^{(t)} + \eta F_\rho(\theta^{(t)})^\dagger \nabla_\theta\left\{ V_{\tilde r_k}^{\pi_\theta}(\rho) - \frac{\alpha}{1-\gamma} D_{d_\rho^{\pi_\theta}}(\pi_\theta\|\pi_k) \right\},

with Fisher information matrix Fρ(θ)F_\rho(\theta).

  1. Anchor Update: πk+1=πk(tk)\pi_{k+1} = \pi_k^{(t_k)}; switch anchor to next iterate.
  2. Dual Updates (for CMDP and max–min): Lagrange multipliers and dual weights updated via optimistic mirror-descent or extra-primal-dual (EPD) mechanisms.

Reward ascent directions r~k\tilde r_k are chosen according to specific paradigm:

  • Scalarization: r~k(s,a)=vF(V1:mπk(ρ)),r1:m(s,a)\tilde r_k(s,a) = \langle \nabla_v F(V_{1:m}^{\pi_k}(\rho)), r_{1:m}(s,a) \rangle.
  • CMDP: r~k(s,a)=r1+i=2m(λk,i+η(biViπk))ri\tilde r_k(s,a) = r_1 + \sum_{i=2}^m (\lambda_{k,i} + \eta'(b_i - V_i^{\pi_k})) r_i.
  • Max–Min: r~k(s,a)=G~kv,r1:m(s,a)\tilde r_k(s,a) = \langle \tilde G_k^v, r_{1:m}(s,a) \rangle, with G~kv=vΦ(V1:mπ~k,λ~k)\tilde G_k^v = \nabla_v \Phi(V_{1:m}^{\tilde \pi_k}, \tilde \lambda_k).

4. Theoretical Guarantees and Convergence

ARNPG-derived algorithms exhibit O~(1/T)\tilde O(1/T) global convergence for all three objective classes (macro-steps KT/(1γ)lnTK \approx T/(1-\gamma)\ln T). Under softmax parametrization, step-size η=(1γ)/α\eta = (1-\gamma)/\alpha, and sufficient inner iterations tk=O(ln(1/ϵk))t_k = O(\ln(1/\epsilon_k)), the anchor update satisfies a fundamental mirror-descent inequality relating anchor KL divergence to value improvement.

Convergence Theorem (Smooth Scalarization Example):

For concave, β\beta-smooth, LL-Lipschitz scalarizations, choosing regularization αβ/(1γ)3\alpha\geq\beta/(1-\gamma)^3, inner iterates tk=11γln5LKβlnAt_k=\left\lceil\frac1{1-\gamma}\ln\frac{5\,L\,K}{\beta\ln|A|}\right\rceil, after KK macro-steps: F(Vπ)max1kKF(Vπk)2αlnA(1γ)K    O(βlnT(1γ)5T).F(V^{\pi^*})-\max_{1\le k\le K}F(V^{\pi_k}) \le \frac{2\,\alpha\ln|A|}{(1-\gamma)\,K} \implies O\left(\frac{\beta\ln T}{(1-\gamma)^5\,T}\right). Analogous proofs (with dual updates for CMDP and max–min) yield O~(1/T)\tilde O(1/T) rates via telescoping fundamental inequalities and mirror descent arguments.

5. Connections to First-Order Optimization

ARNPG establishes a formal connection between policy gradient optimization and first-order algorithms:

  • Mirror Descent: KL regularization to the anchor policy implements a Bregman divergence (Ddρπ(ππk)D_{d_\rho^\pi}(\pi\|\pi_k)), rendering the inner loop a mirror ascent step in the policy space with mirror map h(d)=dlnddlnadh(d) = \sum d\ln d - \sum d\ln\sum_a d.
  • Accelerated and Optimistic Methods: For CMDP, the Extra-Primal-Dual (EPD) update is used to achieve O(1/T)O(1/T) rates (consistent with Yu–Neely 2017). For max–min, an Optimistic Mirror-Descent Ascent (OMDA) coupling is applied, enabling simultaneous primal-dual optimization.

By virtue of the anchor/KL design, ARNPG can systematically “plug in” advanced first-order strategies (accelerated mirror descent, optimistic gradient descent, etc.) for use in policy-gradient-based MORL.

6. Empirical Evaluation Across Tabular and Deep RL

Empirical assessments demonstrate ARNPG’s theoretical properties and practical efficacy in both tabular and deep RL domains:

  • Tabular CMDP (Exact Gradients): On random MDPs (S=20|S|=20, A=10|A|=10, γ=0.8\gamma=0.8, one constraint), ARNPG-EPD consistently outperforms NPG-Primal-Dual [Ding et al. 2020] and CRPO [Xu et al. 2021] in average reward gap (V1V1(πk)V_1^*-V_1(\pi_k)) and constraint violation (b2V2(πk)b_2-V_2(\pi_k)), achieving O(1/T)O(1/T) convergence versus O(1/T)O(1/\sqrt{T}) for baselines (log–log slope 1\approx -1).
  • Tabular CMDP (Sample-Based): ARNPG-EPD retains fast convergence and constraint satisfaction with sample-based gradient estimation (generative trajectories).
  • Deep-RL CMDP (Acrobot-v1, Hopper-v3): Applied with actor-critic neural softmax policies and constraints on link angles/velocities, ARNPG-EPD (1–5 inner loops) achieves higher cumulative reward while satisfying safety constraints than TRPO-based FOCOPS, CRPO, NPG-PD.
  • Smooth & Max–Min in Tabular/Deep RL: ARNPG-IMD (scalarization) and ARNPG-OMDA (max–min) demonstrate rapid O(1/T)O(1/T) convergence and outperform subgradient-based multi-objective NPG in both exact and sample-driven settings.

Critical implementation practices include:

  • Selecting α\alpha to absorb smoothness constants,
  • Updating anchor every tk>1t_k>1 inner steps,
  • Tuning EPD step-size η\eta' for CMDP objective-feasibility balancing,
  • Employing optimistic mirror descent for dual variables in max–min.

7. Synthesis and Operational Significance

ARNPG delineates a generalizable “anchor-change + KL + NPG” structure, integrating sophisticated first-order update ideas into multi-objective, constrained, and max–min RL. It offers unified global theoretical guarantees (O~(1/T)\tilde O(1/T) convergence rates with exact gradients) and demonstrates robust empirical superiority across tabular and deep RL benchmarks (Zhou et al., 2022). Code and models are accessible at https://github.com/tliu1997/ARNPG-MORL.

A plausible implication is that ARNPG’s anchor-regularized paradigm may serve as an extensible template for future multi-objective policy optimization algorithms needing principled first-order acceleration in RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchor-Changing Regularized Natural Policy Gradient (ARNPG).