Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ell₁,₂-Regularized Policy Learning

Updated 20 January 2026
  • Ell₁,₂-regularized policy learning is a reinforcement learning approach that uses the group Lasso penalty to induce group-level sparsity by zeroing out entire clusters of actions.
  • The method employs proximal techniques and policy mirror descent to optimize policies efficiently while satisfying structured sparsity constraints.
  • Empirical benchmarks demonstrate that this technique achieves high sample efficiency and minimal performance loss across both discrete and continuous RL settings.

Ell1,2_{1,2}-regularized policy learning refers to @@@@1@@@@ (RL) algorithms in which the policy is optimized not only for expected cumulative reward, but also under a group-sparsity-inducing regularizer. Specifically, the regularizer is the group Lasso penalty, or 1,2\ell_{1,2} norm, which encourages entire groups of actions—typically defined by domain knowledge or problem structure—to have zero probability, yielding group-level sparsity in the learned policy. This approach is motivated by resource constraints, safety, interpretability, and structured exploration in RL, and it leads to computationally and statistically desirable properties in the resulting policies. The 1,2\ell_{1,2}-regularized setting can be instantiated within general regularized Markov Decision Process (MDP) frameworks and solved efficiently using proximal methods or generalized policy mirror descent, with provable convergence and sparsity guarantees (Zhan et al., 2021, Li et al., 2019).

1. Mathematical Framework and Regularizer Structure

Let M=(S,A,P,r,P0,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,P_0,\gamma) denote a discounted infinite-horizon MDP with finite state and action spaces, transition kernel PP, reward function rr, initial distribution P0P_0, and discount factor γ[0,1)\gamma\in[0,1). A stationary policy π:SΔA\pi: \mathcal{S} \rightarrow \Delta_{\mathcal{A}} is typically represented as a probability vector over actions at each state.

The 1,2\ell_{1,2} penalty, or group-Lasso norm, is imposed by partitioning the action set A=g=1GAg\mathcal{A} = \bigcup_{g=1}^G \mathcal{A}_g into GG disjoint groups. For each state ss, group-sparse regularization is defined as: Ω(π(s))=g=1Gπg(s)2=g=1GaAgπ(as)2\Omega(\pi(\cdot|s)) = \sum_{g=1}^G \|\pi_g(s)\|_2 = \sum_{g=1}^G \sqrt{\sum_{a\in\mathcal{A}_g} \pi(a|s)^2} The RL objective with 1,2\ell_{1,2} regularization is: maxπJ(π)=Es0,a0,[t=0γtr(st,at)]λEsdπ[Ω(π(s))]\max_\pi J(\pi) = \mathbb{E}_{s_0,a_0,\ldots}\left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right] - \lambda\, \mathbb{E}_{s\sim d_\pi}\left[ \Omega(\pi(\cdot|s)) \right] where dπd_\pi is the (discounted) state occupancy under policy π\pi and λ>0\lambda > 0 is the regularization weight (Li et al., 2019).

2. Optimality Conditions and Group-Sparsity Characterization

The regularized RL problem leads to a modified Bellman optimality equation. For each state, the maximization within the Bellman operator becomes a concave program due to the (convex) 1,2\ell_{1,2} penalty and the simplex constraint: (TλV)(s)=maxπ(s)ΔA{aπ(as)[r(s,a)+γEss,aV(s)]λg=1Gπg(s)2}(\mathcal{T}_\lambda V)(s) = \max_{\pi(\cdot|s) \in \Delta_{\mathcal{A}}} \left\{ \sum_{a} \pi(a|s)\left[r(s,a) + \gamma \mathbb{E}_{s'|s,a} V(s')\right] - \lambda \sum_{g=1}^G \|\pi_g(s)\|_2 \right\} The KKT conditions show a thresholding phenomenon: for each group gg, the optimal πλ(s)\pi^*_\lambda(\cdot|s) satisfies

Qλ(s,a)μ(s)λζg(s)=0,aAgQ^*_\lambda(s,a) - \mu(s) - \lambda \zeta_g(s) = 0,\quad \forall a\in\mathcal{A}_g

for some dual variables μ(s)\mu(s) and subgradient ζg(s)πgπg(s)2\zeta_g(s)\in\partial_{\pi_g}\|\pi_g(s)\|_2. Specifically, if all Qλ(s,a)Q^*_\lambda(s,a) in a group are below a threshold, that group is entirely zeroed out, inducing block sparsity at the group level (Li et al., 2019).

Since the group-Lasso penalty's subgradient at zero is bounded, for sufficiently small λ\lambda the optimal policy πλ\pi^*_\lambda zeros out entire action groups in many states—a precise group-sparsity guarantee, distinct from elementwise (1\ell_1) sparsity (Li et al., 2019).

3. Algorithms: Proximal Methods and Policy Mirror Descent

Two primary algorithmic frameworks solve 1,2\ell_{1,2}-regularized policy optimization efficiently:

(a) Generalized Policy Mirror Descent (GPMD):

GPMD (Zhan et al., 2021) decouples the update per state and uses a group-Euclidean Bregman divergence: DR(ππ)=12g=1Gπgπg22D_R(\pi \| \pi') = \frac{1}{2} \sum_{g=1}^G \|\pi_g - \pi'_g\|_2^2 At each iteration kk, the statewise update is: π(k+1)(s)=argminpΔ(A){Qλπ(k)(s,),p+λp2+12ηkpπ(k)(s)22}\pi^{(k+1)}(s) = \arg\min_{p\in\Delta(\mathcal{A})} \left\{ -\langle Q_\lambda^{\pi^{(k)}}(s,\cdot), p\rangle + \lambda \|p\|_2 + \frac{1}{2\eta_k} \|p - \pi^{(k)}(s)\|_2^2 \right\} The unconstrained proximal step is the group-soft-thresholding: u=max(0,1ηkλ/z2)z,z=π(k)(s)+ηkQλπ(k)(s)u = \max(0, 1 - \eta_k \lambda / \|z\|_2) \cdot z,\quad z = \pi^{(k)}(s) + \eta_k Q_\lambda^{\pi^{(k)}}(s) and the result is projected onto the probability simplex. This composite step is computationally efficient, scaling as O(AlogA)O(|\mathcal{A}|\log|\mathcal{A}|) per state (Zhan et al., 2021).

(b) Off-Policy Actor-Critic with Proximal Group-Lasso:

In continuous or large-discrete domains, an actor-critic architecture is used, combining stochastic policy gradient updates with a post-gradient group-Lasso proximal step. The policy output logits for each group are group-soft-thresholded and then normalized, ensuring group-wise sparsity (Li et al., 2019). The critic's update incorporates the 1,2\ell_{1,2} penalty in the value target.

Pseudocode fragments for both approaches are provided in (Zhan et al., 2021) and (Li et al., 2019).

4. Convergence and Theoretical Guarantees

GPMD with 1,2\ell_{1,2} regularization achieves global linear convergence to the unique regularized optimum, even though the penalty is not strongly convex or smooth. For constant step-size η\eta, the sup-norm error contracts by ρ=1ηλ1+ηλ(1γ)\rho = 1 - \frac{\eta \lambda}{1 + \eta \lambda}(1-\gamma) per iteration, and an ϵ\epsilon-accurate solution is achieved in O(1(1γ)ηλ1+ηλlog1ϵ)O(\frac{1}{(1-\gamma)\frac{\eta \lambda}{1+\eta \lambda}} \log \frac{1}{\epsilon}) iterations. This convergence rate is independent of the action and state space dimensions (Zhan et al., 2021).

In the more general regularized MDP framework, the regularized Bellman operator is a γ\gamma-contraction. The performance error between regularized and unregularized value functions is bounded by

VλVλG(1γ)A/G\|V^*_\lambda - V^*\|_\infty \le \frac{\lambda G}{(1-\gamma)\sqrt{|\mathcal{A}|/G}}

where GG is the number of groups and A|\mathcal{A}| is the action space size (Li et al., 2019).

There exists a minimal λmin>0\lambda_{\min}>0 such that for all 0<λ<λmin0<\lambda<\lambda_{\min}, the optimal policy zeros out at least one entire group in some states. As λ0\lambda\rightarrow 0 the policy approaches the deterministic optimum; as λ\lambda\rightarrow\infty the solution approaches uniform, losing sparsity (Li et al., 2019).

5. Empirical Evaluation and Benchmarks

Benchmarks for 1,2\ell_{1,2}-regularized policy learning are documented in discrete and continuous RL environments:

  • Discrete settings (e.g., random MDPs with S=50|\mathcal{S}|=50, A=10|\mathcal{A}|=10; 4×44 \times 4 Gridworld; with actions partitioned into G=2G=2 groups): Comparison between 1,2\ell_{1,2}, Shannon-entropy, Tsallis-entropy, and 1\ell_1 regularizers shows that 1,2\ell_{1,2} achieves 40–60% group sparsity for moderate λ\lambda (0.1–1.0), with only a 1–2% reduction in expected return relative to unregularized baselines. Per-action 1\ell_1 and Shannon entropy regularization either provide no sparsity or induce greater return degradation (Li et al., 2019).
  • Continuous domains (e.g., MuJoCo tasks such as Hopper-v2, Walker-v2, Ant-v2, HalfCheetah-v2 with actions grouped by actuator clusters): 1,2\ell_{1,2} regularization induces block sparsity in action means (disabling entire actuator groups) and increases sample efficiency by approximately 15% over Shannon-entropy regularization, with similar final returns. The results show mild sensitivity to λ\lambda in the range 0.01–1.0 (Li et al., 2019).

Lower values of λ\lambda encourage higher sparsity, with risk of under-exploration; higher λ\lambda yields more uniform (less sparse) policies. Intermediate values (e.g., λ0.1\lambda\approx 0.1) provide the best trade-off. Deeper networks accommodate higher λ\lambda without loss of sparsity (Li et al., 2019). All experiments use direct replacement of the entropy bonus with the group-Lasso penalty and insertion of the proximal projection on the actor output.

6. Computational Considerations and Implementation Details

Per-iteration complexity in the tabular GPMD scheme is dominated by policy evaluation, which can be implemented via dynamic programming in O(S2A)O(|\mathcal{S}|^2|\mathcal{A}|) (linear system solution) or O(SA/(1γ))O(|\mathcal{S}||\mathcal{A}|/(1-\gamma)). The group-shrinkage and simplex-projection steps per state require O(AlogA)O(|\mathcal{A}|\log|\mathcal{A}|) each. Vectorization across states, pre-allocation of memory, and warm-started simplex projection speed up practical performance (Zhan et al., 2021). In large-scale or neural settings, the actor-critic method merely inserts a fast, parallelizable group-soft-threshold and normalization step after each policy update; the overall pipeline remains compatible with established architectures such as Soft Actor-Critic (SAC).

7. Relation to Other Regularization Approaches

1,2\ell_{1,2} group-sparsity regularization generalizes per-action 1\ell_1 sparsity and offers strictly stronger group-sparse control. While Shannon or Tsallis entropy penalties promote diversity and exploration, they do not induce structured sparsity. In contrast, group-Lasso regularization acts directly on predefined action clusters, enabling interpretable “on–off” activation at the group level and facilitating structured resource management or safety constraints. Empirical comparisons indicate that 1,2\ell_{1,2} regularization realizes sparsity with minimal loss of performance, outperforming standard entropy-based techniques when group sparsity is requisite (Li et al., 2019).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ell_{1,2}-Regularized Policy Learning.