Ell₁,₂-Regularized Policy Learning

Updated 20 January 2026

Ell₁,₂-regularized policy learning is a reinforcement learning approach that uses the group Lasso penalty to induce group-level sparsity by zeroing out entire clusters of actions.
The method employs proximal techniques and policy mirror descent to optimize policies efficiently while satisfying structured sparsity constraints.
Empirical benchmarks demonstrate that this technique achieves high sample efficiency and minimal performance loss across both discrete and continuous RL settings.

Ell $_{1,2}$ -regularized policy learning refers to @@@@1@@@@ (RL) algorithms in which the policy is optimized not only for expected cumulative reward, but also under a group-sparsity-inducing regularizer. Specifically, the regularizer is the group Lasso penalty, or $\ell_{1,2}$ norm, which encourages entire groups of actions—typically defined by domain knowledge or problem structure—to have zero probability, yielding group-level sparsity in the learned policy. This approach is motivated by resource constraints, safety, interpretability, and structured exploration in RL, and it leads to computationally and statistically desirable properties in the resulting policies. The $\ell_{1,2}$ -regularized setting can be instantiated within general regularized Markov Decision Process (MDP) frameworks and solved efficiently using proximal methods or generalized policy mirror descent, with provable convergence and sparsity guarantees (Zhan et al., 2021, Li et al., 2019).

1. Mathematical Framework and Regularizer Structure

Let $\mathcal{M}=(\mathcal{S},\mathcal{A},P,r,P_0,\gamma)$ denote a discounted infinite-horizon MDP with finite state and action spaces, transition kernel $P$ , reward function $r$ , initial distribution $P_0$ , and discount factor $\gamma\in[0,1)$ . A stationary policy $\pi: \mathcal{S} \rightarrow \Delta_{\mathcal{A}}$ is typically represented as a probability vector over actions at each state.

The $\ell_{1,2}$ penalty, or group-Lasso norm, is imposed by partitioning the action set $\mathcal{A} = \bigcup_{g=1}^G \mathcal{A}_g$ into $G$ disjoint groups. For each state $s$ , group-sparse regularization is defined as: $\Omega(\pi(\cdot|s)) = \sum_{g=1}^G \|\pi_g(s)\|_2 = \sum_{g=1}^G \sqrt{\sum_{a\in\mathcal{A}_g} \pi(a|s)^2}$ The RL objective with $\ell_{1,2}$ regularization is: $\max_\pi J(\pi) = \mathbb{E}_{s_0,a_0,\ldots}\left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right] - \lambda\, \mathbb{E}_{s\sim d_\pi}\left[ \Omega(\pi(\cdot|s)) \right]$ where $d_\pi$ is the (discounted) state occupancy under policy $\pi$ and $\lambda > 0$ is the regularization weight (Li et al., 2019).

2. Optimality Conditions and Group-Sparsity Characterization

The regularized RL problem leads to a modified Bellman optimality equation. For each state, the maximization within the Bellman operator becomes a concave program due to the (convex) $\ell_{1,2}$ penalty and the simplex constraint: $(\mathcal{T}_\lambda V)(s) = \max_{\pi(\cdot|s) \in \Delta_{\mathcal{A}}} \left\{ \sum_{a} \pi(a|s)\left[r(s,a) + \gamma \mathbb{E}_{s'|s,a} V(s')\right] - \lambda \sum_{g=1}^G \|\pi_g(s)\|_2 \right\}$ The KKT conditions show a thresholding phenomenon: for each group $g$ , the optimal $\pi^*_\lambda(\cdot|s)$ satisfies

$Q^*_\lambda(s,a) - \mu(s) - \lambda \zeta_g(s) = 0,\quad \forall a\in\mathcal{A}_g$

for some dual variables $\mu(s)$ and subgradient $\zeta_g(s)\in\partial_{\pi_g}\|\pi_g(s)\|_2$ . Specifically, if all $Q^*_\lambda(s,a)$ in a group are below a threshold, that group is entirely zeroed out, inducing block sparsity at the group level (Li et al., 2019).

Since the group-Lasso penalty's subgradient at zero is bounded, for sufficiently small $\lambda$ the optimal policy $\pi^*_\lambda$ zeros out entire action groups in many states—a precise group-sparsity guarantee, distinct from elementwise ( $\ell_1$ ) sparsity (Li et al., 2019).

3. Algorithms: Proximal Methods and Policy Mirror Descent

Two primary algorithmic frameworks solve $\ell_{1,2}$ -regularized policy optimization efficiently:

(a) Generalized Policy Mirror Descent (GPMD):

GPMD (Zhan et al., 2021) decouples the update per state and uses a group-Euclidean Bregman divergence: $D_R(\pi \| \pi') = \frac{1}{2} \sum_{g=1}^G \|\pi_g - \pi'_g\|_2^2$ At each iteration $k$ , the statewise update is: $\pi^{(k+1)}(s) = \arg\min_{p\in\Delta(\mathcal{A})} \left\{ -\langle Q_\lambda^{\pi^{(k)}}(s,\cdot), p\rangle + \lambda \|p\|_2 + \frac{1}{2\eta_k} \|p - \pi^{(k)}(s)\|_2^2 \right\}$ The unconstrained proximal step is the group-soft-thresholding: $u = \max(0, 1 - \eta_k \lambda / \|z\|_2) \cdot z,\quad z = \pi^{(k)}(s) + \eta_k Q_\lambda^{\pi^{(k)}}(s)$ and the result is projected onto the probability simplex. This composite step is computationally efficient, scaling as $O(|\mathcal{A}|\log|\mathcal{A}|)$ per state (Zhan et al., 2021).

(b) Off-Policy Actor-Critic with Proximal Group-Lasso:

In continuous or large-discrete domains, an actor-critic architecture is used, combining stochastic policy gradient updates with a post-gradient group-Lasso proximal step. The policy output logits for each group are group-soft-thresholded and then normalized, ensuring group-wise sparsity (Li et al., 2019). The critic's update incorporates the $\ell_{1,2}$ penalty in the value target.

Pseudocode fragments for both approaches are provided in (Zhan et al., 2021) and (Li et al., 2019).

4. Convergence and Theoretical Guarantees

GPMD with $\ell_{1,2}$ regularization achieves global linear convergence to the unique regularized optimum, even though the penalty is not strongly convex or smooth. For constant step-size $\eta$ , the sup-norm error contracts by $\rho = 1 - \frac{\eta \lambda}{1 + \eta \lambda}(1-\gamma)$ per iteration, and an $\epsilon$ -accurate solution is achieved in $O(\frac{1}{(1-\gamma)\frac{\eta \lambda}{1+\eta \lambda}} \log \frac{1}{\epsilon})$ iterations. This convergence rate is independent of the action and state space dimensions (Zhan et al., 2021).

In the more general regularized MDP framework, the regularized Bellman operator is a $\gamma$ -contraction. The performance error between regularized and unregularized value functions is bounded by

$\|V^*_\lambda - V^*\|_\infty \le \frac{\lambda G}{(1-\gamma)\sqrt{|\mathcal{A}|/G}}$

where $G$ is the number of groups and $|\mathcal{A}|$ is the action space size (Li et al., 2019).

There exists a minimal $\lambda_{\min}>0$ such that for all $0<\lambda<\lambda_{\min}$ , the optimal policy zeros out at least one entire group in some states. As $\lambda\rightarrow 0$ the policy approaches the deterministic optimum; as $\lambda\rightarrow\infty$ the solution approaches uniform, losing sparsity (Li et al., 2019).

5. Empirical Evaluation and Benchmarks

Benchmarks for $\ell_{1,2}$ -regularized policy learning are documented in discrete and continuous RL environments:

Discrete settings (e.g., random MDPs with $|\mathcal{S}|=50$ , $|\mathcal{A}|=10$ ; $4 \times 4$ Gridworld; with actions partitioned into $G=2$ groups): Comparison between $\ell_{1,2}$ , Shannon-entropy, Tsallis-entropy, and $\ell_1$ regularizers shows that $\ell_{1,2}$ achieves 40–60% group sparsity for moderate $\lambda$ (0.1–1.0), with only a 1–2% reduction in expected return relative to unregularized baselines. Per-action $\ell_1$ and Shannon entropy regularization either provide no sparsity or induce greater return degradation (Li et al., 2019).
Continuous domains (e.g., MuJoCo tasks such as Hopper-v2, Walker-v2, Ant-v2, HalfCheetah-v2 with actions grouped by actuator clusters): $\ell_{1,2}$ regularization induces block sparsity in action means (disabling entire actuator groups) and increases sample efficiency by approximately 15% over Shannon-entropy regularization, with similar final returns. The results show mild sensitivity to $\lambda$ in the range 0.01–1.0 (Li et al., 2019).

Lower values of $\lambda$ encourage higher sparsity, with risk of under-exploration; higher $\lambda$ yields more uniform (less sparse) policies. Intermediate values (e.g., $\lambda\approx 0.1$ ) provide the best trade-off. Deeper networks accommodate higher $\lambda$ without loss of sparsity (Li et al., 2019). All experiments use direct replacement of the entropy bonus with the group-Lasso penalty and insertion of the proximal projection on the actor output.

6. Computational Considerations and Implementation Details

Per-iteration complexity in the tabular GPMD scheme is dominated by policy evaluation, which can be implemented via dynamic programming in $O(|\mathcal{S}|^2|\mathcal{A}|)$ (linear system solution) or $O(|\mathcal{S}||\mathcal{A}|/(1-\gamma))$ . The group-shrinkage and simplex-projection steps per state require $O(|\mathcal{A}|\log|\mathcal{A}|)$ each. Vectorization across states, pre-allocation of memory, and warm-started simplex projection speed up practical performance (Zhan et al., 2021). In large-scale or neural settings, the actor-critic method merely inserts a fast, parallelizable group-soft-threshold and normalization step after each policy update; the overall pipeline remains compatible with established architectures such as Soft Actor-Critic (SAC).

7. Relation to Other Regularization Approaches

$\ell_{1,2}$ group-sparsity regularization generalizes per-action $\ell_1$ sparsity and offers strictly stronger group-sparse control. While Shannon or Tsallis entropy penalties promote diversity and exploration, they do not induce structured sparsity. In contrast, group-Lasso regularization acts directly on predefined action clusters, enabling interpretable “on–off” activation at the group level and facilitating structured resource management or safety constraints. Empirical comparisons indicate that $\ell_{1,2}$ regularization realizes sparsity with minimal loss of performance, outperforming standard entropy-based techniques when group sparsity is requisite (Li et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence (2021)

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ell_{1,2}-Regularized Policy Learning.