Ell₁,₂-Regularized Policy Learning
- Ell₁,₂-regularized policy learning is a reinforcement learning approach that uses the group Lasso penalty to induce group-level sparsity by zeroing out entire clusters of actions.
- The method employs proximal techniques and policy mirror descent to optimize policies efficiently while satisfying structured sparsity constraints.
- Empirical benchmarks demonstrate that this technique achieves high sample efficiency and minimal performance loss across both discrete and continuous RL settings.
Ell-regularized policy learning refers to @@@@1@@@@ (RL) algorithms in which the policy is optimized not only for expected cumulative reward, but also under a group-sparsity-inducing regularizer. Specifically, the regularizer is the group Lasso penalty, or norm, which encourages entire groups of actions—typically defined by domain knowledge or problem structure—to have zero probability, yielding group-level sparsity in the learned policy. This approach is motivated by resource constraints, safety, interpretability, and structured exploration in RL, and it leads to computationally and statistically desirable properties in the resulting policies. The -regularized setting can be instantiated within general regularized Markov Decision Process (MDP) frameworks and solved efficiently using proximal methods or generalized policy mirror descent, with provable convergence and sparsity guarantees (Zhan et al., 2021, Li et al., 2019).
1. Mathematical Framework and Regularizer Structure
Let denote a discounted infinite-horizon MDP with finite state and action spaces, transition kernel , reward function , initial distribution , and discount factor . A stationary policy is typically represented as a probability vector over actions at each state.
The penalty, or group-Lasso norm, is imposed by partitioning the action set into disjoint groups. For each state , group-sparse regularization is defined as: The RL objective with regularization is: where is the (discounted) state occupancy under policy and is the regularization weight (Li et al., 2019).
2. Optimality Conditions and Group-Sparsity Characterization
The regularized RL problem leads to a modified Bellman optimality equation. For each state, the maximization within the Bellman operator becomes a concave program due to the (convex) penalty and the simplex constraint: The KKT conditions show a thresholding phenomenon: for each group , the optimal satisfies
for some dual variables and subgradient . Specifically, if all in a group are below a threshold, that group is entirely zeroed out, inducing block sparsity at the group level (Li et al., 2019).
Since the group-Lasso penalty's subgradient at zero is bounded, for sufficiently small the optimal policy zeros out entire action groups in many states—a precise group-sparsity guarantee, distinct from elementwise () sparsity (Li et al., 2019).
3. Algorithms: Proximal Methods and Policy Mirror Descent
Two primary algorithmic frameworks solve -regularized policy optimization efficiently:
(a) Generalized Policy Mirror Descent (GPMD):
GPMD (Zhan et al., 2021) decouples the update per state and uses a group-Euclidean Bregman divergence: At each iteration , the statewise update is: The unconstrained proximal step is the group-soft-thresholding: and the result is projected onto the probability simplex. This composite step is computationally efficient, scaling as per state (Zhan et al., 2021).
(b) Off-Policy Actor-Critic with Proximal Group-Lasso:
In continuous or large-discrete domains, an actor-critic architecture is used, combining stochastic policy gradient updates with a post-gradient group-Lasso proximal step. The policy output logits for each group are group-soft-thresholded and then normalized, ensuring group-wise sparsity (Li et al., 2019). The critic's update incorporates the penalty in the value target.
Pseudocode fragments for both approaches are provided in (Zhan et al., 2021) and (Li et al., 2019).
4. Convergence and Theoretical Guarantees
GPMD with regularization achieves global linear convergence to the unique regularized optimum, even though the penalty is not strongly convex or smooth. For constant step-size , the sup-norm error contracts by per iteration, and an -accurate solution is achieved in iterations. This convergence rate is independent of the action and state space dimensions (Zhan et al., 2021).
In the more general regularized MDP framework, the regularized Bellman operator is a -contraction. The performance error between regularized and unregularized value functions is bounded by
where is the number of groups and is the action space size (Li et al., 2019).
There exists a minimal such that for all , the optimal policy zeros out at least one entire group in some states. As the policy approaches the deterministic optimum; as the solution approaches uniform, losing sparsity (Li et al., 2019).
5. Empirical Evaluation and Benchmarks
Benchmarks for -regularized policy learning are documented in discrete and continuous RL environments:
- Discrete settings (e.g., random MDPs with , ; Gridworld; with actions partitioned into groups): Comparison between , Shannon-entropy, Tsallis-entropy, and regularizers shows that achieves 40–60% group sparsity for moderate (0.1–1.0), with only a 1–2% reduction in expected return relative to unregularized baselines. Per-action and Shannon entropy regularization either provide no sparsity or induce greater return degradation (Li et al., 2019).
- Continuous domains (e.g., MuJoCo tasks such as Hopper-v2, Walker-v2, Ant-v2, HalfCheetah-v2 with actions grouped by actuator clusters): regularization induces block sparsity in action means (disabling entire actuator groups) and increases sample efficiency by approximately 15% over Shannon-entropy regularization, with similar final returns. The results show mild sensitivity to in the range 0.01–1.0 (Li et al., 2019).
Lower values of encourage higher sparsity, with risk of under-exploration; higher yields more uniform (less sparse) policies. Intermediate values (e.g., ) provide the best trade-off. Deeper networks accommodate higher without loss of sparsity (Li et al., 2019). All experiments use direct replacement of the entropy bonus with the group-Lasso penalty and insertion of the proximal projection on the actor output.
6. Computational Considerations and Implementation Details
Per-iteration complexity in the tabular GPMD scheme is dominated by policy evaluation, which can be implemented via dynamic programming in (linear system solution) or . The group-shrinkage and simplex-projection steps per state require each. Vectorization across states, pre-allocation of memory, and warm-started simplex projection speed up practical performance (Zhan et al., 2021). In large-scale or neural settings, the actor-critic method merely inserts a fast, parallelizable group-soft-threshold and normalization step after each policy update; the overall pipeline remains compatible with established architectures such as Soft Actor-Critic (SAC).
7. Relation to Other Regularization Approaches
group-sparsity regularization generalizes per-action sparsity and offers strictly stronger group-sparse control. While Shannon or Tsallis entropy penalties promote diversity and exploration, they do not induce structured sparsity. In contrast, group-Lasso regularization acts directly on predefined action clusters, enabling interpretable “on–off” activation at the group level and facilitating structured resource management or safety constraints. Empirical comparisons indicate that regularization realizes sparsity with minimal loss of performance, outperforming standard entropy-based techniques when group sparsity is requisite (Li et al., 2019).