Proximal Policy Distillation (PPD) in RL
- Proximal Policy Distillation (PPD) is a reinforcement learning technique that transfers high-capacity teacher policies to efficient student models using PPO clipping and KL regularization.
- PPD reduces inference costs and improves sample efficiency by enabling online student data collection alongside guided policy matching.
- PPD demonstrates robustness to imperfect teachers and outperforms traditional distillation methods in both discrete and continuous control benchmarks.
Proximal Policy Distillation (PPD) is a reinforcement learning (RL) technique combining policy distillation with trust-region style regularization, based on the Proximal Policy Optimization (PPO) framework. PPD enables efficient transfer of policies from high-capacity teacher networks to lower-capacity (or otherwise target) student networks while preserving the stability and data-efficiency inherent in PPO. The method is designed to address both inference costs associated with large policies and sample-efficiency bottlenecks in traditional policy distillation, and it demonstrates strong empirical results across discrete and continuous control domains, as well as robustness to imperfect teachers (Green et al., 2019, Spigler, 2024).
1. Conceptual Foundations and Motivation
PPD emerges from two strands in RL: (1) actor-critic policy optimization, especially with PPO’s clipped surrogate objective, and (2) policy distillation, where a student model absorbs the action distribution of a pre-trained teacher via supervised loss. The motivation for PPD includes:
- High-capacity teachers (large CNNs) yield strong PPO policies but are expensive to deploy due to high inference costs. Distillation offers a hardware-agnostic means to transfer these policies into smaller, faster students.
- Classic distillation approaches (DQN distillation, supervised KL-matching) decouple policy learning from the RL environment, potentially leading to sub-optimal student policies, especially if the student is used to collect new data.
- PPD aims for a hybrid scheme: the student collects data, exploits its own environment returns, but regularizes its policy toward the teacher via a clipped KL penalty, enforcing policy similarity while allowing beneficial divergence when advantageous.
2. Methodological Structure and Algorithmic Description
PPD can be instantiated in both offline and online settings, and has been formalized in two primary variants.
2.1. Teacher-Student Framework
- Teacher: High-capacity PPO-trained network (e.g., 3 conv layers: 32→64→64 filters, kernels 8x8→4x4→3x3, stride 4→2→1, FC(512)). ≈1.68M parameters (Green et al., 2019).
- Student: Lower-capacity (e.g., 8→16→16, FC(128), 0.11M params), medium (16→32→32, FC(256), 0.42M), or even larger than teacher. Student may mimic teacher or, with sufficient training, outperform it (Green et al., 2019, Spigler, 2024).
2.2. Loss Functions
PPD combines multiple objectives in a unified loss, balancing between PPO’s policy improvement and KL-based imitation:
- PPO Surrogate Loss
where , and is the advantage.
- Distillation (Actor) KL Loss
- Clipped Distillation Loss (online PPD)
- Student Objective (Full)
with in most scenarios. In online PPD, the loss includes the clipped distillation penalty and PPO loss, jointly maximizing RL returns and policy matching (Spigler, 2024).
2.3. High-level Algorithm Steps
The implementation is structured in phases:
| Phase | Key Steps | Environment Interaction |
|---|---|---|
| Teacher Training | PPO on large network, collect rollouts, freeze weights | Yes (full PPO training) |
| Teacher Data Collection | Run teacher, store in buffer | Yes (teacher only) |
| Offline Distillation | Student fit to teacher policy via KL distillation on | No (supervised only) |
| Student Fine-tuning (optional) | Online PPO updates on distilled student | Yes (student only) |
| Online PPD (joint) | Student collects rollouts, optimizes PPO + clipped distillation | Yes (student only) |
Pseudocode and precise parameterization are provided in (Green et al., 2019) and Algorithm 1 of (Spigler, 2024).
3. Empirical Evaluation
PPD has been extensively evaluated on discrete-action (Atari 2600, Procgen) and continuous-control (MuJoCo) benchmarks:
- Student types: Smaller (≈25%), same, or larger (3–7×) models relative to teacher (Spigler, 2024).
- Metrics: Episodic return vs. environment steps (sample efficiency), final performance as a fraction/multiple of teacher score, robustness to imperfect teachers.
- Offline Distillation (Green et al., 2019):
- Medium student distilled from teacher achieves ≈94% of teacher’s performance vs. 88% for PPO-from-scratch.
- Low-capacity student achieves ≈85% of teacher (vs. 84% for scratch).
- Fine-tuning enables full teacher parity at ≈27% of original training cost.
- Distilled PPO students outperform DQN teachers (160% of DQN mean) as the PPO teacher itself exceeds DQN (169% of DQN mean).
- Online PPD (Spigler, 2024):
- Larger students trained with PPD can exceed teacher performance: 1.25× on Atari, 1.02× on Procgen, similar in MuJoCo.
- PPD reaches teacher performance faster than student-distill or teacher-distill baselines (Figure 1, (Spigler, 2024)).
- PPD exhibits greater robustness to imperfect/corrupted teachers (e.g., when teacher parameters are perturbed): PPD achieves 0.59× (Atari) vs. 0.45× for teacher-distill and 0.46× for student-distill when teacher is reduced to 0.41× baseline.
4. Comparative Analysis: Baselines and Ablations
PPD is analyzed relative to two canonical baselines:
- Student-distill (SD): Student collects data, trained solely by supervised KL to teacher, ignores environment rewards during distillation (Spigler, 2024).
- Teacher-distill (TD): Teacher collects data, student trains via supervised KL; yields faster KL convergence but is sample-inefficient and overfits, producing poor downstream RL performance.
Online PPD improves on both by leveraging environment returns during student-driven learning, while maintaining teacher regularization to preserve skill transfer and policy stability.
5. Architectural and Hyperparameter Considerations
- Teacher and student networks use the same architectural pattern but differ in capacity, scaling convolutional filters and linear head size accordingly.
- Key hyperparameters include: PPO clip ratio (0.1 or 0.2), distillation weight (0.5–5), entropy regularization coefficient (0.01 for discrete, 0 for continuous), , GAE , rollout lengths (), batch size, and number of PPO epochs.
- No temperature scaling is applied to the teacher policy during distillation (Green et al., 2019); stochasticity is preserved natively.
6. Implementation, Software, and Extensions
PPD has been implemented in the sb3-distill Python library, built on Stable-Baselines3 (Spigler, 2024). The framework supports distillation for any On-Policy Algorithm via a PolicyDistillationAlgorithm interface. Default settings and practical tips (e.g., omitting critic distillation for larger students; using dynamic schedules in multi-teacher/multitask settings) are included. Value distillation is optional and was omitted in most experiments ( for ).
7. Broader Implications, Generalization, and Robustness
PPD generalizes beyond PPO and discrete actions; the same proximal-distillation structure applies to other actor-critic methods (TRPO, A2C) and can handle continuous action spaces via appropriate divergence choices (e.g., Gaussian KL). Robustness evaluations demonstrate PPD’s ability to surpass imperfect teachers and recover performance, overcoming a principal shortcoming of pure supervised distillation. The general methodology—teacher pretraining, offline or online KL-regularized distillation, followed by (optional) environment-based fine-tuning—provides a scalable blueprint for sample-efficient, high-performance RL policies with reduced inference demands (Green et al., 2019, Spigler, 2024).