Data-Regularized Diffusion RL
- The paper introduces DDRL, replacing on-policy reverse KL with an off-policy forward KL penalty, ensuring regularization remains meaningful during reinforcement learning.
- It formulates diffusion models as Markov Decision Processes, integrating RL rollouts with denoising loss to preserve sample fidelity and prevent reward hacking.
- Empirical studies in video generation show DDRL improves reward metrics and human evaluation scores while maintaining realism, diversity, and prompt alignment.
Data-Regularized Diffusion Reinforcement Learning (DDRL) is a reinforcement learning paradigm for post-training generative diffusion models that employs data-driven regularization to address reward hacking and distributional drift. The method replaces on-policy reference-model regularization with an off-policy forward KL penalty anchored on a real or synthetic data distribution, enabling robust reward optimization without compromising sample realism, diversity, or fidelity.
1. Motivation: Reward Hacking and Regularization Failure in Diffusion RL
Standard RL post-training of diffusion models seeks to maximize an externally specified reward , such as a proxy for human preference, while constraining divergence from a reference model using the reverse KL:
However, the reward model is generally only reliable near the data manifold, while —the policy learned via RL—can drift off-manifold in multi-step sampling, rendering the on-policy reverse KL penalty uninformative. Consequently, diffusion models "hack" the reward: they produce over-stylized, low-diversity, or unrealistic outputs that nevertheless achieve high reward scores but reduced human preference and increased visible artifacts (e.g., noise patterns, cartoonish outputs). DDRL directly addresses this pathology by substituting the on-policy reverse KL with an off-policy forward KL, anchoring the diffusion policy to an external data distribution so that the regularization remains meaningful even as explores unfamiliar regions (Ye et al., 3 Dec 2025).
2. Theoretical Formulation: Diffusion, Off-Policy KL, and Optimality
2.1 Diffusion Models as Markov Decision Processes
Let represent the real data conditional on context . The forward noising process is
and the diffusion model defines the reverse transitions
so that
In RL interpretation, is a stochastic policy acting in (state, action) pairs given by , terminating at .
2.2 Off-Policy Data Distribution
The reference/off-policy distribution is constructed by: drawing from real or synthetic data, then running the forward noising process for . The resulting marginals are "noisy data" distributed along the real-data manifold.
2.3 Forward KL Regularization
DDRL replaces the standard penalty with the off-policy , evaluated on samples from the data+noise process. By properties of diffusion models,
where is synthesized from plus noise at .
2.4 Objective and Optimal Policy
Define the (zero-meaned) advantage:
The DDRL objective is
with . In practice, can be replaced by identity to match baseline scaling.
Equivalently,
and the optimal policy is
showing that DDRL correctly recovers the KL-regularized posterior (Ye et al., 3 Dec 2025).
3. Algorithmic Implementation
DDRL alternates between RL-style rollouts and off-policy diffusion loss regularization:
- Draw batches of conditions .
- For each : generate samples ; compute rewards , the baseline , and advantages .
- Accumulate the policy-gradient (REINFORCE) loss: .
- For each : sample , random , noise ; synthesize ; compute denoising loss as .
- Minimize total loss using AdamW. Only a subset of timesteps is used for efficiency.
DDRL requires only a data/noise sampler at training, without reference to a pretrained model; it can leverage real or synthetic data for the off-policy KL term (Ye et al., 3 Dec 2025).
4. Hyperparameter and Practical Details
The DDRL recipe for large-scale video generation includes:
- Diffusion timesteps: , (evenly spaced).
- RL rollout size: per condition; learning rates (2B), (14B).
- Batch size: conditions/iteration.
- Off-policy data: real samples (from high-quality fine-tuning data) or synthetic samples (pre-generated by base model, e.g., 10k videos).
- Optimizer: AdamW (, ).
- Computational cost: GPU-hours (H100). Efficiency is aided by single forward-diffusion pass per data point and asynchronous external reward servers.
No classifier-free guidance is used in diffusion loss; conditioning is dropped with 20% probability during training (Ye et al., 3 Dec 2025).
5. Empirical Results in High-Resolution Video Generation
DDRL was evaluated in post-training Cosmos2.5 (2B, 14B) over mixed TextVideo and ImageVideo tasks, using VideoAlign and VBench for quantitative and human-eval metrics.
Quantitative performance:
- DDRL improves average reward by (VideoAlign) and (VBench) over the base model.
- Competing methods (DanceGRPO, FlowGRPO) improve some metrics but "hack" others (e.g., over-stylization, blurred or mismatched outputs).
Human preferences:
- For Cosmos2.5-2B, DDRL outperforms base by in human vote and increases VideoAlign score from $0.604$ to $0.715$.
- Baselines can increase reward but decrease text alignment or realism.
Qualitative effects:
- DanceGRPO shows increased color saturation and prompt misalignment.
- FlowGRPO exhibits blur, temporal jitter, and artifacts.
- DDRL maintains realism/diversity, better prompt fidelity, and smooth motion (Ye et al., 3 Dec 2025).
6. Limitations and Future Directions
DDRL effectiveness depends on the quality of ; if real data is limited or synthetic data distribution is mismatched, regularization may not fully constrain . RL rollout cost dominates overall compute—sample efficiency improvements (e.g., using value networks or off-policy RL) are a potential avenue for further work. The field lacks an automatic metric for detecting reward hacking, though signals could include spikes in diffusion loss, lower variance in outputs, or abrupt trade-offs in evaluation metrics.
Proposed extensions include:
- Unifying SFT and RL post-training in a single-stage DDRL setup,
- Application to other generative architectures (flows, autoregressive LLMs),
- Incorporating classifier-free guidance with RL policy improvements at inference (Ye et al., 3 Dec 2025).
7. Related Continuous-Time RL Formulations
A closely related line derives reward-directed, entropy-regularized RL for continuous-time score-based diffusion models (Gao et al., 2024). This approach formulates diffusion model learning as an MDP with policies representing time-varying applied scores, targeting a reward that penalizes deviation from the true data distribution score:
with an entropy bonus and terminal reward. The optimal policy is Gaussian with mean incorporating both the data score and value-gradient, and a known variance. Training employs actor-critic (q-learning) with density-ratio score estimation on noisy data, facilitating a principled, model-free balance between task reward maximization and distributional fidelity, without reliance on a pretrained model. Comparison with model-based fine-tuning reveals improved robustness to poor initialization and avoidance of reward overfitting (Gao et al., 2024).
For thorough derivations, implementation details, and experimental results, see "Data-regularized Reinforcement Learning for Diffusion Models at Scale" (Ye et al., 3 Dec 2025) and "Reward-Directed Score-Based Diffusion Models via q-Learning" (Gao et al., 2024).