Data-Regularized Diffusion RL

Updated 13 February 2026

The paper introduces DDRL, replacing on-policy reverse KL with an off-policy forward KL penalty, ensuring regularization remains meaningful during reinforcement learning.
It formulates diffusion models as Markov Decision Processes, integrating RL rollouts with denoising loss to preserve sample fidelity and prevent reward hacking.
Empirical studies in video generation show DDRL improves reward metrics and human evaluation scores while maintaining realism, diversity, and prompt alignment.

Data-Regularized Diffusion Reinforcement Learning (DDRL) is a reinforcement learning paradigm for post-training generative diffusion models that employs data-driven regularization to address reward hacking and distributional drift. The method replaces on-policy reference-model regularization with an off-policy forward KL penalty anchored on a real or synthetic data distribution, enabling robust reward optimization without compromising sample realism, diversity, or fidelity.

1. Motivation: Reward Hacking and Regularization Failure in Diffusion RL

Standard RL post-training of diffusion models seeks to maximize an externally specified reward $r(x)$ , such as a proxy for human preference, while constraining divergence from a reference model $p_\text{ref}$ using the reverse KL:

$J_\text{RL}(\theta) = \mathbb{E}_{p_\theta}[r(x)/\beta] - \mathrm{KL}(p_\theta \Vert p_\text{ref}).$

However, the reward model is generally only reliable near the data manifold, while $p_\theta$ —the policy learned via RL—can drift off-manifold in multi-step sampling, rendering the on-policy reverse KL penalty uninformative. Consequently, diffusion models "hack" the reward: they produce over-stylized, low-diversity, or unrealistic outputs that nevertheless achieve high reward scores but reduced human preference and increased visible artifacts (e.g., noise patterns, cartoonish outputs). DDRL directly addresses this pathology by substituting the on-policy reverse KL with an off-policy forward KL, anchoring the diffusion policy to an external data distribution so that the regularization remains meaningful even as $p_\theta$ explores unfamiliar regions (Ye et al., 3 Dec 2025).

2. Theoretical Formulation: Diffusion, Off-Policy KL, and Optimality

2.1 Diffusion Models as Markov Decision Processes

Let $x_0 \sim p_\text{data}(\cdot\,|\,c)$ represent the real data conditional on context $c$ . The forward noising process is

$q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I), \qquad t = 1 \ldots T,$

and the diffusion model $\epsilon_\theta(x_t, t, c)$ defines the reverse transitions

$p_\theta(x_{t-1}|x_t, c) = \mathcal{N}(\mu_\theta(x_t, t, c), \sigma_t^2 I),$

so that

$p_\theta(x_{0:T}|c) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t, c).$

In RL interpretation, $p_\theta$ is a stochastic policy acting in (state, action) pairs given by $(x_t, t, c) \to x_{t-1}$ , terminating at $x_0$ .

2.2 Off-Policy Data Distribution

The reference/off-policy distribution $\tilde p_\text{data}(x_{0:T}|c)$ is constructed by: drawing $x_0$ from real or synthetic data, then running the forward noising process for $t = 1, \ldots, T$ . The resulting marginals $\tilde p_\text{data}(x_t)$ are "noisy data" distributed along the real-data manifold.

2.3 Forward KL Regularization

DDRL replaces the standard penalty $\mathrm{KL}(p_\theta \Vert p_\text{ref})$ with the off-policy $\mathrm{KL}(\tilde p_\text{data} \Vert p_\theta)$ , evaluated on samples from the data+noise process. By properties of diffusion models,

$\mathrm{KL}(\tilde p_\text{data} \Vert p_\theta) \equiv L_\text{diff}(\theta; \tilde p_\text{data}) = \mathbb{E}_{t, x_0, \epsilon} [ w_t \|\epsilon_\theta(x_t, t, c) - \epsilon\|^2 ]$

where $x_t$ is synthesized from $x_0$ plus noise at $t$ .

2.4 Objective and Optimal Policy

Define the (zero-meaned) advantage:

$A(x_0, c) = \frac{r(x_0, c) - Z}{\beta} \qquad\text{where}\qquad Z = \beta \log\, \mathbb{E}_{p_\text{ref}}[\exp(r(x_0, c)/\beta)]$

The DDRL objective is

$J_\text{DDRL}(\theta) = \mathbb{E}_{x_0 \sim p_\theta} [\lambda(A(x_0, c))] - \mathrm{KL}(\tilde p_\text{data} \Vert p_\theta)$

with $\lambda(u) = -\exp(-u)$ . In practice, $\lambda$ can be replaced by identity to match baseline scaling.

Equivalently,

$\tilde J_\text{DDRL}(\theta) = \mathbb{E}_{x_0 \sim p_\theta} [A(x_0, c)] - L_\text{diff}(\theta; \tilde p_\text{data}),$

and the optimal policy is

$p_\theta^*(x_0|c) \propto \tilde p_\text{data}(x_0|c)\; \exp(r(x_0, c)/\beta),$

showing that DDRL correctly recovers the KL-regularized posterior (Ye et al., 3 Dec 2025).

3. Algorithmic Implementation

DDRL alternates between RL-style rollouts and off-policy diffusion loss regularization:

Draw batches of conditions $c_1, \ldots, c_B$ .
For each $c_i$ : generate $N$ samples $x_0^n \sim p_\theta(\cdot\,|\,c_i)$ ; compute rewards $r^n$ , the baseline $Z$ , and advantages $A^n$ .
Accumulate the policy-gradient (REINFORCE) loss: $L_\text{RL} = -\!\frac1N\sum_n A^n \log p_\theta(x_0^n|c_i)$ .
For each $c_i$ : sample $x_0 \sim \tilde p_\text{data}(\cdot|c_i)$ , random $t \in S$ , noise $\epsilon \sim \mathcal{N}(0, I)$ ; synthesize $x_t$ ; compute denoising loss $\|\epsilon_\theta(x_t, t, c_i) - \epsilon\|^2$ as $L_\text{diff}$ .
Minimize total loss $L_\text{total} = L_\text{diff} - L_\text{RL}$ using AdamW. Only a subset $S$ of timesteps is used for efficiency.

DDRL requires only a data/noise sampler at training, without reference to a pretrained model; it can leverage real or synthetic data for the off-policy KL term (Ye et al., 3 Dec 2025).

4. Hyperparameter and Practical Details

The DDRL recipe for large-scale video generation includes:

Diffusion timesteps: $T = 20$ , $|S| = 10$ (evenly spaced).
RL rollout size: $N = 8$ per condition; learning rates $10^{-5}$ (2B), $3 \times 10^{-6}$ (14B).
Batch size: $B = 16$ conditions/iteration.
Off-policy data: real samples (from high-quality fine-tuning data) or synthetic samples (pre-generated by base model, e.g., 10k videos).
Optimizer: AdamW ( $\beta_1 = 0.9$ , $\beta_2 = 0.99$ ).
Computational cost: $\sim 10^6$ GPU-hours (H100). Efficiency is aided by single forward-diffusion pass per data point and asynchronous external reward servers.

No classifier-free guidance is used in diffusion loss; conditioning is dropped with 20% probability during training (Ye et al., 3 Dec 2025).

5. Empirical Results in High-Resolution Video Generation

DDRL was evaluated in post-training Cosmos2.5 (2B, 14B) over mixed Text $\to$ Video and Image $\to$ Video tasks, using VideoAlign and VBench for quantitative and human-eval metrics.

Quantitative performance:

DDRL improves average reward by $+0.20$ (VideoAlign) and $+0.05$ (VBench) over the base model.
Competing methods (DanceGRPO, FlowGRPO) improve some metrics but "hack" others (e.g., over-stylization, blurred or mismatched outputs).

Human preferences:

For Cosmos2.5-2B, DDRL outperforms base by $+22.9\%$ in human vote and increases VideoAlign score from $0.604$ to $0.715$.
Baselines can increase reward but decrease text alignment or realism.

Qualitative effects:

DanceGRPO shows increased color saturation and prompt misalignment.
FlowGRPO exhibits blur, temporal jitter, and artifacts.
DDRL maintains realism/diversity, better prompt fidelity, and smooth motion (Ye et al., 3 Dec 2025).

6. Limitations and Future Directions

DDRL effectiveness depends on the quality of $\tilde p_\text{data}$ ; if real data is limited or synthetic data distribution is mismatched, regularization may not fully constrain $p_\theta$ . RL rollout cost dominates overall compute—sample efficiency improvements (e.g., using value networks or off-policy RL) are a potential avenue for further work. The field lacks an automatic metric for detecting reward hacking, though signals could include spikes in diffusion loss, lower variance in outputs, or abrupt trade-offs in evaluation metrics.

Proposed extensions include:

Unifying SFT and RL post-training in a single-stage DDRL setup,
Application to other generative architectures (flows, autoregressive LLMs),
Incorporating classifier-free guidance with RL policy improvements at inference (Ye et al., 3 Dec 2025).

A closely related line derives reward-directed, entropy-regularized RL for continuous-time score-based diffusion models (Gao et al., 2024). This approach formulates diffusion model learning as an MDP with policies representing time-varying applied scores, targeting a reward that penalizes deviation from the true data distribution score:

$r(t, y, a) = - (g(T-t))^2 \| \nabla_x\log p_{T-t}(y) - a \|^2,$

with an entropy bonus and terminal reward. The optimal policy is Gaussian with mean incorporating both the data score and value-gradient, and a known variance. Training employs actor-critic (q-learning) with density-ratio score estimation on noisy data, facilitating a principled, model-free balance between task reward maximization and distributional fidelity, without reliance on a pretrained model. Comparison with model-based fine-tuning reveals improved robustness to poor initialization and avoidance of reward overfitting (Gao et al., 2024).

For thorough derivations, implementation details, and experimental results, see "Data-regularized Reinforcement Learning for Diffusion Models at Scale" (Ye et al., 3 Dec 2025) and "Reward-Directed Score-Based Diffusion Models via q-Learning" (Gao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Data-regularized Reinforcement Learning for Diffusion Models at Scale (2025)

Reward-Directed Score-Based Diffusion Models via q-Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Regularized Diffusion Reinforcement Learning (DDRL).