Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data-Regularized Diffusion RL

Updated 13 February 2026
  • The paper introduces DDRL, replacing on-policy reverse KL with an off-policy forward KL penalty, ensuring regularization remains meaningful during reinforcement learning.
  • It formulates diffusion models as Markov Decision Processes, integrating RL rollouts with denoising loss to preserve sample fidelity and prevent reward hacking.
  • Empirical studies in video generation show DDRL improves reward metrics and human evaluation scores while maintaining realism, diversity, and prompt alignment.

Data-Regularized Diffusion Reinforcement Learning (DDRL) is a reinforcement learning paradigm for post-training generative diffusion models that employs data-driven regularization to address reward hacking and distributional drift. The method replaces on-policy reference-model regularization with an off-policy forward KL penalty anchored on a real or synthetic data distribution, enabling robust reward optimization without compromising sample realism, diversity, or fidelity.

1. Motivation: Reward Hacking and Regularization Failure in Diffusion RL

Standard RL post-training of diffusion models seeks to maximize an externally specified reward r(x)r(x), such as a proxy for human preference, while constraining divergence from a reference model prefp_\text{ref} using the reverse KL:

JRL(θ)=Epθ[r(x)/β]KL(pθpref).J_\text{RL}(\theta) = \mathbb{E}_{p_\theta}[r(x)/\beta] - \mathrm{KL}(p_\theta \Vert p_\text{ref}).

However, the reward model is generally only reliable near the data manifold, while pθp_\theta—the policy learned via RL—can drift off-manifold in multi-step sampling, rendering the on-policy reverse KL penalty uninformative. Consequently, diffusion models "hack" the reward: they produce over-stylized, low-diversity, or unrealistic outputs that nevertheless achieve high reward scores but reduced human preference and increased visible artifacts (e.g., noise patterns, cartoonish outputs). DDRL directly addresses this pathology by substituting the on-policy reverse KL with an off-policy forward KL, anchoring the diffusion policy to an external data distribution so that the regularization remains meaningful even as pθp_\theta explores unfamiliar regions (Ye et al., 3 Dec 2025).

2. Theoretical Formulation: Diffusion, Off-Policy KL, and Optimality

2.1 Diffusion Models as Markov Decision Processes

Let x0pdata(c)x_0 \sim p_\text{data}(\cdot\,|\,c) represent the real data conditional on context cc. The forward noising process is

q(xtxt1)=N(1βtxt1,βtI),t=1T,q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I), \qquad t = 1 \ldots T,

and the diffusion model ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) defines the reverse transitions

pθ(xt1xt,c)=N(μθ(xt,t,c),σt2I),p_\theta(x_{t-1}|x_t, c) = \mathcal{N}(\mu_\theta(x_t, t, c), \sigma_t^2 I),

so that

pθ(x0:Tc)=p(xT)t=1Tpθ(xt1xt,c).p_\theta(x_{0:T}|c) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t, c).

In RL interpretation, pθp_\theta is a stochastic policy acting in (state, action) pairs given by (xt,t,c)xt1(x_t, t, c) \to x_{t-1}, terminating at x0x_0.

2.2 Off-Policy Data Distribution

The reference/off-policy distribution p~data(x0:Tc)\tilde p_\text{data}(x_{0:T}|c) is constructed by: drawing x0x_0 from real or synthetic data, then running the forward noising process for t=1,,Tt = 1, \ldots, T. The resulting marginals p~data(xt)\tilde p_\text{data}(x_t) are "noisy data" distributed along the real-data manifold.

2.3 Forward KL Regularization

DDRL replaces the standard penalty KL(pθpref)\mathrm{KL}(p_\theta \Vert p_\text{ref}) with the off-policy KL(p~datapθ)\mathrm{KL}(\tilde p_\text{data} \Vert p_\theta), evaluated on samples from the data+noise process. By properties of diffusion models,

KL(p~datapθ)Ldiff(θ;p~data)=Et,x0,ϵ[wtϵθ(xt,t,c)ϵ2]\mathrm{KL}(\tilde p_\text{data} \Vert p_\theta) \equiv L_\text{diff}(\theta; \tilde p_\text{data}) = \mathbb{E}_{t, x_0, \epsilon} [ w_t \|\epsilon_\theta(x_t, t, c) - \epsilon\|^2 ]

where xtx_t is synthesized from x0x_0 plus noise at tt.

2.4 Objective and Optimal Policy

Define the (zero-meaned) advantage:

A(x0,c)=r(x0,c)ZβwhereZ=βlogEpref[exp(r(x0,c)/β)]A(x_0, c) = \frac{r(x_0, c) - Z}{\beta} \qquad\text{where}\qquad Z = \beta \log\, \mathbb{E}_{p_\text{ref}}[\exp(r(x_0, c)/\beta)]

The DDRL objective is

JDDRL(θ)=Ex0pθ[λ(A(x0,c))]KL(p~datapθ)J_\text{DDRL}(\theta) = \mathbb{E}_{x_0 \sim p_\theta} [\lambda(A(x_0, c))] - \mathrm{KL}(\tilde p_\text{data} \Vert p_\theta)

with λ(u)=exp(u)\lambda(u) = -\exp(-u). In practice, λ\lambda can be replaced by identity to match baseline scaling.

Equivalently,

J~DDRL(θ)=Ex0pθ[A(x0,c)]Ldiff(θ;p~data),\tilde J_\text{DDRL}(\theta) = \mathbb{E}_{x_0 \sim p_\theta} [A(x_0, c)] - L_\text{diff}(\theta; \tilde p_\text{data}),

and the optimal policy is

pθ(x0c)p~data(x0c)  exp(r(x0,c)/β),p_\theta^*(x_0|c) \propto \tilde p_\text{data}(x_0|c)\; \exp(r(x_0, c)/\beta),

showing that DDRL correctly recovers the KL-regularized posterior (Ye et al., 3 Dec 2025).

3. Algorithmic Implementation

DDRL alternates between RL-style rollouts and off-policy diffusion loss regularization:

  • Draw batches of conditions c1,,cBc_1, \ldots, c_B.
  • For each cic_i: generate NN samples x0npθ(ci)x_0^n \sim p_\theta(\cdot\,|\,c_i); compute rewards rnr^n, the baseline ZZ, and advantages AnA^n.
  • Accumulate the policy-gradient (REINFORCE) loss: LRL= ⁣1NnAnlogpθ(x0nci)L_\text{RL} = -\!\frac1N\sum_n A^n \log p_\theta(x_0^n|c_i).
  • For each cic_i: sample x0p~data(ci)x_0 \sim \tilde p_\text{data}(\cdot|c_i), random tSt \in S, noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I); synthesize xtx_t; compute denoising loss ϵθ(xt,t,ci)ϵ2\|\epsilon_\theta(x_t, t, c_i) - \epsilon\|^2 as LdiffL_\text{diff}.
  • Minimize total loss Ltotal=LdiffLRLL_\text{total} = L_\text{diff} - L_\text{RL} using AdamW. Only a subset SS of timesteps is used for efficiency.

DDRL requires only a data/noise sampler at training, without reference to a pretrained model; it can leverage real or synthetic data for the off-policy KL term (Ye et al., 3 Dec 2025).

4. Hyperparameter and Practical Details

The DDRL recipe for large-scale video generation includes:

  • Diffusion timesteps: T=20T = 20, S=10|S| = 10 (evenly spaced).
  • RL rollout size: N=8N = 8 per condition; learning rates 10510^{-5} (2B), 3×1063 \times 10^{-6} (14B).
  • Batch size: B=16B = 16 conditions/iteration.
  • Off-policy data: real samples (from high-quality fine-tuning data) or synthetic samples (pre-generated by base model, e.g., 10k videos).
  • Optimizer: AdamW (β1=0.9\beta_1 = 0.9, β2=0.99\beta_2 = 0.99).
  • Computational cost: 106\sim 10^6 GPU-hours (H100). Efficiency is aided by single forward-diffusion pass per data point and asynchronous external reward servers.

No classifier-free guidance is used in diffusion loss; conditioning is dropped with 20% probability during training (Ye et al., 3 Dec 2025).

5. Empirical Results in High-Resolution Video Generation

DDRL was evaluated in post-training Cosmos2.5 (2B, 14B) over mixed Text\toVideo and Image\toVideo tasks, using VideoAlign and VBench for quantitative and human-eval metrics.

Quantitative performance:

  • DDRL improves average reward by +0.20+0.20 (VideoAlign) and +0.05+0.05 (VBench) over the base model.
  • Competing methods (DanceGRPO, FlowGRPO) improve some metrics but "hack" others (e.g., over-stylization, blurred or mismatched outputs).

Human preferences:

  • For Cosmos2.5-2B, DDRL outperforms base by +22.9%+22.9\% in human vote and increases VideoAlign score from $0.604$ to $0.715$.
  • Baselines can increase reward but decrease text alignment or realism.

Qualitative effects:

  • DanceGRPO shows increased color saturation and prompt misalignment.
  • FlowGRPO exhibits blur, temporal jitter, and artifacts.
  • DDRL maintains realism/diversity, better prompt fidelity, and smooth motion (Ye et al., 3 Dec 2025).

6. Limitations and Future Directions

DDRL effectiveness depends on the quality of p~data\tilde p_\text{data}; if real data is limited or synthetic data distribution is mismatched, regularization may not fully constrain pθp_\theta. RL rollout cost dominates overall compute—sample efficiency improvements (e.g., using value networks or off-policy RL) are a potential avenue for further work. The field lacks an automatic metric for detecting reward hacking, though signals could include spikes in diffusion loss, lower variance in outputs, or abrupt trade-offs in evaluation metrics.

Proposed extensions include:

  • Unifying SFT and RL post-training in a single-stage DDRL setup,
  • Application to other generative architectures (flows, autoregressive LLMs),
  • Incorporating classifier-free guidance with RL policy improvements at inference (Ye et al., 3 Dec 2025).

A closely related line derives reward-directed, entropy-regularized RL for continuous-time score-based diffusion models (Gao et al., 2024). This approach formulates diffusion model learning as an MDP with policies representing time-varying applied scores, targeting a reward that penalizes deviation from the true data distribution score:

r(t,y,a)=(g(Tt))2xlogpTt(y)a2,r(t, y, a) = - (g(T-t))^2 \| \nabla_x\log p_{T-t}(y) - a \|^2,

with an entropy bonus and terminal reward. The optimal policy is Gaussian with mean incorporating both the data score and value-gradient, and a known variance. Training employs actor-critic (q-learning) with density-ratio score estimation on noisy data, facilitating a principled, model-free balance between task reward maximization and distributional fidelity, without reliance on a pretrained model. Comparison with model-based fine-tuning reveals improved robustness to poor initialization and avoidance of reward overfitting (Gao et al., 2024).


For thorough derivations, implementation details, and experimental results, see "Data-regularized Reinforcement Learning for Diffusion Models at Scale" (Ye et al., 3 Dec 2025) and "Reward-Directed Score-Based Diffusion Models via q-Learning" (Gao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Regularized Diffusion Reinforcement Learning (DDRL).