Reward-Guided Diffusion as Stochastic Control

Updated 23 December 2025

The paper reformulates reward-guided diffusion as an optimal drift control problem in reversed SDEs, integrating reward maximization with energy regularization.
It derives closed-form controllers using the Hamilton–Jacobi–Bellman equation, ensuring principled solutions across continuous and discrete diffusion schemes.
The framework unifies diverse techniques such as Doob’s h-transform, classifier guidance, reinforcement learning, and fine-tuning to enhance control over generative models.

Reward-guided diffusion as stochastic optimal control (SOC) is the rigorous reformulation of reward-seeking inference or fine-tuning procedures in diffusion models as stochastic control problems. This framework interprets the addition of reward alignment or endpoint constraints to diffusion generation as the imposition of an optimally chosen, state- and time-dependent drift control on the underlying (reversed) stochastic differential equation (SDE), subject to both reward-optimization and energy/entropy regularization constraints. SOC-based diffusion systematically unifies previously disparate heuristics—including Doob’s $h$ -transform, classifier guidance, value-based resampling, reinforcement learning, and fine-tuning—under the principles of dynamic programming and the Hamilton–Jacobi–Bellman (HJB) equation, providing closed-form solutions and principled algorithms for both continuous- and discrete-time diffusion models.

1. Mathematical Formulation: Diffusion as a Stochastic Control Problem

Diffusion generative models are typically constructed by reversing a known SDE,

$dx_t = f(x_t, t) \, dt + g(t) \, dW_t,$

with $x_0\sim p_{\mathrm{prior}}$ , where $f$ is the drift function, $g$ is the (scalar or matrix-valued) diffusion coefficient, and $W_t$ is a standard Wiener process on $\mathbb{R}^d$ (Uehara et al., 16 Jan 2025). In reward-guided or endpoint-constrained settings, the goal is to drive the generated trajectory toward a high-reward or endpoint structure $x_T^\star$ by modifying the reverse-time SDE through an additional control drift $u(x_t, t)$ :

$dx_t = f(x_t, t) \, dt + u(x_t, t)\,dt + g(t)\,dW_t.$

The control objective is the maximization of a reward or terminal constraint, regularized by a quadratic energy term that penalizes aggressive departures from the prior dynamics. For reward-guided generation:

$J(u) = \mathbb{E} \left[ R(x_T) - \frac{1}{2} \int_0^T \|u(x_t, t)\|^2 \, dt \right],$

where $R(x_T)$ is the terminal reward, which may encode classifier confidence, structural similarity, or a user-chosen metric (Uehara et al., 2024, Uehara et al., 16 Jan 2025, Zhu et al., 9 Feb 2025).

For endpoint-constrained ("diffusion bridge") problems, the functional becomes:

$J(u; \gamma) = \mathbb{E} \left[ \int_0^T \frac{1}{2}\|u_t\|^2 dt + \gamma \|x_T^u - x_T^\star\|^2 \right],$

with $\gamma>0$ quantifying the tradeoff between energy and endpoint accuracy (Zhu et al., 9 Feb 2025).

2. Hamilton–Jacobi–Bellman Equation and Closed-Form Controllers

The infinitesimal dynamic programming principle yields the HJB partial differential equation:

$-\partial_t V(x, t) = \min_u \left\{ \frac{1}{2}\|u\|^2 + \langle \nabla_x V, f(x, t) + u \rangle + \frac{1}{2} g(t)^2 \Delta_x V \right\},$

with $V(x, T)=R(x)$ (or the terminal penalty) (Zhu et al., 9 Feb 2025, Uehara et al., 16 Jan 2025, Uehara et al., 2024). The unique minimizer is

$u^*(x, t) = -g(t)^T \nabla_x V(x, t).$

In the linear-quadratic case relevant to Gaussian bridges or time-homogeneous noise schedules, this yields a closed-form controller—see Theorem 4.1 of (Zhu et al., 9 Feb 2025)—parameterized by the terminal weight $\gamma$ , where the controller smoothly interpolates between uncontrolled dynamics ( $\gamma \to 0$ ) and the hard Doob $h$ -transform bridge ( $\gamma \to \infty$ ).

For reward-guided guidance, the only change is the terminal condition $V(x,T)=R(x)$ . The optimal controlled SDE samples from a law whose marginals are

$p_t^*(x) \propto e^{V(x, t) / \alpha} p_{\mathrm{data}, t}(x),$

with $\alpha$ an entropy regularization parameter (Uehara et al., 2024).

3. Algorithmic Instantiations: Guidance, Fine-Tuning, and Bridge Construction

Reward-guided diffusion as SOC underpins a diverse range of both inference-time and fine-tuning algorithms.

Guidance scheduling: In classifier or reward-guided generation, a scalar or vector-valued guidance weight $w_t$ multiplies the reward gradient $\nabla_x r(x)$ and modulates the drift (Azangulov et al., 25 May 2025). Rather than using heuristic schedules, SOC prescribes $w^*(t, x)$ via the state- and time-dependent solution of the HJB (Azangulov et al., 25 May 2025).
Diffusion bridges (UniDB framework): Endpoint conditioning or image translation via diffusion bridges is recovered as an infinite-penalty ( $\gamma\to \infty$ ) limit of the SOC, yielding the Doob $h$ -transform. The UniDB framework generalizes this, realizing improved detail/hardness tradeoffs by tuning $\gamma$ (Zhu et al., 9 Feb 2025).
Reward-guided model fine-tuning: Finetuning a pretrained diffusion model to maximize rewards is posed as entropy-regularized SOC (Uehara et al., 2024, Keramati et al., 2 Aug 2025). The optimal control is a learned drift $u^*(t, x)\propto \nabla_x V(x, t)$ , with the value-function $V$ parameterized by a neural network; training minimizes the negative reward penalized by energy (Uehara et al., 2024). Actor–critic, value-based sampling, and policy-gradient methods (e.g., q-learning (Gao et al., 2024), weighted denoising cross-entropy (Tang et al., 29 Sep 2025)) instantiate this principle in both continuous and discrete settings.
Trajectory optimal control for reward-guided editing: Image editing, viewed as controlled SDE trajectory optimization, is solved via Pontryagin's Minimum Principle and iterative adjoint/gradient-based updates, achieving a balance between fidelity and user-specified reward (Chang et al., 30 Sep 2025).

4. Theoretical Guarantees and the Diversity–Reward Tradeoff

SOC-based formulations yield rigorous guarantees:

Marginal matching: Under optimal control, the terminal marginal $p_T^*(x)\propto \exp(R(x)/\alpha)p_{\mathrm{data}}(x)$ , with support containment and tail behavior controlled by $\alpha$ (reward-weight) (Uehara et al., 2024, Azangulov et al., 25 May 2025).
Monotonicity and support preservation: For classifier-guided SDEs, any nonnegative guidance schedule strictly increases terminal confidence and preserves support under mild boundedness by martingale arguments (Azangulov et al., 25 May 2025).
Detail–fidelity tradeoff: For bridges, infinite penalty ( $\gamma\to\infty$ ) strictly enforces endpoint at the cost of smoothed detail; finite $\gamma$ yields strictly improved cost and more realistic samples (Zhu et al., 9 Feb 2025).
Soft-optimal policies: Entropy-regularized objectives interpolate between reward maximization (low $\alpha$ , collapsing onto argmax of reward under data prior) and data-realism (large $\alpha$ ) (Uehara et al., 2024).

5. Implementation and Loss Function Taxonomy

SOC for diffusion provides a taxonomy of algorithmic losses:

Adjoint matching: Pathwise, variance-minimizing, and unbiased loss that efficiently computes the SOC gradient via backward adjoint ODE, yielding zero-variance at the optimum in high-dimensional stochastic systems (Domingo-Enrich, 2024).
Path-integral/likelihood-ratio: Standard REINFORCE or control cost, unbiased but high-variance due to reliance on importance weights.
Value and Q-learning based: Policy evaluation and improvement using Bellman (or soft-Bellman) recursions for continuous/discrete Markov processes (Gao et al., 2024, Keramati et al., 2 Aug 2025, Tang et al., 29 Sep 2025).
Weighted denoising and cross-entropy: Importance-weighted (exponential reward) noise-matching or token cross-entropy losses for both continuous and discrete diffusion (Zhang et al., 2023, Keramati et al., 2 Aug 2025, Tang et al., 29 Sep 2025).

Recommended practice is to use adjoint-matching losses for scalability and variance reduction (Domingo-Enrich, 2024).

6. Applications and Empirical Outcomes

Reward-guided SOC frameworks have been empirically validated for a variety of tasks:

Image translation and restoration: UniDB yields enhanced structure preservation and detail in imaged bridges with minimal code modifications (Zhu et al., 9 Feb 2025).
Reward-directed design: SOC-guided diffusion achieves >25% reduction in resistance (ship hull) and >10% lift-to-drag ratio improvement (airfoil) relative to training-set maxima (Keramati et al., 2 Aug 2025).
Molecule and 3D shape generation: RGDM improves over 3D-DDPM, with up to 60% relative gain in MMD and improved coverage, generating samples with higher reward-aligned metrics (Zhang et al., 2023).
Image editing: Trajectory-optimal control matches or exceeds inversion- and guidance-based baselines across multiple domains, balancing reward alignment and input fidelity without reward hacking (Chang et al., 30 Sep 2025).
Discrete biological sequences: MCTS-based trajectory replay for discrete diffusion yields Pareto-optimal tradeoffs in multi-objective peptide design (Tang et al., 29 Sep 2025).

Domain	SOC-based method	Empirical gain
Image restoration	UniDB bridge (Zhu et al., 9 Feb 2025)	Sharper outputs, detail–fidelity balance
3D shape/molecule	RGDM (Zhang et al., 2023)	Higher fidelity, reward-aligned samples
Design optimization	Reward-directed DDPM (Keramati et al., 2 Aug 2025)	Large improvements in engineering metrics
Sequence generation	TR2-D2 (Tang et al., 29 Sep 2025)	Hypervolume dominance in multi-objective regimes
Image editing	Trajectory OC (Chang et al., 30 Sep 2025)	Best reward-fidelity tradeoff, avoids reward-hacking

7. Connection to Classical Stochastic Control and Future Directions

Reward-guided diffusion as SOC is deeply rooted in well-established stochastic control theory, connecting with classical notions such as the linear Bellman equation, desirability functions, path-integral control, exponential tilting, and Doob’s $h$ -transform for bridge processes (Ha et al., 2016, Zhu et al., 9 Feb 2025). The emergence of multiscale planning via diffusion wavelets (Ha et al., 2016) and fine-grained control via soft-value Bellman recursions (Keramati et al., 2 Aug 2025) reflects a convergence of control theory, reinforcement learning, and deep generative modeling.

Ongoing directions include efficient solutions to the HJB in high dimensions (e.g., policy-gradient with adjoint-state variance reduction), integration of non-differentiable or black-box rewards with replay and search-based fine-tuning (Tang et al., 29 Sep 2025), and theoretical analysis of reward–diversity tradeoffs in the overoptimization regime (Uehara et al., 2024).