Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trajectory-Differentiable Reward Functions

Updated 5 February 2026
  • Trajectory-differentiable reward functions are mappings from trajectories to real-valued scores that enable direct gradient-based optimization.
  • They facilitate inverse reinforcement learning, diffusion-model policy improvement, and preference-based reward learning by aligning model outputs with expert scores.
  • Empirical results show significant gains in control tasks and image generation through methods like DRaFT, leveraging full trajectory backpropagation.

A trajectory-differentiable reward function is any mapping from trajectories (or sampled outputs) to real-valued scores that is differentiable with respect to the underlying trajectory or model parameters, allowing the direct computation of gradients for optimization or model steering. This property enables both extraction and optimization of reward functions within learning frameworks that treat generated data—such as state-action trajectories, denoising chains, or output sequences—as differentiable objects. Such functions have become critical in applications spanning inverse reinforcement learning (IRL), diffusion-model policy improvement, preference-based reward learning, and direct conditional generation.

1. Formalization and Differentiable Reward Extraction

Let τ=(s0,a0,...,sT,aT)\tau=(s_0,a_0,...,s_T,a_T) denote a finite-horizon trajectory in state–action space. Suppose there are two pretrained diffusion models over trajectories: a base plow(τ)p_{\text{low}}(\tau) and an expert phigh(τ)p_{\text{high}}(\tau), with score networks sϕ(τ,t)s_\phi(\tau,t) and sΘ(τ,t)s_\Theta(\tau,t) that approximate τlogplow\nabla_\tau \log p_{\text{low}} and τlogphigh\nabla_\tau \log p_{\text{high}} at time tt, respectively.

A reward function R(τ)R(\tau) is called trajectory-differentiable if one can compute τR(τ)\nabla_\tau R(\tau) (and, via autodiff, higher-order derivatives with respect to model or reward parameters). For the relative reward extraction setting, one seeks RR such that

τR(τ)sΘ(τ,t)sϕ(τ,t).\nabla_\tau R(\tau) \approx s_\Theta(\tau, t) - s_\phi(\tau, t).

The canonical approach is to parameterize RR (or a time-indexed reward ρϕ(x,t)\rho_\phi(x, t)) as a neural network and fit it by minimizing the L2L^2 distance between its gradient and the difference of score networks:

L(ϕ)=EtU(0,T), xtptxρϕ(xt,t)[sΘ(xt,t)sϕ(xt,t)]22.L(\phi) = \mathbb{E}_{t \sim U(0,T),\ x_t \sim p_t}\left\|\nabla_x \rho_\phi(x_t, t) - [s_\Theta(x_t, t) - s_\phi(x_t, t)]\right\|_2^2.

This setup yields a trajectory-differentiable reward function whose gradients can steer the base diffusion model or inform other model-based controllers (Nuti et al., 2023).

2. Existence, Uniqueness, and Theoretical Foundations

Existence and uniqueness of trajectory-differentiable reward functions in the context of diffusion models (specifically, for the relative reward) are grounded in the theory of SDE drift matching and conservative vector field projection. Under Lipschitz drift and bounded diffusion noise (with g(0)>0g(0)>0), there exists a unique vector field h(x,t)=f2(x,t)f1(x,t)h(x,t)=f^2(x,t)-f^1(x,t) such that the “guided” SDE matches the expert model (Nuti et al., 2023):

dxt=[f1(xt,t)+h(xt,t)]dt+g(t)dwt.dx_t = [f^1(x_t, t) + h(x_t, t)]\, dt + g(t) dw_t.

When the score difference is not conservative, one projects onto the set of gradients of smooth potentials in L2(Rn,Rn)L^2(\mathbb{R}^n, \mathbb{R}^n). For any ϵ>0\epsilon>0, a smooth potential Φϵ(x,t)\Phi_\epsilon(x, t) can be constructed such that xΦϵht2<ϵ\int \|\nabla_x \Phi_\epsilon - h_t\|^2 < \epsilon, making an ϵ\epsilon-relative reward both well-defined and unique up to an additive constant (Nuti et al., 2023).

3. Trajectory-Differentiable Reward Optimization Methods

Direct optimization of differentiable rewards through generative or sequential models draws on techniques that utilize backpropagation through entire trajectories, as enabled by the differentiability of the reward function with respect to trajectories—and, by the chain rule, with respect to generator (e.g., diffusion model) parameters.

  • In the context of diffusion models, this paradigm is embodied in the DRaFT (“Direct Reward Fine-Tuning”) family of algorithms (Clark et al., 2023). Let J(θ)=Ec,xT[r(x0,c)]J(\theta) = \mathbb{E}_{c,x_T}[r(x_0, c)] denote the expected reward under the generative model. Backpropagation yields

θJ(θ)=Ec,xT[x0r(x0,c)x0θ]\nabla_\theta J(\theta) = \mathbb{E}_{c,x_T}\left[\nabla_{x_0} r(x_0, c) \cdot \frac{\partial x_0}{\partial \theta}\right]

where x0x_0 is propagated back through every denoising step or only the last KK in DRaFT-K.

  • For reward models based on human preferences, the Trajectory Alignment Coefficient (TAC) and its differentiable surrogate, Soft-TAC, provide a differentiable loss for fitting reward models so as to maximize their agreement with human-labeled preference data. Soft-TAC relaxes the non-differentiable sign operator in Kendall’s τ\tau via a tanh\tanh transformation, directly enabling end-to-end backpropagation over trajectory pairs (Muslimani et al., 23 Jan 2026):

σ~TAC,α(Dh;R)=E(τi,τj,yij)Dh[yijtanh(α[GR(τi)GR(τj)])]\tilde{\sigma}_{TAC,\alpha}(\mathcal D_h;R) = \mathbb{E}_{(\tau_i,\tau_j,y_{ij}) \sim \mathcal D_h} \left[ y_{ij} \tanh(\alpha\,[G_R(\tau_i) - G_R(\tau_j)]) \right]

Loss minimization then proceeds by differentiating through all per-step contributions along each trajectory pair.

4. Model Architectures and Practical Algorithms

Trajectory-differentiable reward functions are typically implemented as neural networks that take as input complete trajectories, partial segments (e.g., sliding windows), or denoised states at various diffusion steps. Architectural choices in published works include:

Domain Reward Network Architecture Input
Maze2D Small MLP State–action windows
Locomotion (MuJoCo) UNet encoder (mirroring diffusion value net) Full trajectory segments
Image (Stable Diffusion) CLIP-latent UNet encoder Image latents

Gradient-based alignment or backpropagation is performed via automatic differentiation frameworks (e.g., PyTorch), allowing for second-order derivatives where required. Optimization typically uses Adam with batch sizes in the range $64$–$256$, learning rates 10410^{-4}5×1055 \times 10^{-5}, and $50$K–$100$K steps (Nuti et al., 2023).

In practice, further tricks include:

  • Uniform random sampling of diffusion times and forward-noising of data.
  • Normalization of trajectory or latent inputs.
  • Use of truncated or low-variance backpropagation to reduce memory and variance costs (e.g., DRaFT-K, DRaFT-LV) (Clark et al., 2023).
  • Incorporation of hyperparameters like guidance scale ω\omega and stop-gradient schedules.

The full differentiability of reward models (e.g., feed-forward nets in DRaFT or Soft-TAC) permits trajectory-level optimization and classifier guidance (Nuti et al., 2023, Clark et al., 2023, Muslimani et al., 23 Jan 2026).

5. Applications and Empirical Results

Trajectory-differentiable rewards enable both extraction of underlying preferences or objectives and direct fine-tuning of generative models for improved task performance:

  • Inverse reward extraction: Reward gradients extracted from comparative diffusion models can accurately recover navigation goals (Maze2D: reward heatmaps peak at true goal in 78±6% of cases) and support policy retraining to recover \sim74% of ground-truth performance (Nuti et al., 2023).
  • Locomotion and control: Steering diffusion-based policies using gradient-based reward guidance yields quantifiable improvements (e.g., Walker2D: score increases from $28.20$ to $38.40$, a 36%36\% relative gain), outperforming discriminator baselines and generalizing across initialization (Nuti et al., 2023).
  • Reward learning with preference data: Soft-TAC-optimized reward models produce more distinct, goal-aligned policies than cross-entropy (CE) alternatives in both synthetic (Lunar Lander; landing pad success: 0.72±0.16 vs. CE 0.60±0.24) and large-scale (Gran Turismo 7; aggressive/timid driving styles) scenarios (Muslimani et al., 23 Jan 2026).
  • Image generation: Differentiable reward fine-tuning (DRaFT) yields marked improvements in aesthetic and safety rewards for large-scale diffusion-based generators, with empirical superiority in both query efficiency and final reward compared to RL-style baselines (Clark et al., 2023, Nuti et al., 2023).

6. Algorithmic and Theoretical Connections

Trajectory-differentiable reward methods unify and generalize both RL and generative modeling fine-tuning approaches. The DRaFT family (DRaFT, DRaFT-K, DRaFT-LV) and related algorithms can be interpreted by the placement of stop-gradient operations in the sampling loop, spanning:

  • Full untruncated direct backpropagation (no stop-gradients)
  • Truncated (recent KK-step) backpropagation for computational efficiency
  • Per-pair gradient matching or preference-based surrogates (Soft-TAC)

Deterministic Policy Gradient (DPG) and standard RL approaches can, in principle, be adapted to these differentiable setups but have been shown empirically less efficient than the direct methods when reward gradients are available (Clark et al., 2023). In the context of IRL, projecting score differences onto conservative fields ensures theoretical correctness of the extracted reward (Nuti et al., 2023). Preference-based optimization via Soft-TAC directly targets Kendall's τ\tau, promoting robustness and symmetric label treatment compared to Bradley–Terry CE loss (Muslimani et al., 23 Jan 2026).

7. Open Directions and Limitations

Current research on trajectory-differentiable reward functions identifies several notable limitations and open directions:

  • The use of simple linear reward models in preference-based learning (Soft-TAC) highlights the need to extend to deep, non-linear networks (Muslimani et al., 23 Jan 2026).
  • Preference elicitation remains largely fixed and offline; online or active preference-query strategies for continuous improvement of reward models are not yet standard (Muslimani et al., 23 Jan 2026).
  • Theoretical guarantees exist for existence/uniqueness only under specific smoothness and support conditions; handling of non-conservative or highly noisy gradients remains partially addressed (Nuti et al., 2023).
  • Tuning the sensitivity parameter (e.g., α\alpha in Soft-TAC) and developing robust approaches to non-transitive or noisy preference data represent ongoing challenges (Muslimani et al., 23 Jan 2026).

Nonetheless, trajectory-differentiable reward functions have proven to be a foundational tool for model-based learning, model steering, and inverse reward inference across a range of modalities and domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-Differentiable Reward Functions.