Trajectory-Differentiable Reward Functions

Updated 5 February 2026

Trajectory-differentiable reward functions are mappings from trajectories to real-valued scores that enable direct gradient-based optimization.
They facilitate inverse reinforcement learning, diffusion-model policy improvement, and preference-based reward learning by aligning model outputs with expert scores.
Empirical results show significant gains in control tasks and image generation through methods like DRaFT, leveraging full trajectory backpropagation.

A trajectory-differentiable reward function is any mapping from trajectories (or sampled outputs) to real-valued scores that is differentiable with respect to the underlying trajectory or model parameters, allowing the direct computation of gradients for optimization or model steering. This property enables both extraction and optimization of reward functions within learning frameworks that treat generated data—such as state-action trajectories, denoising chains, or output sequences—as differentiable objects. Such functions have become critical in applications spanning inverse reinforcement learning (IRL), diffusion-model policy improvement, preference-based reward learning, and direct conditional generation.

1. Formalization and Differentiable Reward Extraction

Let $\tau=(s_0,a_0,...,s_T,a_T)$ denote a finite-horizon trajectory in state–action space. Suppose there are two pretrained diffusion models over trajectories: a base $p_{\text{low}}(\tau)$ and an expert $p_{\text{high}}(\tau)$ , with score networks $s_\phi(\tau,t)$ and $s_\Theta(\tau,t)$ that approximate $\nabla_\tau \log p_{\text{low}}$ and $\nabla_\tau \log p_{\text{high}}$ at time $t$ , respectively.

A reward function $R(\tau)$ is called trajectory-differentiable if one can compute $\nabla_\tau R(\tau)$ (and, via autodiff, higher-order derivatives with respect to model or reward parameters). For the relative reward extraction setting, one seeks $R$ such that

$\nabla_\tau R(\tau) \approx s_\Theta(\tau, t) - s_\phi(\tau, t).$

The canonical approach is to parameterize $R$ (or a time-indexed reward $\rho_\phi(x, t)$ ) as a neural network and fit it by minimizing the $L^2$ distance between its gradient and the difference of score networks:

$L(\phi) = \mathbb{E}_{t \sim U(0,T),\ x_t \sim p_t}\left\|\nabla_x \rho_\phi(x_t, t) - [s_\Theta(x_t, t) - s_\phi(x_t, t)]\right\|_2^2.$

This setup yields a trajectory-differentiable reward function whose gradients can steer the base diffusion model or inform other model-based controllers (Nuti et al., 2023).

2. Existence, Uniqueness, and Theoretical Foundations

Existence and uniqueness of trajectory-differentiable reward functions in the context of diffusion models (specifically, for the relative reward) are grounded in the theory of SDE drift matching and conservative vector field projection. Under Lipschitz drift and bounded diffusion noise (with $g(0)>0$ ), there exists a unique vector field $h(x,t)=f^2(x,t)-f^1(x,t)$ such that the “guided” SDE matches the expert model (Nuti et al., 2023):

$dx_t = [f^1(x_t, t) + h(x_t, t)]\, dt + g(t) dw_t.$

When the score difference is not conservative, one projects onto the set of gradients of smooth potentials in $L^2(\mathbb{R}^n, \mathbb{R}^n)$ . For any $\epsilon>0$ , a smooth potential $\Phi_\epsilon(x, t)$ can be constructed such that $\int \|\nabla_x \Phi_\epsilon - h_t\|^2 < \epsilon$ , making an $\epsilon$ -relative reward both well-defined and unique up to an additive constant (Nuti et al., 2023).

3. Trajectory-Differentiable Reward Optimization Methods

Direct optimization of differentiable rewards through generative or sequential models draws on techniques that utilize backpropagation through entire trajectories, as enabled by the differentiability of the reward function with respect to trajectories—and, by the chain rule, with respect to generator (e.g., diffusion model) parameters.

In the context of diffusion models, this paradigm is embodied in the DRaFT (“Direct Reward Fine-Tuning”) family of algorithms (Clark et al., 2023). Let $J(\theta) = \mathbb{E}_{c,x_T}[r(x_0, c)]$ denote the expected reward under the generative model. Backpropagation yields

$\nabla_\theta J(\theta) = \mathbb{E}_{c,x_T}\left[\nabla_{x_0} r(x_0, c) \cdot \frac{\partial x_0}{\partial \theta}\right]$

where $x_0$ is propagated back through every denoising step or only the last $K$ in DRaFT-K.

For reward models based on human preferences, the Trajectory Alignment Coefficient (TAC) and its differentiable surrogate, Soft-TAC, provide a differentiable loss for fitting reward models so as to maximize their agreement with human-labeled preference data. Soft-TAC relaxes the non-differentiable sign operator in Kendall’s $\tau$ via a $\tanh$ transformation, directly enabling end-to-end backpropagation over trajectory pairs (Muslimani et al., 23 Jan 2026):

$\tilde{\sigma}_{TAC,\alpha}(\mathcal D_h;R) = \mathbb{E}_{(\tau_i,\tau_j,y_{ij}) \sim \mathcal D_h} \left[ y_{ij} \tanh(\alpha\,[G_R(\tau_i) - G_R(\tau_j)]) \right]$

Loss minimization then proceeds by differentiating through all per-step contributions along each trajectory pair.

4. Model Architectures and Practical Algorithms

Trajectory-differentiable reward functions are typically implemented as neural networks that take as input complete trajectories, partial segments (e.g., sliding windows), or denoised states at various diffusion steps. Architectural choices in published works include:

Domain	Reward Network Architecture	Input
Maze2D	Small MLP	State–action windows
Locomotion (MuJoCo)	UNet encoder (mirroring diffusion value net)	Full trajectory segments
Image (Stable Diffusion)	CLIP-latent UNet encoder	Image latents

Gradient-based alignment or backpropagation is performed via automatic differentiation frameworks (e.g., PyTorch), allowing for second-order derivatives where required. Optimization typically uses Adam with batch sizes in the range $64$–$256$, learning rates $10^{-4}$ – $5 \times 10^{-5}$ , and $50$K–$100$K steps (Nuti et al., 2023).

In practice, further tricks include:

Uniform random sampling of diffusion times and forward-noising of data.
Normalization of trajectory or latent inputs.
Use of truncated or low-variance backpropagation to reduce memory and variance costs (e.g., DRaFT-K, DRaFT-LV) (Clark et al., 2023).
Incorporation of hyperparameters like guidance scale $\omega$ and stop-gradient schedules.

The full differentiability of reward models (e.g., feed-forward nets in DRaFT or Soft-TAC) permits trajectory-level optimization and classifier guidance (Nuti et al., 2023, Clark et al., 2023, Muslimani et al., 23 Jan 2026).

5. Applications and Empirical Results

Trajectory-differentiable rewards enable both extraction of underlying preferences or objectives and direct fine-tuning of generative models for improved task performance:

Inverse reward extraction: Reward gradients extracted from comparative diffusion models can accurately recover navigation goals (Maze2D: reward heatmaps peak at true goal in 78±6% of cases) and support policy retraining to recover $\sim$ 74% of ground-truth performance (Nuti et al., 2023).
Locomotion and control: Steering diffusion-based policies using gradient-based reward guidance yields quantifiable improvements (e.g., Walker2D: score increases from $28.20$ to $38.40$, a $36\%$ relative gain), outperforming discriminator baselines and generalizing across initialization (Nuti et al., 2023).
Reward learning with preference data: Soft-TAC-optimized reward models produce more distinct, goal-aligned policies than cross-entropy (CE) alternatives in both synthetic (Lunar Lander; landing pad success: 0.72±0.16 vs. CE 0.60±0.24) and large-scale (Gran Turismo 7; aggressive/timid driving styles) scenarios (Muslimani et al., 23 Jan 2026).
Image generation: Differentiable reward fine-tuning (DRaFT) yields marked improvements in aesthetic and safety rewards for large-scale diffusion-based generators, with empirical superiority in both query efficiency and final reward compared to RL-style baselines (Clark et al., 2023, Nuti et al., 2023).

6. Algorithmic and Theoretical Connections

Trajectory-differentiable reward methods unify and generalize both RL and generative modeling fine-tuning approaches. The DRaFT family (DRaFT, DRaFT-K, DRaFT-LV) and related algorithms can be interpreted by the placement of stop-gradient operations in the sampling loop, spanning:

Full untruncated direct backpropagation (no stop-gradients)
Truncated (recent $K$ -step) backpropagation for computational efficiency
Per-pair gradient matching or preference-based surrogates (Soft-TAC)

Deterministic Policy Gradient (DPG) and standard RL approaches can, in principle, be adapted to these differentiable setups but have been shown empirically less efficient than the direct methods when reward gradients are available (Clark et al., 2023). In the context of IRL, projecting score differences onto conservative fields ensures theoretical correctness of the extracted reward (Nuti et al., 2023). Preference-based optimization via Soft-TAC directly targets Kendall's $\tau$ , promoting robustness and symmetric label treatment compared to Bradley–Terry CE loss (Muslimani et al., 23 Jan 2026).

7. Open Directions and Limitations

Current research on trajectory-differentiable reward functions identifies several notable limitations and open directions:

The use of simple linear reward models in preference-based learning (Soft-TAC) highlights the need to extend to deep, non-linear networks (Muslimani et al., 23 Jan 2026).
Preference elicitation remains largely fixed and offline; online or active preference-query strategies for continuous improvement of reward models are not yet standard (Muslimani et al., 23 Jan 2026).
Theoretical guarantees exist for existence/uniqueness only under specific smoothness and support conditions; handling of non-conservative or highly noisy gradients remains partially addressed (Nuti et al., 2023).
Tuning the sensitivity parameter (e.g., $\alpha$ in Soft-TAC) and developing robust approaches to non-transitive or noisy preference data represent ongoing challenges (Muslimani et al., 23 Jan 2026).

Nonetheless, trajectory-differentiable reward functions have proven to be a foundational tool for model-based learning, model steering, and inverse reward inference across a range of modalities and domains.

Markdown Report Issue Upgrade to Chat

References (3)

Extracting Reward Functions from Diffusion Models (2023)

Directly Fine-Tuning Diffusion Models on Differentiable Rewards (2023)

The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Trajectory-Differentiable Reward Functions.

Trajectory-Differentiable Reward Functions

1. Formalization and Differentiable Reward Extraction

2. Existence, Uniqueness, and Theoretical Foundations

3. Trajectory-Differentiable Reward Optimization Methods

4. Model Architectures and Practical Algorithms

5. Applications and Empirical Results

6. Algorithmic and Theoretical Connections

7. Open Directions and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Trajectory-Differentiable Reward Functions

1. Formalization and Differentiable Reward Extraction

2. Existence, Uniqueness, and Theoretical Foundations

3. Trajectory-Differentiable Reward Optimization Methods

4. Model Architectures and Practical Algorithms

5. Applications and Empirical Results

6. Algorithmic and Theoretical Connections

7. Open Directions and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research