Trajectory-Differentiable Reward Functions
- Trajectory-differentiable reward functions are mappings from trajectories to real-valued scores that enable direct gradient-based optimization.
- They facilitate inverse reinforcement learning, diffusion-model policy improvement, and preference-based reward learning by aligning model outputs with expert scores.
- Empirical results show significant gains in control tasks and image generation through methods like DRaFT, leveraging full trajectory backpropagation.
A trajectory-differentiable reward function is any mapping from trajectories (or sampled outputs) to real-valued scores that is differentiable with respect to the underlying trajectory or model parameters, allowing the direct computation of gradients for optimization or model steering. This property enables both extraction and optimization of reward functions within learning frameworks that treat generated data—such as state-action trajectories, denoising chains, or output sequences—as differentiable objects. Such functions have become critical in applications spanning inverse reinforcement learning (IRL), diffusion-model policy improvement, preference-based reward learning, and direct conditional generation.
1. Formalization and Differentiable Reward Extraction
Let denote a finite-horizon trajectory in state–action space. Suppose there are two pretrained diffusion models over trajectories: a base and an expert , with score networks and that approximate and at time , respectively.
A reward function is called trajectory-differentiable if one can compute (and, via autodiff, higher-order derivatives with respect to model or reward parameters). For the relative reward extraction setting, one seeks such that
The canonical approach is to parameterize (or a time-indexed reward ) as a neural network and fit it by minimizing the distance between its gradient and the difference of score networks:
This setup yields a trajectory-differentiable reward function whose gradients can steer the base diffusion model or inform other model-based controllers (Nuti et al., 2023).
2. Existence, Uniqueness, and Theoretical Foundations
Existence and uniqueness of trajectory-differentiable reward functions in the context of diffusion models (specifically, for the relative reward) are grounded in the theory of SDE drift matching and conservative vector field projection. Under Lipschitz drift and bounded diffusion noise (with ), there exists a unique vector field such that the “guided” SDE matches the expert model (Nuti et al., 2023):
When the score difference is not conservative, one projects onto the set of gradients of smooth potentials in . For any , a smooth potential can be constructed such that , making an -relative reward both well-defined and unique up to an additive constant (Nuti et al., 2023).
3. Trajectory-Differentiable Reward Optimization Methods
Direct optimization of differentiable rewards through generative or sequential models draws on techniques that utilize backpropagation through entire trajectories, as enabled by the differentiability of the reward function with respect to trajectories—and, by the chain rule, with respect to generator (e.g., diffusion model) parameters.
- In the context of diffusion models, this paradigm is embodied in the DRaFT (“Direct Reward Fine-Tuning”) family of algorithms (Clark et al., 2023). Let denote the expected reward under the generative model. Backpropagation yields
where is propagated back through every denoising step or only the last in DRaFT-K.
- For reward models based on human preferences, the Trajectory Alignment Coefficient (TAC) and its differentiable surrogate, Soft-TAC, provide a differentiable loss for fitting reward models so as to maximize their agreement with human-labeled preference data. Soft-TAC relaxes the non-differentiable sign operator in Kendall’s via a transformation, directly enabling end-to-end backpropagation over trajectory pairs (Muslimani et al., 23 Jan 2026):
Loss minimization then proceeds by differentiating through all per-step contributions along each trajectory pair.
4. Model Architectures and Practical Algorithms
Trajectory-differentiable reward functions are typically implemented as neural networks that take as input complete trajectories, partial segments (e.g., sliding windows), or denoised states at various diffusion steps. Architectural choices in published works include:
| Domain | Reward Network Architecture | Input |
|---|---|---|
| Maze2D | Small MLP | State–action windows |
| Locomotion (MuJoCo) | UNet encoder (mirroring diffusion value net) | Full trajectory segments |
| Image (Stable Diffusion) | CLIP-latent UNet encoder | Image latents |
Gradient-based alignment or backpropagation is performed via automatic differentiation frameworks (e.g., PyTorch), allowing for second-order derivatives where required. Optimization typically uses Adam with batch sizes in the range $64$–$256$, learning rates –, and $50$K–$100$K steps (Nuti et al., 2023).
In practice, further tricks include:
- Uniform random sampling of diffusion times and forward-noising of data.
- Normalization of trajectory or latent inputs.
- Use of truncated or low-variance backpropagation to reduce memory and variance costs (e.g., DRaFT-K, DRaFT-LV) (Clark et al., 2023).
- Incorporation of hyperparameters like guidance scale and stop-gradient schedules.
The full differentiability of reward models (e.g., feed-forward nets in DRaFT or Soft-TAC) permits trajectory-level optimization and classifier guidance (Nuti et al., 2023, Clark et al., 2023, Muslimani et al., 23 Jan 2026).
5. Applications and Empirical Results
Trajectory-differentiable rewards enable both extraction of underlying preferences or objectives and direct fine-tuning of generative models for improved task performance:
- Inverse reward extraction: Reward gradients extracted from comparative diffusion models can accurately recover navigation goals (Maze2D: reward heatmaps peak at true goal in 78±6% of cases) and support policy retraining to recover 74% of ground-truth performance (Nuti et al., 2023).
- Locomotion and control: Steering diffusion-based policies using gradient-based reward guidance yields quantifiable improvements (e.g., Walker2D: score increases from $28.20$ to $38.40$, a relative gain), outperforming discriminator baselines and generalizing across initialization (Nuti et al., 2023).
- Reward learning with preference data: Soft-TAC-optimized reward models produce more distinct, goal-aligned policies than cross-entropy (CE) alternatives in both synthetic (Lunar Lander; landing pad success: 0.72±0.16 vs. CE 0.60±0.24) and large-scale (Gran Turismo 7; aggressive/timid driving styles) scenarios (Muslimani et al., 23 Jan 2026).
- Image generation: Differentiable reward fine-tuning (DRaFT) yields marked improvements in aesthetic and safety rewards for large-scale diffusion-based generators, with empirical superiority in both query efficiency and final reward compared to RL-style baselines (Clark et al., 2023, Nuti et al., 2023).
6. Algorithmic and Theoretical Connections
Trajectory-differentiable reward methods unify and generalize both RL and generative modeling fine-tuning approaches. The DRaFT family (DRaFT, DRaFT-K, DRaFT-LV) and related algorithms can be interpreted by the placement of stop-gradient operations in the sampling loop, spanning:
- Full untruncated direct backpropagation (no stop-gradients)
- Truncated (recent -step) backpropagation for computational efficiency
- Per-pair gradient matching or preference-based surrogates (Soft-TAC)
Deterministic Policy Gradient (DPG) and standard RL approaches can, in principle, be adapted to these differentiable setups but have been shown empirically less efficient than the direct methods when reward gradients are available (Clark et al., 2023). In the context of IRL, projecting score differences onto conservative fields ensures theoretical correctness of the extracted reward (Nuti et al., 2023). Preference-based optimization via Soft-TAC directly targets Kendall's , promoting robustness and symmetric label treatment compared to Bradley–Terry CE loss (Muslimani et al., 23 Jan 2026).
7. Open Directions and Limitations
Current research on trajectory-differentiable reward functions identifies several notable limitations and open directions:
- The use of simple linear reward models in preference-based learning (Soft-TAC) highlights the need to extend to deep, non-linear networks (Muslimani et al., 23 Jan 2026).
- Preference elicitation remains largely fixed and offline; online or active preference-query strategies for continuous improvement of reward models are not yet standard (Muslimani et al., 23 Jan 2026).
- Theoretical guarantees exist for existence/uniqueness only under specific smoothness and support conditions; handling of non-conservative or highly noisy gradients remains partially addressed (Nuti et al., 2023).
- Tuning the sensitivity parameter (e.g., in Soft-TAC) and developing robust approaches to non-transitive or noisy preference data represent ongoing challenges (Muslimani et al., 23 Jan 2026).
Nonetheless, trajectory-differentiable reward functions have proven to be a foundational tool for model-based learning, model steering, and inverse reward inference across a range of modalities and domains.