Q-learning with Adjoint Matching (QAM)
- The paper introduces QAM, which leverages adjoint matching to optimize flow-matching policies while avoiding the numerical instability of multi-step backpropagation.
- It employs a lean adjoint technique that preserves first-order critic gradients, achieving unbiased, behavior-constrained optimal policies in continuous-action domains.
- QAM demonstrates state-of-the-art performance on offline and offline-to-online RL benchmarks, outperforming traditional methods in stability and sample efficiency.
Q-learning with Adjoint Matching (QAM) is a temporal-difference-based reinforcement learning method addressing the challenge of optimizing expressive diffusion or flow-matching policies with respect to a parameterized Q-function in continuous-action domains. QAM incorporates adjoint matching, a technique originally from generative modeling, to preserve first-order critic information for policy improvement while entirely circumventing the numerical instability associated with backpropagation through multi-step denoising processes inherent in flow or diffusion policy classes. In tandem with standard TD backup for critic learning, QAM yields provably unbiased, behavior-constrained optimal policies and demonstrates state-of-the-art empirical performance in offline and offline-to-online RL benchmarks (Li et al., 20 Jan 2026).
1. Formal Problem Setup
QAM operates within a continuous-action Markov decision process (MDP) specified by , where and is the discount factor. The primary objective in offline RL is to learn a policy maximizing the expected return, given a static dataset .
Q-learning employs a parameterized critic , trained to minimize the standard TD error: where and tracks via a Polyak average.
Departing from conventional Gaussian policies, QAM utilizes a flow-matching policy parameterized by a time-indexed velocity field . This defines a memoryless SDE: Sampling the endpoint (via ODE integration) yields actions from . The base (behavior) policy is learned with standard flow matching: QAM aims to extract the optimal policy constrained by behavior: by fine-tuning without unstable backpropagation through the SDE.
2. Derivation of the Adjoint-Matching Objective
2.1 Critic Gradients and BPTT Instability
Naive actor-critic approaches hill-climb the Q-function via the policy gradient . For multi-step flow or diffusion policies, this requires backpropagation through all intermediate ODE/SDE steps and associated noise, leading to compounding of ill-conditioned Jacobians and significant numerical instability.
2.2 From Stochastic Optimal Control to Adjoint Matching
Stochastic Optimal Control (SOC) Formulation:
Optimal flow-based policy extraction can be cast as minimizing the SOC loss: where trajectories evolve under the memoryless SDE. Naive differentiation with respect to entails gradient backpropagation through the entire stochastic trajectory.
Basic Adjoint Matching:
Define the adjoint (co-state) via the backward ODE: Domingo-Enrich et al. (2025) established the equivalence , with
Lean Adjoint Matching (QAM):
At optimality (), vanishes in the adjoint ODE, permitting a "lean" adjoint computed purely through the fixed base velocity field : The resulting QAM objective circumvents unstable backpropagation through : This construction exactly targets the desired optimal policy.
3. Algorithmic Workflow
QAM alternates between critic TD-updates and policy updates using adjoint matching.
- Critic update:
- Sample batch
- Compute
- Calculate target $y \gets r + \gamma [\mū Q_\text{mean}(s', a') - \rho Q_\text{std}(s', a')]$
- Update to minimize squared Bellman error
- Update target
- Policy update:
- Forward: Sample , rollout through the SDE under
- Adjoint: Initialize ; backward step
- Calculate as above
- Update to minimize
Convergence is defined by , ensuring that induces the policy without gradient bias and retaining full flow expressivity.
4. Theoretical Analysis
Unique Optimality and Unbiasedness:
Proposition 4.1 (extension of Domingo-Enrich 2025) asserts that admits a unique minimizer such that: and in the limit (with sufficient function approximation capacity) QAM recovers the exact behavior-constrained optimum.
The "lean adjoint" propagates the Q-function’s boundary gradient according to the fixed , rigorously linking the residual flow velocity to the SOC solution. The unbiasedness of the adjoint matching gradient for the SOC objective follows directly. Variance is controlled by the discretization steps of the SDE and the sampling noise in ; is effective in practice.
5. Empirical Results and Comparative Evaluation
QAM was benchmarked on OGBench—ten long-horizon, sparse-reward environments (antmaze, humanoidmaze, cube, puzzle, scene)—against 17 baseline algorithms spanning Gaussian policies (ReBRAC), backprop-through-time flows (FBRAC, BAM), one-step distillations (FQL), advantage-weighted flows (FAWAC), gradient guidance methods (QSM, DAC, CGQL variants), and post-hoc editors (DSRL, FEdit, IFQL).
- Offline RL (1M updates):
QAM, with tuned per domain, attained 44/50 normalized score across tasks, outperforming the nearest baseline (40). Discarding (FAWAC) scored , while backprop-based flows (BAM) scored —demonstrating the performance and stability benefit of the lean adjoint technique. Reducing flows to one step (FQL) markedly degraded performance on multi-modal tasks. Guided and post-processed methods were consistently outperformed.
- Offline-to-online fine-tuning (0.5M environment steps):
The QAM-EDIT variant, introducing a minor relaxation beyond , realized the fastest score improvements and highest asymptotic returns while being more sample-efficient than all considered baselines, including RLPD (behavior-unconstrained), QSM, and FQL.
- Ablation and sensitivity analyses:
Adjoint matching (BAM) versus lean adjoint (QAM) revealed higher stability and performance for QAM. The inverse-temperature is the most critical hyperparameter; results are robust to moderate numbers of flow steps () and gradient clipping. In noise and data stitching robustness evaluations, QAM maintained near-optimality, unlike most baselines which significantly degraded.
6. Related Methodologies and Context
QAM subsumes and generalizes prior approaches for flow-based policy optimization in offline RL. Existing methods are limited either by their exclusion of first-order critic information, reliance on actor-critic updates susceptible to gradient bias or numerical instability (due to backpropagation through multiple ODE/SDE integration steps), or by constraining policy expressivity to simplify gradients. QAM, by leveraging adjoint matching, uniquely preserves full generative policy expressivity and unbiased policy gradients while entirely sidestepping backpropagation through noisy, multi-step diffusion processes. This positions QAM as a principled policy-extraction method for complex action distributions in RL, with guaranteed convergence to the optimal behavior-constrained distribution (Li et al., 20 Jan 2026).