Q-learning with Adjoint Matching (QAM)

Updated 21 January 2026

The paper introduces QAM, which leverages adjoint matching to optimize flow-matching policies while avoiding the numerical instability of multi-step backpropagation.
It employs a lean adjoint technique that preserves first-order critic gradients, achieving unbiased, behavior-constrained optimal policies in continuous-action domains.
QAM demonstrates state-of-the-art performance on offline and offline-to-online RL benchmarks, outperforming traditional methods in stability and sample efficiency.

Q-learning with Adjoint Matching (QAM) is a temporal-difference-based reinforcement learning method addressing the challenge of optimizing expressive diffusion or flow-matching policies with respect to a parameterized Q-function in continuous-action domains. QAM incorporates adjoint matching, a technique originally from generative modeling, to preserve first-order critic information for policy improvement while entirely circumventing the numerical instability associated with backpropagation through multi-step denoising processes inherent in flow or diffusion policy classes. In tandem with standard TD backup for critic learning, QAM yields provably unbiased, behavior-constrained optimal policies and demonstrates state-of-the-art empirical performance in offline and offline-to-online RL benchmarks (Li et al., 20 Jan 2026).

1. Formal Problem Setup

QAM operates within a continuous-action Markov decision process (MDP) specified by $(\mathcal{S}, \mathcal{A}, P, \gamma, R, \mu)$ , where $\mathcal{A} \subseteq \mathbb{R}^d$ and $\gamma \in [0,1)$ is the discount factor. The primary objective in offline RL is to learn a policy $\pi(a|s)$ maximizing the expected return, given a static dataset $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ .

Q-learning employs a parameterized critic $Q_\phi(s,a)$ , trained to minimize the standard TD error: $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ where $a' \sim \pi_\theta(\cdot | s')$ and $\bar{\phi}$ tracks $\phi$ via a Polyak average.

Departing from conventional Gaussian policies, QAM utilizes a flow-matching policy $\mathcal{A} \subseteq \mathbb{R}^d$ 0 parameterized by a time-indexed velocity field $\mathcal{A} \subseteq \mathbb{R}^d$ 1. This defines a memoryless SDE: $\mathcal{A} \subseteq \mathbb{R}^d$ 2 Sampling the endpoint $\mathcal{A} \subseteq \mathbb{R}^d$ 3 (via ODE integration) yields actions from $\mathcal{A} \subseteq \mathbb{R}^d$ 4. The base (behavior) policy $\mathcal{A} \subseteq \mathbb{R}^d$ 5 is learned with standard flow matching: $\mathcal{A} \subseteq \mathbb{R}^d$ 6 QAM aims to extract the optimal policy constrained by behavior: $\mathcal{A} \subseteq \mathbb{R}^d$ 7 by fine-tuning $\mathcal{A} \subseteq \mathbb{R}^d$ 8 without unstable backpropagation through the SDE.

2. Derivation of the Adjoint-Matching Objective

2.1 Critic Gradients and BPTT Instability

Naive actor-critic approaches hill-climb the Q-function via the policy gradient $\mathcal{A} \subseteq \mathbb{R}^d$ 9. For multi-step flow or diffusion policies, this requires backpropagation through all intermediate ODE/SDE steps and associated noise, leading to compounding of ill-conditioned Jacobians and significant numerical instability.

2.2 From Stochastic Optimal Control to Adjoint Matching

Stochastic Optimal Control (SOC) Formulation:

Optimal flow-based policy extraction can be cast as minimizing the SOC loss: $\gamma \in [0,1)$ 0 where trajectories $\gamma \in [0,1)$ 1 evolve under the memoryless SDE. Naive differentiation with respect to $\gamma \in [0,1)$ 2 entails gradient backpropagation through the entire stochastic trajectory.

Basic Adjoint Matching:

Define the adjoint (co-state) $\gamma \in [0,1)$ 3 via the backward ODE: $\gamma \in [0,1)$ 4 Domingo-Enrich et al. (2025) established the equivalence $\gamma \in [0,1)$ 5, with

$\gamma \in [0,1)$ 6

Lean Adjoint Matching (QAM):

At optimality ( $\gamma \in [0,1)$ 7), $\gamma \in [0,1)$ 8 vanishes in the adjoint ODE, permitting a "lean" adjoint $\gamma \in [0,1)$ 9 computed purely through the fixed base velocity field $\pi(a|s)$ 0: $\pi(a|s)$ 1 The resulting QAM objective circumvents unstable backpropagation through $\pi(a|s)$ 2: $\pi(a|s)$ 3 This construction exactly targets the desired optimal policy.

3. Algorithmic Workflow

QAM alternates between critic TD-updates and policy updates using adjoint matching.

Critic update:
- Sample batch $\pi(a|s)$ 4
- Compute $\pi(a|s)$ 5
- Calculate target $\pi(a|s)$ 6
- Update $\pi(a|s)$ 7 to minimize squared Bellman error
- Update target $\pi(a|s)$ 8
Policy update:
- Forward: Sample $\pi(a|s)$ 9, rollout $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 0 through the SDE under $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 1
- Adjoint: Initialize $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 2; backward step $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 3
- Calculate $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 4 as above
- Update $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 5 to minimize $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 6

Convergence is defined by $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 7, ensuring that $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 8 induces the policy $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}$ 9 without gradient bias and retaining full flow expressivity.

4. Theoretical Analysis

Unique Optimality and Unbiasedness:

Proposition 4.1 (extension of Domingo-Enrich 2025) asserts that $Q_\phi(s,a)$ 0 admits a unique minimizer $Q_\phi(s,a)$ 1 such that: $Q_\phi(s,a)$ 2 and in the limit $Q_\phi(s,a)$ 3 (with sufficient function approximation capacity) QAM recovers the exact behavior-constrained optimum.

The "lean adjoint" $Q_\phi(s,a)$ 4 propagates the Q-function’s boundary gradient according to the fixed $Q_\phi(s,a)$ 5, rigorously linking the residual flow velocity to the SOC solution. The unbiasedness of the adjoint matching gradient for the SOC objective follows directly. Variance is controlled by the discretization steps $Q_\phi(s,a)$ 6 of the SDE and the sampling noise in $Q_\phi(s,a)$ 7; $Q_\phi(s,a)$ 8 is effective in practice.

5. Empirical Results and Comparative Evaluation

QAM was benchmarked on OGBench—ten long-horizon, sparse-reward environments (antmaze, humanoidmaze, cube, puzzle, scene)—against 17 baseline algorithms spanning Gaussian policies (ReBRAC), backprop-through-time flows (FBRAC, BAM), one-step distillations (FQL), advantage-weighted flows (FAWAC), gradient guidance methods (QSM, DAC, CGQL variants), and post-hoc editors (DSRL, FEdit, IFQL).

Offline RL (1M updates):

QAM, with $Q_\phi(s,a)$ 9 tuned per domain, attained $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 044/50 normalized score across tasks, outperforming the nearest baseline ( $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 140). Discarding $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 2 (FAWAC) scored $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 3, while backprop-based flows (BAM) scored $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 4—demonstrating the performance and stability benefit of the lean adjoint technique. Reducing flows to one step (FQL) markedly degraded performance on multi-modal tasks. Guided and post-processed methods were consistently outperformed.

Offline-to-online fine-tuning (0.5M environment steps):

The QAM-EDIT variant, introducing a minor relaxation beyond $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 5, realized the fastest score improvements and highest asymptotic returns while being more sample-efficient than all considered baselines, including RLPD (behavior-unconstrained), QSM, and FQL.

Ablation and sensitivity analyses:

Adjoint matching (BAM) versus lean adjoint (QAM) revealed higher stability and performance for QAM. The inverse-temperature $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 6 is the most critical hyperparameter; results are robust to moderate numbers of flow steps ( $L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr]$ 7) and gradient clipping. In noise and data stitching robustness evaluations, QAM maintained near-optimality, unlike most baselines which significantly degraded.

QAM subsumes and generalizes prior approaches for flow-based policy optimization in offline RL. Existing methods are limited either by their exclusion of first-order critic information, reliance on actor-critic updates susceptible to gradient bias or numerical instability (due to backpropagation through multiple ODE/SDE integration steps), or by constraining policy expressivity to simplify gradients. QAM, by leveraging adjoint matching, uniquely preserves full generative policy expressivity and unbiased policy gradients while entirely sidestepping backpropagation through noisy, multi-step diffusion processes. This positions QAM as a principled policy-extraction method for complex action distributions in RL, with guaranteed convergence to the optimal behavior-constrained distribution (Li et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Q-learning with Adjoint Matching (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Q-learning with Adjoint Matching (QAM).

Q-learning with Adjoint Matching (QAM)

1. Formal Problem Setup

2. Derivation of the Adjoint-Matching Objective

2.1 Critic Gradients and BPTT Instability

2.2 From Stochastic Optimal Control to Adjoint Matching

3. Algorithmic Workflow

4. Theoretical Analysis

5. Empirical Results and Comparative Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Q-learning with Adjoint Matching (QAM)

1. Formal Problem Setup

2. Derivation of the Adjoint-Matching Objective

2.1 Critic Gradients and BPTT Instability

2.2 From Stochastic Optimal Control to Adjoint Matching

3. Algorithmic Workflow

4. Theoretical Analysis

5. Empirical Results and Comparative Evaluation

6. Related Methodologies and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research