Papers
Topics
Authors
Recent
Search
2000 character limit reached

Q-learning with Adjoint Matching (QAM)

Updated 21 January 2026
  • The paper introduces QAM, which leverages adjoint matching to optimize flow-matching policies while avoiding the numerical instability of multi-step backpropagation.
  • It employs a lean adjoint technique that preserves first-order critic gradients, achieving unbiased, behavior-constrained optimal policies in continuous-action domains.
  • QAM demonstrates state-of-the-art performance on offline and offline-to-online RL benchmarks, outperforming traditional methods in stability and sample efficiency.

Q-learning with Adjoint Matching (QAM) is a temporal-difference-based reinforcement learning method addressing the challenge of optimizing expressive diffusion or flow-matching policies with respect to a parameterized Q-function in continuous-action domains. QAM incorporates adjoint matching, a technique originally from generative modeling, to preserve first-order critic information for policy improvement while entirely circumventing the numerical instability associated with backpropagation through multi-step denoising processes inherent in flow or diffusion policy classes. In tandem with standard TD backup for critic learning, QAM yields provably unbiased, behavior-constrained optimal policies and demonstrates state-of-the-art empirical performance in offline and offline-to-online RL benchmarks (Li et al., 20 Jan 2026).

1. Formal Problem Setup

QAM operates within a continuous-action Markov decision process (MDP) specified by (S,A,P,γ,R,μ)(\mathcal{S}, \mathcal{A}, P, \gamma, R, \mu), where ARd\mathcal{A} \subseteq \mathbb{R}^d and γ[0,1)\gamma \in [0,1) is the discount factor. The primary objective in offline RL is to learn a policy π(as)\pi(a|s) maximizing the expected return, given a static dataset D={(si,ai,ri,si)}\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}.

Q-learning employs a parameterized critic Qϕ(s,a)Q_\phi(s,a), trained to minimize the standard TD error: L(ϕ)=E(s,a,r,s)D[(Qϕ(s,a)(r+γQϕˉ(s,a)))2]L(\phi) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigl[(Q_\phi(s, a) - (r + \gamma Q_{\bar{\phi}}(s', a')))^2\bigr] where aπθ(s)a' \sim \pi_\theta(\cdot | s') and ϕˉ\bar{\phi} tracks ϕ\phi via a Polyak average.

Departing from conventional Gaussian policies, QAM utilizes a flow-matching policy πθ(as)\pi_\theta(a|s) parameterized by a time-indexed velocity field fθ:S×A×[0,1]Af_\theta: \mathcal{S} \times \mathcal{A} \times [0,1] \rightarrow \mathcal{A}. This defines a memoryless SDE: dat=(2fθ(s,at,t)att)dt+2(1t)tdBt,a1data,  a0N(0,I)da_t = \bigl(2 f_\theta(s, a_t, t) - \frac{a_t}{t}\bigr) dt + \sqrt{\frac{2(1-t)}{t}} dB_t,\quad a_1 \sim \text{data},\; a_0 \sim \mathcal{N}(0, I) Sampling the endpoint a0a_0 (via ODE integration) yields actions from πθ(as)\pi_\theta(a|s). The base (behavior) policy πβ\pi_\beta is learned with standard flow matching: LFM(β)=E(s,a)D,tU[0,1],zN[fβ(s,(1t)z+ta,t)(az)2]L_{FM}(\beta) = \mathbb{E}_{(s, a) \sim \mathcal{D},\, t \sim U[0,1],\, z \sim \mathcal{N}} [\| f_\beta(s, (1-t)z + ta, t) - (a - z) \|^2] QAM aims to extract the optimal policy constrained by behavior: π(as)πβ(as)exp[τQϕ(s,a)]\pi^*(a|s) \propto \pi_\beta(a|s) \exp[\tau Q_\phi(s,a)] by fine-tuning fθf_\theta without unstable backpropagation through the SDE.

2. Derivation of the Adjoint-Matching Objective

2.1 Critic Gradients and BPTT Instability

Naive actor-critic approaches hill-climb the Q-function via the policy gradient aQϕ(s,a)\nabla_a Q_\phi(s, a). For multi-step flow or diffusion policies, this requires backpropagation through all intermediate ODE/SDE steps and associated noise, leading to compounding of ill-conditioned Jacobians and significant numerical instability.

2.2 From Stochastic Optimal Control to Adjoint Matching

Stochastic Optimal Control (SOC) Formulation:

Optimal flow-based policy extraction can be cast as minimizing the SOC loss: LSOC(θ)=EsD,a()[0112fθ(s,at,t)fβ(s,at,t)2dtτQϕ(s,a1)]L_{SOC}(\theta) = \mathbb{E}_{s \sim \mathcal{D},\, a_{(\cdot)}} \Bigl[\int_0^1 \tfrac{1}{2} \|f_\theta(s, a_t, t) - f_\beta(s, a_t, t)\|^2 dt - \tau Q_\phi(s, a_1)\Bigr] where trajectories ata_t evolve under the memoryless SDE. Naive differentiation with respect to θ\theta entails gradient backpropagation through the entire stochastic trajectory.

Basic Adjoint Matching:

Define the adjoint (co-state) g(s,at,t)g(s,a_t,t) via the backward ODE: dgdt=[at(2fθat/t)]g+2σt2atfθfβ2,g(s,a1,1)=τaQϕ(s,a1)\frac{dg}{dt} = -\bigl[\nabla_{a_t}(2 f_\theta - a_t/t)\bigr]g + \frac{2}{\sigma_t^2}\nabla_{a_t}\|f_\theta - f_\beta\|^2, \qquad g(s,a_1,1) = -\tau \nabla_a Q_\phi(s,a_1) Domingo-Enrich et al. (2025) established the equivalence LSOC(θ)=LBAM(θ)L_{SOC}(\theta) = L_{BAM}(\theta), with

LBAM(θ)=E[012(fθfβ)/σt+σtg2dt]L_{BAM}(\theta) = \mathbb{E} \biggl[\int_0^1 \bigl\| 2(f_\theta - f_\beta)/\sigma_t + \sigma_t g \bigr\|^2 dt\biggr]

Lean Adjoint Matching (QAM):

At optimality (fθ=ff_\theta = f^*), atfθ\nabla_{a_t}f_\theta vanishes in the adjoint ODE, permitting a "lean" adjoint g~\tilde{g} computed purely through the fixed base velocity field fβf_\beta: dg~dt=at[2fβ(s,at,t)at/t]g~,g~(s,a1,1)=τaQϕ(s,a1)\frac{d\tilde{g}}{dt} = - \nabla_{a_t}[2 f_\beta(s,a_t,t) - a_t/t] \cdot \tilde{g},\qquad \tilde{g}(s,a_1,1) = -\tau \nabla_a Q_\phi(s,a_1) The resulting QAM objective circumvents unstable backpropagation through fθf_\theta: LAM(θ)=Es,{at}[012(fθ(s,at,t)fβ(s,at,t))/σt+σtg~t2dt]L_{AM}(\theta) = \mathbb{E}_{s, \{a_t\}} \left[\int_0^1 \|2(f_\theta(s, a_t, t) - f_\beta(s, a_t, t))/\sigma_t + \sigma_t \tilde{g}_t\|^2 dt\right] This construction exactly targets the desired optimal policy.

3. Algorithmic Workflow

QAM alternates between critic TD-updates and policy updates using adjoint matching.

  • Critic update:
    • Sample batch (s,a,r,s)D(s, a, r, s') \sim \mathcal{D}
    • Compute aODE(fθ(s,,),zN)a' \gets \text{ODE}(f_\theta(s',\cdot,\cdot), z \sim \mathcal{N})
    • Calculate target $y \gets r + \gamma [\mū Q_\text{mean}(s', a') - \rho Q_\text{std}(s', a')]$
    • Update QϕQ_\phi to minimize squared Bellman error
    • Update target ϕˉ\bar{\phi}
  • Policy update:
    • Forward: Sample zNz \sim \mathcal{N}, rollout ata_t through the SDE under fθf_\theta
    • Adjoint: Initialize g~(1)=τaQϕ(s,a1)\tilde{g}(1) = -\tau \nabla_a Q_\phi(s,a_1); backward step g~(th)=g~(t)+hVJP(a[2fβa/t],g~(t))\tilde{g}(t-h) = \tilde{g}(t) + h \cdot \mathrm{VJP}(\nabla_a [2 f_\beta - a/t], \tilde{g}(t))
    • Calculate LAML_{AM} as above
    • Update θ\theta to minimize LAML_{AM}

Convergence is defined by LAM/fθ=0\partial L_{AM}/\partial f_\theta = 0, ensuring that fθf_\theta induces the policy ππβexp(τQ)\pi^* \propto \pi_\beta \exp(\tau Q) without gradient bias and retaining full flow expressivity.

4. Theoretical Analysis

Unique Optimality and Unbiasedness:

Proposition 4.1 (extension of Domingo-Enrich 2025) asserts that LAML_{AM} admits a unique minimizer fθf_\theta^* such that: ssupp(D):πθ(s)πβ(s)exp[τQϕ(s,)]\forall s \in \text{supp}(\mathcal{D}):\quad \pi^*_\theta(\cdot | s) \propto \pi_\beta(\cdot|s)\exp[\tau Q_\phi(s,\cdot)] and in the limit LAM0L_{AM} \rightarrow 0 (with sufficient function approximation capacity) QAM recovers the exact behavior-constrained optimum.

The "lean adjoint" g~\tilde{g} propagates the Q-function’s boundary gradient according to the fixed fβf_\beta, rigorously linking the residual flow velocity to the SOC solution. The unbiasedness of the adjoint matching gradient for the SOC objective follows directly. Variance is controlled by the discretization steps TT of the SDE and the sampling noise in g~\tilde{g}; T10T \approx 10 is effective in practice.

5. Empirical Results and Comparative Evaluation

QAM was benchmarked on OGBench—ten long-horizon, sparse-reward environments (antmaze, humanoidmaze, cube, puzzle, scene)—against 17 baseline algorithms spanning Gaussian policies (ReBRAC), backprop-through-time flows (FBRAC, BAM), one-step distillations (FQL), advantage-weighted flows (FAWAC), gradient guidance methods (QSM, DAC, CGQL variants), and post-hoc editors (DSRL, FEdit, IFQL).

  • Offline RL (1M updates):

QAM, with τ\tau tuned per domain, attained \approx44/50 normalized score across tasks, outperforming the nearest baseline (\approx40). Discarding aQ\nabla_a Q (FAWAC) scored 8\approx 8, while backprop-based flows (BAM) scored 35\approx 35—demonstrating the performance and stability benefit of the lean adjoint technique. Reducing flows to one step (FQL) markedly degraded performance on multi-modal tasks. Guided and post-processed methods were consistently outperformed.

  • Offline-to-online fine-tuning (0.5M environment steps):

The QAM-EDIT variant, introducing a minor relaxation beyond πβexp(τQ)\pi_\beta \exp(\tau Q), realized the fastest score improvements and highest asymptotic returns while being more sample-efficient than all considered baselines, including RLPD (behavior-unconstrained), QSM, and FQL.

  • Ablation and sensitivity analyses:

Adjoint matching (BAM) versus lean adjoint (QAM) revealed higher stability and performance for QAM. The inverse-temperature τ\tau is the most critical hyperparameter; results are robust to moderate numbers of flow steps (T10T \approx 10) and gradient clipping. In noise and data stitching robustness evaluations, QAM maintained near-optimality, unlike most baselines which significantly degraded.

QAM subsumes and generalizes prior approaches for flow-based policy optimization in offline RL. Existing methods are limited either by their exclusion of first-order critic information, reliance on actor-critic updates susceptible to gradient bias or numerical instability (due to backpropagation through multiple ODE/SDE integration steps), or by constraining policy expressivity to simplify gradients. QAM, by leveraging adjoint matching, uniquely preserves full generative policy expressivity and unbiased policy gradients while entirely sidestepping backpropagation through noisy, multi-step diffusion processes. This positions QAM as a principled policy-extraction method for complex action distributions in RL, with guaranteed convergence to the optimal behavior-constrained distribution (Li et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Q-learning with Adjoint Matching (QAM).