Formal relationship between noise-expectation and gradient-expectation objectives

Determine the formal relationship between the noise-expectation and gradient-expectation training objectives for diffusion policies that target the Boltzmann distribution over actions induced by a Q-function in online reinforcement learning, and ascertain whether these objectives can be synthesized into a single unified general formulation.

Background

In online reinforcement learning with diffusion policies, two families of objectives have been proposed to target the Boltzmann action distribution defined by the learned Q-function: the noise-expectation family, which constructs training targets via self-normalized importance sampling (SNIS) of noise weighted by exponentiated Q-values, and the gradient-expectation family, which instead performs SNIS over Q-function gradients.

Despite empirical success, the paper notes that it was previously unclear how these two objectives are related or whether they can be unified into a general formulation, a gap the authors aim to address through their Reverse Flow Matching framework with control variates.

References

Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation.