Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dispersive MeanFlow Policy Optimization

Updated 31 January 2026
  • Dispersive MeanFlow Policy Optimization is a unified framework for generative policy learning that enables mathematically exact one-step action generation.
  • It leverages dispersive regularization to maintain high entropy in internal representations, preventing collapse and ensuring robust multimodal behavior.
  • Empirical benchmarks across manipulation, locomotion, and reasoning tasks show DMPO achieves superior success rates and sub-10 ms inference latency.

Dispersive MeanFlow Policy Optimization (DMPO) is a unified framework for generative policy learning that enables mathematically exact one-step action generation, robust multimodal distribution modeling, and stable reinforcement learning (RL) fine-tuning. DMPO leverages MeanFlow for single-step inference, dispersive regularization to maintain representation entropy and prevent collapse, and RL objectives (notably Q-learning and PPO) to optimize beyond expert performance. It has been evaluated across manipulation, locomotion, and high-dimensional reasoning tasks, yielding state-of-the-art metrics for inference latency and task success.

1. Mathematical Foundations and Generative Architecture

DMPO builds upon MeanFlow generative modeling, which learns a time-dependent velocity field vϕ(at,t)v_\phi(a_t, t) mapping from noise eN(0,I)e\sim\mathcal{N}(0,I) to actions via interpolation at=(1t)a+tea_t = (1-t)a + t e. The interval-averaged velocity over [b,t][b,t] is

uϕ(at,b,t):=1tbbtvϕ(aτ,τ)dτu_\phi(a_t, b, t) := \frac{1}{t-b} \int_{b}^{t} v_\phi(a_\tau, \tau) d\tau

Enabling one-step sampling by setting b=0b=0, t=1t=1: a^=euϕ(e,0,1)\hat a = e - u_\phi(e, 0, 1) The residual reformulation collapses the velocity and integration into a single network: gθ(at,b,t)=atuθ(at,b,t)g_\theta(a_t, b, t) = a_t - u_\theta(a_t, b, t) For one-step inference, the policy network produces

a=gθ(e,0,1)a = g_\theta(e, 0, 1)

This architecture is compatible with both behavior cloning and RL objectives, supporting direct mapping from noise and state to multimodal actions.

2. Dispersive Regularization and Representation Stability

MeanFlow-based policies are susceptible to representation collapse—where encoders map diverse observations to indistinguishable features—degrading policy expressivity in high-precision control domains. DMPO introduces "dispersive" regularizers, which maximize the entropy of the internal batch-wise representations. Canonical forms include:

  • InfoNCE-L2 and -Cosine losses: contrastive objectives maximizing angular or Euclidean separation.
  • Hinge margin penalty: enforces minimum separation for all pairs of representations.
  • Covariance decorrelation: penalizes off-diagonal covariance, ensuring full-rank encoding.

Formally, for batch embeddings hih_i,

LDisp=1N(N1)ijexp(hihj2τ2)\mathcal{L}_{\rm Disp} = \frac{1}{N(N-1)} \sum_{i\neq j} \exp\left(-\frac{\|h_i - h_j\|^2}{\tau^2}\right)

with temperature τ\tau controlling repulsion strength. These losses attach to intermediate MLP, ViT, or UNet layers, preserving feature diversity without inference overhead.

3. Integrated RL Fine-Tuning: Q-Learning and PPO

DMPO couples the generative MeanFlow architecture with RL policy optimization, enabling policies to surpass imitation performance. Two principal variants are described:

  • Q-learning actor–critic (Wang et al., 17 Nov 2025): The residual policy gθg_\theta is trained to minimize a weighted sum of MeanFlow identity loss (LMFI\mathrm{LMFI}) and Q-learning loss (LQL_Q):

θθηθ[LQ(θ)+αLMFI(θ)]\theta \leftarrow \theta - \eta \nabla_\theta [ L_Q(\theta) + \alpha\, \mathrm{LMFI}(\theta) ]

Critic updates use value-guided best-of-K sampling, and an adaptive coefficient α\alpha dynamically balances fidelity and policy improvement.

  • PPO with BC regularization (Zou et al., 28 Jan 2026): PPO loss LPG\mathcal{L}_{\rm PG} and value loss LV\mathcal{L}_V are augmented with entropy and behavior cloning regularizers:

LStage2=LPG+λVLV+λentLent+λBC(n)LBC\mathcal{L}_{\rm Stage2} = \mathcal{L}_{\rm PG} + \lambda_V \mathcal{L}_V + \lambda_{\rm ent}\mathcal{L}_{\rm ent} + \lambda_{BC}(n)\mathcal{L}_{\rm BC}

This enables rapid RL fine-tuning, maintaining stability via dispersive representation and enabling sample-efficient online adaptation.

4. Algorithmic Workflow and Implementation

A typical DMPO training pipeline consists of:

  1. Data encoding: State, visual, and temporal features are embedded via lightweight ViT, Transformer-MLP, or PointNet++ architectures.
  2. Noise-action trajectory formation: For each training example, noise is sampled and interpolated to form ata_t.
  3. Velocity prediction and target computation: MeanFlow identity and, where relevant, Jacobian-vector products (JVPs) or Differential Derivation Equations (DDE; finite-differencing variants) compute targets without explicit backprop through ODE solvers.
  4. Loss aggregation: MeanFlow reconstruction, dispersive penalties (on multiple intermediate features), and RL improvement targets are combined in an SGD update.

At inference, a single (or small number) of network evaluations generate actions at sub-10 ms latency (>120>120 Hz), supporting real-time deployment on various hardware platforms.

5. Empirical Benchmarks and Comparative Metrics

DMPO has been evaluated across 73 benchmark tasks spanning OGBench (state-based, pixel-based, multimodal), D4RL (antmaze, adroit), RoboMimic (Lift, Can, Square, Transport), Meta-World, and physical control scenarios. Representative metrics (all per cited paper):

Task/Benchmark SOTA Baseline DMPO Success Rate / Reward Latency
OGBench antmaze-large FQL: ~81% DMPO: ~81–95% <10 ms
D4RL pen-expert FQL: 149 DMPO: 151 6–7 ms
RoboMimic Lift ShortCut: 85% DMPO: 97–100% 2.4–6.8 ms
Meta-World Medium FlowPolicy: 58% DMPO: 68–77% 6–7 ms
Real-robot (Franka) Baseline: failure DMPO: 90% (Lift), 75% (Can) 2.6–9.6 ms

Success rate gains of 10–30 percentage points, 5–20× inference speedup, and best-of-K multimodal sampling are consistently observed. Ablations confirm performance drops upon removal of dispersive regularization or MeanFlow ratio averaging.

6. Theoretical Rationale for "Dispersive" Formulation

"Dispersive" in DMPO denotes the policy’s capacity to allocate probability mass across multiple high-Q regions rather than concentring on unimodal or collapsed actions. Theoretically, the residual MeanFlow parameterization gθ(at,b,t)=atuθ(at,b,t)g_\theta(a_t,b,t) = a_t - u_\theta(a_t,b,t), anchored by the MeanFlow identity, ensures the output action distribution smoothly interpolates between noise and expert data, supporting arbitrary multimodal mappings. Information-theoretic analysis (maximization of encoder entropy) justifies dispersive regularization as a mechanism to maintain representation diversity and generalization capacity, especially in few-shot and high-precision domains.

Empirical ablations demonstrate:

  • Naive one-step MeanFlow (euϕe-u_\phi) produces out-of-bound actions and unstable training.
  • Simplistic residuals (g(e)=eu(e)g(e)=e-u(e)) fail to recover modes in multimodal environments.
  • Best-of-K sampling improves dispersion versus greediness, with K5K\approx5 balancing exploration and exploitation.

This structure enables stable, expressive, and efficient generative policy learning in both offline and online RL.

7. Impact, Deployment, and Future Directions

DMPO frameworks have realized real-world deployments on high-DOF robot arms (Franka-Emika-Panda) at >>100 Hz control frequency, substantiating the design’s suitability for latency-critical control. Compared to multi-step diffusion or flow-based baselines, DMPO's one-step inference routinely achieves superior performance at a fraction of the computational cost.

Potential future work includes adaptive dispersive weighting scaled to task complexity, extension of MeanFlow architectures to temporal sequence generation, and cross-domain applications such as reasoning in diffusion LLMs via distribution matching objectives (Zhu et al., 9 Oct 2025)—showing analogous success in high-dimensional, multimodal settings.

Together, Dispersive MeanFlow Policy Optimization establishes a rigorous, practical framework for high-speed, multimodal generative policies in control and reasoning applications, underpinned by robust mathematical derivations and empirical validation across diverse regimes (Wang et al., 17 Nov 2025, Zou et al., 28 Jan 2026, Fang et al., 22 Dec 2025, Zou et al., 9 Oct 2025, Sheng et al., 14 Jul 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dispersive MeanFlow Policy Optimization (DMPO).