Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pixel MeanFlow: One-Step Image Generation

Updated 1 February 2026
  • Pixel MeanFlow is a one-step, latent-free generative model that maps Gaussian noise directly to images via ODE-based flow matching.
  • It decouples the prediction (image manifold) and supervision (velocity space) to efficiently train a ViT-based network for robust image synthesis.
  • The method achieves state-of-the-art FID scores on ImageNet, offering high-quality, pixel-space image generation in a single neural network evaluation.

Pixel MeanFlow (pMF) is a framework for one-step, latent-free generative modeling that produces high-quality images directly in pixel space by leveraging the MeanFlow formulation for ODE-based flow matching. pMF is designed to map Gaussian noise to an image in a single neural network evaluation, sidestepping the need for iterative sampling, denoising steps, or learned latent bottlenecks. It achieves this via a novel architectural decoupling of the prediction (network output) space and the loss (supervision) space, harnessing a manifold-aligned prediction target and a velocity-space loss to ensure consistency with the underlying flow ODE trajectory (Lu et al., 29 Jan 2026).

1. Motivation and Conceptual Framework

Pixel MeanFlow addresses the two main limitations of conventional diffusion and flow-based models: reliance on multi-step sampling algorithms and operations in latent rather than pixel space. The methodology’s core objective is to develop a model that generates images from standard Gaussian noise without iterative procedures or dependency on a decoder for latent variable representations. This is realized by directly training a neural network

xθ(zt,r,t)x_\theta(z_t, r, t)

to predict a denoised image-like field from a noisy input, where the output is structured to lie on or near the image manifold, while the loss is defined on the flow velocity induced by the denoising ODE dynamics. This separation is critical: network prediction targets the data manifold; supervision occurs in the physically meaningful velocity space (Lu et al., 29 Jan 2026).

2. Formulation of Prediction and Loss Spaces

pMF’s central design is the decoupling of the prediction target (“image manifold”) and the supervision objective (“velocity space”). The prediction space is

xθ(zt,r,t)RH×W×3,x_\theta(z_t, r, t) \in \mathbb{R}^{H \times W \times 3},

where H,WH, W are spatial resolution parameters, and the network output is image-like (potentially blurred but manifold-aligned).

Supervision is conducted via MeanFlow in velocity space. Let zt=(1t)x+tϵz_t = (1-t)x + t\epsilon, with ϵ\epsilon standard Gaussian noise. The instantaneous velocity is given by

v(zt,t)=ddtzt=ϵx,v(z_t, t) = \frac{d}{dt}z_t = \epsilon - x,

while the average velocity over [r,t][r, t] is

u(zt,r,t)=1trrtv(zτ,τ)dτ.u(z_t, r, t) = \frac{1}{t-r} \int_r^t v(z_\tau, \tau)\, d\tau.

The MeanFlow identity relates these: v(zt,t)=u(zt,r,t)+(tr)tu(zt,r,t).v(z_t, t) = u(z_t, r, t) + (t - r) \frac{\partial}{\partial t} u(z_t, r, t). The network ultimately predicts the denoised image xx, which is then mapped (via invertible algebraic transforms) to an average velocity field for use in the MeanFlow loss (Lu et al., 29 Jan 2026).

3. Loss Function and Training Objectives

The pixel MeanFlow loss is formulated as

LpMF=Et,r,x,ϵVθ(zt,r,t)v(zt,t)2,\mathcal{L}_{\text{pMF}} = \mathbb{E}_{t, r, x, \epsilon} \left\| V_\theta(z_t, r, t) - v(z_t, t) \right\|^2,

where the network-derived estimate of instantaneous velocity,

Vθ=uθ+(tr)JVPsg(tuθ),V_\theta = u_\theta + (t-r)\, \mathtt{JVP}_{\text{sg}}(\partial_t u_\theta),

incorporates both the predicted mean velocity uθu_\theta and its time derivative, with the latter computed by a Jacobian-vector-product (JVP) under stop-gradient to stabilize learning.

A “generalized denoised image” field is defined via

x(zt,r,t)zttu(zt,r,t),x(z_t, r, t) \triangleq z_t - t \cdot u(z_t, r, t),

with invertible mappings: u=1t(ztx),v=u+(tr)tu.u = \frac{1}{t}(z_t - x), \quad v = u + (t-r) \frac{\partial}{\partial t} u. The proposed network backbone is a patch-based Vision Transformer (ViT); input tokens encode both noisy input and conditioning information (r,tr, t), and the output is mapped to average velocity and then transformed via the MeanFlow identity for loss computation. Training uses the Muon optimizer (parameters (β1,β2)=(0.9,0.95)(\beta_1, \beta_2) = (0.9, 0.95), LR 10310^{-3}), with batch size 1024 and 160–360 epochs depending on ablation or final training (Lu et al., 29 Jan 2026).

An optional perceptual loss combines LPIPS-VGG and LPIPS-ConvNeXt features (weights 0.4 and 0.1) and is selectively applied for early noise schedules (ttthrt \le t_{\text{thr}}).

4. Implementation and Sampling

Key implementation parameters include:

  • Patch size of image_size/16 at 256×256 resolutions (patch dim 768);
  • Model variants spanning {B/16, L/16, H/16} with depths {16, 32, 48}, widths {768, 1024, 1280}, and parameter counts from 119M to 956M;
  • Time pairs (t,r)(t, r) sampled from a logit-normal distribution over the triangle 0rt10 \le r \le t \le 1, with forced diversity (50% of samples with rtr \neq t).

Sampling is a single forward pass: draw z1N(0,I)z_1 \sim \mathcal{N}(0, I), then output

x^=xθ(z1,r=0,t=1).\hat x = x_\theta(z_1, r=0, t=1).

There is no iterative scheduler, denoising loop, or latent variable decoding.

5. Empirical Performance and Comparison

pMF achieves state-of-the-art FID for one-step latent-free image synthesis:

  • ImageNet 256×256: 2.22 FID (H/16 variant, 360 epochs);
  • ImageNet 512×512: 2.48 FID (H/32, scaled patch, 360 epochs).

Comparative context:

Model type 1-step FID (256×256) Latent decoder Network FLOPs (M)
DiT/SiT+REPA (multi-step) ~1.4 (best) Yes 300–400 per sample
One-step GAN (StyleGAN-XL) ~2.3 No ~1574
EPG (one-step diffusion) 8.82 No Not specified
pMF-H/16 2.22 No 271

pMF occupies a unique regime: direct pixel-space synthesis in one call, no decoder, and substantially lower compute load than existing GANs (Lu et al., 29 Jan 2026).

6. Architectural and Methodological Innovations

Pixel MeanFlow’s distinctive contributions are:

  • Use of manifold-aligned prediction (direct xx-prediction) rather than velocity field output;
  • Decoupling between network output and supervision spaces, enabling the network to focus learning capacity on the data manifold;
  • Explicit invertible transformations linking image manifold and velocity space for principled supervision;
  • Incorporation of time-interval diversity via random sampling of (t,r)(t, r) for improved target geometry and gradient flow;
  • Efficient ViT-based architecture designed for high-resolution pixel space synthesis.

These design choices facilitate robust one-step generation while remaining tractable at the high dimensions of pixel space (Lu et al., 29 Jan 2026).

7. Limitations and Future Directions

Major limitations include:

  • The requirement for carefully tuned sampling of (r,t)(r, t). Restricting to only r=0r=0 or only r=tr=t leads to failure.
  • JVP-based time-derivative computation increases complexity and compute overhead compared to strictly image-prediction models.
  • At very high resolutions (512×512), patch sizes must be aggressively enlarged (e.g., 32×32), and performance, while strong, still lags multi-step latent models (best ~1.1 FID). Future work could explore learned continuous-time samplers, more efficient manifold parameterizations, adversarial/perceptual hybrid training, or lightweight latent decoders to push FID below 2 (Lu et al., 29 Jan 2026).

The pMF (“pixel MeanFlow”) method is distinct from the Re-MeanFlow and MeanFlow models described in earlier literature (Zhang et al., 28 Nov 2025). The latter operate on the full sample vector (or latent codes at higher spatial resolutions) and do not define per-pixel MeanFlow. In contrast, pMF directly parameterizes its output in pixel space, obviating the need for a latent bottleneck or external decoder. The pMF loss formulation, architectural decoupling, and strictly single-step synthesis in pixel space distinguish it from these predecessors and position it as a uniquely effective approach for end-to-end, latent-free, one-step generative modeling (Zhang et al., 28 Nov 2025, Lu et al., 29 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pixel MeanFlow (pMF).