One-step Latent-free Image Generation with Pixel Mean Flows

Published 29 Jan 2026 in cs.CV | (2601.22158v1)

Abstract: Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel model, Pixel MeanFlow (pMF), that eliminates latent space for one-step image generation.
The methodology maps noisy inputs directly to pixel space using a transformation between image and velocity spaces to achieve high fidelity with FID scores as low as 2.22.
The results suggest that pMF enhances computational efficiency and training speed while maintaining image diversity on challenging datasets like ImageNet.

One-step Latent-free Image Generation with Pixel Mean Flows

Introduction

The paper "One-step Latent-free Image Generation with Pixel Mean Flows" (2601.22158) presents a novel approach for image generation using diffusion and flow-based models. Traditional models typically rely on multi-step sampling and operation within a latent space, which involves breaking down the complex problem of image generation into simpler subproblems. This paper proposes Pixel MeanFlow (pMF), which aims to achieve one-step image generation directly in the pixel space without relying on latent representations.

Methodology

The Pixel MeanFlow (pMF) model operates on the premise of establishing a clear separation between the network's output space and the loss space. The output is designed to lie on a low-dimensional image manifold (referred to as $x$ -prediction), while the loss is defined in the velocity space using a MeanFlow framework. By introducing a transformation between these two spaces, the researchers enable the neural network to perform one-step image generation efficiently.

In pixel MeanFlow, the network learns to map noisy inputs directly to image pixels, ensuring that the output maintains a "what-you-see-is-what-you-get" characteristic, thus enhancing generation quality through the introduction of perceptual loss.

Results

The experimental results demonstrate the superiority of pMF in generating high-quality images in a single step without relying on latent spaces. The model achieves an FID score of 2.22 at a resolution of 256×256 and 2.48 at 512×512 on the ImageNet dataset, indicating strong performance in terms of both image fidelity and diversity when compared to existing methods.

Figure 1: Muon converges faster and achieves better FID. At 320 epochs, Adam reaches 11.86 FID, while Muon achieves 8.71 FID. (Settings: pMF-B/16, MSE loss).

Implications

The implications of this research are significant for the field of image generation, as it provides a viable pathway towards simplifying the generative process by removing the need for latent variables. This contributes to computational efficiency and potentially faster training times due to the reduced complexity.

Furthermore, pMF's approach is in line with the manifold hypothesis, suggesting that images can be effectively modeled as lying on a low-dimensional manifold. This perspective may inspire further exploration into end-to-end learning architectures that leverage pixel-based representations directly.

Conclusion

In conclusion, the Pixel MeanFlow (pMF) model presents a promising direction for the advancement of image generation techniques, demonstrating that it is feasible to achieve competitive performance with a one-step, latent-free approach. This study provides valuable insights into the potential for simplifying generative models, aligning with the broader goals of developing efficient and effective deep learning frameworks. Future research may expand on this work by exploring more complex datasets and further optimizing the network architecture for different applications.