Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Published 13 Mar 2026 in cs.CV, cs.AI, cs.LG, cs.NE, and stat.ML | (2603.12893v1)

Abstract: Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision LLMs and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel finite difference flow optimization (FDFO) method that improves RL post-training of text-to-image models by effectively aligning rewards.
It replaces traditional MDP policy gradients with a paired trajectory finite difference approach to reduce variance, suppress reward-neutral noise, and minimize artifacts.
Empirical validations on Stable Diffusion indicate that FDFO achieves faster convergence and higher quality outputs with enhanced prompt alignment and robustness.

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Overview

The paper "Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models" (2603.12893) introduces a novel reinforcement learning (RL) framework—Finite Difference Flow Optimization (FDFO)—for post-training conditional diffusion-based text-to-image models. The approach addresses significant limitations in prior RL-based post-training methods by reducing variance in flow updates and increasing convergence speed, output quality, and prompt alignment.

Motivation and Problem Formulation

Post-training of diffusion-based generative models is essential for refining aspects such as image quality and prompt alignment, which are not directly quantifiable in pre-training objectives. Reinforcement learning enables optimization with respect to arbitrary, often non-differentiable reward functions, including human preference models like PickScore and vision-LLM (VLM) alignment metrics.

Existing RL post-training methods, such as Flow-GRPO, formulate the sampling process as a Markov Decision Process (MDP), treating each discretization step as a separate action and utilizing stochastic policy gradients. However, these approaches suffer from substantial variance, reward-neutral noise, and uncontrolled drift in the model’s output space, which can exacerbate reward hacking and limit the achievable fine-tuning.

The goal of FDFO is to algorithmically improve the signal-to-noise ratio in flow parameter updates during RL post-training, directly targeting image improvement via the reward signal while mitigating unwanted drift and artifacts.

Methodology

Flow Matching Paradigm

Diffusion models generate images by applying a learned probability flow (parameterized by neural velocity fields $v_\theta$ ) to iteratively denoise random Gaussian samples. Flow matching parameterizes this process as a deterministic ODE, with stochastic noise injection used for trajectory diversification.

Finite Difference Flow Optimization

FDFO departs from the MDP policy gradient paradigm; instead, it considers the entire image generation trajectory as a single action. For each prompt, it samples paired trajectories diverging from the same initial noise, perturbed by minimal stochasticity. The pairwise difference in output images, weighted by the reward difference, is used as an approximate gradient directing the flow velocity update across all trajectory steps toward higher-reward regions.

Formally, the update leverages:

Image difference vector $\Delta x = x_T' - x_T$
Reward difference $\Delta R = R(x_T') - R(x_T)$
The normalized training signal $\Delta R \cdot \Delta x$

This approach ensures every update directionally benefits reward maximization, suppressing reward-neutral noise and reducing variance compared to Flow-GRPO, where individual step updates can be reward-neutral or even detrimental.

The stochastic sampling procedure is adapted with principles from the EDM sampler [24], introducing calibrated noise injections proportional to the current noise level to generate meaningful trajectory variations.

Implementation details include normalized gradient signals, on-policy optimization with policy ratio clipping (SPO), batching for memory efficiency, and compatibility with KL regularization and classifier-free guidance (CFG).

Empirical Evaluation

Reward Maximization

Extensive experiments with Stable Diffusion 3.5 Medium and LoRA adaptation show that FDFO converges faster and achieves higher rewards for both human preference and VLM-based prompt alignment objectives. In challenging reward scenarios (VLM alignment and combined rewards), FDFO maintains superior convergence and output fidelity relative to Flow-GRPO.

Quality, Alignment, and Diversity Metrics

Evaluations with external metrics (OneIG-Bench, HPSv2, CLIP, DreamSim) confirm that FDFO yields higher prompt alignment, better match to human preferences, and reduced reward hacking artifacts. Diversity decay is equivalent between methods when matched for reward levels, but FDFO’s rapid convergence enables more efficient exploration of the reward-diversity tradeoff.

Grid-like reward hacking artifacts and style drift, observed in Flow-GRPO outputs, do not emerge in FDFO, highlighting improved stability and artifact suppression.

Adding VLM-based reward components further enhances prompt alignment, with additive formulation preferred over combined prompt objectives to maintain a balanced optimization across multiple reward criteria.

Ablation Studies

Ablations show that each component of FDFO—stochastic sampling, pairwise finite-difference gradient updates, shared initial noise for paired trajectories, batch normalization, and SPO clipping—contributes incrementally to convergence speed and output quality. Using deterministic sampling for one trajectory or PPO-style clipping yields similar results given tuned stochasticity schedules. Direct backpropagation through the reward or SDE steps is less effective than the finite-difference approach employed by FDFO.

Wall-clock Performance

Benchmarks on NVIDIA H200 GPUs demonstrate that FDFO is 5x–19x faster in reaching high reward levels versus Flow-GRPO, accounting for both implementation efficiency and intrinsic algorithmic convergence.

Theoretical Analysis

The finite-difference update scheme approximates the gradient of a smoothed reward function (via pairwise perturbations). Under mild assumptions (positive semi-definite Jacobians of the flow mapping, inspired by optimal transport theory), this direction is expected to be a reward ascent direction, offering theoretical justification for convergence and stability observed empirically.

Practical and Theoretical Implications

Practically, FDFO enables more robust RL post-training for text-to-image diffusion models, mitigating reward hacking and improving sample efficiency. The algorithm provides a drop-in replacement to state-of-the-art methods like Flow-GRPO and DanceGRPO with reduced variance and artifact risk.

Theoretically, the departure from MDP-based policy gradients shortens the reward attribution horizon and expands the class of feasible RL objectives. The approach is agnostic to differentiability of reward functions and integrates seamlessly with existing RL techniques (e.g., KL regularization) and emerging reward models (VLMs).

Speculatively, as vision-LLMs improve, richer reward compositions leveraging VLM proxies can enhance prompt alignment and visual quality, further benefiting from the stable and efficient optimization provided by FDFO.

Future directions include direct diversity-targeting reward components, improved trade-off mechanisms for alignment-diversity, and integration with richer downstream evaluation protocols.

Conclusion

The paper delineates a rigorous and efficient RL post-training paradigm for diffusion-based text-to-image models, substantially advancing convergence speed, output quality, and robustness against reward-neutral drift and artifacts. The finite difference flow optimization framework, underpinned by theoretical and empirical analysis, sets a new standard in RL post-training, enabling scalable and stable deployment of generative image models aligned with complex reward signals.

Markdown Report Issue