Semantic Relative Preference Optimization (SRPO)

Updated 15 February 2026

Semantic Relative Preference Optimization (SRPO) is a suite of training techniques that align text-to-image diffusion models with human semantic preferences using relative preference signals and semantic-aware weighting.
It leverages full diffusion trajectory reward propagation with prompt-controlled adjustments to achieve substantial gains in realism and style fidelity, evidenced by improved human ratings and faster convergence.
SRPO integrates contrastive learning with semantic regularization and embedding-based cross-prompt weighting to dynamically guide the model from coarse structure to fine detail.

Semantic Relative Preference Optimization (SRPO) is a suite of training and fine-tuning techniques for aligning text-to-image diffusion models with human semantic preferences. SRPO integrates contrastive preference learning with semantic-aware weighting and prompt-controlled online relative rewards, supporting efficient optimization across the full diffusion trajectory and providing substantial gains in perceived realism, style fidelity, and user alignment. Notably, SRPO formalizes the preference signal as a semantic differential, rather than as an absolute reward, and leverages prompt augmentation and CLIP-style joint embedding for dynamic, gradient-based model alignment (Gu et al., 2024, Shen et al., 8 Sep 2025).

1. Core Formulation and Relative Preference Signal

SRPO defines a relative preference signal as the central supervisory feedback during fine-tuning of a generative diffusion network. Given a batch of human-annotated prompt–image pairs

$D = \{ (p_i, x^w_i, x^l_i) \}_{i=1}^M$

each instance provides a prompt $p_i$ , a "winner" image $x^w_i$ (preferred by humans), and a "loser" image $x^l_i$ . At every diffusion timestep $t$ , the SRPO objective is defined over pairs (or generalized cross-prompt) as:

$L_{RPO}(\theta) = -\mathbb{E}_{t;i,j} \Bigg[ \omega_{i,j} \cdot \log \sigma \Big( r_\theta(x^w_i; t, p_i) - r_\theta(x^l_j; t, p_j) \Big) \Bigg]$

where $r_\theta$ is a log-likelihood ratio (see Section 2), $\omega_{i,j}$ are semantic similarity weights, and $\sigma$ is the sigmoid activation. This contrasts the degree of model alignment for preferred and unpreferred samples and dynamically incorporates semantic proximity using precomputed joint embeddings.

In prompt-augmentation SRPO (Shen et al., 8 Sep 2025), the same prompt is expanded with positive and negative control tokens (e.g., "Realistic photo of" vs. "CG render of"), and the model is trained to produce outputs that maximize the difference in CLIP-based reward between the two controlled generations for a single $x_t$ :

$r_{\mathrm{SRP}}(x) = \mathrm{RM}(x, p_c^+ || p) - \mathrm{RM}(x, p_c^- || p)$

2. Semantic Weighting and Cross-Prompt Structure

To extend preference informativeness, SRPO generalizes the pairwise loss with semantic weighting:

For each batch, all winning–losing pairs $(x^w_i, x^l_j)$ are compared.
Cosine similarity between CLIP joint embeddings $f_{\mathrm{CLIP}}(p_i, x^w_i)$ and $f_{\mathrm{CLIP}}(p_j, x^l_j)$ provides a semantic distance, converted via temperature scaling to raw weights:

$\tilde{\omega}_{i,j} = \exp\left( - \frac{1 - \cos(\cdot)}{\tau} \right), \quad \omega_{i,j} = \frac{\tilde{\omega}_{i,j}}{\sum_{j'} \tilde{\omega}_{i,j'}}$

The weighting matrix $\omega_{i,j}$ focuses optimization on semantically similar cross-prompt comparisons, amplifying the learning signal where user preference structure is richest (Gu et al., 2024).

This semantic contrastive weighting scheme increases the model's sensitivity to both prompt-specific and global style/generalization axes, governed by the inverse temperature parameter $\tau$ . Lower $\tau$ emphasizes fine prompt distinctions; higher $\tau$ encourages broad visual alignment.

3. Integration with the Diffusion Framework

SRPO utilizes the full diffusion process for reward propagation, avoiding reward overfitting at late denoising steps ("reward hacking"). The model is parameterized as a standard DDPM or latent U-Net (e.g., Stable Diffusion, FLUX.1.dev). For a timestep $t$ , noisy latent $x_t$ is generated by interpolating $x_0$ with Gaussian noise. Preference-guided updates are injected via log-likelihood ratios between the online model $\pi_\theta$ and a reference $\pi_{ref}$ :

$r_\theta(x; t, p) = \beta \left[ \log \pi_\theta(x_t | x_{t+1}, p) - \log \pi_{ref}(x_t | x_{t+1}, p) \right]$

For Gaussian diffusion with score-matching, this reduces to an energy difference over predicted vs. true noise, scaled by step-dependent constants.

In Direct-Align SRPO (Shen et al., 8 Sep 2025), reward is computed at multiple timesteps with a monotonic late-step discount $\lambda(t)$ , promoting corrections throughout the full denoising trajectory:

$R(t) = \lambda(t) \left[ r_1(\hat{x}_0^{denoise}) - r_2(\tilde{x}_0^{inv}) \right]$

This enables optimization beyond the final denoising stage, increasing stability and suppressing adversarial reward gaming.

4. Semantic Regularization and Style Generalization

SRPO introduces an explicit semantic regularizer for aligning model outputs from semantically proximate prompt–image pairs:

$L_{sem}(\theta) = \mathbb{E}_{i, j} \left[ S(p_i, p_j) \cdot \lVert f_{\mathrm{CLIP}}(p_i, \hat{x}_i) - f_{\mathrm{CLIP}}(p_j, \hat{x}_j) \rVert_2 \right]$

where $S(p_i, p_j)$ is a softmaxed similarity between prompts. This regularizer, weighted by a hyperparameter $\lambda$ , penalizes embedding drift between generations conditioned on related textual content, supporting improved style and global visual character retention.

Empirical ablations demonstrate that this regularizer, together with cross-prompt semantic weighting in $L_{RPO}$ and temperature tuning, governs a trade-off between prompt adherence and broader style matching.

5. Training Dynamics and Algorithmic Structure

The SRPO training protocol operates as follows:

For each batch $(p_i, x^w_i, x^l_i)$ , noisy latents are synthesized via forward diffusion.
For each $t$ , model-generated and reference outputs are scored, semantically weighted pairwise losses are accumulated, and the semantic regularization loss is joined.
The optimizer (AdamW or Adafactor) updates all U-Net parameters for $L_{total} = L_{RPO} + \lambda L_{sem}$ , with reference weights frozen.
In prompt-augmentation SRPO (used with Direct-Align), batches are constructed using positive and negative control tokens, and reward differentials between these augmentations are used for single-pass, trajectory-wide preference guidance.

Pseudocode for the core control-token SRPO (with Direct-Align) is given in (Shen et al., 8 Sep 2025), demonstrating minibatch sampling, per- $t$ noise injection, paired stepwise reward computation, and discounted per-batch updates.

6. Empirical Results and Evaluation Metrics

SRPO has demonstrated improvements over prior preference alignment approaches on both human preference and style alignment benchmarks. Key evaluated metrics include:

Method	HPSv2 (SDXL)	PickScore (SDXL)	FID (Van Gogh, SDXL)	Realism (FLUX, human)*	Aesthetic (FLUX, human)*	Convergence time (FLUX)
Base	27.900	22.694	—	8.2%	9.8%	—
DPO	28.082	23.159	152.35	—	—	—
SRPO	28.658	23.208	97.57	38.9%	40.5%	5.3 GPU hrs

* Human "Excellent" ratings for realism/aesthetics (FLUX.1.dev, HPDv2 five-rater majority) (Shen et al., 8 Sep 2025).

SRPO achieves a marked increase in human-rated realism and style capture relative to baselines, with ~3.7× improvements reported for certain subjective criteria and over 75× speedup compared to DanceGRPO.

7. Design Insights and Comparative Analysis

SRPO's innovation lies in combining three technical pillars:

Semantic Cross-Prompt Weighting: Cross-batch, embedding-based weighting surfaces nuanced preference signals and supports style/fidelity trade-offs via $\tau$ .
Full-Trajectory Reward Propagation: Applying discounted SRPO rewards across all diffusion timesteps ensures correction of both early (coarse structure) and late (fine detail) generations, mitigating overfitting to reward model pathologies.
Relative, Prompt-Conditioned Reward Design: Leveraging control tokens and semantic-differential rewards enables rapid, online-alterable preference axes—reducing sensitivity to absolute reward model bias and eliminating the need for reward model retraining.

Empirical ablations confirm that omitting semantic weights, early trajectory rewards, or inverse regularization sharply degrades both alignment and generalization performance. SRPO is adaptable to standard text-to-image U-Nets (e.g., Stable Diffusion, FLUX.1.dev) and requires only the addition of a differentiable reward and CLIP embedding infrastructure.

References

"Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization" (Gu et al., 2024)
"Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference" (Shen et al., 8 Sep 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization (2024)

Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Relative Preference Optimization (SRPO).