DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Published 21 Oct 2025 in cs.CV and cs.AI | (2510.18851v1)

Abstract: Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a hybrid perceptual reward that combines full-reference and no-reference IQA metrics to drive preference-based optimization.
It employs hierarchical preference optimization to adaptively weight training pairs based on reward disparities and perceptual diversity.
Empirical results show significant improvements in perceptual quality, reduced variability, and superior performance on diffusion and flow-based ISR benchmarks.

Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Overview

DP $^2$ O-SR introduces a preference-driven optimization framework for real-world image super-resolution (Real-ISR) leveraging the stochasticity of pre-trained text-to-image (T2I) diffusion and flow-based models. The method aligns generative models with perceptual preferences by constructing a hybrid reward signal from full-reference (FR) and no-reference (NR) image quality assessment (IQA) models, trained on large-scale human preference datasets. This reward signal is used to curate multiple preference pairs from diverse model outputs, enabling hierarchical preference optimization (HPO) that adaptively weights training pairs for efficient and stable learning. The framework is validated on both diffusion (C-SD2) and flow-based (C-FLUX) backbones, demonstrating significant improvements in perceptual quality and generalization to real-world benchmarks.

Motivation and Background

Traditional ISR methods focus on pixel-level accuracy, often resulting in over-smoothed outputs lacking realistic textures. GAN-based approaches improve perceptual quality but suffer from unstable training and artifacts. Recent advances in T2I diffusion models (e.g., Stable Diffusion, FLUX) have shown strong potential for Real-ISR due to their ability to synthesize plausible and diverse details. However, the inherent stochasticity of these models leads to output variability for the same input, which is typically viewed as a limitation. DP $^2$ O-SR exploits this stochasticity as a source of supervision, enabling preference-driven optimization to better utilize the generative capacity of T2I models.

Perceptual Reward Construction

The perceptual reward aggregates multiple IQA metrics:

Full-Reference (FR): LPIPS, TOPIQ-FR, AFINE-FR
No-Reference (NR): MANIQA, MUSIQ, CLIPIQA+, TOPIQ-NR, AFINE-NR, Q-Align

Each candidate output is scored by these metrics, direction-aligned, and normalized. The final reward for a sample is computed as the average of normalized FR and NR scores, with equal weighting. This hybrid reward balances structural fidelity (FR) and natural appearance (NR), avoiding the oversmoothing of FR-only training and hallucinations from NR-only training.

def compute_perceptual_reward(fr_scores, nr_scores):
    fr_norm = (fr_scores - fr_scores.min()) / (fr_scores.max() - fr_scores.min())
    nr_norm = (nr_scores - nr_scores.min()) / (nr_scores.max() - nr_scores.min())
    reward = 0.5 * fr_norm.mean() + 0.5 * nr_norm.mean()
    return reward

Preference Pair Curation

For each LR input, $M$ outputs are sampled from the same model using different noise seeds. Outputs are ranked by perceptual reward, and the top- $N$ and bottom- $N$ samples are selected to form $N^2$ preference pairs. This approach provides richer supervision than the conventional best-vs-worst selection, capturing finer perceptual distinctions and better exploiting model diversity.

Trade-offs in $M$ and $N/M$

Larger $M$ : Increases perceptual diversity and training stability, but with diminishing returns.
Selection Ratio $N/M$ : Smaller models benefit from broader coverage (higher $N/M$ ), while larger models respond better to stronger contrast (lower $N/M$ ).

Empirical results indicate optimal $N/M$ is architecture-dependent: $1/4$ for C-SD2 (0.8B UNet), $1/16$ for C-FLUX (12B DiT).

Hierarchical Preference Optimization (HPO)

HPO adaptively weights training pairs at two levels:

Intra-group: Emphasizes pairs with larger reward gaps within each group.
Inter-group: Prioritizes groups with greater perceptual spread (higher standard deviation of rewards).

The final loss is a weighted sum of Diff-DPO losses for each pair:

$\mathcal{L}_{HPO} = \sum_{(\boldsymbol{x}_0^w, \boldsymbol{x}_0^l)} w_{\text{intra}} \cdot w_{\text{inter}} \cdot \ell(\boldsymbol{x}_0^w, \boldsymbol{x}_0^l; \theta)$

This selective weighting improves training efficiency and perceptual alignment.

Implementation Details

Backbones: C-SD2 (UNet, 0.8B) and C-FLUX (MMDiT, 12B) with ControlNet paradigm.
Training: Batch size 1024, learning rate $2 \times 10^{-5}$ , $\beta=5000$ , 1000 iterations, 8 $\times$ A800 GPUs.
Sampling: Up to $M=64$ outputs per LR image, 25/50 inference steps, CFG scales 2.5/3.5.
Offline Candidate Generation: 1.92M images labeled by IQA models, requiring significant computational resources (168h for C-SD2, 432h for C-FLUX).

Experimental Results

Quantitative

Perceptual Metrics: DP $^2$ O-SR consistently improves trained FR/NR metrics (e.g., LPIPS, TOPIQ-FR, MANIQA, CLIPIQA+, QALIGN) and generalizes to untrained metrics (VQ-R1, NIMA, TOPIQ-IAA).
Generalization: On RealSR, DP $^2$ O-SR (SD2) achieves highest MANIQA (0.7031) and CLIPIQA+ (0.7852), outperforming SOTA methods.
Stability: DP $^2$ O-SR raises the quality floor (Worst@M), reduces output variability, and achieves lower variance in perceptual scores.

Qualitative

Artifact Suppression: DP $^2$ O-SR removes semantic and structural artifacts present in baseline outputs.
Localized Refinement: Despite global reward, DP $^2$ O-SR enhances local details (e.g., wing venation, text restoration) without explicit local supervision.

Ablation

HPO: Both intra-group and inter-group weighting contribute to performance; their combination yields the highest perceptual reward.

User Study

Preference Alignment: DP $^2$ O-SR is consistently preferred over baselines and SOTA methods in pairwise and multi-way comparisons.

Practical and Theoretical Implications

DP $^2$ O-SR demonstrates that stochasticity in generative models can be harnessed for preference-driven optimization, improving perceptual quality and robustness in Real-ISR. The hybrid reward design and hierarchical weighting provide a scalable framework for aligning model outputs with human-like preferences without costly manual annotation. The method generalizes across architectures and evaluation metrics, indicating strong transferability.

Limitations

Reward Interpretability: IQA-based reward correlates with human preference but lacks interpretability and may not fully capture subjective quality.
Offline Pipeline: Current training is fully offline; online or iterative preference optimization could further enhance adaptability.

Future Directions

Reward Model Development: More accurate and explainable perceptual reward models are needed.
Online Preference Optimization: Integrating online or iterative preference feedback could improve performance and adaptability.
Extension to Other Generative Tasks: The framework is applicable to other domains (e.g., video SR, restoration, synthesis) where perceptual diversity is present.

Conclusion

DP $^2$ O-SR provides a robust, scalable, and effective framework for aligning generative Real-ISR models with perceptual preferences, leveraging model stochasticity and hybrid IQA rewards. The approach yields substantial improvements in perceptual quality, generalization, and stability, and sets a new standard for preference-driven optimization in image restoration tasks.