- The paper introduces a hybrid perceptual reward that combines full-reference and no-reference IQA metrics to drive preference-based optimization.
- It employs hierarchical preference optimization to adaptively weight training pairs based on reward disparities and perceptual diversity.
- Empirical results show significant improvements in perceptual quality, reduced variability, and superior performance on diffusion and flow-based ISR benchmarks.
Direct Perceptual Preference Optimization for Real-World Image Super-Resolution
Overview
DP2O-SR introduces a preference-driven optimization framework for real-world image super-resolution (Real-ISR) leveraging the stochasticity of pre-trained text-to-image (T2I) diffusion and flow-based models. The method aligns generative models with perceptual preferences by constructing a hybrid reward signal from full-reference (FR) and no-reference (NR) image quality assessment (IQA) models, trained on large-scale human preference datasets. This reward signal is used to curate multiple preference pairs from diverse model outputs, enabling hierarchical preference optimization (HPO) that adaptively weights training pairs for efficient and stable learning. The framework is validated on both diffusion (C-SD2) and flow-based (C-FLUX) backbones, demonstrating significant improvements in perceptual quality and generalization to real-world benchmarks.
Motivation and Background
Traditional ISR methods focus on pixel-level accuracy, often resulting in over-smoothed outputs lacking realistic textures. GAN-based approaches improve perceptual quality but suffer from unstable training and artifacts. Recent advances in T2I diffusion models (e.g., Stable Diffusion, FLUX) have shown strong potential for Real-ISR due to their ability to synthesize plausible and diverse details. However, the inherent stochasticity of these models leads to output variability for the same input, which is typically viewed as a limitation. DP2O-SR exploits this stochasticity as a source of supervision, enabling preference-driven optimization to better utilize the generative capacity of T2I models.
Perceptual Reward Construction
The perceptual reward aggregates multiple IQA metrics:
- Full-Reference (FR): LPIPS, TOPIQ-FR, AFINE-FR
- No-Reference (NR): MANIQA, MUSIQ, CLIPIQA+, TOPIQ-NR, AFINE-NR, Q-Align
Each candidate output is scored by these metrics, direction-aligned, and normalized. The final reward for a sample is computed as the average of normalized FR and NR scores, with equal weighting. This hybrid reward balances structural fidelity (FR) and natural appearance (NR), avoiding the oversmoothing of FR-only training and hallucinations from NR-only training.
1
2
3
4
5
|
def compute_perceptual_reward(fr_scores, nr_scores):
fr_norm = (fr_scores - fr_scores.min()) / (fr_scores.max() - fr_scores.min())
nr_norm = (nr_scores - nr_scores.min()) / (nr_scores.max() - nr_scores.min())
reward = 0.5 * fr_norm.mean() + 0.5 * nr_norm.mean()
return reward |
Preference Pair Curation
For each LR input, M outputs are sampled from the same model using different noise seeds. Outputs are ranked by perceptual reward, and the top-N and bottom-N samples are selected to form N2 preference pairs. This approach provides richer supervision than the conventional best-vs-worst selection, capturing finer perceptual distinctions and better exploiting model diversity.
Trade-offs in M and N/M
- Larger M: Increases perceptual diversity and training stability, but with diminishing returns.
- Selection Ratio N/M: Smaller models benefit from broader coverage (higher N/M), while larger models respond better to stronger contrast (lower N/M).
Empirical results indicate optimal N/M is architecture-dependent: $1/4$ for C-SD2 (0.8B UNet), $1/16$ for C-FLUX (12B DiT).
Hierarchical Preference Optimization (HPO)
HPO adaptively weights training pairs at two levels:
- Intra-group: Emphasizes pairs with larger reward gaps within each group.
- Inter-group: Prioritizes groups with greater perceptual spread (higher standard deviation of rewards).
The final loss is a weighted sum of Diff-DPO losses for each pair:
LHPO=(x0w,x0l)∑wintra⋅winter⋅ℓ(x0w,x0l;θ)
This selective weighting improves training efficiency and perceptual alignment.
Implementation Details
- Backbones: C-SD2 (UNet, 0.8B) and C-FLUX (MMDiT, 12B) with ControlNet paradigm.
- Training: Batch size 1024, learning rate 2×10−5, β=5000, 1000 iterations, 8×A800 GPUs.
- Sampling: Up to M=64 outputs per LR image, 25/50 inference steps, CFG scales 2.5/3.5.
- Offline Candidate Generation: 1.92M images labeled by IQA models, requiring significant computational resources (168h for C-SD2, 432h for C-FLUX).
Experimental Results
Quantitative
- Perceptual Metrics: DP2O-SR consistently improves trained FR/NR metrics (e.g., LPIPS, TOPIQ-FR, MANIQA, CLIPIQA+, QALIGN) and generalizes to untrained metrics (VQ-R1, NIMA, TOPIQ-IAA).
- Generalization: On RealSR, DP2O-SR (SD2) achieves highest MANIQA (0.7031) and CLIPIQA+ (0.7852), outperforming SOTA methods.
- Stability: DP2O-SR raises the quality floor (Worst@M), reduces output variability, and achieves lower variance in perceptual scores.
Qualitative
- Artifact Suppression: DP2O-SR removes semantic and structural artifacts present in baseline outputs.
- Localized Refinement: Despite global reward, DP2O-SR enhances local details (e.g., wing venation, text restoration) without explicit local supervision.
Ablation
- HPO: Both intra-group and inter-group weighting contribute to performance; their combination yields the highest perceptual reward.
User Study
- Preference Alignment: DP2O-SR is consistently preferred over baselines and SOTA methods in pairwise and multi-way comparisons.
Practical and Theoretical Implications
DP2O-SR demonstrates that stochasticity in generative models can be harnessed for preference-driven optimization, improving perceptual quality and robustness in Real-ISR. The hybrid reward design and hierarchical weighting provide a scalable framework for aligning model outputs with human-like preferences without costly manual annotation. The method generalizes across architectures and evaluation metrics, indicating strong transferability.
Limitations
- Reward Interpretability: IQA-based reward correlates with human preference but lacks interpretability and may not fully capture subjective quality.
- Offline Pipeline: Current training is fully offline; online or iterative preference optimization could further enhance adaptability.
Future Directions
- Reward Model Development: More accurate and explainable perceptual reward models are needed.
- Online Preference Optimization: Integrating online or iterative preference feedback could improve performance and adaptability.
- Extension to Other Generative Tasks: The framework is applicable to other domains (e.g., video SR, restoration, synthesis) where perceptual diversity is present.
Conclusion
DP2O-SR provides a robust, scalable, and effective framework for aligning generative Real-ISR models with perceptual preferences, leveraging model stochasticity and hybrid IQA rewards. The approach yields substantial improvements in perceptual quality, generalization, and stability, and sets a new standard for preference-driven optimization in image restoration tasks.