- The paper introduces a disentangled reward framework that decouples response consistency from inter-sample preference alignment for robust visual quality assessment.
- It employs a two-stage fine-tuning process with multi-prompt exploration and single-prompt stability to improve calibration and overall performance.
- Experimental results demonstrate state-of-the-art SRCC and PLCC metrics, validating the approach’s effectiveness in both IQA and VQA benchmarks.
Fine-Grained Reinforcement Learning for Visual Quality Assessment: PreResQ-R1
Visual Quality Assessment (QA) remains a fundamental challenge in computer vision, particularly in scenarios lacking pristine references where no-reference (NR) methods must generalize across diverse distortions and domains. Classical approaches often fall short either in robustness (handcrafted models) or in calibration/interpretability (deep learning regressors and rank-only methods). Recent advances in MLLMs have enabled chain-of-thought (CoT) reasoning for quality assessment, but supervised fine-tuning is prone to shallow reasoning and dataset-specific overfitting. Reinforcement learning (RL)-based techniques offer improved alignment with human perceptual judgment but frequently utilize monolithic reward structures, neglecting the interplay between intra-sample coherence (response consistency) and inter-sample preference alignment.
PreResQ-R1 introduces a principled, reasoning-driven optimization framework that explicitly decouples these two axes—response and preference—within a fine-grained RL paradigm. The central premise is that human quality evaluation is both context-aware (consistent within a perceptual context) and comparative (aligned with global preference), requiring a reward structure that supports both absolute and relative calibration.
Methodology: Preference–Response Disentangled Policy Optimization
Reward Disentanglement
PreResQ-R1 employs Preference–Response Disentangled Policy Optimization (PRPO), in which the reward is bifurcated:
- Response-based Ranking Reward (RR): Measures intra-sample coherence by aligning multiple CoT generations of the same input, leveraging a geometric median-centered stabilizer to regularize the score predictions across five perceptual dimensions (Saturation, Granularity, Sharpness, Foreground, Background).
- Preference-based Ranking–Score Reward (PRS): Captures inter-sample relationships using both pairwise ordinal correctness and magnitude-aware score alignment, as well as triplet-based transitivity constraints for global ranking stability.
The total reward is a convex combination of these components, weighted to balance local response fidelity and global perceptual regression.
Reasoning and Data Flow
For IQA, structured prompts elicit CoT responses over five aspects, each rated with a score in [1,5]. For VQA, a global–temporal representation aggregates all video frames, while randomly sampled local–spatial frames preserve spatial detail, forming input branches for unified reasoning across temporal and spatial axes. Multiple generations per sample encode uncertainty, facilitating robust rank-and-score estimation.
Exploration-to-Stability Fine-tuning
Training proceeds in two stages:
- Exploration: Multi-prompt perturbations promote diverse reasoning trajectories, with standard deviation penalties enforcing granularity differentiation across aspect scores.
- Stability: Single-prompt refinement further regularizes and calibrates the outputs, suppressing overconfident predictions and consolidating forecasting stability.
Optimization utilizes Group Relative Policy Optimization (GRPO) with reward-weighted likelihood and KL regularization to prevent catastrophic drift from the pretrained reference policy.
Experimental Results
PreResQ-R1 is fine-tuned on 6K images (KADID-10K) and 28K videos (LSVQ) and evaluated zero-shot across ten IQA and five VQA benchmarks. It achieves state-of-the-art performance:
- IQA: Average SRCC of 0.811 and PLCC of 0.790, surpassing previous methods by margins of 5.3% (SRCC) and 2.15% (PLCC). Superior error distribution and score calibration are evidenced against strong prior MLLM-based approaches.
- VQA: Average SRCC of 0.850 and PLCC of 0.863 across all datasets, consistently outperforming baselines when trained with minimal data.
Ablation studies demonstrate the complementarity of response and preference rewards; removal of any branch degrades performance and reasoning interpretability. The fine-grained CoT leads to more human-aligned, stable reasoning traces that expose perceptual cues under quality judgments.
Implications and Future Directions
PreResQ-R1 advances RL-based QA by formalizing a disentangled reward structure, enabling MLLMs to perform calibrated, interpretable quality assessment across domain shifts with limited supervision. Practically, this framework enhances reliability in automated QA systems for media generation, enhancement, and compression, lowering annotation costs and augmenting transparency. Theoretically, it offers a scalable approach for aligning perceptual models with human MOS through minimal RL intervention, addressing reward hacking and overfitting.
Future research will expand PreResQ-R1 toward unified frameworks combining image quality assessment with text-to-image generation, promoting joint optimization of generative and evaluative capabilities, and deeper integration of fine-grained perceptual reasoning within foundation models. Complementary optimization of generation and evaluation—particularly in the AIGC pipeline—could yield more consistent, robust, and human-aligned visual understanding (2511.05393).
Conclusion
PreResQ-R1 sets a new benchmark for fine-grained rank-and-score reinforcement learning in visual quality assessment, establishing that reward disentanglement is essential for stable, human-aligned perceptual reasoning in large multimodal models. This work provides both methodological and practical foundation for scalable, interpretable quality assessment systems with broad applicability across AI-driven visual domains.