PreResQ-R1: Towards Fine-Grained Rank-and-Score Reinforcement Learning for Visual Quality Assessment via Preference-Response Disentangled Policy Optimization

Published 7 Nov 2025 in cs.CV | (2511.05393v1)

Abstract: Visual Quality Assessment (QA) seeks to predict human perceptual judgments of visual fidelity. While recent multimodal LLMs (MLLMs) show promise in reasoning about image and video quality, existing approaches mainly rely on supervised fine-tuning or rank-only objectives, resulting in shallow reasoning, poor score calibration, and limited cross-domain generalization. We propose PreResQ-R1, a Preference-Response Disentangled Reinforcement Learning framework that unifies absolute score regression and relative ranking consistency within a single reasoning-driven optimization scheme. Unlike prior QA methods, PreResQ-R1 introduces a dual-branch reward formulation that separately models intra-sample response coherence and inter-sample preference alignment, optimized via Group Relative Policy Optimization (GRPO). This design encourages fine-grained, stable, and interpretable chain-of-thought reasoning about perceptual quality. To extend beyond static imagery, we further design a global-temporal and local-spatial data flow strategy for Video Quality Assessment. Remarkably, with reinforcement fine-tuning on only 6K images and 28K videos, PreResQ-R1 achieves state-of-the-art results across 10 IQA and 5 VQA benchmarks under both SRCC and PLCC metrics, surpassing by margins of 5.30% and textbf2.15% in IQA task, respectively. Beyond quantitative gains, it produces human-aligned reasoning traces that reveal the perceptual cues underlying quality judgments. Code and model are available.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a disentangled reward framework that decouples response consistency from inter-sample preference alignment for robust visual quality assessment.
It employs a two-stage fine-tuning process with multi-prompt exploration and single-prompt stability to improve calibration and overall performance.
Experimental results demonstrate state-of-the-art SRCC and PLCC metrics, validating the approach’s effectiveness in both IQA and VQA benchmarks.

Fine-Grained Reinforcement Learning for Visual Quality Assessment: PreResQ-R1

Motivation and Problem Formulation

Visual Quality Assessment (QA) remains a fundamental challenge in computer vision, particularly in scenarios lacking pristine references where no-reference (NR) methods must generalize across diverse distortions and domains. Classical approaches often fall short either in robustness (handcrafted models) or in calibration/interpretability (deep learning regressors and rank-only methods). Recent advances in MLLMs have enabled chain-of-thought (CoT) reasoning for quality assessment, but supervised fine-tuning is prone to shallow reasoning and dataset-specific overfitting. Reinforcement learning (RL)-based techniques offer improved alignment with human perceptual judgment but frequently utilize monolithic reward structures, neglecting the interplay between intra-sample coherence (response consistency) and inter-sample preference alignment.

PreResQ-R1 introduces a principled, reasoning-driven optimization framework that explicitly decouples these two axes—response and preference—within a fine-grained RL paradigm. The central premise is that human quality evaluation is both context-aware (consistent within a perceptual context) and comparative (aligned with global preference), requiring a reward structure that supports both absolute and relative calibration.

Methodology: Preference–Response Disentangled Policy Optimization

Reward Disentanglement

PreResQ-R1 employs Preference–Response Disentangled Policy Optimization (PRPO), in which the reward is bifurcated:

Response-based Ranking Reward (RR): Measures intra-sample coherence by aligning multiple CoT generations of the same input, leveraging a geometric median-centered stabilizer to regularize the score predictions across five perceptual dimensions (Saturation, Granularity, Sharpness, Foreground, Background).
Preference-based Ranking–Score Reward (PRS): Captures inter-sample relationships using both pairwise ordinal correctness and magnitude-aware score alignment, as well as triplet-based transitivity constraints for global ranking stability.

The total reward is a convex combination of these components, weighted to balance local response fidelity and global perceptual regression.

Reasoning and Data Flow

For IQA, structured prompts elicit CoT responses over five aspects, each rated with a score in $[1,5]$ . For VQA, a global–temporal representation aggregates all video frames, while randomly sampled local–spatial frames preserve spatial detail, forming input branches for unified reasoning across temporal and spatial axes. Multiple generations per sample encode uncertainty, facilitating robust rank-and-score estimation.

Exploration-to-Stability Fine-tuning

Training proceeds in two stages:

Exploration: Multi-prompt perturbations promote diverse reasoning trajectories, with standard deviation penalties enforcing granularity differentiation across aspect scores.
Stability: Single-prompt refinement further regularizes and calibrates the outputs, suppressing overconfident predictions and consolidating forecasting stability.

Optimization utilizes Group Relative Policy Optimization (GRPO) with reward-weighted likelihood and KL regularization to prevent catastrophic drift from the pretrained reference policy.

Experimental Results

PreResQ-R1 is fine-tuned on 6K images (KADID-10K) and 28K videos (LSVQ) and evaluated zero-shot across ten IQA and five VQA benchmarks. It achieves state-of-the-art performance:

IQA: Average SRCC of 0.811 and PLCC of 0.790, surpassing previous methods by margins of 5.3% (SRCC) and 2.15% (PLCC). Superior error distribution and score calibration are evidenced against strong prior MLLM-based approaches.
VQA: Average SRCC of 0.850 and PLCC of 0.863 across all datasets, consistently outperforming baselines when trained with minimal data.

Ablation studies demonstrate the complementarity of response and preference rewards; removal of any branch degrades performance and reasoning interpretability. The fine-grained CoT leads to more human-aligned, stable reasoning traces that expose perceptual cues under quality judgments.

Implications and Future Directions

PreResQ-R1 advances RL-based QA by formalizing a disentangled reward structure, enabling MLLMs to perform calibrated, interpretable quality assessment across domain shifts with limited supervision. Practically, this framework enhances reliability in automated QA systems for media generation, enhancement, and compression, lowering annotation costs and augmenting transparency. Theoretically, it offers a scalable approach for aligning perceptual models with human MOS through minimal RL intervention, addressing reward hacking and overfitting.

Future research will expand PreResQ-R1 toward unified frameworks combining image quality assessment with text-to-image generation, promoting joint optimization of generative and evaluative capabilities, and deeper integration of fine-grained perceptual reasoning within foundation models. Complementary optimization of generation and evaluation—particularly in the AIGC pipeline—could yield more consistent, robust, and human-aligned visual understanding (2511.05393).

Conclusion

PreResQ-R1 sets a new benchmark for fine-grained rank-and-score reinforcement learning in visual quality assessment, establishing that reward disentanglement is essential for stable, human-aligned perceptual reasoning in large multimodal models. This work provides both methodological and practical foundation for scalable, interpretable quality assessment systems with broad applicability across AI-driven visual domains.

Markdown Report Issue