Q-Hawkeye: RL for Visual IQA
- Q-Hawkeye is a reinforcement learning-based framework that formulates no-reference image quality assessment as a Markov Decision Process.
- It employs Uncertainty-Aware Dynamic Optimization (UADO) to scale policy updates based on sample reliability and reduce unstable predictions.
- The approach integrates Perception-Aware Optimization (PAO) by contrasting pristine and degraded images to enforce visual grounding.
Q-Hawkeye denotes a reinforcement learning-based framework for reliable visual policy optimization in no-reference image quality assessment (NR-IQA). The framework explicitly addresses sample-level reliability and perceptual grounding limitations that afflict prior multimodal LLM (MLLM) approaches to IQA. Q-Hawkeye introduces two architectural and algorithmic advances: Uncertainty-Aware Dynamic Optimization (UADO) and Perception-Aware Optimization (PAO), formulating policy learning as a Markov Decision Process and optimizing with respect to both predictive stability and perceptual sensitivity (Xie et al., 30 Jan 2026).
1. NR-IQA as a Vision-Language Markov Decision Process
Q-Hawkeye formulates NR-IQA as an episodic MDP, where the policy is a vision-language transformer generating a concise reasoning-and-answer sequence string in response to an image and a fixed prompt . Each MDP state consists of and the partially generated output tokens . Actions correspond to generating the next token from vocabulary , with deterministic token transitions until a special terminal token. Reward feedback is assigned only at episode completion, with:
where (with being the ground-truth mean opinion score (MOS) and the scalar parsed IQA score from the predicted answer), and if the output matches required formatting, else $0$. The only stochasticity arises from policy sampling.
2. Uncertainty-Aware Dynamic Optimization (UADO)
Q-Hawkeye samples rollouts for each pair under the current policy , parsing predicted scores and computing the sample mean and variance:
This variance operationalizes predictive uncertainty. To normalize, the maximal attainable variance for MOS in is $4$, so is clipped and rescaled:
A downweighting factor with temperature is applied to rescale the policy gradient update strength for the entire sample, yielding for advantage :
where uses within-sample group relative normalization. All subsequent policy gradient and GRPO surrogate loss terms are modified to use these weighted advantages, thereby reducing the impact of high-uncertainty (unstable) samples on learning.
3. Perception-Aware Optimization via Implicit Perception Loss
To enforce true visual grounding, for each pristine input image , Q-Hawkeye synthesizes a randomly degraded image counterpart using noise, blur, JPEG compression, or darkening operators.
The implicit perception loss is defined as
Maximizing this KL divergence encourages the policy’s output distribution to diverge when degradation alters visual quality, discouraging text-only or language-prior-driven judgments. To avoid trivial high-entropy solutions, a double-entropy regularization term penalizes excessive randomness in both pristine and degraded conditions.
The full PAO objective term becomes: with , .
4. End-to-End Optimization Algorithm and Implementation Details
The training routine iteratively processes each mini-batch through uncertainty-aware sampling (UADO), degraded-input resampling (PAO), and updates the model via AdamW. All optimizations are conducted on Qwen2.5-VL-7B backbones. Key hyperparameters include for reward tolerance, PPO-style clipping (), KL-regularization (), and batch size 32 for 15 epochs.
All data augmentations for degradation are human- and GPT-4o–filtered to ensure perceptual relevance; e.g., Gaussian noise (), blur (), JPEG quality=30, darken=0.6.
5. Empirical Evaluation and Comparative Results
Q-Hawkeye is trained only on the KonIQ-10k dataset and tested in-distribution (KonIQ test) and out-of-distribution (SPAQ, LIVE-Wild, FLIVE, KADID-10K, CSIQ, PIPAL, AGIQA-3K) without fine-tuning. Metrics are PLCC and SRCC between predicted and human MOS.
Across these benchmarks, Q-Hawkeye achieves 80.0 PLCC and 76.2 SRCC—outperforming all prior MLLM-based IQA methods trained solely on KonIQ. Even compared to multi-dataset state-of-the-art approaches like DeQA-Score and VisualQuality-R1, the single-dataset Q-Hawkeye remains competitive or superior.
Ablation studies demonstrate that both UADO and PAO contribute significant, complementary improvements. UADO alone yields 77.5/73.8 (PLCC/SRCC), PAO variants alone yield 76.9–78.9/73.4–75.5, but the integrated framework attains 80.0/76.6.
6. Analysis: Reliability, Robustness, and Generalization
Module ablations confirm that UADO stabilizes training, reducing reward variance and raising average reward. Hyperparameter sweeps for temperature , perception regularization , and entropic weights reveal well-behaved optima near defaults. Increasing rollout count brings rapidly diminishing returns beyond .
The perception gap (score differential between and ) monotonically increases during training under PAO, providing direct evidence that the policy becomes more visually sensitive rather than language-prior-driven. Crucially, the framework generalizes across highly diverse out-of-distribution datasets despite training on a single set.
7. Contributions and Broader Impact
Q-Hawkeye advances RL-based IQA by reformulating reliability as a function of sample-wise prediction stability and perception grounding. The UADO mechanism adaptively scales gradient updates to filter unreliable samples, while PAO enforces visual evidence dependence via original/vs/degraded contrast. This results in substantial improvements in robustness, cross-dataset generalization, and ultimately, visually grounded quantitative IQA performed by modern MLLM systems (Xie et al., 30 Jan 2026).