Q-Hawkeye: RL for Visual IQA

Updated 6 February 2026

Q-Hawkeye is a reinforcement learning-based framework that formulates no-reference image quality assessment as a Markov Decision Process.
It employs Uncertainty-Aware Dynamic Optimization (UADO) to scale policy updates based on sample reliability and reduce unstable predictions.
The approach integrates Perception-Aware Optimization (PAO) by contrasting pristine and degraded images to enforce visual grounding.

Q-Hawkeye denotes a reinforcement learning-based framework for reliable visual policy optimization in no-reference image quality assessment (NR-IQA). The framework explicitly addresses sample-level reliability and perceptual grounding limitations that afflict prior multimodal LLM (MLLM) approaches to IQA. Q-Hawkeye introduces two architectural and algorithmic advances: Uncertainty-Aware Dynamic Optimization (UADO) and Perception-Aware Optimization (PAO), formulating policy learning as a Markov Decision Process and optimizing with respect to both predictive stability and perceptual sensitivity (Xie et al., 30 Jan 2026).

1. NR-IQA as a Vision-Language Markov Decision Process

Q-Hawkeye formulates NR-IQA as an episodic MDP, where the policy $\pi_\theta$ is a vision-language transformer generating a concise reasoning-and-answer sequence $o$ string in response to an image $I$ and a fixed prompt $q$ . Each MDP state consists of $(I,q)$ and the partially generated output tokens $o_{<t}$ . Actions correspond to generating the next token $o_t$ from vocabulary $V$ , with deterministic token transitions until a special terminal token. Reward feedback is assigned only at episode completion, with:

$R = R_{\mathrm{acc}} + R_{\mathrm{fmt}},$

where $R_{\mathrm{acc}} = \exp(-|\hat{y} - y|/\alpha)$ (with $y$ being the ground-truth mean opinion score (MOS) and $\hat{y}$ the scalar parsed IQA score from the predicted answer), and $R_{\mathrm{fmt}} = 1$ if the output matches required formatting, else $0$. The only stochasticity arises from policy sampling.

2. Uncertainty-Aware Dynamic Optimization (UADO)

Q-Hawkeye samples $K=8$ rollouts for each $(I,q)$ pair under the current policy $\pi_{\theta_\mathrm{old}}$ , parsing predicted scores $\hat{y}_k$ and computing the sample mean and variance:

$\mu = \frac{1}{K} \sum_{k=1}^K \hat{y}_k, \quad u = \frac{1}{K} \sum_{k=1}^K (\hat{y}_k - \mu)^2$

This variance $u$ operationalizes predictive uncertainty. To normalize, the maximal attainable variance for MOS in $[1,5]$ is $4$, so $u$ is clipped and rescaled:

$\bar{u} = \min\left(\frac{u}{4+\epsilon_u}, 1 \right),\quad \epsilon_u=10^{-5}$

A downweighting factor $w(u) = \exp(-T \cdot \bar{u})$ with temperature $T=0.2$ is applied to rescale the policy gradient update strength for the entire sample, yielding for advantage $A_k$ :

$\bar{A}_k = w(u) \cdot A_k, \quad A_k = \frac{r_k - \mu_r}{\sigma_r}$

where $A_k$ uses within-sample group relative normalization. All subsequent policy gradient and GRPO surrogate loss terms are modified to use these weighted advantages, thereby reducing the impact of high-uncertainty (unstable) samples on learning.

3. Perception-Aware Optimization via Implicit Perception Loss

To enforce true visual grounding, for each pristine input image $I$ , Q-Hawkeye synthesizes a randomly degraded image counterpart $I_\mathrm{deg}$ using noise, blur, JPEG compression, or darkening operators.

The implicit perception loss is defined as

$D_{\mathrm{KL}} [\pi_\theta(\cdot|I,q)\,||\,\pi_\theta(\cdot|I_\mathrm{deg},q)]$

Maximizing this KL divergence encourages the policy’s output distribution to diverge when degradation alters visual quality, discouraging text-only or language-prior-driven judgments. To avoid trivial high-entropy solutions, a double-entropy regularization term penalizes excessive randomness in both pristine and degraded conditions.

The full PAO objective term becomes: $L_{\mathrm{PAO}}(\theta) = -\lambda_{\mathrm{per}} D_{\mathrm{KL}} + \gamma_1 H(\pi_\theta(\cdot|I,q)) + \gamma_2 H(\pi_\theta(\cdot|I_\mathrm{deg},q))$ with $\lambda_{\mathrm{per}}=5\times10^{-4}$ , $\gamma_1=\gamma_2=10^{-4}$ .

4. End-to-End Optimization Algorithm and Implementation Details

The training routine iteratively processes each mini-batch through uncertainty-aware sampling (UADO), degraded-input resampling (PAO), and updates the model via AdamW. All optimizations are conducted on Qwen2.5-VL-7B backbones. Key hyperparameters include $\alpha=0.3$ for reward tolerance, PPO-style clipping ( $\epsilon=0.2$ ), KL-regularization ( $\beta=10^{-3}$ ), and batch size 32 for 15 epochs.

All data augmentations for degradation are human- and GPT-4o–filtered to ensure perceptual relevance; e.g., Gaussian noise ( $\sigma=0.05$ ), blur ( $\sigma=2.0$ ), JPEG quality=30, darken=0.6.

5. Empirical Evaluation and Comparative Results

Q-Hawkeye is trained only on the KonIQ-10k dataset and tested in-distribution (KonIQ test) and out-of-distribution (SPAQ, LIVE-Wild, FLIVE, KADID-10K, CSIQ, PIPAL, AGIQA-3K) without fine-tuning. Metrics are PLCC and SRCC between predicted and human MOS.

Across these benchmarks, Q-Hawkeye achieves 80.0 PLCC and 76.2 SRCC—outperforming all prior MLLM-based IQA methods trained solely on KonIQ. Even compared to multi-dataset state-of-the-art approaches like DeQA-Score and VisualQuality-R1, the single-dataset Q-Hawkeye remains competitive or superior.

Ablation studies demonstrate that both UADO and PAO contribute significant, complementary improvements. UADO alone yields 77.5/73.8 (PLCC/SRCC), PAO variants alone yield 76.9–78.9/73.4–75.5, but the integrated framework attains 80.0/76.6.

6. Analysis: Reliability, Robustness, and Generalization

Module ablations confirm that UADO stabilizes training, reducing reward variance and raising average reward. Hyperparameter sweeps for temperature $T$ , perception regularization $\lambda_{\mathrm{per}}$ , and entropic weights $\gamma_i$ reveal well-behaved optima near defaults. Increasing rollout count $K$ brings rapidly diminishing returns beyond $K=8$ .

The perception gap (score differential between $I$ and $I_\mathrm{deg}$ ) monotonically increases during training under PAO, providing direct evidence that the policy becomes more visually sensitive rather than language-prior-driven. Crucially, the framework generalizes across highly diverse out-of-distribution datasets despite training on a single set.

7. Contributions and Broader Impact

Q-Hawkeye advances RL-based IQA by reformulating reliability as a function of sample-wise prediction stability and perception grounding. The UADO mechanism adaptively scales gradient updates to filter unreliable samples, while PAO enforces visual evidence dependence via original/vs/degraded contrast. This results in substantial improvements in robustness, cross-dataset generalization, and ultimately, visually grounded quantitative IQA performed by modern MLLM systems (Xie et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Q-Hawkeye.

Q-Hawkeye: RL for Visual IQA

1. NR-IQA as a Vision-Language Markov Decision Process

2. Uncertainty-Aware Dynamic Optimization (UADO)

3. Perception-Aware Optimization via Implicit Perception Loss

4. End-to-End Optimization Algorithm and Implementation Details

5. Empirical Evaluation and Comparative Results

6. Analysis: Reliability, Robustness, and Generalization

7. Contributions and Broader Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Q-Hawkeye: RL for Visual IQA

1. NR-IQA as a Vision-Language Markov Decision Process

2. Uncertainty-Aware Dynamic Optimization (UADO)

3. Perception-Aware Optimization via Implicit Perception Loss

4. End-to-End Optimization Algorithm and Implementation Details

5. Empirical Evaluation and Comparative Results

6. Analysis: Reliability, Robustness, and Generalization

7. Contributions and Broader Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research