Papers
Topics
Authors
Recent
Search
2000 character limit reached

Q-Hawkeye: RL for Visual IQA

Updated 6 February 2026
  • Q-Hawkeye is a reinforcement learning-based framework that formulates no-reference image quality assessment as a Markov Decision Process.
  • It employs Uncertainty-Aware Dynamic Optimization (UADO) to scale policy updates based on sample reliability and reduce unstable predictions.
  • The approach integrates Perception-Aware Optimization (PAO) by contrasting pristine and degraded images to enforce visual grounding.

Q-Hawkeye denotes a reinforcement learning-based framework for reliable visual policy optimization in no-reference image quality assessment (NR-IQA). The framework explicitly addresses sample-level reliability and perceptual grounding limitations that afflict prior multimodal LLM (MLLM) approaches to IQA. Q-Hawkeye introduces two architectural and algorithmic advances: Uncertainty-Aware Dynamic Optimization (UADO) and Perception-Aware Optimization (PAO), formulating policy learning as a Markov Decision Process and optimizing with respect to both predictive stability and perceptual sensitivity (Xie et al., 30 Jan 2026).

1. NR-IQA as a Vision-Language Markov Decision Process

Q-Hawkeye formulates NR-IQA as an episodic MDP, where the policy πθ\pi_\theta is a vision-language transformer generating a concise reasoning-and-answer sequence oo string in response to an image II and a fixed prompt qq. Each MDP state consists of (I,q)(I,q) and the partially generated output tokens o<to_{<t}. Actions correspond to generating the next token oto_t from vocabulary VV, with deterministic token transitions until a special terminal token. Reward feedback is assigned only at episode completion, with:

R=Racc+Rfmt,R = R_{\mathrm{acc}} + R_{\mathrm{fmt}},

where Racc=exp(y^y/α)R_{\mathrm{acc}} = \exp(-|\hat{y} - y|/\alpha) (with yy being the ground-truth mean opinion score (MOS) and y^\hat{y} the scalar parsed IQA score from the predicted answer), and Rfmt=1R_{\mathrm{fmt}} = 1 if the output matches required formatting, else $0$. The only stochasticity arises from policy sampling.

2. Uncertainty-Aware Dynamic Optimization (UADO)

Q-Hawkeye samples K=8K=8 rollouts for each (I,q)(I,q) pair under the current policy πθold\pi_{\theta_\mathrm{old}}, parsing predicted scores y^k\hat{y}_k and computing the sample mean and variance:

μ=1Kk=1Ky^k,u=1Kk=1K(y^kμ)2\mu = \frac{1}{K} \sum_{k=1}^K \hat{y}_k, \quad u = \frac{1}{K} \sum_{k=1}^K (\hat{y}_k - \mu)^2

This variance uu operationalizes predictive uncertainty. To normalize, the maximal attainable variance for MOS in [1,5][1,5] is $4$, so uu is clipped and rescaled:

uˉ=min(u4+ϵu,1),ϵu=105\bar{u} = \min\left(\frac{u}{4+\epsilon_u}, 1 \right),\quad \epsilon_u=10^{-5}

A downweighting factor w(u)=exp(Tuˉ)w(u) = \exp(-T \cdot \bar{u}) with temperature T=0.2T=0.2 is applied to rescale the policy gradient update strength for the entire sample, yielding for advantage AkA_k:

Aˉk=w(u)Ak,Ak=rkμrσr\bar{A}_k = w(u) \cdot A_k, \quad A_k = \frac{r_k - \mu_r}{\sigma_r}

where AkA_k uses within-sample group relative normalization. All subsequent policy gradient and GRPO surrogate loss terms are modified to use these weighted advantages, thereby reducing the impact of high-uncertainty (unstable) samples on learning.

3. Perception-Aware Optimization via Implicit Perception Loss

To enforce true visual grounding, for each pristine input image II, Q-Hawkeye synthesizes a randomly degraded image counterpart IdegI_\mathrm{deg} using noise, blur, JPEG compression, or darkening operators.

The implicit perception loss is defined as

DKL[πθ(I,q)πθ(Ideg,q)]D_{\mathrm{KL}} [\pi_\theta(\cdot|I,q)\,||\,\pi_\theta(\cdot|I_\mathrm{deg},q)]

Maximizing this KL divergence encourages the policy’s output distribution to diverge when degradation alters visual quality, discouraging text-only or language-prior-driven judgments. To avoid trivial high-entropy solutions, a double-entropy regularization term penalizes excessive randomness in both pristine and degraded conditions.

The full PAO objective term becomes: LPAO(θ)=λperDKL+γ1H(πθ(I,q))+γ2H(πθ(Ideg,q))L_{\mathrm{PAO}}(\theta) = -\lambda_{\mathrm{per}} D_{\mathrm{KL}} + \gamma_1 H(\pi_\theta(\cdot|I,q)) + \gamma_2 H(\pi_\theta(\cdot|I_\mathrm{deg},q)) with λper=5×104\lambda_{\mathrm{per}}=5\times10^{-4}, γ1=γ2=104\gamma_1=\gamma_2=10^{-4}.

4. End-to-End Optimization Algorithm and Implementation Details

The training routine iteratively processes each mini-batch through uncertainty-aware sampling (UADO), degraded-input resampling (PAO), and updates the model via AdamW. All optimizations are conducted on Qwen2.5-VL-7B backbones. Key hyperparameters include α=0.3\alpha=0.3 for reward tolerance, PPO-style clipping (ϵ=0.2\epsilon=0.2), KL-regularization (β=103\beta=10^{-3}), and batch size 32 for 15 epochs.

All data augmentations for degradation are human- and GPT-4o–filtered to ensure perceptual relevance; e.g., Gaussian noise (σ=0.05\sigma=0.05), blur (σ=2.0\sigma=2.0), JPEG quality=30, darken=0.6.

5. Empirical Evaluation and Comparative Results

Q-Hawkeye is trained only on the KonIQ-10k dataset and tested in-distribution (KonIQ test) and out-of-distribution (SPAQ, LIVE-Wild, FLIVE, KADID-10K, CSIQ, PIPAL, AGIQA-3K) without fine-tuning. Metrics are PLCC and SRCC between predicted and human MOS.

Across these benchmarks, Q-Hawkeye achieves 80.0 PLCC and 76.2 SRCC—outperforming all prior MLLM-based IQA methods trained solely on KonIQ. Even compared to multi-dataset state-of-the-art approaches like DeQA-Score and VisualQuality-R1, the single-dataset Q-Hawkeye remains competitive or superior.

Ablation studies demonstrate that both UADO and PAO contribute significant, complementary improvements. UADO alone yields 77.5/73.8 (PLCC/SRCC), PAO variants alone yield 76.9–78.9/73.4–75.5, but the integrated framework attains 80.0/76.6.

6. Analysis: Reliability, Robustness, and Generalization

Module ablations confirm that UADO stabilizes training, reducing reward variance and raising average reward. Hyperparameter sweeps for temperature TT, perception regularization λper\lambda_{\mathrm{per}}, and entropic weights γi\gamma_i reveal well-behaved optima near defaults. Increasing rollout count KK brings rapidly diminishing returns beyond K=8K=8.

The perception gap (score differential between II and IdegI_\mathrm{deg}) monotonically increases during training under PAO, providing direct evidence that the policy becomes more visually sensitive rather than language-prior-driven. Crucially, the framework generalizes across highly diverse out-of-distribution datasets despite training on a single set.

7. Contributions and Broader Impact

Q-Hawkeye advances RL-based IQA by reformulating reliability as a function of sample-wise prediction stability and perception grounding. The UADO mechanism adaptively scales gradient updates to filter unreliable samples, while PAO enforces visual evidence dependence via original/vs/degraded contrast. This results in substantial improvements in robustness, cross-dataset generalization, and ultimately, visually grounded quantitative IQA performed by modern MLLM systems (Xie et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Q-Hawkeye.