Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaVA-Critic: Open-Source Multimodal Evaluation Models

Updated 1 February 2026
  • The paper establishes that integrating critic-specific tuning with reinforcement learning on preference data enables LLaVA-Critic to serve as both an evaluator and generative policy model.
  • LLaVA-Critic employs LLaVA-OneVision and Qwen-2.5-VL-7B backbones to transform multimodal inputs into scalar scores and detailed text explanations through unified prompt engineering.
  • Empirical results reveal significant improvements in Pearson correlations and pairwise accuracies, reflecting enhanced alignment with human judgments and robust self-critique capabilities.

LLaVA-Critic refers to a set of open-source multimodal models designed to act as generalist evaluators or critics for vision-language outputs. Originating with "LLaVA-Critic: Learning to Evaluate Multimodal Models" (Xiong et al., 2024) and later extended by "LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model" (Wang et al., 31 Aug 2025), this line of research establishes that, with properly curated training data and objectives, large vision-LLMs (LMMs) can support not only generative answering but also robust, model-based evaluation of outputs across diverse multimodal tasks. Critically, reinforcement learning (RL) on preference-labeled critic data enables dual usage: these models serve exceptionally both as reward models for alignment and as high-performance generative policies themselves.

1. Model Architectures and Critic Integration

LLaVA-Critic is based on the LLaVA-OneVision (LLaVA-OV) model, with variants in 7B and 72B parameter scales, augmented through critic-specific instruction tuning. The later LLaVA-Critic-R1 uses Qwen-2.5-VL-7B (a two-stage vision-LLM with a frozen ViT/Swin vision encoder and a Qwen-2.5 causal LLM) as its base. No additional trainable “value head” is introduced: both evaluation (critic) and generation (policy) tasks employ the same generative head, enforced through prompt engineering.

When prompted, the model receives as input the multimodal context (image, question), candidate responses, and a task description. For evaluation, it outputs either a numerical score plus textual reasoning (pointwise) or a structured preference/ranking (pairwise), often enclosed within a > … chain-of-thought and a boxed final decision. This architectural unification allows the model to support both end-to-end answer generation and explicit critique within the same transformer stack (Xiong et al., 2024, Wang et al., 31 Aug 2025).

Model Variant Backbone Critic Output Format
LLaVA-Critic-OV-7B/72B LLaVA-OneVision Score + Explanation / Preference + Reasoning
LLaVA-Critic-R1 Qwen-2.5-VL-7B <think> CoT + \boxed{…} Decision

2. Dataset Construction and Training Objectives

LLaVA-Critic employs the LLaVA-Critic-113k dataset (113,000 annotated samples over 46,000 unique images), mixing 72.8k pointwise (scored) and 40.1k pairwise (preference) examples. Construction follows a GPT-4o–assisted pipeline, aggregating contexts from eight public vision instruction tuning sets (e.g., LLaVA-Instruction, SVIT, ComVint, LLaVAR, LRV) and drawing candidate outputs from twelve competitive LMMs plus GPT-4o references.

Evaluation prompts cover a rich spectrum—captioning, visual reasoning, robustness, hallucination avoidance, ethical safety—and are sourced/adapted from seven widely used benchmarks, including MMHal-Bench, WildVision-Bench, and LLaVA-in-the-Wild.

Supervised loss is computed over full output streams (score/preference plus justification) using cross-entropy: Lsup=t=1TlogPθ(yty<t,x)\mathcal{L}_{\text{sup}} = -\sum_{t=1}^{T} \log P_\theta\bigl(y_t \mid y_{<t}, x\bigr) For reinforcement learning in LLaVA-Critic-R1, pairwise preference data is reformulated for verifiable RL signals. The reward function combines correctness of preference judgment (rprefr_\mathrm{pref}), with output format adherence (rformatr_\mathrm{format}), weighted by α=0.9\alpha=0.9: r(τ)=αrpref(τ)+(1α)rformat(τ)r(\tau) = \alpha\,r_{\mathrm{pref}}(\tau) + (1-\alpha)\,r_{\mathrm{format}}(\tau) The RL objective employs KL-regularized policy-gradient optimization (akin to PPO or Group Relative Policy Optimization), directly on the base model without any warm-start supervised fine-tuning (Wang et al., 31 Aug 2025).

3. Training Procedures and Optimization

LLaVA-Critic is trained via standard supervised learning on GPT-4o-labeled datasets. For the preference learning applications, the model provides reward signals for ranking-based methods such as Direct Preference Optimization (DPO): LDPO(θ)=E(y+,y) ⁣[logσ(πθ(y+x)πθ(yx))]\mathcal{L}_{\text{DPO}}(\theta) = \mathbb{E}_{(y^+,y^-)}\!\bigl[\log \sigma\bigl(\pi_\theta(y^+|x) - \pi_\theta(y^-|x)\bigr)\bigr] where (y+,y)(y^+, y^-) are candidate response pairs.

For LLaVA-Critic-R1, RL is applied cold-start to Qwen-2.5-VL-7B, guided purely by formatted preference labels (e.g., outputs “1”, “2”, or “Two responses are equally good.”) and strict format consrtaints. Hyperparameters include a learning rate of 5×1065\times10^{-6}, batch size 256, about 40K training updates, and a KL penalty coefficient β0.1\beta \approx 0.1. Output format is rigorously enforced during both training and evaluation via prompt design (Wang et al., 31 Aug 2025).

4. Evaluation Protocols and Empirical Results

LLaVA-Critic assessments are conducted across two axes: (i) evaluator performance (“LMM-as-a-Judge”), and (ii) reward model effectiveness for preference alignment.

Evaluation as Generalist Critic:

On pointwise scoring against GPT-4o, LLaVA-Critic-7B achieves Pearson correlation coefficients up to 0.732 (72B: 0.754), with Kendall’s Tau at 0.911 (72B: 0.933). For pairwise ranking on 2,000 WildVision pairs, LLaVA-Critic-72B attains 0.736 accuracy without ties, exceeding GPT-4V (0.708) (Xiong et al., 2024).

Alignment with Human Judgments:

On out-of-domain LMM-as-a-Judge benchmarks, LLaVA-Critic-72B improves human score correlation from 0.287 (LLaVA-OV-72B) to 0.393, with pairwise-with-tie accuracy rising to 0.578—closing much of the gap with GPT-4V (0.636/0.773) (Xiong et al., 2024).

Evaluator Pearson-r Pairwise Acc. (No Tie)
LLaVA-OV-72B 0.287 0.701
LLaVA-Critic-72B 0.393 0.715
GPT-4V 0.490 0.773

Policy Emergence in LLaVA-Critic-R1:

Although trained as a critic, LLaVA-Critic-R1 exhibits emergent answer generation prowess, achieving an average of +5.7% over Qwen-2.5-VL-7B across 26 visual reasoning benchmarks (e.g., VQA, chart understanding, video reasoning). As a critic, it improves scores on VLRewardBench by +9.4 points and MM-RLHF by +2.9 points. Notably, all improvements are achieved without separate RLHF reward heads or extra trainable parameters (Wang et al., 31 Aug 2025).

Extension via Critic RL (LLaVA-Critic-R1+):

Applying the RL optimization to a stronger reasoning base (ThinkLite-VL-7B) yields LLaVA-Critic-R1+, which further increases performance, e.g., achieving 71.9% on the MMMU benchmark at 7B scale—a new state-of-the-art (Wang et al., 31 Aug 2025).

5. Self-Critique and Preference Learning Applications

LLaVA-Critic models are central to preference-based alignment cycles. When used as reward models for DPO, they surpass human-supervised models in RLHF benchmarks. After three DPO iterations with LLaVA-Critic-7B as reward model, accuracy on WildVision-Bench rises from 54.0 to 67.3; other benchmarks show similar or greater improvements (Xiong et al., 2024).

In the inference phase, LLaVA-Critic-R1 demonstrates effective self-critique via recursive best-of-nn selection: for five reasoning tasks, the average self-critique improvement is +13.8% over single-pass inference. The model samples nn candidate answers, then recursively applies its own critic capacity to select the best outcome—a scalable path for self-improving inference (Wang et al., 31 Aug 2025).

Benchmark Majority Vote Base Critic Self-Critic
MathVista 76.4 77.1 78.9
MathVerse 54.7 55.2 60.9
MathVision 32.9 34.7 44.1
MMMU 60.0 61.9 66.4
MMStar 67.4 67.9 69.7

6. Limitations and Future Directions

LLaVA-Critic and its RL-extended variants remain reliant on GPT-4o–generated labels, inheriting any upstream biases and coverage gaps, especially in specialist or novel domains. There are coverage gaps in prompt and scenario diversity, motivating future extensions, such as the expansion to multi-image and video settings, meta-learning for new evaluation criteria, and ensemble "critic of critics" systems to push feedback quality beyond existing LMMs.

Planned directions include broadening the scope of self-critique, advancing transparency via multi-modal chain-of-thought in explanations, and adapting critic learning to less explored benchmarks or domains using accelerated, adaptive paradigms (Xiong et al., 2024).

7. Scientific Significance and Impact

LLaVA-Critic formally extends LMMs from generative “assistant” roles to reliable, generalist evaluators, providing open-source, transparent scoring with GPT-4V–competitive performance (Xiong et al., 2024). The LLaVA-Critic-R1 paradigm demonstrates that, under RL from verifiable preference data, a model can unify both critic and policy capacities within a single architecture, simplifying evaluation pipelines and supporting self-improving inference (Wang et al., 31 Aug 2025). This work enables scalable, reproducible evaluation in research pipelines, serves as a high-quality signal for policy alignment via DPO/RLHF, and marks a notable advance toward self-reflective, continually improving multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaVA-Critic.