Multimodal Reward Models

Updated 9 February 2026

Multimodal reward models are functions that map visual and textual inputs to quantitative scores, enabling refined assessment of complex outputs.
They integrate vision encoders and language models with cross-modal fusion adapters to deliver either outcome or process-level rewards.
Training strategies combining supervised fine-tuning, reinforcement learning, and entropy-guided data curation boost efficiency and model robustness.

A multimodal reward model is a learned function that quantitatively evaluates the quality of responses involving both visual (e.g., images, video frames) and textual data, given a candidate solution or reasoning trajectory. These models are central to alignment, training, and inference in Multimodal LLMs (MLLMs). By providing scalar, process-level, or structured feedback on outputs conditioned on complex mixed-modality inputs, they enable learning from human preferences, reinforcement learning (RL), best-of-N candidate selection, and fine-grained error detection in chain-of-thought (CoT) reasoning (Wang et al., 13 Mar 2025, Hu et al., 18 Dec 2025, Wang et al., 6 May 2025, Li et al., 4 Feb 2026).

1. Foundational Taxonomy and Formalism

Multimodal reward models (MM-RMs) are parameterized functions $R_\theta$ mapping inputs $(I, q, s)$ , where $I$ is a visual input (image or video), $q$ is a textual query or instruction, and $s$ is a candidate response or reasoning trace, to a quantitative reward. There are two principal types:

Outcome Reward Models (ORMs): Assign a scalar reward based solely on the final output:

$R_{\mathrm{out}}(I, q, s) = f_\theta(I, q, s)$

Process Reward Models (PRMs): Decompose supervision over reasoning steps, assigning rewards at each intermediate step and aggregating:

$y_i = g_\theta(I, q, s_{\le i}),\quad R_{\mathrm{proc}}(I, q, s) = \frac{1}{n+1}\sum_{i=0}^n y_i$

where $s_{\le i}$ is the ith step prefix, and $y_i$ is typically discretized.

Some frameworks extend these with structured, verifiable, or agentic reward models, which provide multi-dimensional, sub-question, or tool-grounded feedback (Zhang et al., 7 Aug 2025, Ding et al., 4 Dec 2025).

PRMs are especially critical for enabling step-level critique—i.e., not only if the final output is correct, but which reasoning steps (textual or visual) are correctly executed, allowing dense supervision for RL-style optimization and error diagnosis (Wang et al., 13 Mar 2025, Li et al., 4 Feb 2026, Wang et al., 11 Jun 2025).

2. Architectures and Training Paradigms

Core Backbones

Typical MM-RMs are built atop large vision-language transformers integrating:

A frozen or fine-tuned vision encoder (ViT or CLIP-ViT).
A text encoder or LLM (e.g., Qwen2.5-VL-7B, InternVL2.5-8B, LLaVA-OneVision-7B).
Cross-modal fusion adapters connecting visual and textual streams.
A reward head (scalar or multi-headed MLP) for classification or regression on fused representations (Wang et al., 13 Mar 2025, Wang et al., 6 May 2025, Wang et al., 12 May 2025, Zhang et al., 19 Sep 2025).

Reward Paradigms

Naive-RM: Directly predicts a scalar reward via an MLP head, typically using a pairwise ranking loss (e.g., Bradley–Terry loss): $\mathcal{L}_{\text{Naive}} = -\log \sigma(R_\theta(y_w\mid x) - R_\theta(y_l\mid x))$ Critic-Based RM: Generates critiques and scores them; loss is defined over the critique ranking or scoring. Generative RM: The model outputs a preference decision ("1"/"2"), optimized by generative cross-entropy (Zhang et al., 19 Sep 2025).

Process/Chain-of-Thought RMs: Incorporate long-form reasoning traces, either by maximizing the likelihood of stepwise CoT outputs or leveraging RL to reward/penalize per-step correctness (Wang et al., 6 May 2025, Gao et al., 9 Apr 2025).

Structured/Verifier RMs: Use an auxiliary verifier head to assess per-subquestion correctness, semantic/mathematical equivalence, or other fine-grained criteria (Zhang et al., 7 Aug 2025, Gao et al., 9 Apr 2025).

Training Strategies

Supervised Fine-Tuning: Standard pairwise (or multi-way) ranking on curated preference datasets, often augmented with process-level (stepwise) or sub-question labels.
Reinforcement Learning: Algorithms such as PPO, StableReinforce, or Group Relative Policy Optimization (GRPO) are used to train reward models as agent critics, reward signal generators, or to perform hard-sample mining (Zhang et al., 5 May 2025, Ding et al., 4 Dec 2025).
Data-efficient Labeling: Monte Carlo simulation (VisualPRM400K, MC estimation), weak/strong completer consistency schemes (Athena-PRM), and entropy-guided selection (EGT) are used to reduce annotation cost and maximize training efficiency (Li et al., 4 Feb 2026, Wang et al., 11 Jun 2025, Yang et al., 2 Feb 2026).

Data Curation

State-of-the-art MM-RMs are trained on mixtures of human-curated, automated, and synthetic paired-comparison data across images, videos, interleaved sequences, reasoning chains, and step-level correctness (Wang et al., 13 Mar 2025, Hu et al., 18 Dec 2025, Wang et al., 6 May 2025, Zhang et al., 19 Sep 2025).

3. Benchmarks and Metrics

The performance, alignment, and reliability of MM-RMs are validated using increasingly rigorous, large-scale, and fine-grained benchmarks:

Benchmark Name	Focus	Key Statistics & Metrics
VisualProcessBench	Step-level multimodal reasoning	26,950 steps, Macro-F1, accuracy
Multimodal RewardBench 2 (MMRB2)	Interleaved text-image, four subtasks (T2I, editing, reasoning, interleaved generation)	4,000 pairs, judge-human agreement, accuracy, Pearson ρ (Hu et al., 18 Dec 2025)
VideoRewardBench	Multi-aspect video understanding	1,563 samples, accuracy by dimension (perception, knowledge, reasoning, safety) (Zhang et al., 30 Aug 2025)
Agent-RewardBench	Step-level agentic planning, perception, safety	1,136 high-quality pairs, per-dimension accuracy (Men et al., 26 Jun 2025)

Metrics: Macro-F1, pairwise accuracy, best-of-N reranking gain, human–model agreement, and specialized sub-scores (e.g., hallucination, safety, reasoning).

Benchmarks such as VisualProcessBench and MMRB2 provide human-annotated ground truth for stepwise correctness or pairwise preferences, allowing fine-grained ranking and error localization (Wang et al., 13 Mar 2025, Hu et al., 18 Dec 2025, Li et al., 4 Feb 2026). Correlation between reward model performance and downstream task success is quantified, e.g., MMRB2 accuracy correlates with best-of-N (BoN) candidate improvement ( $\rho > 0.8$ ) (Hu et al., 18 Dec 2025).

4. Data-Efficiency, Generalization, and Robustness

Data-Efficient Process Reward Modeling

Training large process reward models on full-scale MC-annotated corpora (e.g., VisualPRM400K with 400K–565K rollouts) is known to saturate quickly under random subsampling: only 10%–25% of the data is needed to match full-data performance due to redundancy and label noise (Li et al., 4 Feb 2026). The Balanced-Information Score (BIS), combining step mixture (uncertainty) and MC label reliability at the rollout level, selects informative rollouts and achieves equivalent or superior downstream performance at small data fractions.

Large redundancy implies that careful data curation, prioritizing uncertain-but-reliable steps, is critical for maximizing model training efficiency and information content.

Shortcut Mitigation and Generalization

Naive training of MM-RMs on unfiltered datasets results in over-reliance on textual shortcuts (e.g., length bias, hallucination tokens), harming out-of-distribution (OOD) robustness (Li et al., 5 Mar 2025). Shortcut-aware learning dynamically downweights samples predictable by text-only proxies, markedly improving OOD accuracy (gap reduction from 23.3% to 11.7%) and decreasing Shortcut-Failure Degradation (SFD) (39.5 to 18.4). Practitioners are recommended to include unimodal baselines and dynamic weighting schemes during training to guarantee genuine multimodal discrimination.

Entropy-Guided Training

Response entropy serves as an effective, unsupervised proxy for annotation noise and sample complexity in preference data. Entropy-guided data curation—retaining low-entropy (less ambiguous) samples—and an easy-to-hard curriculum improve reasoning accuracy and efficiency. Training on the bottom 15% entropy subset suffices to match full-data accuracy while yielding +3–4 points over prior best models (Yang et al., 2 Feb 2026).

5. Structured, Stepwise, and Generative Process Reward Models

Advanced MM-RMs increasingly focus on providing fine-grained feedback and corrections at the reasoning step or sub-question level:

Process Reward Models (PRMs): VisualPRM, Athena-PRM, and GM-PRM output per-step correctness probabilities, enabling step-level supervision and BoN solution reranking (Wang et al., 13 Mar 2025, Wang et al., 11 Jun 2025, Zhang et al., 6 Aug 2025).
Structured/Verifier RMs: StructVRM computes k-dimensional binary or fractional rewards for sub-questions within a response, leveraging semantic/mathematical equivalence rather than string match. This allows for partial-credit and compositional feedback, which is especially effective for complex, multi-part STEM problems (Zhang et al., 7 Aug 2025, Gao et al., 9 Apr 2025).
Tri-Head and Multi-Dimensional Models: SVIP-Reward and similar frameworks model relevance, logic, and attribute correctness via multi-head attention, strengthening chain-of-thought verification (Gao et al., 9 Apr 2025).
Generative Corrective PRMs: GM-PRM not only diagnoses but also generates corrected reasoning steps, supporting refined best-of-N reranking (Refined-BoN). This convert the critic from a binary verifier to an active collaborator, improving solution diversity and accuracy with remarkable data efficiency (20K samples) (Zhang et al., 6 Aug 2025).

Key findings confirm that process-level (stepwise) supervision consistently outperforms outcome-only and self-consistency baselines for best-of-N scaling—e.g., VisualPRM yields +5.9 points on InternVL2.5-78B over seven benchmarks using BoN=8 (Wang et al., 13 Mar 2025); Athena-PRM achieves +10.2 on WeMath and SoTA results (+3.9 F1) on VisualProcessBench with only 5K samples (Wang et al., 11 Jun 2025).

6. Agentic, Unified, and Application-Specific Models

Agentic Reward Models

Agentic MM-RMs such as ARM-Thinker augment reward models with explicit tool-calling and verification modules, transitioning from static scoring to interactive, evidence-grounded judgment. ARM-Thinker leverages ReAct-style think–act–observe loops with image cropping, document retrieval, and instruction checkers, allowing for fine-grained tool-assisted reward modeling. It achieves +16.2% average improvement on reward modeling tasks and +9.6% on tool-use benchmarks. Joint RL optimization (SFT, staged GRPO) is used to learn when and how to invoke agentic tools within the reward modeling process (Ding et al., 4 Dec 2025).

Unified and Multi-purpose Reward Models

UnifiedReward, Skywork-VL Reward, and BaseReward exemplify MM-RMs designed for joint assessment of understanding and generation tasks over images and videos. These models are trained on broad, multi-domain preference datasets, support pairwise and pointwise heads, and are validated via both static benchmarks (MMRB2, VLRewardBench) and downstream DPO/RL tasks (Wang et al., 7 Mar 2025, Wang et al., 12 May 2025, Zhang et al., 19 Sep 2025).

BaseReward provides a systematic recipe for state-of-the-art MM-RM construction, adopting a Qwen2.5-VL backbone with an optimized two-layer SiLU-MLP reward head and a curated mixture of 2.8M preference pairs. It exceeds previous SOTA by >10 points on MM-RLHF-Reward Bench and +14.2 points on VL-Reward Bench (Zhang et al., 19 Sep 2025).

Application: Reward-Guided Decoding and Control

Reward models are increasingly deployed not only as training critics but for run-time control of model outputs. Reward-guided decoding combines the LLM probability with one or more reward functions (e.g., object hallucination, recall), enabling explicit trade-offs between precision and recall, as well as compute-vs-grounding quality. This setting allows user-controllable, on-the-fly adjustment of MLLM behavior at inference (Mañas et al., 15 Aug 2025).

7. Limitations, Open Problems, and Future Directions

Headroom and Open Challenges

State-of-the-art reward models, including proprietary models (Gemini 3 Pro, GPT-5), lag human-level agreement by 14–25 points on challenging benchmarks (MMRB2 >90% human vs. ~76% best model).
Inadequate performance in fine-grained, same-model comparisons and persistent modality biases (overweighing image-containing responses) (Hu et al., 18 Dec 2025).
Safety and robust generalization remain acute bottlenecks in agentic and long-horizon settings (Agent-RewardBench: best models ~61.6% accuracy, safety lags far behind perception/planning) (Men et al., 26 Jun 2025).
Substantial redundancy and label noise exist in process reward corpora; only a small, carefully selected subset is necessary for optimal performance (Li et al., 4 Feb 2026, Yang et al., 2 Feb 2026).
Most models and data pipelines rely heavily on MC-annotated or LLM-annotated synthetic data; further progress relies on scaling high-fidelity human judgments and process-level error collections.

Recommendations and Directions

Scaling to larger backbones (20B+), joint training of critic–policy models, and hybrid outcome–process reward formulations.
Process/safety-critical reward models: Expanding and balancing benchmarks for safety, multi-turn dialogue, and stepwise agent planning.
Debiasing and generalization: Incorporate shortcut-aware and entropy-based data selection pipelines; deploy adversarial or invariant risk minimization strategies.
Multi-modal, structured, agentic signals: Leverage structured verifier heads, generative correction, multi-head attention, and tool-assisted agentic modules for comprehensive stepwise supervision.

In sum, multimodal reward models anchor the emerging field of preference-aligned, step-level, and agentic evaluation in MLLMs. State-of-the-art MM-RMs display rapid progress in architectural sophistication, data-efficiency, and application breadth, but significant scientific, engineering, and annotation challenges persist, particularly in generalization, safety, and open-ended reasoning (Wang et al., 13 Mar 2025, Hu et al., 18 Dec 2025, Li et al., 4 Feb 2026, Li et al., 5 Mar 2025, Zhang et al., 7 Aug 2025, Zhang et al., 6 Aug 2025, Wang et al., 6 May 2025, Gao et al., 9 Apr 2025, Wang et al., 11 Jun 2025, Zhang et al., 19 Sep 2025, Ding et al., 4 Dec 2025, Wang et al., 12 May 2025, Yang et al., 2 Feb 2026).