MLLM-as-a-Judge Benchmark
- MLLM-as-a-Judge Benchmark is an evaluation suite that tests multimodal LLMs’ ability to score, rank, and compare candidate outputs across various modalities.
- It employs a two-tier design—TaskAnything and JudgeAnything—with a 15-way dataset that rigorously stresses cross-modal understanding and generation tasks.
- Quantitative findings highlight strong alignment in multimodal understanding tasks and expose challenges in scoring generation tasks, urging improvements in rubric design and adaptive criteria.
A Multimodal LLM-as-a-Judge Benchmark is an evaluation suite designed to rigorously assess the capability of advanced Multimodal LLMs (MLLMs) to function as automated judges—assigning scores, ranking, and discriminating between candidate outputs—across open-ended tasks spanning text, image, audio, video, and their combinations. Such benchmarks directly target the acute challenges of human preference alignment, cross-modal bias, and reliability in both multimodal understanding (MMU) and multimodal generation (MMG), providing robust testbeds and protocols for comparative research and practical deployment (Pu et al., 21 Mar 2025).
1. Benchmark Design and Dataset Construction
The MLLM-as-a-Judge Benchmark, as instantiated in "Judge Anything: MLLM as a Judge Across Any Modality," employs a two-tier design: TaskAnything and JudgeAnything. TaskAnything is a 15-way, any-to-any multimodal understanding and generation corpus, systematically constructed to stress-test every plausible input–output cross-modal mapping encountered in modern foundation models.
The modalities and combinations evaluated are:
| Input→Output Modality | Examples | #Prompts |
|---|---|---|
| Text→Text, Image→Text, ... | Captioning, VQA | 100 ea |
| ... (15 total, see below) | Editing, Gen, etc | 1,500 |
Enumerated, the 15 input–output pairs are:
- Text→Text 6. Text→Image 11. Image→Audio
- Image→Text 7. Text→Video 12. Video→Video
- Video→Text 8. Text→Audio 13. Video→Audio
- Audio→Text 9. Image→Image 14. Audio→Video
- Audio+Video→Text 10. Image→Video 15. Audio→Audio
Prompt construction follows a four-stage curation workflow:
- Sample open-ended instructions from canonical benchmarks, followed by aggressive filtering (removal of fixed-format, NSFW, repetitive queries).
- Target underrepresented cross-modal pairs (e.g., image→audio) using YouTube-sourced datasets (AVSYNC15, VGGSound), custom-crafting prompts as needed.
- Filter and verify data for safety: manual review of video/audio; classifier-based filtering for images.
- Assemble the final dataset 𝒬: 1,500 prompts evenly distributed.
This approach ensures controlled coverage, empirical challenge, and minimal contamination or data leakage relative to model pretraining (Pu et al., 21 Mar 2025).
2. Judgment Task Taxonomy and Protocols
The evaluation of judge models—referred to hereafter as "evaluators"—is formalized in JudgeAnything via two central protocols:
a. Score Evaluation: For each (Qᵢ, Rⱼ), where Qᵢ is a prompt and Rⱼ a candidate model response, an MLLM judge assigns an integer score from 1–5. Ground-truth comes from expert annotators. Exact score match, Pearson correlation (ρ), Spearman's (r), and Mean Absolute Error (MAE) measure alignment.
b. Pair Comparison: For every (Qᵢ, Rⱼ¹, Rⱼ²) tuple—two candidate responses per prompt—an MLLM judge must select "first," "second," or "tie." Correctness is assessed by exact match with human label; accuracy is calculated as the percentage of model choices matching human choices. Evaluator-fusion (ensemble via majority voting/score averaging) is used for robust alignment estimation.
Checklist-of-Thought paradigm: Judges utilize sample-specific, high-level rubrics (Overall, Relevance, Trustworthiness, Creativity, Clarity, Coherence, Completeness). Gemini-1.5-Pro generates tailored checklists, pruned by human annotators to isolate the most salient evaluation axes for each sample. Three judgment paradigms are supported:
- Overall (direct judgment)
- Rubric (evaluate each top-level criterion, then summarize)
- Checklist (evaluate using a sample-specific checklist, then average to overall)
Judgment stability is quantified by the Majority Consistency Criterion (MCC) over repeated runs, providing a measure of intra-model robustness.
3. Quantitative Evaluation and Empirical Findings
Extensive evaluation across five state-of-the-art MLLMs (notably Gemini-1.5-Pro, GPT-4o, Gemini-2.0-Flash) on the 1,500-prompt/6,000-response TaskAnything/ JudgeAnything pool reveals:
- MMU (Understanding) vs. MMG (Generation):
- Judges align substantially better with human ratings on understanding tasks (MMU) than generation (MMG).
- Checklist-of-Thought yields the highest human-model agreement.
- Best single-evaluator alignment:
- Gemini-1.5-Pro (Checklist setting, MMU): Pair Comparison (w. tie) = 70.6%; Score Eval Pearson = 0.745.
- Highest overall ensemble alignment:
- MMU, Pair Comparison (w. tie) = 66.55%; Score Eval Pearson = 0.687.
- MMG, Pair Comparison (w. tie) = 53.37%; Score Eval Pearson = 0.562.
- Score evaluation instability: MCC for Gemini-1.5-Pro falls to ~0.76 on Score Evaluation, compared to >0.9 for Pair Comparison.
GPT-4o systematically underperforms on MMG and exhibits hallucinations in fine-grained, rubric-driven scoring. Output–modality bias is pronounced: text→* and image→* tasks yield higher agreement, while video→* and audio→* tasks show graded performance degradation (Pu et al., 21 Mar 2025).
4. Error Modes, Biases, and Limitations
Several structural failure modes inherent to current MLLM judge architectures are exposed:
- Task complexity bias: Judges perform best on task types they can "do" themselves; performance deteriorates on audio, video, and multi-modal generation.
- Output-modality bias: Accuracy and correlation systematically decrease with the complexity of the output modality.
- Rubric-induced hallucinations: Excessively granular checklists can precipitate hallucinated failures (e.g., flagging benign props as violent, under-detecting relevant modal changes).
- Poor cross-modal reasoning: LLMs may tie responses even where critical cross-modal differences are salient or reward mere verbosity.
- Score evaluation inconsistency: Stability erodes in numeric scoring; pairwise comparative judgment is more repeatable.
- Human–MLLM divergences: MLLMs may overweight superficial features (section completeness, verbosity) and neglect logical or factual soundness, particularly in legal and multi-step reasoning domains (Karp et al., 6 Nov 2025).
5. The OmniArena Platform and Benchmark Infrastructure
OmniArena operationalizes the benchmark by automating:
- Batch inference for omni-models on all 15 any-to-any modality tasks.
- Automated MLLM-based pairwise judging (Gemini-1.5-Pro by default).
- Dynamic performance ranking via ELO (see Appendix for formulas), enabling a competitive, reproducible leaderboard.
The arena reveals that broad omni-models are currently suboptimal relative to domain-specialized models in generation-heavy tasks (MMG), while Gemini-1.5-Pro consistently leads on MMU.
The entire pipeline (code, dataset, protocols) is distributed for public access and reproducibility at https://urrealhero.github.io/judgeanythingweb/ (Pu et al., 21 Mar 2025).
6. Recommendations for Development and Future Practices
The systematic evaluation reveals that:
- Diverse, cross-modal, and context-sensitive judging data is essential for mitigating output-modality bias. Static rubrics should be replaced or complemented by dynamic, sample-adaptive criteria—especially in open-ended MMG.
- Human–MLLM agreement depends on combining both modalities of annotation ("checklists" + human label calibration) and on jointly optimizing for alignment using both human-fused and MLLM-derived metrics.
- Standardized, open, scalable testbeds like OmniArena are necessary to foster reproducible, cross-institutional progress and to ensure benchmarks reflect evolving model capabilities and real usage.
- Current reliability is insufficient for fully automated open-ended generation scoring; integrating humans in the loop on ambiguous cases remains requisite.
All evidence advocates for continued deepening of cross-modal judgment alignment, improved rubric design, and systematic human–MLLM co-calibration efforts (Pu et al., 21 Mar 2025).
References
- "Judge Anything: MLLM as a Judge Across Any Modality" (Pu et al., 21 Mar 2025)
- "MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark" (Chen et al., 2024)
- "LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal" (Karp et al., 6 Nov 2025)