LVLM-eHub: Vision-Language Benchmark
- LVLM-eHub is a unified benchmarking framework for large vision-language models, integrating quantitative metrics with human-in-the-loop assessment.
- It systematically evaluates models over 47 diverse benchmarks spanning six multimodal capabilities to diagnose performance and error modes.
- The platform features an online arena for direct model competitions, promoting robust, real-world assessments and guiding future refinements.
Large Vision-LLM evaluation Hub (LVLM-eHub) is a comprehensive benchmarking framework that systematically quantifies and compares the capabilities of publicly available Large Vision-LLMs (LVLMs) across a diverse spectrum of multimodal tasks. Designed to address the absence of a unified, large-scale evaluation platform for LVLMs, LVLM-eHub encompasses both rigorous quantitative testing and real-world human-in-the-loop assessment, providing a robust infrastructure for the diagnosis and advancement of multimodal learning systems (Xu et al., 2023).
1. Motivation and Design Principles
The rapid emergence of LVLMs such as Flamingo, BLIP-2, InstructBLIP, and MiniGPT-4 prompted the need for a holistic evaluation standard that extends beyond isolated task performance. The motivation for LVLM-eHub is threefold:
- To cover the full spectrum of multimodal capabilities, including zero-shot evaluation, generalization, and open-ended scenario handling.
- To deliver robust, reproducible benchmarks that span both established and frontier datasets (47 vision-language benchmarks across six categories).
- To integrate end-user, real-world assessment through a dynamic “Arena” platform, enabling human-centric evaluation that moves beyond static benchmark scores.
LVLM-eHub explicitly quantifies six core multimodal capabilities, investigates real generalization versus in-domain overfitting, and exposes systematic error modes such as object hallucination, proposing evidence-based pipeline refinements (Xu et al., 2023).
2. Evaluated Model Suite
LVLM-eHub benchmarks eight representative LVLM architectures, all under 10B parameters, capturing a wide range of adapter strategies and instruction tuning regimes:
| Model Name | Vision Backbone / Adapter | Language Module | Key Training Data |
|---|---|---|---|
| BLIP-2 | EVA-CLIP ViT-g/14 + Q-Former (107M) | FlanT5-XL (frozen) | 129M image-text pairs; no instruction data |
| LLaVA | CLIP ViT-L/14 + 1 FC layer | Vicuna-7B | 595K image-text; 158K multimodal instructions |
| LLaMA-Adapter V2 | CLIP ViT-L/14 + B-tuning (63M) | LLaMA-7B (frozen) | 200M caption; 158K multimodal + 52K GPT-4 instructions |
| MiniGPT-4 | BLIP-2 vision path + FC | Vicuna-7B (frozen) | 5M image-text pairs; 3.5K self-instruct |
| mPLUG-Owl | ViT-L/14 + Perceiver, LoRA (388M) | LLaMA-7B (frozen) | 204M image-text; 158K multimodal instructions |
| Otter | ViT-L/14 + 1.3B adapters (OpenFlamingo) | LLaMA-9B (frozen) | 158K instructions |
| InstructBLIP | ViT-g/14 + Q-Former (107M) | Vicuna-7B (frozen) | 16M VQA instructions |
| VPGTrans | BLIP-2 visual prompting to Vicuna | - | 13.8M COCO/VG/SBU pairs; 3.5K self-instruct |
These models range from adapter-light to heavily instruction-tuned, supporting broad coverage of current LVLM design paradigms (Xu et al., 2023).
3. Quantitative Evaluation Framework
LVLM-eHub assesses performance across six mutually orthogonal multimodal capabilities:
- Visual Perception: Classification (ImageNet1K, CIFAR-10, Oxford datasets), multi-class detection (COCO-MCI, VCR-MCI), counting (COCO-OC, VCR-OC).
- Visual Knowledge Acquisition: Optical character recognition (IIIT5K, IC13, Total-Text, etc.), key information extraction (SROIE, FUNSD), image captioning (NoCaps, Flickr30K, WHOOPS).
- Visual Reasoning: Visual question answering (DocVQA, OKVQA, TextVQA, etc.), grounded image description (ScienceQA-IMG, VizWiz), visual entailment (SNLI-VE).
- Visual Commonsense: ImageNetVC (color, shape, material, etc.), visual commonsense reasoning (VCR).
- Object Hallucination: POPE pipeline probes (MSCOCO: random, popular, adversarial splits).
- Embodied Intelligence: High-level planning in simulated environments (MineDojo, VirtualHome, Meta-World, Franka Kitchen).
Primary evaluation metrics include:
- Accuracy:
- CIDEr: For image captioning, measures consensus with reference captions.
- F1 score: For object hallucination detection:
- MRR, entity-level F1, “Yes” rate: Task-specific auxiliary metrics (Xu et al., 2023).
4. Online Arena and Human-in-the-Loop Assessment
A distinguishing feature of LVLM-eHub is its online Arena platform, which orchestrates direct, open-world competitions between LVLMs. The platform workflow is as follows:
- Two anonymized models face off on a user-submitted image and open-ended text query.
- Each model provides a single-turn response.
- The user selects a winner (A, B, tie, or both bad), directly updating the models’ Elo ratings.
This system yields a crowdsourced, dynamic leaderboard reflecting human judgment on diverse, unconstrained inputs—exposing brittleness, overfitting, and emergent strengths that may not surface in traditional, closed benchmarks (Xu et al., 2023).
5. Key Empirical Findings
- Overfitting in Massively Instruction-Tuned LVLMs: InstructBLIP, tuned on 16M VQA pairs, achieves state-of-the-art scores on in-domain benchmarks but underperforms in the Arena and embodied planning scenarios. This indicates overfitting and limited generalization outside curated, familiar tasks.
- Object Hallucination with Moderate Instruction Data: LVLMs with moderate or imbalanced instruction tuning (LLaVA, LLaMA-Adapter V2, MiniGPT-4, mPLUG-Owl) exhibit elevated “Yes” rates in the POPE pipeline. These models over-predict object presence, resulting in artificially high CIDEr scores and spurious answers, revealing a key evaluation weakness.
- Multi-Turn Reasoning as a Mitigation Strategy: A multi-stage evaluation employing a “IdealGPT”-style pipeline (question decomposition, subquestion answering, reasoning over consistency) significantly reduces hallucination rates. Explicitly, a ChatGPT-based Questioner decomposes the query, the LVLM answers each part, and a ChatGPT Reasoner evaluates consistency before returning a verdict or requesting clarification. This multistep protocol systematically exposes and suppresses hallucinated content (Xu et al., 2023).
6. Future Directions and Related Work
LVLM-eHub establishes a rigorous foundation for future LVLM evaluation and development. Planned directions include:
- Development of learned, model-based evaluation metrics (e.g., multimodal BERTScore) to supplement existing task scores.
- Extension of the Arena to multi-turn dialog and enrichment with real user scenarios.
- Broader coverage of embodied and real-world robotic tasks.
- Scaling benchmarks to include models exceeding 10B parameters and tracking progress on GPT-4–backed LVLMs (Xu et al., 2023).
Related work: TinyLVLM-eHub (Shao et al., 2023) offers a lightweight variant, compressing the 47-benchmark evaluation into a 2.1K image–text suite for fast, local assessment. It introduces ChatGPT Ensemble Evaluation (CEE), achieving 88–90% alignment with human judgments, and finds that Bard outperforms open-source LVLMs on several benchmarks but remains susceptible to object hallucination.
LVLM-eHub’s comprehensive methodology has guided refinements across benchmarking, model selection, and error diagnosis for the vision-language evaluation community.
7. Impact and Significance
LVLM-eHub serves as the standard reference for the multimodal field’s transition toward robust, task-wide, and open-ended evaluation. By quantifying six distinct axes of zero-shot vision–language capability and integrating user-centric assessment, it enables precise tracing of trade-offs such as utility versus overfitting and hallucination versus fluency. The benchmark has influenced best practices in evaluation pipeline design, data curation, and model selection in multimodal research (Xu et al., 2023).
LVLM-eHub resources are available at https://github.com/OpenGVLab/Multi-Modality-Arena.