Evaluation setting of proprietary multimodal LLMs on common benchmarks

Determine whether proprietary multimodal large language models such as GPT-4o, Gemini Pro 1.5, Claude 3.5 Sonnet, and Grok-2 were evaluated on ChartQA, DocVQA, VQAv2, TextVQA, and AI2D under zero-shot or fine-tuning conditions by identifying their training datasets and documented evaluation protocols to provide conclusive evidence of the evaluation setting.

Background

In the discussion of supervised fine-tuning (SFT) data, the authors include training splits from several widely used benchmarks—ChartQA, DocVQA, VQAv2, TextVQA, and AI2D—while using their test sets for evaluation. They note that this implies a fine-tuning setting for their own models on these benchmarks.

They explicitly state that it is unknown whether leading proprietary multimodal LLMs are evaluated zero-shot or after fine-tuning on the same benchmarks, due to the lack of disclosed training data and evaluation details. They hypothesize that fine-tuning may be involved, citing observed accuracy gaps between training and test sets for some proprietary models, but emphasize that this inference is not conclusive.

References

Note that it is unknown whether the proprietary multimodal LLMs are being evaluated on these benchmarks in a zero-shot or fine-tuning setting, as no information is provided regarding their training datasets. We hypothesize that it is a fine-tuning setting, based on observed accuracy gaps between the training and test sets for some proprietary models; however, this is not conclusive.

NVLM: Open Frontier-Class Multimodal LLMs  (2409.11402 - Dai et al., 2024) in Section 5.2 (Multimodal SFT Data)