Evaluation setting of proprietary multimodal LLMs on common benchmarks
Determine whether proprietary multimodal large language models such as GPT-4o, Gemini Pro 1.5, Claude 3.5 Sonnet, and Grok-2 were evaluated on ChartQA, DocVQA, VQAv2, TextVQA, and AI2D under zero-shot or fine-tuning conditions by identifying their training datasets and documented evaluation protocols to provide conclusive evidence of the evaluation setting.
References
Note that it is unknown whether the proprietary multimodal LLMs are being evaluated on these benchmarks in a zero-shot or fine-tuning setting, as no information is provided regarding their training datasets. We hypothesize that it is a fine-tuning setting, based on observed accuracy gaps between the training and test sets for some proprietary models; however, this is not conclusive.