How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Published 2 Jul 2025 in cs.CV, cs.AI, and cs.LG | (2507.01955v1)

Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a prompt chaining framework to adapt GPT-4o’s text outputs for structured vision tasks such as object detection and segmentation.
It shows that while GPT-4o achieves competitive results on semantic tasks, its performance on geometric tasks remains significantly lower than vision specialists.
Findings suggest that optimized prompt design and reasoning augmentation can partially bridge the performance gap with dedicated vision models.

Evaluation of GPT-4o and Multimodal Foundation Models on Standard Computer Vision Tasks

This paper presents a comprehensive empirical evaluation of leading multimodal foundation models (MFMs), with a particular focus on GPT-4o, across a suite of standard computer vision tasks. The study addresses a critical gap in the literature: while MFMs have demonstrated strong performance on language-centric multimodal benchmarks, their capabilities on classical vision tasks—such as semantic segmentation, object detection, image classification, depth estimation, and surface normal prediction—remain underexplored. The authors introduce a standardized benchmarking framework that enables direct, quantitative comparison between MFMs and state-of-the-art vision specialist models, even when MFMs are only accessible via text-based APIs.

Methodology

A central challenge in benchmarking MFMs on vision tasks is the mismatch between the output modalities: most MFMs are designed to produce text, whereas standard vision tasks require structured outputs (e.g., segmentation masks, bounding boxes, or dense depth maps). The authors address this by developing a prompt chaining framework, which decomposes each vision task into a sequence of text-promptable sub-tasks. For example:

Object Detection: The image is recursively partitioned into grids, and the model is prompted to identify the presence of target objects in each cell, progressively narrowing down the location.
Semantic Segmentation: Images are over-segmented into superpixels, and the model is prompted to classify each superpixel, leveraging multi-scale context to improve accuracy.
Depth and Surface Normal Prediction: The model is queried for pairwise depth or normal relationships between superpixels, and global rankings are inferred via optimization.

This approach enables the evaluation of MFMs on standard datasets (COCO, ImageNet, Hypersim, etc.) using established metrics, and allows for direct comparison with vision specialists under controlled algorithmic constraints.

Key Findings

The experimental results yield several important insights:

MFMs as Generalists: While MFMs, particularly GPT-4o, do not match the performance of state-of-the-art vision specialists on any task, they achieve nontrivial results across all evaluated tasks. GPT-4o secures the top position among MFMs in 4 out of 6 tasks.
Semantic vs. Geometric Tasks: MFMs consistently perform better on semantic tasks (classification, segmentation, grouping) than on geometric tasks (depth, surface normals). The performance gap is especially pronounced for 3D understanding.
Prompt Chaining Efficacy: The prompt chaining framework significantly improves performance over direct prompting, especially for tasks requiring structured outputs. However, the approach incurs substantial computational and API costs due to the large number of required queries.
Prompt Sensitivity: Model performance is sensitive to prompt design, but higher-performing models (e.g., GPT-4o) exhibit reduced sensitivity, indicating more robust internal representations.
Reasoning Models: Recent reasoning-augmented MFMs (e.g., o1, o3, o4-mini) show improved performance on geometric tasks, suggesting that explicit reasoning capabilities can partially compensate for the lack of direct supervision on dense vision tasks.
Image Generation Outputs: Preliminary analysis of MFMs with native image generation (e.g., GPT-4o) reveals that generated outputs often constitute semantic recreations rather than precise, spatially aligned edits, leading to hallucinations and misalignments.
Data Contamination and Generalization: The authors address concerns about potential data contamination by evaluating models on "in-the-wild" images released after the models' knowledge cutoffs, finding that MFMs generalize reasonably well to novel data.

Numerical Results

The paper provides detailed quantitative comparisons. For example, on ImageNet classification, GPT-4o achieves 77.2% top-1 accuracy, trailing the specialist Model Soups ViT-G (90.94%). On COCO object detection, GPT-4o attains 60.62 AP50, compared to 91.3 for Co-DETR. For semantic segmentation on COCO, GPT-4o reaches 44.89 mIoU, while OneFormer achieves 65.52. On depth estimation (Hypersim), GPT-4o's $\delta_1$ is 0.461, compared to Omnidata's 0.768. These results consistently show a substantial but not insurmountable gap between MFMs and vision specialists.

Implications and Future Directions

Practical Implications:

Generalist Utility: MFMs like GPT-4o can serve as competent generalists for a wide range of vision tasks, especially in scenarios where deploying multiple specialist models is impractical.
Prompt Engineering: Careful prompt design and chaining are essential for extracting maximal performance from MFMs on non-textual tasks.
API Constraints: The reliance on API-based access and text-only outputs imposes significant computational and cost overheads, limiting the scalability of such benchmarking for large-scale or real-time applications.

Theoretical Implications:

Representation Learning: The superior performance of MFMs on semantic tasks suggests that their pretraining on image-text pairs yields strong semantic representations, but geometric and spatial reasoning remain underdeveloped.
Reasoning Augmentation: The observed improvements from reasoning-augmented models indicate that integrating explicit reasoning mechanisms or training on structured vision tasks could bridge the gap to specialists.

Future Developments:

Direct Supervision: Training MFMs directly on dense vision tasks, or equipping them with structured output heads, may substantially improve their geometric understanding.
Unified Benchmarks: The open-sourcing of the prompt chaining framework provides a standardized platform for future MFM evaluation, facilitating progress tracking and fair comparison.
Efficient Inference: Research into more efficient prompting strategies or hybrid architectures (combining MFMs with lightweight vision specialists) could mitigate the high inference costs.
Robustness and Generalization: Further work is needed to assess and improve the robustness of MFMs to distribution shifts, adversarial inputs, and real-world deployment scenarios.

Conclusion

This study establishes a rigorous, extensible benchmark for evaluating the vision capabilities of MFMs, revealing both their current limitations and their potential as generalist models. While MFMs like GPT-4o are not yet competitive with vision specialists on classical tasks, their respectable performance across diverse domains, especially with prompt chaining, underscores their promise as flexible, multi-purpose AI systems. The findings motivate future research on integrating dense vision supervision, enhancing geometric reasoning, and developing more efficient and robust multimodal architectures.