CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Published 13 Dec 2022 in cs.CL and cs.CV | (2212.07796v3)

Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (100)

View on Semantic Scholar

Summary

The paper introduces a novel benchmark to assess compositional reasoning in vision-language foundation models.
It employs systematic experiments to uncover limitations in handling complex compositional tasks.
Findings highlight the need for enhanced model architectures to improve multimodal understanding.

Overview of CVPR \LaTeX\ Author Guidelines

The document in question provides comprehensive guidelines for preparing and submitting manuscripts for the CVPR proceedings, detailing essential instructions and conventions for authors using the \LaTeX\ document preparation system. This guidance is pivotal for ensuring consistency and maintaining the academic integrity and readability of conference submissions.

Key Aspects of the Guidelines

Manuscript Structure and Format: The document specifies the structural components of a compliant paper, including an abstract, numbered sections and equations, and structured references. It upholds a stringent limit on the primary content length—excluding references—to facilitate concise and focused submissions.
Dual Submission Policy: Authors are reminded about dual submission practices, which are restricted as per CVPR's policies to ensure the originality and exclusivity of concurrent submissions.
Paper ID and Anonymous Submission: Authors must ensure their submission is identifiable by a paper ID and that it adheres to the blind review process. This involves omitting self-identifying references while ensuring past work is cited appropriately without implying authorship.
Technical Formatting Details: This includes specifics on typefaces, font sizes, and margins crucial for maintaining the uniform look of CVPR proceedings. Peculiar aspects such as the use of rulers for review copies and considerations for the exclusion of measurements in the final copy are outlined.
Mathematics and Equations: The instructions emphasize the structured numbering of equations, facilitating easy reference during peer reviews and further citation. It hints at the academic rigor expected in technical expositions.
Illustrations and Graphics Handling: The guidelines underscore the necessity for clear and appropriately scaled figures, ensuring that visual data is decipherable upon printing—adaptive changes like scaling using line widths are suggested.
Use of Colors: Authors are encouraged to consider color vision deficiencies when designing figures to ensure the conveyance of information is not hindered by reliance on color contrast alone.

Implications and Prospective Directions

Adhering to these guidelines serves multiple functions: it streamlines the review process by maintaining uniformity in presentation, and it potentially amplifies the quality and impact of the research disseminated. For researchers, these conventions are critical in aiding the clear presentation of technical content, an essential component of effective scholarly communication.

On a forward-looking note, as document preparation tools evolve, there may be opportunities for further automation in adhering to such guidelines, thereby optimizing the submission process and reducing the load on authors. This may also include intelligent tools for checking document compliance prior to submission.

Overall, while technical in nature, the adherence to such guidelines underscores an unwritten contract among researchers for maintaining the credibility, reproducibility, and readability of academic outputs within a prestigious venue like the CVPR.

Markdown Report Issue