IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Published 25 Oct 2021 in cs.CV, cs.AI, cs.CL, and cs.LG | (2110.13214v4)

Abstract: Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.

Abstract PDF Upgrade to Chat

Citations (141)

View on Semantic Scholar

Summary

The paper introduces IconQA, a dataset comprising over 107,000 questions that advance abstract diagram understanding and visual language reasoning.
The methodology highlights a pyramid cross-modal transformer, Patch-TRM, which segments diagrams into coherent patches for more robust feature extraction.
The dataset and associated Icon645 support domain-specific pre-training, paving the way for improved educational tools and advanced multimodal AI research.

Analyzing IconQA: Benchmarking Abstract Diagram Understanding and Visual Language Reasoning

The paper under discussion introduces IconQA, a comprehensive benchmark dataset specifically designed to advance research in the domain of visual question answering (VQA) with a unique focus on abstract diagrams. Traditional VQA tasks predominantly use natural images, which may not fully capture the complexity inherent in understanding abstract, semantically-rich diagrams. The authors address this gap by presenting IconQA, a dataset comprising over 107,000 questions spanning three unique sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Unlike preceding datasets, IconQA leverages real-world inspired scenarios typical of educational math word problems, necessitating not only perceptual comprehension but also a wide array of cognitive reasoning skills such as geometric, commonsense, and arithmetic reasoning.

Dataset Composition and Challenges

IconQA represents a significant expansion in the field of visual comprehension tasks due to its emphasis on abstract diagrams. The dataset encompasses a substantial variety of icons, categorized into 388 diverse classes, fostering a need for systems to develop robust pattern recognition capabilities that are less reliant on realism-driven biases. This facet is particularly important as it challenges existing models largely trained on natural image datasets, potentially reshaping traditional training paradigms.

The authors also introduce Icon645, an auxiliary dataset with approximately 645,000 colored icons aimed at supporting semantic representation learning for icon imagery. This step is crucial, considering existing foundational models such as ResNet, are typically pre-trained on natural scenery; such models could underperform when tasked with abstract diagram comprehension due to lack of domain-specific pre-training.

Methodology and Baseline Models

To benchmark the IconQA dataset, the paper evaluates well-established VQA models while also proposing a new model, Patch-TRM. This model utilizes a pyramid cross-modal transformer, which interestingly integrates a hierarchical diagram parsing method that segments inputs into coherent patches. Such an approach potentially enhances the interpretative capacity of the system by preserving semantic object integrity within patches. In particular, these patches are semantically enriched using a ResNet pre-trained on the icon classification task, indicative of a stratified feature extraction methodology promising improved inference accuracy on abstract imagery.

The proposed Patch-TRM model outperformed several existing attention-based and transformer-based models, achieving demonstrably better results across different sub-tasks and reasoning skills. This suggests that domain-specific pre-training, coupled with tailored model architectures, holds potential for substantial improvements in VQA tasks concerning abstract diagrams.

Implications and Future Directions

From a practical perspective, IconQA sets the stage for developing more nuanced educational tools, such as intelligent tutoring systems capable of understanding and interacting through abstract diagrams. This is particularly relevant for STEM education, where diagrammatic interpretations are often required for understanding complex concepts. Theoretically, the introduction of datasets like IconQA fosters broader research into domain-specific comprehension, possibly accelerating advancements in multimodal learning, abstraction handling in AI, and transfer learning applications.

In conclusion, the IconQA dataset and its auxiliary Icon645 are timely contributions that underscore the necessity of evolving current AI paradigms beyond natural image-centric datasets to accommodate the richness of abstract visual reasoning. The insights gained here could extend far beyond educational purposes, igniting further explorations into the nature of visual abstractions and their cognitive implications in AI systems. As AI progresses towards more generalized intelligence, the ability to interpret abstract diagrams will likely become a critical component of future AI capabilities.

Markdown Report Issue