Papers
Topics
Authors
Recent
Search
2000 character limit reached

SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset

Published 30 Oct 2024 in cs.CV | (2410.22648v1)

Abstract: Visual Question Answering (VQA) has emerged as a promising area of research to develop AI-based systems for enabling interactive and immersive learning. Numerous VQA datasets have been introduced to facilitate various tasks, such as answering questions or identifying unanswerable ones. However, most of these datasets are constructed using real-world images, leaving the performance of existing models on cartoon images largely unexplored. Hence, in this paper, we present "SimpsonsVQA", a novel dataset for VQA derived from The Simpsons TV show, designed to promote inquiry-based learning. Our dataset is specifically designed to address not only the traditional VQA task but also to identify irrelevant questions related to images, as well as the reverse scenario where a user provides an answer to a question that the system must evaluate (e.g., as correct, incorrect, or ambiguous). It aims to cater to various visual applications, harnessing the visual content of "The Simpsons" to create engaging and informative interactive systems. SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments (https://simpsonsvqa.org). Our experiments show that current large vision-LLMs like ChatGPT4o underperform in zero-shot settings across all three tasks, highlighting the dataset's value for improving model performance on cartoon images. We anticipate that SimpsonsVQA will inspire further research, innovation, and advancements in inquiry-based learning VQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain. In CLEF (Working Notes), 2020.
  2. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes), 2(6), 2019.
  3. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018.
  4. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  5. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016.
  6. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  7. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2019.
  8. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In CLEF 2021 Working Notes, CEUR Workshop Proceedings, Bucharest, Romania, September 21-24 2021. CEUR-WS.org.
  9. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620, 2017.
  10. Why does a visual question have different answers? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4271–4280, 2019.
  11. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  12. Do explanations make vqa models more predictable to a human? arXiv preprint arXiv:1810.12366, 2018.
  13. Leaf-qa: Locate, encode & attend for figure question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3512–3521, 2020.
  14. Grounding answers for visual questions asked by visually impaired people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19098–19107, 2022.
  15. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517, 2021.
  16. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.
  17. Ernest Davis. Unanswerable questions about images and texts. Frontiers in Artificial Intelligence, 3:51, 2020.
  18. Are you talking to a machine? dataset and methods for multilingual image question. Advances in neural information processing systems, 28, 2015.
  19. A dataset and baselines for visual question answering on art. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 92–108. Springer, 2020.
  20. Cross-modal self-attention with multi-task pre-training for medical visual question answering. In Proceedings of the 2021 International Conference on Multimedia Retrieval, ICMR ’21, page 456–460, New York, NY, USA, 2021. Association for Computing Machinery.
  21. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  23. Vizwiz grand challenge: Answering visual questions from blind people. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018.
  24. Overview of imageclef 2018 medical domain visual question answering task. In CLEF (Working Notes), 2018.
  25. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
  26. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  27. Automatic understanding of image and video advertisements. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1705–1715, 2017.
  28. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  29. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017.
  30. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325, 2016.
  31. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  32. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  33. Towards visual dialog for radiology. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 60–69, 2020.
  34. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR), 2017.
  35. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  36. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 5, 2018.
  37. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  38. Medical visual question answering: A survey. arXiv preprint arXiv:2111.10056, 2021.
  39. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
  40. Inverse visual question answering: A new benchmark and vqa diagnosis tool. IEEE transactions on pattern analysis and machine intelligence, 42(2):460–474, 2018.
  41. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024.
  42. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  43. Learning to answer questions from image using convolutional neural network. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  44. The promise of premise: Harnessing question premises in visual question answering. arXiv preprint arXiv:1705.00601, 2017.
  45. A multi-world approach to question answering about real-world scenes based on uncertain input. Advances in neural information processing systems, 27, 2014.
  46. Improving automatic vqa evaluation using large language models. arXiv preprint arXiv:2310.02567, 2023.
  47. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  48. Robust visual question answering via semantic cross modal augmentation. Computer Vision and Image Understanding, page 103862, 2023.
  49. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  50. Ocr-vqa: Visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019.
  51. OpenAI. Chatgpt. OpenAI API, 2021. Accessed: [13/17/2023].
  52. OpenAI. Gpt-4o. 2024.
  53. Connecting vision and language with localized narratives. In ECCV, 2020.
  54. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  55. Question relevance in vqa: identifying non-visual and false-premise questions. arXiv preprint arXiv:1606.06622, 2016.
  56. Exploring models and data for image question answering. Advances in neural information processing systems, 28, 2015.
  57. A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 146–162. Springer, 2022.
  58. A dataset for multimodal question answering in the cultural heritage domain. In Proceedings of the COLING 2016 Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 10–17. ACL, 2016.
  59. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  60. Why did the chicken cross the road? rephrasing and analyzing ambiguous questions in vqa. arXiv preprint arXiv:2211.07516, 2022.
  61. Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13878–13888, 2021.
  62. Question part relevance and editing for cooperative and context-aware vqa (c2vqa). In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, pages 1–6, 2017.
  63. WSDM Cup 2023 Challenge on Visual Question Answering. In Proceedings of the 4th Crowd Science Workshop on Collaboration of Humans and Learning Algorithms for Data Labeling, pages 1–7, Singapore, 2023.
  64. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540, 2023.
  65. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  66. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2413–2427, 2017.
  67. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015.
  68. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  69. On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10126–10135, 2020.
  70. Reliable visual question answering: Abstain rather than answer incorrectly. In European Conference on Computer Vision, pages 148–166. Springer, 2022.
  71. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6281–6290, 2019.
  72. X2-vlm: All-in-one pre-trained model for vision-language tasks. arXiv preprint arXiv:2211.12402, 2022.
  73. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  74. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.