Explicit Knowledge-based Reasoning for Visual Question Answering

Published 9 Nov 2015 in cs.CV and cs.CL | (1511.02570v2)

Abstract: We describe a method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base. The method not only answers natural language questions using concepts not contained in the image, but can provide an explanation of the reasoning by which it developed its answer. The method is capable of answering far more complex questions than the predominant long short-term memory-based approach, and outperforms it significantly in the testing. We also provide a dataset and a protocol by which to evaluate such methods, thus addressing one of the key issues in general visual ques- tion answering.

Abstract PDF Upgrade to Chat

Citations (246)

View on Semantic Scholar

Summary

The paper introduces Ahab, a system that integrates DBpedia knowledge for enhanced visual question answering.
It employs natural language parsing and RDF graph queries to generate interpretable reasoning paths for answering image-based questions.
It establishes the KB-VQA dataset and demonstrates improved accuracy over traditional CNN-LSTM models in complex reasoning scenarios.

Explicit Knowledge-based Reasoning for Visual Question Answering

The paper presents a novel approach to Visual Question Answering (VQA) that integrates explicit knowledge-based reasoning, contrasting with conventional methods that primarily rely on convolutional neural networks (CNNs) and Long Short-Term Memory (LSTM) networks. This method enables the generation of answers to natural language questions regarding image content by leveraging an extensive knowledge base, specifically DBpedia, which provides structured world knowledge. Furthermore, the proposed system, named Ahab, provides the reasoning path leading to each answer, addressing the opacity of decision-making inherent in typical neural methods.

Key Contributions

Integration of Knowledge Base: Ahab's distinctive feature lies in its integration of structured knowledge bases to facilitate more sophisticated reasoning. By mapping detected image concepts to equivalent knowledge in DBpedia, the system can process questions necessitating external common-sense or encyclopedic knowledge, significantly expanding the scope of addressable inquiries beyond the visually explicit.
Question Processing and Reasoning: The system employs natural language parsing tools to decompose questions, identifying core concepts requiring reasoning. It then formulates queries to scrutinize relationships within the constructed RDF graph amalgamating image-derived and DBpedia knowledge. This leads to a reasoning path that is interpretable to the user.
KB-VQA Dataset and Evaluation Protocol: The paper introduces a new dataset, KB-VQA, curated to evaluate VQA systems adept in handling questions demanding high-level reasoning through external knowledge. This dataset is characterized by questions labeled across three knowledge levels: Visual, Common-sense, and KB-knowledge. The dataset aids in benchmarking methodologies in scenarios closer to real-world image-based question answering challenges.
Methodological Performance: Ahab demonstrates significantly improved performance over LSTM-based models across varying levels of knowledge requirements, particularly excelling in scenarios that require tapping into large external knowledge bases. The accuracy on complex question types and the provision of logical reasoning trails highlight the system’s robustness and transparency.

Implications and Future Directions

The approach outlines a potential shift in VQA methodologies toward more comprehensive systems that utilize both visual data and extensive knowledge repositories dynamically. This shift could herald advancements in AI systems’ capacity to process nuanced and context-rich scenarios, aligning more closely with human-like understanding and reasoning.

For future developments, extending knowledge base integration to incorporate multiple, diverse datasets could further enhance system capabilities. Additionally, refining the mechanisms for reasoning transparency may bolster user trust and system interpretability. The potential interlinking of knowledge bases across domains opens avenues for VQA applications in specialized sectors, such as technical troubleshooting, educational tools, and intelligent virtual assistants, fostering AI advancements with far-reaching impacts.

In conclusion, the paper sets a substantial precedent for the incorporation of structured knowledge reasoning in VQA systems, offering a robust framework to address broader and more complex image-based inquiries while maintaining transparency in AI decision-making processes.