Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering
Abstract: Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA). Existing approaches struggle to establish effective symbolic reasoning structures, which are crucial for answering compositional spatio-temporal questions. To address this challenge, we propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA), specifically designed for real-world VideoQA tasks. The uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene Parser Network (SPN) to transform static-dynamic video scenes into Symbolic Representation (SR), structuralizing persons, objects, relations, and action chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down question decompositions and bottom-up compositional reasonings. Specifically, a polymorphic program executor is constructed for internally consistent reasoning from SR to the final answer. As a result, Our NS-VideoQA not only improves the compositional spatio-temporal reasoning in real-world VideoQA task, but also enables step-by-step error analysis by tracing the intermediate results. Experimental evaluations on the AGQA Decomp benchmark demonstrate the effectiveness of the proposed NS-VideoQA framework. Empirical studies further confirm that NS-VideoQA exhibits internal consistency in answering compositional questions and significantly improves the capability of spatio-temporal and logical inference for VideoQA tasks.
- Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: a comprehensive survey. IEEE Access, 9:43799–43823, 2021.
- Video question answering: a survey of models and datasets. Mobile Networks and Applications, pages 1–34, 2021.
- Dense but efficient videoqa for intricate compositional reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1114–1123, 2023.
- Learning fine-grained visual understanding for video question answering via decoupling spatial-temporal modeling. 2022.
- Test of time: Instilling video-language models with a sense of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023.
- Learning situation hyper-graphs for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14879–14889, 2023.
- All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
- Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022.
- Vaishak Belle. Symbolic logic meets machine learning: A brief survey in infinite domains. In International conference on scalable uncertainty management, pages 3–16. Springer, 2020.
- A survey on neural-symbolic learning systems. Neural Networks, 2023.
- Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018.
- Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2019.
- Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
- Multi-turn video question answering via multi-stream hierarchical attention context network. In IJCAI, volume 2018, page 27th, 2018.
- Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553, 2, 2019.
- Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8658–8665, 2019.
- Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1556–1565, 2020.
- Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4648–4660, 2020.
- Efficient end-to-end video question answering with pyramidal multimodal transformer. arXiv preprint arXiv:2302.02136, 2023.
- Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):63–74, 2020.
- Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pages 2989–2998, 2017.
- The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2018.
- Probabilistic neural symbolic models for interpretable visual question answering. In International Conference on Machine Learning, pages 6428–6437. PMLR, 2019.
- Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2020.
- Comphy: Compositional physical reasoning of objects and events from videos. In International Conference on Learning Representations, 2022.
- Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8211–8225, 2020.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
- Agqa 2.0: An updated benchmark for compositional spatio-temporal reasoning. arXiv preprint arXiv:2204.06105, 2022.
- Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021.
- Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3588–3597, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510. Springer, 2022.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019.
- Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.