Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering

Published 5 Apr 2024 in cs.CV | (2404.04007v1)

Abstract: Compositional spatio-temporal reasoning poses a significant challenge in the field of video question answering (VideoQA). Existing approaches struggle to establish effective symbolic reasoning structures, which are crucial for answering compositional spatio-temporal questions. To address this challenge, we propose a neural-symbolic framework called Neural-Symbolic VideoQA (NS-VideoQA), specifically designed for real-world VideoQA tasks. The uniqueness and superiority of NS-VideoQA are two-fold: 1) It proposes a Scene Parser Network (SPN) to transform static-dynamic video scenes into Symbolic Representation (SR), structuralizing persons, objects, relations, and action chronologies. 2) A Symbolic Reasoning Machine (SRM) is designed for top-down question decompositions and bottom-up compositional reasonings. Specifically, a polymorphic program executor is constructed for internally consistent reasoning from SR to the final answer. As a result, Our NS-VideoQA not only improves the compositional spatio-temporal reasoning in real-world VideoQA task, but also enables step-by-step error analysis by tracing the intermediate results. Experimental evaluations on the AGQA Decomp benchmark demonstrate the effectiveness of the proposed NS-VideoQA framework. Empirical studies further confirm that NS-VideoQA exhibits internal consistency in answering compositional questions and significantly improves the capability of spatio-temporal and logical inference for VideoQA tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Video question-answering techniques, benchmark datasets and evaluation metrics leveraging video captioning: a comprehensive survey. IEEE Access, 9:43799–43823, 2021.
  2. Video question answering: a survey of models and datasets. Mobile Networks and Applications, pages 1–34, 2021.
  3. Dense but efficient videoqa for intricate compositional reasoning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1114–1123, 2023.
  4. Learning fine-grained visual understanding for video question answering via decoupling spatial-temporal modeling. 2022.
  5. Test of time: Instilling video-language models with a sense of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023.
  6. Learning situation hyper-graphs for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14879–14889, 2023.
  7. All in one: Exploring unified video-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6598–6608, 2023.
  8. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
  9. Video graph transformer for video question answering. In European Conference on Computer Vision, pages 39–58. Springer, 2022.
  10. Vaishak Belle. Symbolic logic meets machine learning: A brief survey in infinite domains. In International conference on scalable uncertainty management, pages 3–16. Springer, 2020.
  11. A survey on neural-symbolic learning systems. Neural Networks, 2023.
  12. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018.
  13. Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2019.
  14. Measuring compositional consistency for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2022.
  15. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  16. Multi-turn video question answering via multi-stream hierarchical attention context network. In IJCAI, volume 2018, page 27th, 2018.
  17. Learning to reason with relational video representation for question answering. arXiv preprint arXiv:1907.04553, 2, 2019.
  18. Beyond rnns: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8658–8665, 2019.
  19. Bert representations for video question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1556–1565, 2020.
  20. Mmft-bert: Multimodal fusion transformer with bert encodings for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4648–4660, 2020.
  21. Efficient end-to-end video question answering with pyramidal multimodal transformer. arXiv preprint arXiv:2302.02136, 2023.
  22. Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):63–74, 2020.
  23. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE international conference on computer vision, pages 2989–2998, 2017.
  24. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2018.
  25. Probabilistic neural symbolic models for interpretable visual question answering. In International Conference on Machine Learning, pages 6428–6437. PMLR, 2019.
  26. Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2020.
  27. Comphy: Compositional physical reasoning of objects and events from videos. In International Conference on Learning Representations, 2022.
  28. Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8211–8225, 2020.
  29. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
  30. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11287–11297, 2021.
  31. Agqa 2.0: An updated benchmark for compositional spatio-temporal reasoning. arXiv preprint arXiv:2204.06105, 2022.
  32. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16372–16382, 2021.
  33. Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3588–3597, 2018.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510. Springer, 2022.
  36. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  37. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  38. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  39. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019.
  40. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9972–9981, 2020.

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.