Papers
Topics
Authors
Recent
Search
2000 character limit reached

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

Published 19 Mar 2024 in cs.CV | (2403.12534v1)

Abstract: Event cameras have recently been shown beneficial for practical vision tasks, such as action recognition, thanks to their high temporal resolution, power efficiency, and reduced privacy concerns. However, current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information, rendering it stunningly superior in reducing semantic uncertainty. In light of this, we propose ExACT, a novel approach that, for the first time, tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly, we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then, we propose a conceptual reasoning-based uncertainty estimation module, which simulates the recognition process to enrich the semantic representation. In particular, conceptual reasoning builds the temporal relation based on the action semantics, and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%), 90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017.
  2. Multimodal word distributions. arXiv preprint arXiv:1704.08424, 2017.
  3. Vision-based human activity recognition: a survey. Multimedia Tools and Applications, 79(41-42):30509–30555, 2020.
  4. Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE Transactions on Image Processing, 29:9084–9098, 2020.
  5. Data uncertainty learning in face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5710–5719, 2020.
  6. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1):38–56, 2023.
  7. Label-free event-based object recognition via joint learning with image reconstruction from events. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19866–19877, 2023.
  8. Improving local identifiability in probabilistic box embeddings. Advances in Neural Information Processing Systems, 33:182–192, 2020.
  9. Eventtransact: A video transformer-based framework for event-camera based action recognition. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–7. IEEE, 2023.
  10. A voxel graph cnn for object classification with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1172–1181, 2022.
  11. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022.
  12. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
  13. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  14. Action recognition and benchmark using event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  15. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  16. Stca: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks. In IJCAI, pages 1366–1372, 2019.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  18. Temporal binary representation for event-based action recognition. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 10426–10432. IEEE, 2021.
  19. Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23262–23271, 2023.
  20. Embodied neuromorphic vision with event-driven random backpropagation. arXiv preprint arXiv:1904.04805, 2019.
  21. Synaptic plasticity dynamics for deep continuous local learning (decolle). Frontiers in Neuroscience, 14:515306, 2020.
  22. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  23. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  24. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
  25. Non-intrusive human activity recognition and abnormal behavior detection on elderly people: A review. Artificial Intelligence Review, 53(3):1975–2021, 2020.
  26. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
  27. Event-based action recognition using motion information and spiking neural networks. In IJCAI, pages 1743–1749, 2021a.
  28. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13708–13718, 2021b.
  29. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
  30. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  31. Speed invariant time surface for learning to detect corner points with event-based cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10245–10254, 2019.
  32. Event-based gesture recognition with dynamic background suppression using smartphone computational capabilities. Frontiers in neuroscience, 14:501775, 2020.
  33. Core challenges of social robot navigation: A survey. ACM Transactions on Human-Robot Interaction, 12(3):1–39, 2023.
  34. Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection. Frontiers in neurorobotics, 13:38, 2019.
  35. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, 54:2259–2322, 2021.
  36. E2 (go) motion: Motion augmented event stream for egocentric action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19935–19947, 2022.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
  39. Event transformer. a sparse-aware solution for efficient event data processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2677–2686, 2022.
  40. Slayer: Spike layer error reassignment in time. Advances in neural information processing systems, 31, 2018.
  41. Self-supervised 3d skeleton action representation learning with motion consistency and continuity. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13328–13338, 2021.
  42. View-invariant probabilistic embedding for human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 53–70. Springer, 2020.
  43. Human action recognition from various data modalities: A review. IEEE transactions on pattern analysis and machine intelligence, 2022.
  44. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  45. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623, 2014.
  48. Hardvs: Revisiting human activity recognition with dynamic vision sensors. arXiv preprint arXiv:2211.09648, 2022.
  49. Sstformer: Bridging spiking neural network and memory support transformer for frame-event based recognition. arXiv preprint arXiv:2308.04369, 2023a.
  50. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13214–13223, 2021.
  51. Masked spiking transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1761–1771, 2023b.
  52. Eventclip: Adapting clip for event-based object recognition. arXiv preprint arXiv:2306.06354, 2023.
  53. An event-driven categorization model for aer image sensors using multispike encoding and learning. IEEE transactions on neural networks and learning systems, 31(9):3649–3657, 2019.
  54. Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification. IEEE Robotics and Automation Letters, 7(2):1976–1983, 2022.
  55. Event voxel set transformer for spatiotemporal representation learning on event streams. arXiv preprint arXiv:2303.03856, 2023.
  56. Motion deblurring with real events. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2583–2592, 2021.
  57. Deep learning for event-based vision: A comprehensive survey and benchmarks. arXiv preprint arXiv:2302.08890, 2023.
  58. E-clip: Towards label-efficient event-based open-world understanding by clip. arXiv preprint arXiv:2308.03135, 2023.
Citations (9)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.