Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Published 26 May 2024 in cs.CV | (2405.16493v2)

Abstract: Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behaviors. Our data and code are available at https://github.com/ZhangLab-DeepNeuroCogLab/MotionPerceiver.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (114)
  1. Ralph Adolphs. The neurobiology of social cognition. Current opinion in neurobiology, 11(2):231–239, 2001.
  2. Richard John Andrew. Evolution of facial expression: Many human expressions can be traced back to reflex responses of primitive primates and insectivores. Science, 142(3595):1034–1041, 1963.
  3. 2d pose-based real-time human action recognition with occlusion-handling. IEEE Transactions on Multimedia, 22(6):1433–1446, 2019.
  4. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  5. A database and evaluation methodology for optical flow. International journal of computer vision, 92:1–31, 2011.
  6. JA Beintema and Markus Lappe. Perception of biological motion without local image motion. Proceedings of the National Academy of Sciences, 99(8):5661–5663, 2002.
  7. Perception of movement and dancer characteristics from point-light displays of dance. The Psychological Record, 47:411–422, 1997.
  8. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5, 2011.
  9. Roxanne L Canosa. Simulating biological motion perception using a recurrent neural network. In FLAIRS, pages 617–622, 2004.
  10. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  11. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  12. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6321–6330, 2019.
  13. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  14. Shuffle and attend: Video domain adaptation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 678–695. Springer, 2020.
  15. Movement pattern histogram for action recognition and retrieval. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 695–710. Springer, 2014.
  16. Recognizing friends by their walk: Gait perception without familiarity cues. Bulletin of the psychonomic society, 9:353–356, 1977.
  17. Supermix: Supervising the mixing data augmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13794–13803, 2021.
  18. Charles Darwin. The expression of the emotions in man and animals, new york: D. Appleton and Company, 1872.
  19. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  20. Perception of emotion from dynamic point-light displays represented in dance. Perception, 25(6):727–738, 1996.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems, 35:28940–28954, 2022.
  23. Jörg-Peter Ewert. Neuroethology of releasing mechanisms: prey-catching in toads. Behavioral and Brain Sciences, 10(3):337–368, 1987.
  24. Jörg-Peter Ewert. The release of visual behavior in toads: stages of parallel/hierarchical information processing. In Visuomotor coordination: Amphibians, comparisons, models, and robots, pages 39–120. Springer, 1989.
  25. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
  26. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  27. Learning to recognize activities from the wrong view point. In Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10, pages 154–166. Springer, 2008.
  28. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
  29. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  30. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  31. Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency. Advances in Neural Information Processing Systems, 33:13890–13902, 2020.
  32. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018.
  33. Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4(3):179–192, 2003.
  34. Biologically plausible neural model for the recognition of biological motion and actions. 2002.
  35. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  36. Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
  37. Understanding person identification through gait. arXiv preprint arXiv:2203.04179, 2022.
  38. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE international conference on computer vision workshops, pages 3154–3160, 2017.
  39. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  40. Stylemix: Separating content and style for enhanced data augmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14862–14870, 2021.
  41. Semantic-based surveillance video retrieval. IEEE Transactions on image processing, 16(4):1168–1181, 2007.
  42. Is appearance free action recognition possible? In European Conference on Computer Vision, pages 156–173. Springer, 2022.
  43. A fast, invariant representation for human action in the visual system. Journal of neurophysiology, 119(2):631–640, 2018.
  44. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14244–14253, 2020.
  45. Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & psychophysics, 14:201–211, 1973.
  46. Self recognition versus recognition of others by biological motion: Viewpoint-dependent effects. Perception, 35(7):911–920, 2006.
  47. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  48. Exploring temporally dynamic data augmentation for video recognition. arXiv preprint arXiv:2206.15015, 2022.
  49. Learning temporally invariant and localizable features via data augmentation for video recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 386–403. Springer, 2020.
  50. Analysis and extensions of adversarial training for video classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3416–3425, 2022.
  51. Conditional object-centric learning from video. arXiv preprint arXiv:2111.12594, 2021.
  52. Anticipating human activities using object affordances for reactive robotic response. IEEE transactions on pattern analysis and machine intelligence, 38(1):14–29, 2015.
  53. Recognizing the sex of a walker from a dynamic point-light display. Perception & psychophysics, 21:575–580, 1977.
  54. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 204–212, 2015.
  55. A model of biological motion perception from configural form cues. Journal of Neuroscience, 26(11):2894–2906, 2006.
  56. Ffcv: Accelerating training by removing data bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12011–12020, 2023.
  57. Prediction of human activity by discovering temporal sequence patterns. IEEE transactions on pattern analysis and machine intelligence, 36(8):1644–1657, 2014.
  58. Action recognition of construction workers under occlusion. Journal of Building Engineering, 45:103352, 2022.
  59. Self-supervised video representation learning with meta-contrastive network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8239–8249, 2021.
  60. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. arXiv preprint arXiv:2001.02407, 2020.
  61. A view-invariant action recognition based on multi-view space hidden markov models. Human Motion Sensing and Recognition: A Fuzzy Qualitative Approach, pages 251–267, 2017.
  62. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10):2684–2701, 2019.
  63. Object-centric learning with slot attention. Advances in neural information processing systems, 33:11525–11538, 2020.
  64. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  65. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  66. Vdt: General-purpose video diffusion transformers via mask modeling. In The Twelfth International Conference on Learning Representations, 2023.
  67. Hongjing Lu. Structural processing in biological motion perception. Journal of Vision, 10(12):13–13, 2010.
  68. Gender discrimination in biological motion displays based on dynamic cues. Proceedings of the Royal Society of London. Series B: Biological Sciences, 258(1353):273–279, 1994.
  69. Desmond Morris. The reproductive behaviour of the zebra finch (poephila guttata), with special reference to pseudofemale behaviour and displacement activities. Behaviour, 6(1):271–322, 1954.
  70. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 122–132, 2020.
  71. Seeing biological motion. Nature, 395(6705):894–896, 1998.
  72. Adversarial cross-domain action recognition with co-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11815–11822, 2020.
  73. Marina A Pavlova. Biological motion processing as a hallmark of social cognition. Cerebral Cortex, 22(5):981–995, 2012.
  74. Parsing video events with goal inference and intent prediction. In 2011 International Conference on Computer Vision, pages 487–494. IEEE, 2011.
  75. Action recognition with stacked fisher vectors. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 581–595. Springer, 2014.
  76. Exploring biological motion perception in two-stream convolutional neural networks. Vision Research, 178:28–40, 2021.
  77. Causal action: A fundamental constraint on perception and inference about body movements. Psychological science, 28(6):798–807, 2017.
  78. Gender recognition from point-light walkers. Journal of Experimental Psychology: Human Perception and Performance, 31(6):1247, 2005.
  79. Over-the-air adversarial flickering attacks against video recognition networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 515–524, 2021.
  80. Moving the lab into the mountains: a pilot study of human activity recognition in unstructured environments. Sensors, 21(2):654, 2021.
  81. Michael S Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 international conference on computer vision, pages 1036–1043. IEEE, 2011.
  82. Robot-centric activity prediction from first-person videos: What will they do to me? In Proceedings of the tenth annual ACM/IEEE international conference on human-robot interaction, pages 295–302, 2015.
  83. Angular features-based human action recognition system for a real application with subtle unit actions. IEEE Access, 10:9645–9657, 2022.
  84. A large-scale robustness analysis of video action recognition models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14698–14708, 2023.
  85. Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron, 108(3):413–423, 2020.
  86. On the integration of optical flow and action recognition. In Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40, pages 281–297. Springer, 2019.
  87. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
  88. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  89. Segregated dynamical networks for biological motion perception in the mu and beta range underlie social deficits in autism. Diagnostics, 14(4):408, 2024.
  90. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  91. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1390–1399, 2018.
  92. Unstructured human activity detection from rgbd images. In 2012 IEEE international conference on robotics and automation, pages 842–849. IEEE, 2012.
  93. Perception of social interactions for spatially scrambled biological motion. PloS one, 9(11):e112539, 2014.
  94. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  95. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  96. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  97. Person identification from biological motion: Effects of structural and kinematic cues. Perception & Psychophysics, 67:667–675, 2005.
  98. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  99. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
  100. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2649–2656, 2014.
  101. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  102. Object-centric learning with cyclic walks between parts and whole. Advances in Neural Information Processing Systems, 36, 2024.
  103. Making action recognition robust to occlusions and viewpoint changes. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part III 11, pages 635–648. Springer, 2010.
  104. Svformer: Semi-supervised video transformer for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816–18826, 2023.
  105. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8121–8130, 2022.
  106. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  107. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014.
  108. Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7177–7188, 2021.
  109. Action4d: Online action recognition in the crowd and clutter. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11857–11866, 2019.
  110. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  111. Object-centric learning for real-world videos by predicting temporal feature similarities. Advances in Neural Information Processing Systems, 36, 2024.
  112. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 9(1):3730, 2018.
  113. Few-shot action recognition with hierarchical matching and contrastive learning. In European Conference on Computer Vision, pages 297–313. Springer, 2022.
  114. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.