Papers
Topics
Authors
Recent
Search
2000 character limit reached

C3T: Cross-modal Transfer Through Time for Sensor-based Human Activity Recognition

Published 23 Jul 2024 in cs.CV, cs.AI, cs.HC, cs.LG, and eess.SP | (2407.16803v3)

Abstract: In order to unlock the potential of diverse sensors, we investigate a method to transfer knowledge between time-series modalities using a multimodal \textit{temporal} representation space for Human Activity Recognition (HAR). Specifically, we explore the setting where the modality used in testing has no labeled data during training, which we refer to as Unsupervised Modality Adaptation (UMA). We categorize existing UMA approaches as Student-Teacher or Contrastive Alignment methods. These methods typically compress continuous-time data samples into single latent vectors during alignment, inhibiting their ability to transfer temporal information through real-world temporal distortions. To address this, we introduce Cross-modal Transfer Through Time (C3T), which preserves temporal information during alignment to handle dynamic sensor data better. C3T achieves this by aligning a set of temporal latent vectors across sensing modalities. Our extensive experiments on various camera+IMU datasets demonstrate that C3T outperforms existing methods in UMA by at least 8% in accuracy and shows superior robustness to temporal distortions such as time-shift, misalignment, and dilation. Our findings suggest that C3T has significant potential for developing generalizable models for time-series sensor data, opening new avenues for various multimodal applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Adaptnet: human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks. IEEE Sensors Journal, 21(18):20398–20411, 2021.
  2. Imu2doppler: Cross-modal domain adaptation for doppler-based activity recognition using imu data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 5(4):1–20, 2021.
  3. Multimodal fusion via teacher-student network for indoor action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3199–3207, 2021.
  4. Activity recognition in wearables using adversarial multi-source domain adaptation. Smart Health, 19:100174, 2021.
  5. A systematic study of unsupervised domain adaptation for robust human-activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–30, 2020.
  6. Czu-mhad: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors. IEEE Sensors Journal, 22(7):7034–7042, 2022.
  7. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International conference on image processing (ICIP), pages 168–172. IEEE, 2015.
  8. A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021.
  9. Posegpt: Chatting about 3d human pose. arXiv preprint arXiv:2311.18836, 2023.
  10. Personalized human activity recognition based on integrated wearable sensor and transfer learning. Sensors, 21(3):885, 2021.
  11. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  12. Cromosim: A deep learning-based cross-modality inertial measurement simulator. IEEE Transactions on Mobile Computing, 2022.
  13. Swl-adapt: An unsupervised domain adaptation model with sample weight learning for cross-user wearable human activity recognition. In Proceedings of the AAAI Conference on artificial intelligence, volume 37, pages 6012–6020, 2023.
  14. Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10285–10292. IEEE, 2020.
  15. Maven: A memory augmented recurrent approach for multimodal fusion. IEEE Transactions on Multimedia, 2022.
  16. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
  17. Scaling human activity recognition via deep learning-based domain adaptation. In 2018 IEEE international conference on pervasive computing and communications (PerCom), pages 1–9. IEEE, 2018.
  18. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8658–8667, 2019.
  19. Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(3):1–29, 2020.
  20. Frozen clip models are efficient video learners. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 388–404. Springer, 2022.
  21. Vision and inertial sensing fusion for human action recognition: A review. IEEE Sensors Journal, 21(3):2454–2467, 2020.
  22. Imuposer: Full-body pose estimation using imus in phones, watches, and earbuds. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2023.
  23. Imu2clip: Multimodal contrastive learning for imu motion sensors from egocentric videos and text. arXiv preprint arXiv:2210.14395, 2022.
  24. A decade survey of transfer learning (2010–2020). IEEE Transactions on Artificial Intelligence, 1(2):151–166, 2020.
  25. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  26. Rethinking video vits: Sparse video tubes for joint image and video learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2214–2224, 2023.
  27. Hybrid domain adaptation with deep network architecture for end-to-end cross-domain human activity recognition. Computers & Industrial Engineering, 151:106953, 2021.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  29. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6):96–108, 2017.
  30. Toward multimodal human-computer interface. Proceedings of the IEEE, 86(5):853–869, 1998.
  31. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014.
  32. Cross-modal knowledge distillation for action recognition. In 2019 IEEE International Conference on Image Processing (ICIP), pages 6–10. IEEE, 2019.
  33. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  34. Deep transfer learning for cross-domain activity recognition. In proceedings of the 3rd International Conference on Crowd Science and Engineering, pages 1–8, 2018.
  35. Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1828–1838, 2020.
  36. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023.
  37. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  38. Fusion of video and inertial sensing for deep learning–based human action recognition. Sensors, 19(17):3680, 2019.
  39. Towards continual egocentric activity recognition: A multi-modal egocentric activity dataset for continual learning. IEEE Transactions on Multimedia, 2023.
  40. The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487, 2022.
  41. Xhar: Deep domain adaptation for human activity recognition with smart devices. In 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 1–9. IEEE, 2020.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.