Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Published 5 Mar 2024 in cs.CV and cs.LG | (2403.03037v1)

Abstract: Human comprehension of a video stream is naturally broad: in a few instants, we are able to understand what is happening, the relevance and relationship of objects, and forecast what will follow in the near future, everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks, to synergistically exploit them when learning novel skills. To accomplish this, we seek for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead, to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack, a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights, as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks, outperforming current state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Hiervl: Learning hierarchical video-language embeddings. In CVPR, 2023.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. The evolution of first person vision methods: A survey. IEEE TCSVT, 2015.
  4. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  5. Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
  6. Temporal attentive alignment for large-scale video domain adaptation. In ICCV, 2019.
  7. A unified sequence interface for vision tasks. In NeurIPS, 2022.
  8. Adamv-moe: Adaptive multi-task vision mixture-of-experts. In ICCV, 2023.
  9. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018.
  10. Unihcp: A unified model for human-centric perceptions. In CVPR, 2023.
  11. The epic-kitchens dataset: Collection, challenges and baselines. IEEE TPAMI, 2021.
  12. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2022.
  13. Egocentric object manipulation graphs. arXiv preprint arXiv:2006.03201, 2020.
  14. Forecasting action through contact representations from first person video. IEEE TPAMI, 2021.
  15. Graph neural networks for social recommendation. In The world wide web conference, 2019.
  16. Slowfast networks for video recognition. In ICCV, 2019.
  17. Efficiently identifying task groupings for multi-task learning. In NeurIPS, 2021.
  18. Rolling-unrolling lstms for action anticipation from first-person video. IEEE TPAMI, 2020.
  19. Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 2017.
  20. Listen to look: Action recognition by previewing audio. In CVPR, 2020.
  21. All about knowledge graphs for actions. arXiv preprint arXiv:2008.12432, 2020a.
  22. Stacked spatio-temporal graph convolutional networks for action segmentation. In WACV, 2020b.
  23. Anticipative video transformer. In ICCV, 2021.
  24. Omnivore: A single model for many visual modalities. In CVPR, 2022.
  25. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  26. Recent advances in convolutional neural networks. PR, 2018.
  27. Dynamic task prioritization for multitask learning. In ECCV, 2018.
  28. Learning to branch for multi-task learning. In ICML, 2020.
  29. Inductive representation learning on large graphs. In NeurIPS, 2017.
  30. In the eye of the beholder: A survey of models for eyes and gaze. IEEE TPAMI, 2009.
  31. Video task decathlon: Unifying image and video tasks in autonomous driving. In ICCV, 2023.
  32. Mutual context network for jointly estimating egocentric gaze and action. IEEE TIP, 2020a.
  33. Improving action segmentation via graph-based temporal reasoning. In CVPR, 2020b.
  34. Epic-tent: An egocentric video dataset for camping tent assembly. In ICCVW, 2019.
  35. Learning with whom to share in multi-task feature learning. In ICML, 2011.
  36. Multitask learning to improve egocentric action recognition. In ICCVW, 2019.
  37. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 2016.
  38. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018.
  39. A survey of the recent architectures of deep convolutional neural networks. Artificial intelligence review, 2020.
  40. Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
  41. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021.
  42. Egocentric video-language pretraining. In NeurIPS, 2022.
  43. Intention-conditioned long-term human egocentric action anticipation. In WACV, 2023.
  44. Multi-modal domain adaptation for fine-grained action recognition. In CVPR, 2020.
  45. Ego-topo: Environment affordances from egocentric video. In CVPR, 2020.
  46. Egocentric vision-based action recognition: A survey. Neurocomputing, 2022.
  47. Graph learning in robotics: a survey. IEEE Access, 2023.
  48. Relative norm alignment for tackling domain shift in deep multi-modal classification. IJCV, 2024.
  49. An outlook into the future of egocentric vision. arXiv preprint arXiv:2308.07123, 2023a.
  50. What can a cook in italy teach a mechanic in india? action recognition generalisation over scenarios and locations. In ICCV, 2023b.
  51. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In ICCV, 2023.
  52. Spotem: Efficient video search for episodic memory. In ICLR, 2023.
  53. Action graphs: Weakly-supervised action localization with graph convolution networks. In WACV, 2020.
  54. Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  55. Learning to simulate complex physics with graph networks. In ICML, 2020.
  56. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In CVPR, 2022.
  57. Deep multitask learning with progressive parameter sharing. In ICCV, 2023.
  58. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In CVPR, 2017.
  59. Gradient adversarial training of neural networks. arXiv preprint arXiv:1806.08028, 2018.
  60. Which tasks should be learned together in multi-task learning? In ICML, 2020.
  61. Adashare: Learning what to share for efficient deep multi-task learning. In NeurIPS, 2020.
  62. Human action recognition from various data modalities: A review. IEEE TPAMI, 2023.
  63. Mti-net: Multi-scale task interaction networks for multi-task learning. In ECCV, 2020.
  64. Attention is all you need. In NeurIPS, 2017.
  65. Interactive prototype learning for egocentric action recognition. In ICCV, 2021.
  66. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics, 2019.
  67. Understanding and improving information transfer in multi-task learning. In ICLR, 2020.
  68. Egocentric video task translation. In CVPR, 2023.
  69. Multiview transformers for video recognition. In CVPR, 2022.
  70. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In CVPR, 2022.
  71. Gradient surgery for multi-task learning. In NeurIPS, 2020.
  72. Graph convolutional networks for temporal action localization. In ICCV, 2019.
  73. Actionformer: Localizing moments of actions with transformers. In ECCV, 2022.
  74. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, 2021.
  75. Learning video representations from large language models. In CVPR, 2023.
  76. Anticipative feature fusion transformer for multi-modal action anticipation. In WACV, 2023.
  77. Temporal relational reasoning in videos. In ECCV, 2018.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.