Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Published 30 Nov 2023 in cs.CV | (2311.18773v3)

Abstract: Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
  4. Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023.
  5. Language models are few-shot learners. In NeurIPS, 2020.
  6. Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.
  7. Project and probe: Sample-efficient domain adaptation by interpolating orthogonal features. arXiv preprint arXiv:2302.05441, 2023.
  8. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
  9. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  12. Counting out time: Class agnostic video repetition counting in the wild. In CVPR, 2020.
  13. VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. In arXiv:2111.1268, 2021.
  14. Anticipative video transformer. In ICCV, 2021.
  15. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  16. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  17. Long movie clip classification with state-space video models. In ECCV. Springer.
  18. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  19. Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
  20. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
  21. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022a.
  22. Unsupervised action segmentation by joint representation learning and online clustering. In CVPR, 2022b.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  24. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
  25. Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
  28. You only need a good embeddings extractor to fix spurious correlations. arXiv preprint arXiv:2212.06254, 2022.
  29. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  30. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  31. Ego-topo: Environment affordances from egocentric video. In CVPR, 2020.
  32. Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In ECCV, 2022.
  33. Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
  34. Learning transferable visual models from natural language supervision. In ICML, 2021.
  35. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
  36. Temporal action detection using a statistical language model. In CVPR, 2016.
  37. Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, 2017.
  38. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. arXiv preprint arXiv:2202.06856, 2022.
  39. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In CVPR, 2022.
  40. Roboclip: one demonstration is enough to learn robot policies. arXiv preprint arXiv:2310.07899, 2023.
  41. Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In CVPR, 2022a.
  42. Multi-task learning of object state changes from uncurated videos. arXiv preprint arXiv:2211.13500, 2022b.
  43. Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UBICOMP, 2013.
  44. Videobert: A joint model for video and language representation learning. In ICCV, 2019.
  45. Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
  46. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
  47. Rethinking few-shot image classification: a good embedding is all you need? In ECCV, 2020.
  48. Efficiently scaling up video annotation with crowdsourced marketplaces. In ECCV, 2010.
  49. All in one: Exploring unified video-language pre-training. In CVPR, 2023.
  50. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  51. Robust fine-tuning of zero-shot models. In CVPR, 2022.
  52. Towards long-form video understanding. In CVPR, 2021.
  53. Long-term feature banks for detailed video understanding. In CVPR, 2019.
  54. Daydreamer: World models for physical robot learning. In CoRL, 2023.
  55. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018.
  56. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
  57. Msr-vtt: A large video description dataset for bridging video and language. CVPR, 2016.
  58. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  59. Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
  60. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
  61. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  62. Cross-task weakly supervised learning from instructional videos. In CVPR, 2019.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.