Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Abstract: Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g. visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning. The Spacewalk-18 benchmark is released at https://brown-palm.github.io/Spacewalk-18/.
- When will you do what?-anticipating temporal occurrences of activities. In CVPR, 2018.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
- Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Procedure planning in instructional videos. In European Conference on Computer Vision, pages 334–350. Springer, 2020.
- Project and probe: Sample-efficient domain adaptation by interpolating orthogonal features. arXiv preprint arXiv:2302.05441, 2023.
- Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Counting out time: Class agnostic video repetition counting in the wild. In CVPR, 2020.
- VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling. In arXiv:2111.1268, 2021.
- Anticipative video transformer. In ICCV, 2021.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- Long movie clip classification with state-space video models. In ECCV. Springer.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Last layer re-training is sufficient for robustness to spurious correlations. arXiv preprint arXiv:2204.02937, 2022.
- The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022a.
- Unsupervised action segmentation by joint representation learning and online clustering. In CVPR, 2022b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022.
- Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
- You only need a good embeddings extractor to fix spurious correlations. arXiv preprint arXiv:2212.06254, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
- Ego-topo: Environment affordances from egocentric video. In CVPR, 2020.
- Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In ECCV, 2022.
- Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP, 2019.
- Temporal action detection using a statistical language model. In CVPR, 2016.
- Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, 2017.
- Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. arXiv preprint arXiv:2202.06856, 2022.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In CVPR, 2022.
- Roboclip: one demonstration is enough to learn robot policies. arXiv preprint arXiv:2310.07899, 2023.
- Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In CVPR, 2022a.
- Multi-task learning of object state changes from uncurated videos. arXiv preprint arXiv:2211.13500, 2022b.
- Combining embedded accelerometers with computer vision for recognizing food preparation activities. In UBICOMP, 2013.
- Videobert: A joint model for video and language representation learning. In ICCV, 2019.
- Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
- Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
- Rethinking few-shot image classification: a good embedding is all you need? In ECCV, 2020.
- Efficiently scaling up video annotation with crowdsourced marketplaces. In ECCV, 2010.
- All in one: Exploring unified video-language pre-training. In CVPR, 2023.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Robust fine-tuning of zero-shot models. In CVPR, 2022.
- Towards long-form video understanding. In CVPR, 2021.
- Long-term feature banks for detailed video understanding. In CVPR, 2019.
- Daydreamer: World models for physical robot learning. In CoRL, 2023.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, 2018.
- VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
- Msr-vtt: A large video description dataset for bridging video and language. CVPR, 2016.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
- Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Cross-task weakly supervised learning from instructional videos. In CVPR, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.