Papers
Topics
Authors
Recent
Search
2000 character limit reached

Valley: Video Assistant with Large Language model Enhanced abilitY

Published 12 Jun 2023 in cs.CV, cs.AI, and cs.CL | (2306.07207v3)

Abstract: LLMs, with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored. In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely Valley-702k and Valley-instruct-73k, to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios. Our code and data are published anonymously at https://github.com/valley-vl/Valley.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  3. Collecting highly parallel data for paraphrase evaluation. In The 49th Annual Meeting of the Association for Computational Linguistics, 2011.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 2021.
  8. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  961–970, 2015.
  9. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
  10. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023.
  11. Memecap: A dataset for captioning and interpreting memes. CoRR, abs/2305.13703, 2023.
  12. Otter: A multi-modal model with in-context instruction tuning. CoRR, abs/2305.03726, 2023a.
  13. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597, 2023b.
  14. Videochat: Chat-centric video understanding. CoRR, abs/2305.06355, 2023c.
  15. Visual instruction tuning. CoRR, abs/2304.08485, 2023.
  16. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022a.
  17. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  18. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424, 2023.
  19. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  20. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  21. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
  22. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
  23. Pandagpt: One model to instruction-follow them all, 2023.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  25. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  26. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
  27. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  28. MSR-VTT: A large video description dataset for bridging video and language. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  29. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems, 35:124–141, 2022.
  30. MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
  31. Activitynet-qa: A dataset for understanding complex web videos via question answering. In The Thirty-Third AAAI Conference on Artificial Intelligence, pp.  9127–9134, 2019.
  32. GLM-130B: an open bilingual pre-trained model. CoRR, abs/2210.02414, 2022.
  33. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  34. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023a.
  36. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
Citations (139)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.