PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Abstract: Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-LLMs. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-LLMs with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at https://pllava.github.io/
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Zero-shot video question answering with procedural programs. ArXiv abs/2312.00937, 2023.
- The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. IEEE Conf. Comput. Vis. Pattern Recog., pages 18995–19012, 2022.
- Cogagent: A visual language model for gui agents. ArXiv, abs/2312.08914, 2023.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Vtimellm: Empower llm to grasp video moments, 2023.
- Lita: Language instructed temporal-localization assistant. arXiv preprint arXiv:2403.19046, 2024.
- Chat-univi: Unified visual representation empowers large language models with image and video understanding. ArXiv abs/2311.08046, 2024.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv preprint arXiv:2403.18406, 2024.
- Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- Mvbench: A comprehensive multi-modal video understanding benchmark. ArXiv abs/2311.17005, 2023.
- Llama-vid: An image is worth 2 tokens in large language models. ArXiv abs/2311.17043, 2023.
- Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
- Video-llava: Learning united visual representation by alignment before projection. ArXiv abs/2311.10122, 2023.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
- Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785, 2023.
- St-llm: Large language models are effective temporal learners. arXiv preprint arXiv:2404.00308, 2024.
- Vista-llama: Reliable video narrator via equal distance to visual tokens. ArXiv abs/2312.08870, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Moviechat: From dense token to sparse memory for long video understanding. ArXiv abs/2307.16449, 2023.
- Adapool: Exponential adaptive pooling for information-retaining downsampling. IEEE Transactions on Image Processing, 32:251–266, 2022.
- Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- A large cross-modal video retrieval dataset with reading comprehension. arXiv preprint arXiv:2305.03347, 2023.
- Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
- Zero-shot video question answering via frozen bidirectional language models. Adv. Neural Inform. Process. Syst., 35:124–141, 2022.
- Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios. arXiv preprint arXiv:2403.04640, 2024.
- Clevrer: Collision events for video representation and reasoning. In International Conference on Learning Representations, 2020.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019.
- A simple llm framework for long-range video question-answering. ArXiv abs/2312.17235, 2023.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Conf. Empirical Methods in Natural Language Processing, pages 543–553, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.