A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval
Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using LLMs. In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same prompts, we can successfully employ LLMs to improve the retrieval on EpicSounds, compared to using the original audio class labels of the dataset. Finally, we confirm that LLMs can be used to determine the difficulty of identifying the action associated with a sound.
- OpenAI, “Chatgpt,” https://openai.com/blog/chatgpt, Accessed July, August 2023.
- “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv:2307.09288, 2023.
- “A short note on the kinetics-700-2020 human action dataset,” arXiv:2010.10864, 2020.
- “Audiocaps: Generating captions for audios in the wild,” in Proc. NACCL, 2019.
- “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018.
- “Ego4d: Around the world in 3,000 hours of egocentric video,” in CVPR, 2022.
- “Audio retrieval with natural language queries,” in INTERSPEECH, 2021.
- “Audio retrieval with natural language queries: A benchmark study,” IEEE Transactions on Multimedia, 2022.
- M. Slaney, “Semantic-audio retrieval,” in ICASSP, 2002.
- “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP, 2023.
- OpenAI, “Gpt-4 technical report,” arXiv:2303.08774, 2023.
- “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” 2023.
- “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
- “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023.
- “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv:2303.17580, 2023.
- “Multi-modal classifiers for open-vocabulary object detection,” in ICML, 2023.
- S. Menon and C. Vondrick, “Visual classification via description from large language models,” in ICLR, 2023.
- “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv:2303.17395, 2023.
- “Learning state-aware visual representations from audible interactions,” in NeurIPS, 2022.
- “Domain generalization through audio-visual relative norm alignment in first person action recognition,” in WACV, 2022.
- “Egocentric video-language pretraining,” in NeurIPS, 2022.
- “Learning video representations from large language models,” in CVPR, 2023.
- “Hiervl: Learning hierarchical video-language embeddings,” in CVPR, 2023.
- “Egovlpv2: Egocentric video-language pre-training with fusion in the backbone,” in ICCV, 2023.
- “Epic-fusion: Audio-visual temporal binding for egocentric action recognition,” in ICCV, 2019.
- “EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound,” in ICASSP, 2023.
- “Rescaling egocentric vision,” IJCV, 2022.
- K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Trans. Inf. Syst., 2002.
- “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP, 2022.
- “Clotho: An audio captioning dataset,” in ICASSP, 2020.
- “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in ICCV, 2019.
- Show Lab, “Egovlp: A repository for ego-centric vision language pre-training,” https://github.com/showlab/EgoVLP, 2023, Accessed: 2023-01-10.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.