Papers
Topics
Authors
Recent
Search
2000 character limit reached

What Are They Doing? Joint Audio-Speech Co-Reasoning

Published 22 Sep 2024 in cs.SD, cs.CL, and eess.AS | (2409.14526v2)

Abstract: In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory LLMs (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 15 likes about this paper.