Papers
Topics
Authors
Recent
Search
2000 character limit reached

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

Published 3 Aug 2025 in cs.SD and eess.AS | (2508.01659v1)

Abstract: Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of multimodal LLMs (MLLMs). To further strengthen this alignment, recent works have proposed Audio Difference Captioning (ADC), which takes multiple audio inputs and encourages the model to describe their differences, thereby promoting fine-grained audio discrimination. However, despite its effectiveness in enabling difference-telling and detailed discrimination, ADC introduces a notable semantic gap between the input audios-often rich in diverse sound events-and the relatively brief, difference-focused output captions. This deviation from AC-style descriptions leads to a mismatch with the pretraining objective, resulting in catastrophic forgetting during finetuning. To mitigate this issue, we propose Audio Commonality Captioning (ACC), a comparably challenging but gentler alternative that encourages the model to capture the shared semantics across audio clips rather than emphasizing their detailed differences. Experimental results demonstrate that ACC not only effectively enhances audio-text understanding on primary captioning benchmarks but also better preserves general capabilities across diverse speech and music-related downstream tasks, such as vocal sound classification (VSC), speech emotion recognition (SER), musical instrument classification (MIC), and music genre classification (MGC), compared to ADC. These findings validate that ACC contributes to more robust cross-modal understanding and achieves a better balance between generalization and task-specific performance in the context of MLLMs.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.