AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

Published 19 Feb 2026 in cs.SD | (2602.17097v1)

Abstract: Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AudioChat, a multi-modal model that unifies audio storytelling, editing, and understanding with transfusion forcing to achieve state-of-the-art performance.
It employs a self-cascaded transformer integrated with AudioCopilot, which generates 6 million synthetic multi-turn dialogues to boost fine-grained, mixed-modal reasoning.
Results from novel metrics like editFLAM and human evaluations demonstrate superior contextual grounding and precise audio generation and editing.

Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing: An Expert Analysis of "AudioChat" (2602.17097)

Introduction

The rise of multi-modal foundation models has made significant progress in tasks involving single-source speech and non-speech processing. However, modeling, reasoning about, generating, and editing complex polyphonic scenes that integrate multi-source speech, sound effects, and ambient cues—a domain termed audio stories—remains a fundamental open challenge. The "AudioChat" framework confronts this challenge by introducing a unified architecture for fine-grained audio story understanding, reasoning, editing, and generation, leveraging innovations at both the data, modeling, and evaluation levels.

Problem Setting and Motivations

Audio story processing tasks—wherein a system must understand, generate, and edit temporally and semantically entangled mixtures of speech and non-speech (e.g., dialogue with background ambience and sound effects)—encode vastly richer constraints than canonical TTS, T2A, or audio captioning. The computational complexity arises both from the need for structured reasoning (planning and decomposing the auditory scene) and from fine-grained audio manipulation (precise edits without overmodification). Prior approaches have relied on cascaded agent-based systems, discrete tokenization pipelines, or lack end-to-end fine-grained editability and interpretability.

Data Synthesis via AudioCopilot

Manual creation of large-scale, fine-grained annotated datasets for audio stories is infeasible. To overcome this, AudioChat introduces AudioCopilot—a tool-calling LLM-based agent capable of simulating user-system conversations and synthesizing audio stories from scratch. This agent leverages a large LLM (Gemma 3 27B) and interfaces with zero-shot TTS and T2A models to construct 6 million multi-turn, polyphonic audio stories with paired structured reasoning traces, decomposing high-level instructions into structured descriptions of the soundscape. AudioCopilot's data includes both narration and fine-grained sound parameters (panning, loudness, onset, duration) and supports edit simulation (addition, removal, transformation of sound elements), yielding a dataset that allows direct supervision of structured audio story reasoning and manipulation.

Model Architecture

Tokenization and Representation

AudioChat employs a continuous audio tokenizer operating on stereo 48 kHz waveforms, producing 40 Hz latents per channel. This design enables dense, information-preserving embeddings suitable for both understanding and high-fidelity generation, improving upon approaches that rely on discrete tokenizers or downsampled audio (cf. Ming-UniAudio, UALM, Bagpiper).

A novel Self-Cascaded Transformer (SCT) structure splits the Transformer layers between understanding and generation. Initial layers (initialized from a text-only LLM, Gemma 2) are optimized for language modeling and comprehension, while later layers perform diffusion-based generation. This separation is more parameter- and computation-efficient than designs such as Mixture-of-Transformers (MoT) and does not require parallel modal towers as in BAGEL—simplifying cross-modal interaction and improving editability.

Audio Transfusion Forcing

A central modeling innovation is the Audio Transfusion Forcing objective, extending Transfusion (previously vision-only) to audio and multi-turn dialogic tasks. The model jointly minimizes a weighted sum of a causal language modeling (structured chain-of-thought decomposition) and multi-step diffusion loss (generation in latent space). Critically, diffusion forcing stochastically applies independent noise schedules to context and target audio latents per editing turn, mitigating shortcut learning where the model copies input latents rather than applying the specified edits. Masking is carefully controlled: causal masking for text ensures unidirectional reasoning, bidirectional for audio within an editing turn, and sequential between turns—enabling multi-turn interaction with accurate edit propagation.

Evaluation Paradigm and Metrics

Recognizing that FAD, KAD, and CLAP are inadequate for scene-level or fine-grained semantic edits, the authors propose three task-driven metrics:

multiFLAM: Uses OpenFLAM to evaluate the likelihood that each intended sound-source caption is realized in the output audio, supporting detection at frame-level resolution.
AmultiFLAM: Measures edit consistency by quantifying how much non-target audio content is changed during editing.
editFLAM: Directly quantifies the fidelity of edit operations (addition, removal, modification, parametric changes) by comparing frame-level sound caption presence before and after the edit.

These metrics provide direct, interpretable, and instruction-aligned evaluation, correlating closely with human judgment.

Experimental Results

Editing Performance

AudioChat establishes SOTA performance for open-ended multi-source audio editing, outperforming diffusion transformers, classical cascades, and ablations lacking structured reasoning or context access. Results show low AmultiFLAM (minimal extraneous change), high editFLAM (faithful instruction execution), and high human evaluation ratings across all edit types (addition, removal, parametric adjustment).

Storytelling and Understanding

On the StoryGen-Eval benchmark, AudioChat substantially outperforms SOTA T2A and agent-based pipelines (e.g., Stable Audio Open, WavJourney) in both output quality (KAD), semantic completeness (multiFLAM), and generation latency. Unlike traditional systems, which suffer from poor decomposition or inability to integrate abstract scene instructions, AudioChat efficiently grounds compositional user queries to coherent, interpretable audio stories. Its understanding capabilities approach those of fine-tuned expert systems (Whisper-Story), with low tcpWER for speech and high multiFLAM for non-speech sources.

Architectural Ablations

Ablation studies confirm the superiority of the SCT over both vanilla dense and Mixture-of-Transformers architectures for multitask joint modeling. Editing and understanding performance degrades in the absence of structured chain-of-thought and end-to-end training on simulated multi-turn dialogs.

Theoretical and Practical Implications

AudioChat advances the paradigm of unified, interpretable, fine-grained audio LLMs by demonstrating that:

End-to-end training on simulated interactions with explicit reasoning traces enables superior editability, controllability, and task generalization.
Joint modeling of understanding, reasoning, and generation within a single architecture supports compositional tasks not tractable to prior discrete or agent-based frameworks.
The proposed evaluation metrics provide a practical template for future multi-source audio model evaluation, offering interpretability and direct alignment with user instruction following.

Practically, these advances have immediate application in audio post-production, accessible content creation, generative entertainment, and assistive technologies requiring contextual audio synthesis and understanding. The method's extensibility—highlighted by intentions to inject visual modalities—positions it toward genuinely general-purpose, omni-modal foundation models.

Limitations and Future Directions

The current instantiation of AudioChat is limited by dependency on proprietary datasets and training scope primarily in English. Future work will require scaling both cross-lingual and cross-modal (e.g., vision-conditioned audio) generalization, and closing the fidelity gap observed in nuanced or highly dynamic edits. Addressing ethical issues—such as misuse in forgery or impersonation—remains critical, especially given the fine control and high realism enabled by the framework.

Conclusion

AudioChat presents a comprehensive, unified architecture for polyphonic audio story reasoning, generation, and editing, underpinned by innovations in LLM-simulated training data generation, multi-modal SCT architecture, and task-aligned evaluation metrics. By structurally integrating chain-of-thought reasoning with controllable generation and editing, it achieves state-of-the-art empirical performance and sets methodological baselines for subsequent foundation models in audio and broader multi-modal domains. The approach's decomposition and editability mechanisms are likely to inform both foundational research and real-world system deployment in rich auditory environments.

Markdown Report Issue