Papers
Topics
Authors
Recent
Search
2000 character limit reached

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Published 26 Jun 2025 in eess.AS, cs.CV, and cs.SD | (2506.21448v3)

Abstract: While end-to-end video-to-audio generation has greatly improved, producing high-fidelity audio that authentically captures the nuances of visual content remains challenging. Like professionals in the creative industries, this generation requires sophisticated reasoning about items such as visual dynamics, acoustic environments, and temporal relationships. We present ThinkSound, a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos. Our approach decomposes the process into three complementary stages: foundational foley generation that creates semantically coherent soundscapes, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions. At each stage, a multimodal LLM generates contextually aligned CoT reasoning that guides a unified audio foundation model. Furthermore, we introduce AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis. Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https://ThinkSound-Project.github.io.

Summary

  • The paper introduces a novel three-stage chain-of-thought framework for video-to-audio synthesis that enhances semantic coherence and temporal precision.
  • The work presents AudioCoT, the first large-scale multimodal dataset with detailed CoT annotations to enable controllable and context-aware sound editing.
  • ThinkSound’s architecture combines a fine-tuned VideoLLaMA2 with adaptive fusion modules, outperforming previous methods in both objective metrics and interactive editing.

Chain-of-Thought Reasoning for Multimodal Audio Generation: An Expert Analysis of ThinkSound

Introduction

ThinkSound introduces a novel paradigm for video-to-audio (V2A) synthesis, integrating Chain-of-Thought (CoT) reasoning with multimodal LLMs (MLLMs) to address the fine-grained compositional and temporal complexities inherent in realistic sound generation. Unlike prior end-to-end V2A systems, which largely rely on black-box generative models with limited reasoning capacity or multi-stage pipelines with insufficient integration, ThinkSound operationalizes a three-stage interactive framework. This framework facilitates semantically coherent, temporally precise, and user-controllable audio generation and editing, targeting the central pain points of existing approaches: contextual fidelity and user intent alignment.

AudioCoT Dataset: Foundation for Structured Audio Reasoning

The AudioCoT dataset is a pivotal contribution, representing the first large-scale multimodal resource with detailed CoT reasoning annotations that interlink video, text, and audio modalities for the purposes of controllable sound synthesis. Figure 1

Figure 1: The AudioCoT dataset pipeline extracts multimodal pairs and generates structured CoT annotations as the supervision signal for model training.

The construction pipeline involves three automated annotation stages: foundational Foley CoT, object-centric CoT with ROI-based grounding, and instruction-driven editing CoT. The resulting dataset supports not only dense conditioning with explicit stepwise rationales but also robust data curation through multimodal filtering, region-of-interest tracking, and expert-in-the-loop validation. This resource is critical for endowing MLLMs with fine-grained temporal, semantic, and causal reasoning capabilities for downstream generative tasks.

ThinkSound Architecture

Multimodal CoT Reasoning Engine

The core reasoning engine in ThinkSound is a fine-tuned instantiation of VideoLLaMA2, adapted for the audio-visual domain via pretraining on AudioCoT. This model generates context-aware CoT chains that explicitly decompose acoustic scenes into temporally aligned sound events and actionable attributes. The resulting structured output is then used to condition both initial audio generation and subsequent editing steps. Figure 2

Figure 2: The ThinkSound pipeline—VideoLLaMA2 generates CoT reasoning, which, through an enhanced MM-DiT multimodal transformer, conditions high-fidelity, temporally accurate audio synthesis driven by both user interaction and text instructions.

Critically, this approach couples global and local reasoning: CoT chains capture the macro-level event chronology while ROI-aware processing enables micro-level, object-centric sound synthesis. This duality enables compositionality and supports interactive refinement loops.

Unified Audio Foundation Model

The generative backbone of ThinkSound is a multimodal flow-matching audio transformer based on conditional flow matching and a Variational Autoencoder (VAE)-derived audio latent space. The architecture integrates video, text (captions, CoT), and audio context representations via modality-dedicated encoder pathways (MetaCLIP for captions, T5 for CoT, and per-modality feature streams), interconnected with shared attention and adaptive gate fusion modules.

Experimentation demonstrates that dual-pathway encoding (CLIP for captions, T5 for CoT) and gated video-audio fusion outperform naive concatenation or single-path integration, particularly with respect to semantic alignment (CLAP) and temporal synchronization (DeSync). Figure 3

Figure 3: ThinkSound’s multi-stream transformer blocks enable parallel and shared processing of audio, video, and text, supporting flexible input configurations and robust cross-modal reasoning.

CoT-Guided Generation and Editing Pipeline

The generation interface decomposes V2A into three stages:

  1. CoT-Guided Foley Generation: Stepwise breakdown of the visual input yields explicit event chronologies, acoustic properties, and spatial/environmental context, directly informing the soundscape.
  2. Object-Centric Refinement: User interactions (e.g., clicks) select target regions leading to ROI-based event extraction and localized sound design.
  3. Instruction-Based Editing: Natural language modifiers are mapped, via CoT, to specific edit actions (addition, inpainting, extension, removal), allowing high-level user control without sacrificing alignment.

The pipeline is context-preserving: at each stage, audio context is maintained and augmented, not replaced, and all editing is guided by semantically structured CoT control rather than opaque latent modification.

Experimental Results

On the VGGSound test set and the challenging MovieGen Audio Bench, ThinkSound yields state-of-the-art results across all key objective metrics (FD, KL divergence, DeSync) and subjective metrics (MOS-Q for quality, MOS-A for alignment). Most notably, ThinkSound achieves a CLAPCoT\text{CLAP}_\text{CoT} score of 0.46 (VGGSound) and 0.51 (MovieGen), reflecting superior semantic and temporal grounding. Ablation studies indicate that removal of CoT reasoning, degradation of structured encoding, or verbosity in CoT chains all produce substantial performance declines, confirming the necessity of focused, explicit reasoning for fidelity and alignment. Figure 4

Figure 4: Case study—spectrograms from ThinkSound display precise temporal alignment and event detection, avoiding the spurious or delayed sounds present in baseline generative systems.

Object-centric and text-conditioned editing tasks further illustrate that CoT-driven generation not only enhances mainline synthesis but also enables fine-grained, high-level user manipulation, outperforming baselines such as MMAudio, AudioLDM-2, and DDPM-edit across MOS and objective metrics.

Theoretical and Practical Implications

The ThinkSound approach substantiates that explicit, stepwise multimodal reasoning—implemented via large-scale, structured CoT supervision and MLLM fine-tuning—is essential for overcoming the compositional bottlenecks of standard V2A synthesis. Architecturally, modality-dedicated processing with adaptive fusion and global conditioning enhances both controllability and generalization. Practically, this enables a new class of user-interactive, editable sound generation tools for creative industries, film, and gaming, with demonstrated out-of-distribution robustness.

The orthogonality between reasoning quality and generative capacity is experimentally confirmed: advances in either domain (richer CoT structure, increased backbone size) yield independently measurable improvement. In addition, the work lays a theoretical foundation for explicit reasoning control in future multimodal generative models—suggesting that incorporation of physical acoustic modeling, more elaborate event-entity relationships, and finer spatial-temporal manipulations are natural future directions.

Conclusion

ThinkSound marks a significant advancement in controllable, high-fidelity video-to-audio generation by integrating chain-of-thought reasoning with a unified foundation model architecture. Explicit decomposition of sound design into structured, semantically rich CoT steps, tightly coupled to multimodal inputs and augmented by interactive editing capabilities, sets a new standard for synthesis quality, contextual fidelity, and user control. With AudioCoT, ThinkSound also positions itself as a cornerstone for future progress in reasoning-driven, multimodal generative AI for audio applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces ThinkSound, a smart system that adds realistic sound to videos. It does this by “thinking” step-by-step, like a human sound designer, to figure out what sounds should happen, when they should happen, and how they should fit the scene. ThinkSound can also let users refine sounds by clicking on objects in the video and edit sounds using plain-language instructions.

Key Objectives

Here are the main goals of the research, explained simply:

  • Make video sound more realistic and better timed with what’s happening on screen, not just generic noise.
  • Teach an AI to “think in steps” about sound, so it can handle complex scenes with multiple actions and events.
  • Create an easy way for users to control sounds in a video by pointing to specific objects and using natural language (like “make the footsteps louder”).
  • Build a single powerful audio model that can handle many tasks: making new sounds, improving them, and editing them.
  • Provide a new dataset called AudioCoT that pairs videos, text descriptions, and sound reasoning to train better models.

Methods and Approach

ThinkSound works in three stages, guided by a “Chain-of-Thought” (CoT) plan—this is just the AI explaining its own step-by-step reasoning.

  1. Foundational foley generation “Foley” means sound effects added to movies (like footsteps, door clicks, wind). First, the AI watches the whole video and writes a clear plan: what sounds should happen, in what order, how loud they should be, and how they interact. Then it generates the main soundscape that matches the scene.
  2. Interactive object-focused refinement If you click on an object in the video (say, a dog), the AI zooms in on that area and refines the sound just for that object. It thinks about what the dog should sound like right now (barking? paw steps?) and blends that sound smoothly with the existing audio.
  3. Instruction-based audio editing You can edit the audio by typing instructions like “remove car engine noise,” “extend the rain,” “fill in missing audio,” or “add a door slam.” The AI translates your request into a step-by-step plan and applies it without breaking the overall soundtrack.

How it’s built:

  • Multimodal LLM (MLLM): This is an AI that understands videos, text, and audio together. ThinkSound fine-tunes an existing model (VideoLLaMA2) to produce clear CoT steps for sound generation and editing.
  • Unified audio foundation model: This is the “sound maker” that turns the step-by-step plan into actual audio. It uses a technique called “flow matching,” which you can think of like guiding random noise to slowly become a clear, correct sound by following a path. It also uses tools like a VAE (a smart compressor for sound) and special encoders for text and video features.
  • AudioCoT dataset: A large collection of video-and-audio clips, audio-and-text pairs, and detailed reasoning steps. It teaches the AI how to connect visuals, descriptions, and sounds in a structured way.
  • Click-based object selection: The system can find and track objects in the video (like cars or doors) and focus sound changes on them.

Simple analogy: Imagine a chef reading a step-by-step recipe. The LLM writes the recipe (“first footsteps, then door opens, then wind gets louder”), and the audio model cooks the sound following that recipe. If you point to the “salt” (an object) or say “make it spicier” (an instruction), the chef adjusts just that part without ruining the whole dish.

Main Findings

  • Better sound quality and timing: ThinkSound creates audio that matches video events more accurately than other systems. Sounds start and stop when they should and feel more “in the scene.”
  • Stronger understanding of complex scenes: Because it reasons step-by-step, ThinkSound handles tricky moments—like a car door closing, opening, then closing again—without mixing up the order.
  • Improved alignment with text and visuals: Tests show ThinkSound’s audio fits the video and the CoT plan more closely.
  • Works well on new, unseen videos: On an outside benchmark it hadn’t trained on, ThinkSound still performed the best or tied for best in key measures.
  • CoT really helps: When the researchers removed the step-by-step reasoning, the results clearly got worse—proof that “thinking out loud” guides better sound.

Implications and Impact

ThinkSound brings us closer to truly realistic and controllable sound for videos. This can help:

  • Filmmakers and game developers create rich, believable soundtracks faster.
  • Creators on social media make better videos without needing expert audio skills.
  • Education and accessibility tools that need clear audio synced to visuals.

In the future, the team plans to add more knowledge about real-world acoustics (like echoes and how sound travels in rooms) and improve reasoning for scenes with many moving objects. Overall, ThinkSound shows that when AI explains its thinking and follows a clear plan, it can make smarter, more convincing audio for almost any video.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 7 likes about this paper.