Papers
Topics
Authors
Recent
Search
2000 character limit reached

SALM: Spatial Audio Language Model with Structured Embeddings for Understanding and Editing

Published 22 Jul 2025 in cs.SD and eess.AS | (2507.16724v1)

Abstract: Spatial audio understanding is essential for accurately perceiving and interpreting acoustic environments. However, existing audio-LLMs struggle with processing spatial audio and perceiving spatial acoustic scenes. We introduce the Spatial Audio LLM (SALM), a novel framework that bridges spatial audio and language via multi-modal contrastive learning. SALM consists of a text encoder and a dual-branch audio encoder, decomposing spatial sound into semantic and spatial components through structured audio embeddings. Key features of SALM include seamless alignment of spatial and text representations, separate and joint extraction of spatial and semantic information, zero-shot direction classification and robust support for spatial audio editing. Experimental results demonstrate that SALM effectively captures and aligns cross-modal representations. Furthermore, it supports advanced editing capabilities, such as altering directional audio using text-based embeddings.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (5)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 8 likes about this paper.