Spatial Audio Tokens

Updated 31 January 2026

Spatial Audio Tokens are learned representations that abstract essential spatial cues, such as direction, distance, and localization, from multichannel audio signals.
They are generated using methods like FOA-Tokenizer's discrete vector quantization and PhaseCoder's transformer-based continuous encoding, enabling efficient audio compression and multimodal reasoning.
By employing spatial consistency and geometry-invariant losses, these tokens preserve spatial fidelity, improving performance in sound localization and downstream tasks.

Spatial audio tokens are compact, discrete or continuous learned representations that encapsulate the spatial cues present in multichannel audio signals. Designed to abstract information about the direction, distance, and overall localization of sound sources, these tokens provide a medium for spatial audio understanding, compression, and multimodal reasoning without dependence on specific array geometries or traditional waveform representations. Their development enables efficient coding, transmission, and downstream consumption of spatial audio by neural models and LLMs.

1. Definition and Conceptual Framework

Spatial audio tokens, as defined in recent research, are learned embeddings or indices that encode the essential directional and localization information embedded in multichannel audio. In the context of neural codecs, these tokens can be discrete indices selected from a vector quantization (VQ) codebook as in FOA-Tokenizer (Sudarsanam et al., 25 Oct 2025), or continuous vector outputs as in transformer-based encoders like PhaseCoder (Dementyev et al., 28 Jan 2026).

In both architectures, the core idea is to distill the complex, high-dimensional multichannel waveform into a sequence of tokens that preserve salient spatial characteristics—such as interchannel phase differences, time delays, and magnitude variations—that can subsequently be used for tasks like compression, sound event localization, and spatial reasoning within multimodal systems.

2. Architectures for Generating Spatial Audio Tokens

Two representative approaches have been described in the literature:

FOA-Tokenizer

Architecture: An extension of the WavTokenizer U-Net-style encoder to support 4-channel first-order ambisonics (FOA), followed by a single-layer vector quantizer and an asymmetric Vocos decoder with inverse STFT (iSTFT) output head.
Encoding pipeline: Four-channel FOA signals at 24 kHz are ingested; strided convolutions downsample by a factor of 320 to yield 75 latent steps/sec. At each time step, the encoder produces a 512-dimensional vector that is quantized via nearest-neighbor lookup in a codebook of size 4096.
Tokenization: The quantized token sequence consists of one token per frame ( $k^\ast = \arg\min_{k=1\ldots4096} \|z_e(t) - e_k\|_2$ ), yielding a bitrate of $0.9$ kbps (75 tokens/s × 12 bits/token).

PhaseCoder

Architecture: A purely transformer-based encoder that operates on raw multichannel waveforms plus microphone coordinates, agnostic to microphone geometry.
Tokenization process: Short-time Fourier transform (STFT) is performed for $C$ microphones, producing magnitude and phase features that are embedded into a patch sequence. Three kinds of positional encodings—temporal, sequential, and geometry-aware—are applied to produce geometry-sensitive patch embeddings. A learnable [CLS] token aggregates global context, and patch outputs can be reshaped into a sequence of spatial audio tokens (up to $T\approx188$ for a 30 s window).
Token structure: Each spatial token is a fixed-length vector (typically $D=256$ ), which can be projected for consumption by LLMs (Dementyev et al., 28 Jan 2026).

3. Loss Functions and Preservation of Spatial Cues

The preservation of spatial fidelity is central to spatial audio tokens.

Spatial Consistency Loss in FOA-Tokenizer

Motivation: In FOA, directional cues are represented by active intensity vectors. Alignment between input and reconstructed intensity vectors is enforced by a spatial consistency loss:

$\mathcal{L}_{sc} = \frac{1}{T K} \sum_{t=1}^T \sum_{k=1}^K w_{t,k}\left(1 - s_{t,k}\right),$

where $s_{t,k}$ is the cosine similarity of input and reconstructed intensity vectors, and $w_{t,k}$ weights energy-dense, non-diffuse time-frequency bins.

Complete objective:

$\mathcal{L}_{gen} = \lambda_q \mathcal{L}_q + \lambda_{mel} \mathcal{L}_{mel} + \lambda_{adv} \mathcal{L}_{adv} + \lambda_{feat} \mathcal{L}_{feat} + \lambda_{sc} \mathcal{L}_{sc}$

which combines VQ commitment, acoustic, adversarial, feature-matching, and spatial consistency losses.

Classification and Alignment Losses in PhaseCoder

Losses: Azimuth, elevation, and distance heads are trained using cross-entropy, with overall loss

$\mathcal{L} = \lambda_{\rm az}\,\mathcal{L}_{\rm CE}^{\rm az} +\lambda_{\rm el}\,\mathcal{L}_{\rm CE}^{\rm el} +\lambda_{\rm dist}\,\mathcal{L}_{\rm CE}^{\rm dist}$

Geometry Encoding: Microphone positions are mapped to spherical coordinates and injected as sinusoidal position embeddings, yielding geometry-awareness.

4. Quantization, Sequence Format, and Bitrate

Spatial audio tokens are defined by their quantization schemes, dimensions, and integration into larger inference pipelines.

System	Token Type	Tokens/sec	Token Bitwidth	Output Dim	Bitrate
FOA-Tokenizer	Discrete VQ	75	12	Index	0.9 kbps
PhaseCoder	Continuous Vec	≈188 (30 s)	—	256	N/A (vector)

FOA-Tokenizer: Tokens are indices into a 4096-entry codebook, supporting codec and downstream feature applications.
PhaseCoder: Tokens are learned, fixed-length vectors, projected as necessary to match LLM embedding spaces. Special [BSA], [ESA] and audio markers enable multimodal integration.

5. Evaluation Metrics and Empirical Results

Spatial audio token systems are evaluated using both acoustic fidelity and spatial localization accuracy.

FOA-Tokenizer Results

Acoustic fidelity: Metrics include CLAP similarity, STFT and Mel-L1 distance, and speech word error rate (WER).
Spatial accuracy (mean angular error in degrees):
- In-domain simulated reverberant: 13.76° ($0.9$ kbps)
- SpatialVCTK clean speech: 3.96°
- MEIR (real RIRs/noise): 25.83°
- Multichannel Opus (24 kbps): 22.47°, 17.23°, 40.17° (same tasks)

Table: FOA-Tokenizer Angular Error vs. Baselines

Dataset	FOA-Tokenizer	Opus 24 kbps	Opus 32 kbps
In-domain Sim	13.76°	22.47°	8.06°
SpatialVCTK	3.96°	17.23°	1.02°
MEIR	25.83°	40.17°	13.28°

Ablations: Removing spatial loss increases angular error to 87.32°, confirming its role in conveying spatial cues.

PhaseCoder Results

Microphone-invariant localization:
- RSL2019: Mean absolute error 4.33°, Acc@10°=95.5%
- LOCATA: 7.44°, 86.96%
Downstream LLM (Gemma 3n):
- Speaker localization: $\text{MAE}\approx5^\circ$ (azimuth), 3° (elevation), 0.75m (distance)
- Spatial reasoning: 76.8% accuracy on yes/no tasks (baseline: ∼50%)
- Targeted speech transcription: WER reduced from ∼30% to ∼10% on synthetic QA.
- Qualitative: Correct annotation of transcripts with locations, and successful spatial comparison responses (Dementyev et al., 28 Jan 2026).

6. Downstream Applications and Multimodal Integration

Spatial audio tokens are directly deployable as features for spatial audio perception and reasoning tasks:

Sound Event Localization and Detection (SELD): Discrete token indices from FOA-Tokenizer fed into convolutional SELD networks in Multi-ACCDOA format, achieving competitive $F$ -scores and localization errors on STARSS23 (Sudarsanam et al., 25 Oct 2025).
Multimodal LLM Integration: Spatial tokens can be concatenated/prepended to LLMs (e.g., Gemma 3n), enabling tasks such as targeted transcription (e.g., “What is the person on the left saying?”) and spatial reasoning—empowering embodied agents and AR/VR systems with spatial perception beyond mono audio (Dementyev et al., 28 Jan 2026).

7. Extensions, Challenges, and Future Directions

PhaseCoder proposes several directions for spatial audio token research:

Extension to dynamic/moving sources via temporal attention.
Adoption of joint SELD frameworks for non-speech sounds.
Fusion with visual or depth modalities for richer multimodal scene understanding.
Embedding of device-specific acoustic transfer functions to account for array effects on mobile hardware.

This suggests that the universal and geometry-agnostic nature of spatial audio tokens positions them as foundational for future spatially-aware AI agents operating in varied environments and using heterogeneous sensor configurations.

References:

FOA-Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss (Sudarsanam et al., 25 Oct 2025)
PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs (Dementyev et al., 28 Jan 2026)

Markdown Report Issue Upgrade to Chat

References (2)

FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss (2025)

PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial Audio Tokens.