Spatial Audio Tokens
- Spatial Audio Tokens are learned representations that abstract essential spatial cues, such as direction, distance, and localization, from multichannel audio signals.
- They are generated using methods like FOA-Tokenizer's discrete vector quantization and PhaseCoder's transformer-based continuous encoding, enabling efficient audio compression and multimodal reasoning.
- By employing spatial consistency and geometry-invariant losses, these tokens preserve spatial fidelity, improving performance in sound localization and downstream tasks.
Spatial audio tokens are compact, discrete or continuous learned representations that encapsulate the spatial cues present in multichannel audio signals. Designed to abstract information about the direction, distance, and overall localization of sound sources, these tokens provide a medium for spatial audio understanding, compression, and multimodal reasoning without dependence on specific array geometries or traditional waveform representations. Their development enables efficient coding, transmission, and downstream consumption of spatial audio by neural models and LLMs.
1. Definition and Conceptual Framework
Spatial audio tokens, as defined in recent research, are learned embeddings or indices that encode the essential directional and localization information embedded in multichannel audio. In the context of neural codecs, these tokens can be discrete indices selected from a vector quantization (VQ) codebook as in FOA-Tokenizer (Sudarsanam et al., 25 Oct 2025), or continuous vector outputs as in transformer-based encoders like PhaseCoder (Dementyev et al., 28 Jan 2026).
In both architectures, the core idea is to distill the complex, high-dimensional multichannel waveform into a sequence of tokens that preserve salient spatial characteristics—such as interchannel phase differences, time delays, and magnitude variations—that can subsequently be used for tasks like compression, sound event localization, and spatial reasoning within multimodal systems.
2. Architectures for Generating Spatial Audio Tokens
Two representative approaches have been described in the literature:
FOA-Tokenizer
- Architecture: An extension of the WavTokenizer U-Net-style encoder to support 4-channel first-order ambisonics (FOA), followed by a single-layer vector quantizer and an asymmetric Vocos decoder with inverse STFT (iSTFT) output head.
- Encoding pipeline: Four-channel FOA signals at 24 kHz are ingested; strided convolutions downsample by a factor of 320 to yield 75 latent steps/sec. At each time step, the encoder produces a 512-dimensional vector that is quantized via nearest-neighbor lookup in a codebook of size 4096.
- Tokenization: The quantized token sequence consists of one token per frame (), yielding a bitrate of $0.9$ kbps (75 tokens/s × 12 bits/token).
PhaseCoder
- Architecture: A purely transformer-based encoder that operates on raw multichannel waveforms plus microphone coordinates, agnostic to microphone geometry.
- Tokenization process: Short-time Fourier transform (STFT) is performed for microphones, producing magnitude and phase features that are embedded into a patch sequence. Three kinds of positional encodings—temporal, sequential, and geometry-aware—are applied to produce geometry-sensitive patch embeddings. A learnable [CLS] token aggregates global context, and patch outputs can be reshaped into a sequence of spatial audio tokens (up to for a 30 s window).
- Token structure: Each spatial token is a fixed-length vector (typically ), which can be projected for consumption by LLMs (Dementyev et al., 28 Jan 2026).
3. Loss Functions and Preservation of Spatial Cues
The preservation of spatial fidelity is central to spatial audio tokens.
Spatial Consistency Loss in FOA-Tokenizer
- Motivation: In FOA, directional cues are represented by active intensity vectors. Alignment between input and reconstructed intensity vectors is enforced by a spatial consistency loss:
where is the cosine similarity of input and reconstructed intensity vectors, and weights energy-dense, non-diffuse time-frequency bins.
- Complete objective:
which combines VQ commitment, acoustic, adversarial, feature-matching, and spatial consistency losses.
Classification and Alignment Losses in PhaseCoder
- Losses: Azimuth, elevation, and distance heads are trained using cross-entropy, with overall loss
- Geometry Encoding: Microphone positions are mapped to spherical coordinates and injected as sinusoidal position embeddings, yielding geometry-awareness.
4. Quantization, Sequence Format, and Bitrate
Spatial audio tokens are defined by their quantization schemes, dimensions, and integration into larger inference pipelines.
| System | Token Type | Tokens/sec | Token Bitwidth | Output Dim | Bitrate |
|---|---|---|---|---|---|
| FOA-Tokenizer | Discrete VQ | 75 | 12 | Index | 0.9 kbps |
| PhaseCoder | Continuous Vec | ≈188 (30 s) | — | 256 | N/A (vector) |
- FOA-Tokenizer: Tokens are indices into a 4096-entry codebook, supporting codec and downstream feature applications.
- PhaseCoder: Tokens are learned, fixed-length vectors, projected as necessary to match LLM embedding spaces. Special [BSA], [ESA] and audio markers enable multimodal integration.
5. Evaluation Metrics and Empirical Results
Spatial audio token systems are evaluated using both acoustic fidelity and spatial localization accuracy.
FOA-Tokenizer Results
- Acoustic fidelity: Metrics include CLAP similarity, STFT and Mel-L1 distance, and speech word error rate (WER).
- Spatial accuracy (mean angular error in degrees):
- In-domain simulated reverberant: 13.76° ($0.9$ kbps)
- SpatialVCTK clean speech: 3.96°
- MEIR (real RIRs/noise): 25.83°
- Multichannel Opus (24 kbps): 22.47°, 17.23°, 40.17° (same tasks)
Table: FOA-Tokenizer Angular Error vs. Baselines
| Dataset | FOA-Tokenizer | Opus 24 kbps | Opus 32 kbps |
|---|---|---|---|
| In-domain Sim | 13.76° | 22.47° | 8.06° |
| SpatialVCTK | 3.96° | 17.23° | 1.02° |
| MEIR | 25.83° | 40.17° | 13.28° |
- Ablations: Removing spatial loss increases angular error to 87.32°, confirming its role in conveying spatial cues.
PhaseCoder Results
- Microphone-invariant localization:
- RSL2019: Mean absolute error 4.33°, Acc@10°=95.5%
- LOCATA: 7.44°, 86.96%
- Downstream LLM (Gemma 3n):
- Speaker localization: (azimuth), 3° (elevation), 0.75m (distance)
- Spatial reasoning: 76.8% accuracy on yes/no tasks (baseline: ∼50%)
- Targeted speech transcription: WER reduced from ∼30% to ∼10% on synthetic QA.
- Qualitative: Correct annotation of transcripts with locations, and successful spatial comparison responses (Dementyev et al., 28 Jan 2026).
6. Downstream Applications and Multimodal Integration
Spatial audio tokens are directly deployable as features for spatial audio perception and reasoning tasks:
- Sound Event Localization and Detection (SELD): Discrete token indices from FOA-Tokenizer fed into convolutional SELD networks in Multi-ACCDOA format, achieving competitive -scores and localization errors on STARSS23 (Sudarsanam et al., 25 Oct 2025).
- Multimodal LLM Integration: Spatial tokens can be concatenated/prepended to LLMs (e.g., Gemma 3n), enabling tasks such as targeted transcription (e.g., “What is the person on the left saying?”) and spatial reasoning—empowering embodied agents and AR/VR systems with spatial perception beyond mono audio (Dementyev et al., 28 Jan 2026).
7. Extensions, Challenges, and Future Directions
PhaseCoder proposes several directions for spatial audio token research:
- Extension to dynamic/moving sources via temporal attention.
- Adoption of joint SELD frameworks for non-speech sounds.
- Fusion with visual or depth modalities for richer multimodal scene understanding.
- Embedding of device-specific acoustic transfer functions to account for array effects on mobile hardware.
This suggests that the universal and geometry-agnostic nature of spatial audio tokens positions them as foundational for future spatially-aware AI agents operating in varied environments and using heterogeneous sensor configurations.
References:
- FOA-Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss (Sudarsanam et al., 25 Oct 2025)
- PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs (Dementyev et al., 28 Jan 2026)