Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniSRCodec: Unified Low-Bitrate Neural Audio Codec

Updated 13 January 2026
  • UniSRCodec is a unified, low-bitrate neural audio codec that leverages a single-codebook VQ architecture with sub-band reconstruction to compress audio at ultra-low bitrates.
  • It employs Mel-spectrogram-based time-frequency compression, a streamlined encoder-decoder pipeline, and neural vocoder-based phase synthesis to ensure high-fidelity audio generation.
  • Performance benchmarks demonstrate superior fidelity and robustness across speech, music, and general audio, achieving state-of-the-art metrics at 0.52 kbps.

UniSRCodec is a unified, low-bitrate neural audio codec built on a single-codebook vector quantization (VQ) architecture with sub-band reconstruction. Its design addresses limitations in both multi-codebook and conventional single-codebook neural audio codecs (NACs), namely the structural complexity of the former and the fidelity and frequency coverage challenges of the latter. By integrating Mel-spectrogram-based time–frequency compression, a single codebook ("SimVQ"), sub-band loss scaling, and neural vocoder-based phase synthesis, UniSRCodec achieves high fidelity and robustness across speech, music, and general audio at bitrates as low as 0.52 kbps, and demonstrates state-of-the-art (SOTA) performance among cross-domain single-codebook codecs (Zhang et al., 6 Jan 2026).

1. Model Architecture and Signal Pathway

UniSRCodec employs a pipeline that transitions audio from the time domain to a compressed, quantized latent space and back, leveraging spectral representations and modern VQ techniques.

  • Time–Frequency Compression: Input waveform x(t)x(t) at 44.1 kHz is first transformed via short-time Fourier transform (STFT) to complex spectra X(ω,τ)X(\omega, \tau). The magnitude spectrum is mapped to a 128-band Mel-spectrogram via a filterbank Hn,kH_{n,k}:

M(n,τ)=kX(ωk,τ)Hn,k,n=1,,128, τ=1,,TM(n, \tau) = \sum_{k} |X(\omega_k, \tau)| \cdot H_{n,k}, \quad n=1,\ldots,128,\ \tau=1,\ldots, T

Only the magnitude is retained; phase is discarded.

  • Encoder: Utilizes fully convolutional 2D ResBlocks (GroupNorm → SiLU → Conv), adapted from Open-MagViT2. The network expands channel dimensionality (1 → 128 → 256 → 512) while downsampling over both time and frequency through strides [2, 2, 4], giving a final latent shape 512×8×8512 \times 8 \times 8.
  • Single-Codebook Quantizer (SimVQ): Adopts a frame-wise flattening strategy, grouping channels across the 8 temporal and 8 mel-frequency axes to produce 8×8=648 \times 8 = 64 latent vectors per segment. Vectors {zei}\{ z_e^i \} are assigned to the nearest of K=8192K=8192 codebook vectors:

zqi=ek,where k=argminjzeiej22z_q^i = e_{k},\quad \text{where}\ k = \arg\min_j \| z_e^i - e_j \|_2^2

Commitment loss is used to anchor encoder outputs to their assigned codebook entries.

  • Decoder: Mirrors the encoder in reverse (channels [512, 256, 128, 1]; upsample factors [4, 2, 2]), reconstructing the 1×128×1281 \times 128 \times 128 Mel-spectrogram segment.
  • Sub-Band Reconstruction: The Mel-spectrogram target xR128×128x \in \mathbb{R}^{128 \times 128} is split into low and high frequency halves (along Mel axis): xlowR64×128x_{\text{low}} \in \mathbb{R}^{64 \times 128} and xhighR64×128x_{\text{high}} \in \mathbb{R}^{64 \times 128}. A weighted L1 loss emphasizes lower frequencies:

Lsr=αlowxlowx^low1+αhighxhighx^high1αlow+αhighL_{\text{sr}} = \frac{ \alpha_{\text{low}} \|x_{\text{low}} - \hat{x}_{\text{low}}\|_1 + \alpha_{\text{high}} \|x_{\text{high}} - \hat{x}_{\text{high}}\|_1 }{ \alpha_{\text{low}} + \alpha_{\text{high}} }

where αlow=2\alpha_{\text{low}}=2, αhigh=1\alpha_{\text{high}}=1.

  • Phase Recovery: The reconstructed Mel-spectrogram is synthesized into waveform audio using BigVGAN-v2, a pre-trained universal neural vocoder, which implicitly recovers the phase.

2. Training Objectives and Optimization

Multiple loss terms shape the training process:

  • Sub-Band Reconstruction Loss (LsrL_{\text{sr}}): As described above, accentuates low-frequency fidelity while preserving high-frequency detail.
  • Discriminator Loss (LdiscL_{\text{disc}}): A single-scale spectral discriminator (adapted from DAC) distinguishes real vs. reconstructed Mel-spectrograms.
  • Adversarial Loss (LadvL_{\text{adv}}): Standard for GAN-style training against the discriminator.
  • Feature-Matching Loss (LfmL_{\text{fm}}): Encourages generator output to match discriminator intermediate features.
  • Commitment Loss (LcmL_{\text{cm}}): Prevents codebook collapse:

Lcm=sg[ze]ek22L_{\text{cm}} = \| \mathrm{sg}[z_e] - e_k \|_2^2

with sg[]\mathrm{sg}[\cdot] the stop-gradient operator.

The aggregate training objective is:

L=λsrLsr+λdiscLdisc+λadvLadv+λfmLfm+λcmLcmL = \lambda_{\text{sr}} L_{\text{sr}} + \lambda_{\text{disc}} L_{\text{disc}} + \lambda_{\text{adv}} L_{\text{adv}} + \lambda_{\text{fm}} L_{\text{fm}} + \lambda_{\text{cm}} L_{\text{cm}}

with (λsr,λdisc,λadv,λfm,λcm)=(15,1,1,1,1)(\lambda_{\text{sr}}, \lambda_{\text{disc}}, \lambda_{\text{adv}}, \lambda_{\text{fm}}, \lambda_{\text{cm}}) = (15, 1, 1, 1, 1).

3. Compression Parameters and Bitrate Analysis

UniSRCodec achieves high compression rates by minimizing token rates and exploiting codebook size.

  • Token Rate: Each 65,536-sample (\sim1.49 s) segment is represented by a 128×128128 \times 128 Mel-spectrogram, then compressed to an 8×88 \times 8 latent for SimVQ, yielding 64 quantized tokens per segment (about 43 tokens/s, rounded to 40 tokens/s).
  • Bitrate Calculation:
    • K=8192K=8192 codebook log28192=13\Rightarrow \log_2 8192 = 13 bits/token
    • Bitrate =40= 40 tokens/s ×\times 13 bits/token =520= 520 bps =0.52= 0.52 kbps
  • Compression Ratio: Compared to 16-bit PCM audio at 44.1 kHz (\approx 705.6 kbps), this yields a compression ratio of 1,356:1.

4. Experimental Evaluation and Benchmarks

Extensive evaluation was conducted on speech, music, and general audio:

  • Datasets: Training spanned \sim10,000 hours from VCTK, LibriTTS, CommonVoice (speech), MUSDB, Jamendo (music), and AudioSet (general audio). Testing used 1,000 clips/sample from LibriTTS-test-clean, MUSDB-test, and AudioSet-eval.
  • Objective Metrics:
    • Music & Audio: Mel-L1 distance and STFT distance at 44.1 kHz ("Mel-44", "STFT-44") and 16 kHz ("Mel-16", "STFT-16").
    • Speech: Short-Time Objective Intelligibility (STOI, higher is better); Perceptual Evaluation of Speech Quality (PESQ, higher is better).

Summary Table: Objective Results

Model TPS kbps/Nq Mel-44 ↓ STFT-44 ↓ Mel-16 ↓ STFT-16 ↓ STOI ↑ PESQ ↑
WavTokenizer-Unified 40 0.48/1q 1.505 5.242 1.130 2.634 0.875 1.912
UniCodec (single) 75 1.30/1q 1.376 5.169 0.903 2.401 0.940 2.870
UniSRCodec-Base 40 0.52/1q 0.904 2.250 0.900 2.330 0.875 1.836
UniSRCodec-Large 176 2.29/1q 0.729 2.049 0.692 2.093 0.941 2.727

At 40 TPS, UniSRCodec-Base outperforms WavTokenizer-Unified and other single-codebook models across all Mel/STFT metrics. UniSRCodec-Large surpasses or matches leading multi-codebook methods at a lower bitrate.

Subjective Listening (MUSHRA) Results

Domain UniSR-Base UniSR-Large UniCodec
General 53.4 72.7 46.9
Music 62.8 81.0 33.9
Speech 76.9 83.2 83.5

Even at 40 TPS, UniSR-Base exceeds previous single-codebook models on general audio and music domains, and closely matches speech quality.

5. Comparative Analysis with Prior Approaches

Complexity and Efficiency

UniSRCodec's single-codebook design (SimVQ) yields a straightforward inference pipeline and enhances feasibility for integration within audio-LLMs, contrasting with the structural overhead and slower adaptation of multi-codebook (e.g., RVQ) codecs.

Training convergence is attainable within 12 hours on 8×RTX 4090 GPUs, demonstrating the method's efficiency.

Bandwidth and Reconstruction Quality

  • Bandwidth efficiency: Reduces required bitrate to 0.52 kbps, compared to 0.9 kbps (WavTokenizer), 3.4 kbps (MelCap), and 2.88 kbps (SNAC).
  • Fidelity: Achieves comparable or superior high-frequency (Mel-44, STFT-44) reconstruction relative to multi-codebook benchmarks.

Unified Audio Modeling

By training on a diverse audio corpus covering speech, music, and general audio, UniSRCodec attains robust cross-domain generalization. The single codebook design minimizes domain bias and improves adaptation to downstream tasks.

6. Significance and Applications

UniSRCodec demonstrates that spectrogram-domain single-codebook quantization, combined with neural vocoder-based phase synthesis and frequency-aware reconstruction objectives, can achieve SOTA performance in ultra-low bitrate coding, expanding the practical applicability of neural audio codecs in bandwidth-constrained, cross-domain environments. Its simplicity of architecture and training, together with its performance advantages, position it as a strong candidate for integration in speech, music, and general audio coding pipelines, as well as downstream generative or audio-LLMs (Zhang et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniSRCodec.