UniSRCodec: Unified Low-Bitrate Neural Audio Codec

Updated 13 January 2026

UniSRCodec is a unified, low-bitrate neural audio codec that leverages a single-codebook VQ architecture with sub-band reconstruction to compress audio at ultra-low bitrates.
It employs Mel-spectrogram-based time-frequency compression, a streamlined encoder-decoder pipeline, and neural vocoder-based phase synthesis to ensure high-fidelity audio generation.
Performance benchmarks demonstrate superior fidelity and robustness across speech, music, and general audio, achieving state-of-the-art metrics at 0.52 kbps.

UniSRCodec is a unified, low-bitrate neural audio codec built on a single-codebook vector quantization (VQ) architecture with sub-band reconstruction. Its design addresses limitations in both multi-codebook and conventional single-codebook neural audio codecs (NACs), namely the structural complexity of the former and the fidelity and frequency coverage challenges of the latter. By integrating Mel-spectrogram-based time–frequency compression, a single codebook ("SimVQ"), sub-band loss scaling, and neural vocoder-based phase synthesis, UniSRCodec achieves high fidelity and robustness across speech, music, and general audio at bitrates as low as 0.52 kbps, and demonstrates state-of-the-art (SOTA) performance among cross-domain single-codebook codecs (Zhang et al., 6 Jan 2026).

1. Model Architecture and Signal Pathway

UniSRCodec employs a pipeline that transitions audio from the time domain to a compressed, quantized latent space and back, leveraging spectral representations and modern VQ techniques.

Time–Frequency Compression: Input waveform $x(t)$ at 44.1 kHz is first transformed via short-time Fourier transform (STFT) to complex spectra $X(\omega, \tau)$ . The magnitude spectrum is mapped to a 128-band Mel-spectrogram via a filterbank $H_{n,k}$ :

$M(n, \tau) = \sum_{k} |X(\omega_k, \tau)| \cdot H_{n,k}, \quad n=1,\ldots,128,\ \tau=1,\ldots, T$

Only the magnitude is retained; phase is discarded.

Encoder: Utilizes fully convolutional 2D ResBlocks (GroupNorm → SiLU → Conv), adapted from Open-MagViT2. The network expands channel dimensionality (1 → 128 → 256 → 512) while downsampling over both time and frequency through strides [2, 2, 4], giving a final latent shape $512 \times 8 \times 8$ .
Single-Codebook Quantizer (SimVQ): Adopts a frame-wise flattening strategy, grouping channels across the 8 temporal and 8 mel-frequency axes to produce $8 \times 8 = 64$ latent vectors per segment. Vectors $\{ z_e^i \}$ are assigned to the nearest of $K=8192$ codebook vectors:

$z_q^i = e_{k},\quad \text{where}\ k = \arg\min_j \| z_e^i - e_j \|_2^2$

Commitment loss is used to anchor encoder outputs to their assigned codebook entries.

Decoder: Mirrors the encoder in reverse (channels [512, 256, 128, 1]; upsample factors [4, 2, 2]), reconstructing the $1 \times 128 \times 128$ Mel-spectrogram segment.
Sub-Band Reconstruction: The Mel-spectrogram target $x \in \mathbb{R}^{128 \times 128}$ is split into low and high frequency halves (along Mel axis): $x_{\text{low}} \in \mathbb{R}^{64 \times 128}$ and $x_{\text{high}} \in \mathbb{R}^{64 \times 128}$ . A weighted L1 loss emphasizes lower frequencies:

$L_{\text{sr}} = \frac{ \alpha_{\text{low}} \|x_{\text{low}} - \hat{x}_{\text{low}}\|_1 + \alpha_{\text{high}} \|x_{\text{high}} - \hat{x}_{\text{high}}\|_1 }{ \alpha_{\text{low}} + \alpha_{\text{high}} }$

where $\alpha_{\text{low}}=2$ , $\alpha_{\text{high}}=1$ .

Phase Recovery: The reconstructed Mel-spectrogram is synthesized into waveform audio using BigVGAN-v2, a pre-trained universal neural vocoder, which implicitly recovers the phase.

2. Training Objectives and Optimization

Multiple loss terms shape the training process:

Sub-Band Reconstruction Loss ( $L_{\text{sr}}$ ): As described above, accentuates low-frequency fidelity while preserving high-frequency detail.
Discriminator Loss ( $L_{\text{disc}}$ ): A single-scale spectral discriminator (adapted from DAC) distinguishes real vs. reconstructed Mel-spectrograms.
Adversarial Loss ( $L_{\text{adv}}$ ): Standard for GAN-style training against the discriminator.
Feature-Matching Loss ( $L_{\text{fm}}$ ): Encourages generator output to match discriminator intermediate features.
Commitment Loss ( $L_{\text{cm}}$ ): Prevents codebook collapse:

$L_{\text{cm}} = \| \mathrm{sg}[z_e] - e_k \|_2^2$

with $\mathrm{sg}[\cdot]$ the stop-gradient operator.

The aggregate training objective is:

$L = \lambda_{\text{sr}} L_{\text{sr}} + \lambda_{\text{disc}} L_{\text{disc}} + \lambda_{\text{adv}} L_{\text{adv}} + \lambda_{\text{fm}} L_{\text{fm}} + \lambda_{\text{cm}} L_{\text{cm}}$

with $(\lambda_{\text{sr}}, \lambda_{\text{disc}}, \lambda_{\text{adv}}, \lambda_{\text{fm}}, \lambda_{\text{cm}}) = (15, 1, 1, 1, 1)$ .

3. Compression Parameters and Bitrate Analysis

UniSRCodec achieves high compression rates by minimizing token rates and exploiting codebook size.

Token Rate: Each 65,536-sample ( $\sim$ 1.49 s) segment is represented by a $128 \times 128$ Mel-spectrogram, then compressed to an $8 \times 8$ latent for SimVQ, yielding 64 quantized tokens per segment (about 43 tokens/s, rounded to 40 tokens/s).
Bitrate Calculation:
- $K=8192$ codebook $\Rightarrow \log_2 8192 = 13$ bits/token
- Bitrate $= 40$ tokens/s $\times$ 13 bits/token $= 520$ bps $= 0.52$ kbps
Compression Ratio: Compared to 16-bit PCM audio at 44.1 kHz ( $\approx$ 705.6 kbps), this yields a compression ratio of 1,356:1.

4. Experimental Evaluation and Benchmarks

Extensive evaluation was conducted on speech, music, and general audio:

Datasets: Training spanned $\sim$ 10,000 hours from VCTK, LibriTTS, CommonVoice (speech), MUSDB, Jamendo (music), and AudioSet (general audio). Testing used 1,000 clips/sample from LibriTTS-test-clean, MUSDB-test, and AudioSet-eval.
Objective Metrics:
- Music & Audio: Mel-L1 distance and STFT distance at 44.1 kHz ("Mel-44", "STFT-44") and 16 kHz ("Mel-16", "STFT-16").
- Speech: Short-Time Objective Intelligibility (STOI, higher is better); Perceptual Evaluation of Speech Quality (PESQ, higher is better).

Summary Table: Objective Results

Model	TPS	kbps/Nq	Mel-44 ↓	STFT-44 ↓	Mel-16 ↓	STFT-16 ↓	STOI ↑	PESQ ↑
WavTokenizer-Unified	40	0.48/1q	1.505	5.242	1.130	2.634	0.875	1.912
UniCodec (single)	75	1.30/1q	1.376	5.169	0.903	2.401	0.940	2.870
UniSRCodec-Base	40	0.52/1q	0.904	2.250	0.900	2.330	0.875	1.836
UniSRCodec-Large	176	2.29/1q	0.729	2.049	0.692	2.093	0.941	2.727

At 40 TPS, UniSRCodec-Base outperforms WavTokenizer-Unified and other single-codebook models across all Mel/STFT metrics. UniSRCodec-Large surpasses or matches leading multi-codebook methods at a lower bitrate.

Subjective Listening (MUSHRA) Results

Domain	UniSR-Base	UniSR-Large	UniCodec
General	53.4	72.7	46.9
Music	62.8	81.0	33.9
Speech	76.9	83.2	83.5

Even at 40 TPS, UniSR-Base exceeds previous single-codebook models on general audio and music domains, and closely matches speech quality.

5. Comparative Analysis with Prior Approaches

Complexity and Efficiency

UniSRCodec's single-codebook design (SimVQ) yields a straightforward inference pipeline and enhances feasibility for integration within audio-LLMs, contrasting with the structural overhead and slower adaptation of multi-codebook (e.g., RVQ) codecs.

Training convergence is attainable within 12 hours on 8×RTX 4090 GPUs, demonstrating the method's efficiency.

Bandwidth and Reconstruction Quality

Bandwidth efficiency: Reduces required bitrate to 0.52 kbps, compared to 0.9 kbps (WavTokenizer), 3.4 kbps (MelCap), and 2.88 kbps (SNAC).
Fidelity: Achieves comparable or superior high-frequency (Mel-44, STFT-44) reconstruction relative to multi-codebook benchmarks.

Unified Audio Modeling

By training on a diverse audio corpus covering speech, music, and general audio, UniSRCodec attains robust cross-domain generalization. The single codebook design minimizes domain bias and improves adaptation to downstream tasks.

6. Significance and Applications

UniSRCodec demonstrates that spectrogram-domain single-codebook quantization, combined with neural vocoder-based phase synthesis and frequency-aware reconstruction objectives, can achieve SOTA performance in ultra-low bitrate coding, expanding the practical applicability of neural audio codecs in bandwidth-constrained, cross-domain environments. Its simplicity of architecture and training, together with its performance advantages, position it as a strong candidate for integration in speech, music, and general audio coding pipelines, as well as downstream generative or audio-LLMs (Zhang et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniSRCodec.