UniSRCodec: Unified Low-Bitrate Neural Audio Codec
- UniSRCodec is a unified, low-bitrate neural audio codec that leverages a single-codebook VQ architecture with sub-band reconstruction to compress audio at ultra-low bitrates.
- It employs Mel-spectrogram-based time-frequency compression, a streamlined encoder-decoder pipeline, and neural vocoder-based phase synthesis to ensure high-fidelity audio generation.
- Performance benchmarks demonstrate superior fidelity and robustness across speech, music, and general audio, achieving state-of-the-art metrics at 0.52 kbps.
UniSRCodec is a unified, low-bitrate neural audio codec built on a single-codebook vector quantization (VQ) architecture with sub-band reconstruction. Its design addresses limitations in both multi-codebook and conventional single-codebook neural audio codecs (NACs), namely the structural complexity of the former and the fidelity and frequency coverage challenges of the latter. By integrating Mel-spectrogram-based time–frequency compression, a single codebook ("SimVQ"), sub-band loss scaling, and neural vocoder-based phase synthesis, UniSRCodec achieves high fidelity and robustness across speech, music, and general audio at bitrates as low as 0.52 kbps, and demonstrates state-of-the-art (SOTA) performance among cross-domain single-codebook codecs (Zhang et al., 6 Jan 2026).
1. Model Architecture and Signal Pathway
UniSRCodec employs a pipeline that transitions audio from the time domain to a compressed, quantized latent space and back, leveraging spectral representations and modern VQ techniques.
- Time–Frequency Compression: Input waveform at 44.1 kHz is first transformed via short-time Fourier transform (STFT) to complex spectra . The magnitude spectrum is mapped to a 128-band Mel-spectrogram via a filterbank :
Only the magnitude is retained; phase is discarded.
- Encoder: Utilizes fully convolutional 2D ResBlocks (GroupNorm → SiLU → Conv), adapted from Open-MagViT2. The network expands channel dimensionality (1 → 128 → 256 → 512) while downsampling over both time and frequency through strides [2, 2, 4], giving a final latent shape .
- Single-Codebook Quantizer (SimVQ): Adopts a frame-wise flattening strategy, grouping channels across the 8 temporal and 8 mel-frequency axes to produce latent vectors per segment. Vectors are assigned to the nearest of codebook vectors:
Commitment loss is used to anchor encoder outputs to their assigned codebook entries.
- Decoder: Mirrors the encoder in reverse (channels [512, 256, 128, 1]; upsample factors [4, 2, 2]), reconstructing the Mel-spectrogram segment.
- Sub-Band Reconstruction: The Mel-spectrogram target is split into low and high frequency halves (along Mel axis): and . A weighted L1 loss emphasizes lower frequencies:
where , .
- Phase Recovery: The reconstructed Mel-spectrogram is synthesized into waveform audio using BigVGAN-v2, a pre-trained universal neural vocoder, which implicitly recovers the phase.
2. Training Objectives and Optimization
Multiple loss terms shape the training process:
- Sub-Band Reconstruction Loss (): As described above, accentuates low-frequency fidelity while preserving high-frequency detail.
- Discriminator Loss (): A single-scale spectral discriminator (adapted from DAC) distinguishes real vs. reconstructed Mel-spectrograms.
- Adversarial Loss (): Standard for GAN-style training against the discriminator.
- Feature-Matching Loss (): Encourages generator output to match discriminator intermediate features.
- Commitment Loss (): Prevents codebook collapse:
with the stop-gradient operator.
The aggregate training objective is:
with .
3. Compression Parameters and Bitrate Analysis
UniSRCodec achieves high compression rates by minimizing token rates and exploiting codebook size.
- Token Rate: Each 65,536-sample (1.49 s) segment is represented by a Mel-spectrogram, then compressed to an latent for SimVQ, yielding 64 quantized tokens per segment (about 43 tokens/s, rounded to 40 tokens/s).
- Bitrate Calculation:
- codebook bits/token
- Bitrate tokens/s 13 bits/token bps kbps
- Compression Ratio: Compared to 16-bit PCM audio at 44.1 kHz ( 705.6 kbps), this yields a compression ratio of 1,356:1.
4. Experimental Evaluation and Benchmarks
Extensive evaluation was conducted on speech, music, and general audio:
- Datasets: Training spanned 10,000 hours from VCTK, LibriTTS, CommonVoice (speech), MUSDB, Jamendo (music), and AudioSet (general audio). Testing used 1,000 clips/sample from LibriTTS-test-clean, MUSDB-test, and AudioSet-eval.
- Objective Metrics:
- Music & Audio: Mel-L1 distance and STFT distance at 44.1 kHz ("Mel-44", "STFT-44") and 16 kHz ("Mel-16", "STFT-16").
- Speech: Short-Time Objective Intelligibility (STOI, higher is better); Perceptual Evaluation of Speech Quality (PESQ, higher is better).
Summary Table: Objective Results
| Model | TPS | kbps/Nq | Mel-44 ↓ | STFT-44 ↓ | Mel-16 ↓ | STFT-16 ↓ | STOI ↑ | PESQ ↑ |
|---|---|---|---|---|---|---|---|---|
| WavTokenizer-Unified | 40 | 0.48/1q | 1.505 | 5.242 | 1.130 | 2.634 | 0.875 | 1.912 |
| UniCodec (single) | 75 | 1.30/1q | 1.376 | 5.169 | 0.903 | 2.401 | 0.940 | 2.870 |
| UniSRCodec-Base | 40 | 0.52/1q | 0.904 | 2.250 | 0.900 | 2.330 | 0.875 | 1.836 |
| UniSRCodec-Large | 176 | 2.29/1q | 0.729 | 2.049 | 0.692 | 2.093 | 0.941 | 2.727 |
At 40 TPS, UniSRCodec-Base outperforms WavTokenizer-Unified and other single-codebook models across all Mel/STFT metrics. UniSRCodec-Large surpasses or matches leading multi-codebook methods at a lower bitrate.
Subjective Listening (MUSHRA) Results
| Domain | UniSR-Base | UniSR-Large | UniCodec |
|---|---|---|---|
| General | 53.4 | 72.7 | 46.9 |
| Music | 62.8 | 81.0 | 33.9 |
| Speech | 76.9 | 83.2 | 83.5 |
Even at 40 TPS, UniSR-Base exceeds previous single-codebook models on general audio and music domains, and closely matches speech quality.
5. Comparative Analysis with Prior Approaches
Complexity and Efficiency
UniSRCodec's single-codebook design (SimVQ) yields a straightforward inference pipeline and enhances feasibility for integration within audio-LLMs, contrasting with the structural overhead and slower adaptation of multi-codebook (e.g., RVQ) codecs.
Training convergence is attainable within 12 hours on 8×RTX 4090 GPUs, demonstrating the method's efficiency.
Bandwidth and Reconstruction Quality
- Bandwidth efficiency: Reduces required bitrate to 0.52 kbps, compared to 0.9 kbps (WavTokenizer), 3.4 kbps (MelCap), and 2.88 kbps (SNAC).
- Fidelity: Achieves comparable or superior high-frequency (Mel-44, STFT-44) reconstruction relative to multi-codebook benchmarks.
Unified Audio Modeling
By training on a diverse audio corpus covering speech, music, and general audio, UniSRCodec attains robust cross-domain generalization. The single codebook design minimizes domain bias and improves adaptation to downstream tasks.
6. Significance and Applications
UniSRCodec demonstrates that spectrogram-domain single-codebook quantization, combined with neural vocoder-based phase synthesis and frequency-aware reconstruction objectives, can achieve SOTA performance in ultra-low bitrate coding, expanding the practical applicability of neural audio codecs in bandwidth-constrained, cross-domain environments. Its simplicity of architecture and training, together with its performance advantages, position it as a strong candidate for integration in speech, music, and general audio coding pipelines, as well as downstream generative or audio-LLMs (Zhang et al., 6 Jan 2026).