X-Codec-2.0: Neural Audio Codec Overview
- X-Codec-2.0 is a neural audio codec featuring frozen HuBERT, transformer encoders/decoders, and a shared vector quantization codebook for multilingual speech modeling.
- An added pooling layer and increased decoder hop size reduce the latent rate from 50 Hz to 25 Hz while boosting temporal efficiency and perceptual clarity.
- Experimental results indicate a significant MOS improvement across languages, facilitating scalable LLM integration and enhanced multilingual applications.
X-Codec-2.0 is a neural audio codec designed to perform high-fidelity compression and modeling of multilingual speech. The core architecture employs frozen HuBERT features as semantic representations, a transformer-based encoder–decoder setup, and a single shared vector quantization codebook. X-Codec-2.0 originally operated at a 50 Hz latent rate and a 16 kHz audio sampling rate, enabling effective speech modeling across languages but exhibiting limitations in temporal efficiency and audio clarity. A recent modification introduces an additional pooling layer and increases the decoder hop size, reducing the latent rate to 25 Hz and increasing the sampling rate to 24 kHz. This adjustment substantially improves efficiency and perceptual performance while maintaining architectural simplicity and compatibility with LLM pipelines (Zolkepli, 28 Jan 2026).
1. Architectural Components and Signal Flow
X-Codec-2.0’s foundational architecture integrates several sequential modules, optimized for neural speech coding:
- Frozen HuBERT Semantic Encoder: A pretrained HuBERT model processes 16 kHz raw audio into frame-level, language-agnostic semantic features, capturing high-level phonetic and acoustic content. All HuBERT weights remain frozen to ensure stable and transferable representation.
- Transformer Codec Encoder: These semantic representations are fed into a stack of transformer layers, refining contextual dependencies before quantization.
- Vector Quantization: Outputs from the codec encoder are discretized using a single codebook of 65,536 entries at a fixed frame rate.
- Transformer Codec Decoder (Vocoder-Style): The quantized tokens are decoded by a separate transformer, operating within a GAN-like loss framework, into waveform segments reconstructed by a waveform generator.
Textual block diagram:
1 |
[16 kHz waveform] → [HuBERT encoder] → [Transformer codec encoder] → [Quantizer @50 Hz] → [Transformer decoder] → [Waveform @16 kHz] |
2. Temporal and Sampling Parameters
The original X-Codec-2.0 establishment of a 50 Hz latent rate involved a hop size samples at a $16,000$ Hz sampling rate, yielding:
Thus, 50 discrete latent tokens per second are emitted for downstream tasks—impacting both temporal granularity and token sequence length.
3. Modified Configuration: 25 Hz Latent Rate and 24 kHz Sampling
To address efficiency and perceptual limits, two main modifications are introduced:
- Pooling Layer: A 1D average pooling (AvgPool1d with , stride=2) is inserted before quantization, halving the feature map temporal resolution ().
- Decoder Hop Size Increase: Hop size is increased from 320 to 960 samples, and the sampling rate is raised to 24 kHz. The new base frame rate becomes:
The pooling factor and increased hop size ensure the effective latent rate is retained at 25 Hz even after subsampling.
- Decoder Weight Interpolation: To accommodate the new hop size, the pretrained decoder’s generator head weights are linearly resampled:
where and are the old and new output dimensionalities, respectively.
4. Implementation Details
- Training Corpus: 16,000 hours of multilingual speech (over 100 languages), resampled to 24 kHz, in 5-second random crops without transcripts.
- Optimization: Adam optimizer (), cyclic learning rate (max , min ), batch size 20 per GPU, 3 million steps of fine-tuning on 2RTX 3090 Ti (BF16).
- Loss: Composite loss with weighting. Mel loss computed at 24 kHz.
- Validation Protocol: Every 4000 steps on 2560 held-out samples, gradient clipping at 1.0.
- Test Set: Common Voice 17, 116 languages; each evaluated over the top 500 longest VAD-trimmed utterances (20 s after VAD), totaling 48,489 clips.
5. Quantitative Results and Comparative Analysis
UTMOSv2 mean opinion score (MOS) evaluation:
| Model | Latent Rate (Hz) | NL | EN | FR | IT | PL | PT | ES |
|---|---|---|---|---|---|---|---|---|
| DAC | 86 | 2.32 | … | … | … | … | … | 2.20 |
| DistilCodec | 93 | 2.42 | … | … | … | … | … | 2.19 |
| X-Codec-2.0 (baseline) | 50 | 2.17 | … | … | … | … | … | 2.14 |
| Ours (25 Hz, 24 kHz) | 25 | 2.46 | 2.25 | 2.22 | 2.30 | 2.39 | 2.39 | 2.31 |
- The 25 Hz, 24 kHz configuration demonstrates a +0.29 MOS improvement over the 50 Hz, 16 kHz baseline, with an average of 2.46 across representative languages.
- Aggregate evaluation over all 116 languages indicates superior UTMOSv2 performance compared to competing codecs at the same latent rate constraint.
- Listening tests and spectral analyses reveal enhanced clarity, particularly in high-frequency regions (above 8 kHz), and reduced perceptual muffling relative to baseline.
6. Trade-offs and Implications
- Token Sequence Efficiency: Halving the latent token rate from 50 to 25 Hz results in twice shorter discrete sequences, which reduces LLM integration overhead and memory footprint.
- Perceptual Quality: Increasing the sampling rate to 24 kHz allows the model to capture and reproduce content up to 12 kHz, improving speech naturalness and intelligibility, especially in high-frequency bands.
- Preservation of Architecture: The introduced modifications apply only to the pooling layer and decoder hop size, leaving HuBERT and the codec encoder unchanged to enable efficient transfer learning and minimal retuning.
7. Application Domains and Future Potential
- Low-Latency Multilingual TTS and LLM Pipelines: Shorter token sequences with maintained or improved quality are beneficial for speech synthesis pipelines coupled to LLMs.
- Speech-to-Speech Translation: The use of coarse, language-agnostic tokens at higher audio fidelity is conducive to end-to-end multilingual translation architectures.
- Unified Speech-Language Pretraining: The modified X-Codec-2.0 setup supports scalable research on joint modeling of speech and language, leveraging efficient representation learning and high-fidelity synthesis.
A plausible implication is that further advances in neural codec architectures could leverage similar pooling and hop-size adjustments to optimize both efficiency and perceptual quality, particularly for resource-constrained multilingual applications. The released source code and checkpoints facilitate reproduction and adoption within the research community (Zolkepli, 28 Jan 2026).