Base-Token Disentanglement (BTD) Module
- BTD module is a neural network component that disentangles mixed-speech mel-spectrograms into distinct token streams for individual speakers.
- It employs a modular architecture with mel downsampling, intra-source transformers, and anti-consistency inter-source layers to achieve effective separation and compression at ultra-low bitrates.
- Empirical results and ablation studies validate its design, supporting efficient speech transmission in scenarios like real-time online meetings.
The Base-Token Disentanglement (BTD) module is a neural network component designed to factorize mixed-speech representations into discrete token streams, each corresponding to an underlying speaker, while jointly supporting ultra-low bitrate speech compression within the CodeSep framework. BTD operates directly on mel-spectrograms of overlapped speech, synthesizing source-differentiated “base tokens” suitable for subsequent expansion and waveform reconstruction via a codec-driven pathway. This capability enables the simultaneous achievement of speech separation and transmission efficiency, with only the base token streams requiring communication or storage, supporting transmission bitrates as low as 1 kbps (Du et al., 19 Jan 2026).
1. Module Placement and Function within CodeSep
BTD occupies a central position in the inference chain of CodeSep, which is structured as four sequential modules:
- Codec Encoder (MDCTCodec encoder): Converts input audio to latent, framewise representations.
- Base-Token Disentanglement (BTD): Receives mixed-speech mel-spectrograms and emits two base-token sequences, one per embedded speaker.
- Auxiliary-Token Serial Prediction (ATSP): Independently expands each base-token stream into a full multi-stage sequence of codec tokens via serial prediction.
- Codec Decoder (MDCTCodec decoder): Reconstructs separated speaker waveforms from combined base and auxiliary tokens.
During inference, a mixture signal is first transformed into an 80-dimensional mel-spectrogram. BTD processes this spectrogram to produce, for each frame, two sequences of discrete base tokens, each intended to represent a different speaker. These base tokens are then expanded by ATSP modules to full N-stage code token sequences, which the decoder ultimately maps back to time-domain waveforms. In deployment, the system transmits or stores only the base tokens—thereby leveraging the ensuing stages for local or downstream expansion—thus directly contributing to extremely low bitrate operation (Du et al., 19 Jan 2026).
2. Input and Output Structures
The BTD module works with strictly defined framing and encoding conventions:
- Input: 80-dimensional mel-spectrogram frames extracted using an 80-sample frame shift at a 16 kHz sampling rate, delivering high temporal resolution.
- Mel Downsampling: The Mel Downsampling Block () consists of strided 1D convolutions (stride = 2), reducing the temporal frame rate and projecting features to 256 dimensions per frame ().
- Output: For each time frame and each speaker , BTD yields a categorical probability vector ( codebook entries). The argmax index defines the discrete base token: .
This design ensures emissive compactness: only one token per frame per speaker is output, substantially limiting required transmission bandwidth.
3. Internal Neural Architecture
The BTD is structurally modular, integrating downsampling, “intra-source” contextualization, and cross-source disentanglement:
- Mel Downsampling (): Three sequential 1D convolutional layers with stride compress the input spectrogram, producing .
- Source-Intra Transformers (): Four Transformer layers with multi-head self-attention and feed-forward sublayers further encode to .
- Anti-Consistency Source-Inter Transformer ():
- The Anti-Consistency Bias Generator (ACBG) yields two trainable bias vectors , used to perturb the single intra-source representation into two disequilibrium states: , .
- These vectors are processed in parallel via four cross-attention Transformer blocks, producing .
- Token Heads: Independent linear and softmax layers map to probability vectors ; discrete tokens are sampled via .
The anti-consistency design is empirically critical for effective disentanglement: ablation of the ACBG component (“w/o ACBG”) leads to significant degradation in objective speaker similarity metrics (Du et al., 19 Jan 2026).
4. Mathematical Formulation and Training Loss
The BTD’s learning objective explicitly addresses the arbitrary ordering of speakers in mixture data by adopting a permutation-invariant cross-entropy loss (PI-CE):
Let denote the mixed spectrogram, and the ground-truth first-stage codebook indices from the Residual Vector Quantizer (RVQ), corresponding to unmixed sources . The network computes:
The loss is then:
This enforces that each predicted base-token stream matches one of the ground-truth speaker token streams, regardless of their ordering in the mixture (Du et al., 19 Jan 2026).
5. Integration with Residual Vector Quantizer and Auxiliary-Token Serial Prediction
Training leverages the RVQ module: the ground-truth for BTD consists of first-stage RVQ indices assigned by encoding each single-speaker utterance. During validation or deployment:
- Only the base-token sequence is transmitted or stored for each speaker ($1$ token per frame per speaker, at $1$ kbps total).
- At the receiver, for each speaker, an ATSP branch takes as input and autoregressively predicts the remaining token indices per frame via a sequence modeling stack (LSTM+Conformer).
- The complete token streams, per source, are mapped via codebook lookups and summed, then passed to the MDCTCodec decoder for waveform reconstruction.
This design achieves the simultaneous goals of multi-speaker disentanglement (through parallel BTD branches and cross-stream regularization) and lossy coded transmission at ultra-low bitrates (Du et al., 19 Jan 2026).
6. Empirical Validation and Ablation
Extensive objective and subjective evaluations substantiate the crucial role of BTD and its anti-consistency mechanisms. At $1$ kbps, CodeSep demonstrates:
| Metric | CodeSep (BTD) | FCTS (1 kbps) | FSTC (1 kbps) |
|---|---|---|---|
| UTMOS | 3.14 | 1.34 | 1.99 |
| DNSMOS | 3.67 | 3.03 | 3.33 |
| NMOS | 3.65 | 2.96 | 3.24 |
| SMOS | 3.43 | 2.86 | 3.15 |
A/B/X speaker similarity tests confirm the anti-consistency bias’s necessity: ablation (“w/o ACBG”) reduces preference from 54.23% to 38.33% (statistically significant at ). These results indicate that BTD’s combination of downsampling, intra- and inter-source Transformer processing, and PI-CE loss robustly extracts speaker-partitioned base tokens, yielding separation and compression performance unattainable by baseline methods without such disentanglement (Du et al., 19 Jan 2026).
7. Context and Significance
The BTD module exemplifies an integrated approach to discrete-continuous representation disentanglement under coupled constraints of separation and rate efficiency. Its architecture leverages source-invariant and cross-source architecture features, and its training incorporates permutation-invariance to accommodate the inherent ambiguities of multi-source mixtures. This approach is particularly relevant for scenarios such as real-time online meetings and dialogue archiving, where efficient, privacy-preserving, and accurate utterance separation and storage are paramount. A plausible implication is the generalization of BTD-style modules to multi-modal or generalized source separation tasks where compact, interpretable, and disentangled representations are operationally beneficial (Du et al., 19 Jan 2026).