MMS-TTS-Ara: Arabic TTS Transformer
- The paper introduces a unified transformer-based TTS model for Arabic that shares encoder-decoder parameters between text and speech streams.
- It employs joint self-supervised objectives like masked-span prediction, denoising autoencoding, and cross-modal VQ alignment to enhance latent representation.
- It achieves near-ground-truth MOS scores on high-quality datasets and shows robust generalization in low-resource settings with additional ASR pre-fine-tuning.
MMS-TTS-Ara refers to a system or model for Arabic text-to-speech (TTS) synthesis, with particular relevance to research exemplified by ArTST, a pre-trained Arabic text and speech transformer. ArTST is built upon the unified-modal SpeechT5 framework, originally developed for English, and rigorously adapted for Modern Standard Arabic (MSA) through large-scale, monolingual pre-training. The system robustly covers the full TTS pipeline from raw undiacritized Arabic text input to naturalistic speech output via a neural vocoder, with a focus on modular architecture, cross-modal representation learning, and evaluation under both rich and low-resource conditions (Toyin et al., 2023).
1. Unified Transformer-Based Architecture
At the core of MMS-TTS-Ara, as realized in ArTST, is a single encoder–decoder Transformer architecture. This model replicates the configuration of SpeechT5’s “base” model, composed of 12 self-attention and feed-forward layers (per encoder and decoder), with , feed-forward size of ~2048, and 12 attention heads each. A unique feature is the strict parameter sharing between the text and speech streams: both modalities utilize the same encoder, decoder, and core transformer stack, with modality-specific “prenet” and “postnet” adapters attached at either end. These prenets process raw text tokens or 80-dimensional log-mel spectrogram frames into the appropriate input space for the transformer layers, while postnets (including a 5-layer 1D convolutional stack for speech) handle output refinement.
The model incorporates a VQ-VAE quantizer and codebook (size 500, inherited from HuBERT clustering) for enforcing discrete cross-modal representations. During pre-training, 10% of encoder outputs are swapped with quantized embeddings before decoder cross-attention, promoting robust codebook usage and facilitating latent alignment between text and speech.
Key pre-training hyperparameters are as follows:
| Hyperparameter | Value | Notes |
|---|---|---|
| Speech max len (batch) | 250k steps (15.6 s audio) | |
| Text max len (batch) | 600 characters | |
| Pre-training steps | 200k | |
| Warm-up steps | 64k | |
| Learning rate | (Adam) | |
| Hardware | 4 × A100 GPUs ×14 days |
2. Joint Self-Supervised Pre-Training Objectives
Pre-training employs four self-supervised objectives to align and enrich both text and speech modalities:
- Speech masked-span prediction (MSP): The model predicts HuBERT-derived cluster labels for randomly masked time-steps in the input speech, with objective .
- Speech denoising auto-encoding (): The model reconstructs clean speech from masked and pre-processed spectrograms, with a decoder postnet output loss of .
- Text denoising auto-encoding (): Similar masking is performed in the text domain, with the loss .
- Cross-modal VQ alignment and diversity (): Latent vectors are explicitly aligned to their quantized proxies with , with random substitution of 10% encoder activations by codebook samples before cross-attention.
This joint training is designed to promote shared semantic and acoustic structure while enforcing codebook variety across the latent space (Toyin et al., 2023).
3. Fine-Tuning for Arabic TTS
The Arabic TTS fine-tuning pipeline utilizes exclusively MSA data, exploiting both the Arabic Speech Corpus (ASC: 3.8 h train, 0.28 h test, 1,400 reference words) and the Classical Arabic TTS dataset (ClArTTS: 11.16 h train, 0.24 h test, 76k words). Input representation is strictly at the character level, using undiacritized Arabic letters, numerals, select English characters, and special symbols—yielding a vocabulary consisting of all necessary Arabic characters and limited extras.
For TTS, input sequences (max 600 chars) are projected via the text prenet, encoded, and decoded to an output sequence of 80-bin mel-spectrogram frames (max 250k steps), which are further refined by the speech postnet. Training employs standard seq2seq teacher-forcing; dropout schedules and fine-tuning optimizer specifics are not specified, but pre-training utilized the Adam optimizer with a learning rate of .
The TTS objective used during fine-tuning is strictly the reconstruction L1 (or L2) loss between predicted and ground-truth spectrograms:
where .
There is no introduction of adversarial losses, explicit duration predictors, or guided attention penalties.
4. Neural Vocoder and Synthesis Pipeline
For waveform synthesis, MMS-TTS-Ara leverages an off-the-shelf HiFi-GAN neural vocoder, using a pre-trained checkpoint originally developed for SpeechT5. The full synthesis chain is therefore: text input ArTST model 80-bin mel spectrogram HiFi-GAN generator 16 kHz waveform. This allows for robust mapping from encoded text to high-fidelity, natural speech, without the need for specialized Arabic waveform models.
5. Quantitative Evaluation and Generalization
Performance is evaluated using the Mean Opinion Score (MOS) metric (1–5 scale) on held-out test utterances, as rated by 15 native Arabic speakers. For fully fine-tuned ArTST:
| Model | Corpus | MOS | Notes |
|---|---|---|---|
| Ground-truth (recorded) | ASC | 4.31 | |
| Ground-truth (recorded) | ClArTTS | 4.64 | |
| SpeechT5 (English pre-trained) | ClArTTS | 1.88 | Robotic/mispronounced speech |
| ArTST (Arabic pre-trained) | ClArTTS | 4.11 | |
| ArTST (Arabic pre-trained) | ASC | 3.44 | |
| ArTST* (MGB2 ASR pre-fine-tune) | ClArTTS | 4.31 | Matches ground-truth |
Key findings include:
- Monolingual Arabic pre-training is essential; models initialized from English SpeechT5 produce very poor Arabic speech (MOS < 2).
- ArTST achieves high-quality, near-ground-truth results on ClArTTS (MOS > 4) and demonstrates generalization to low-resource settings on the much smaller ASC.
- Additional pre-fine-tuning on large unaligned ASR data (MGB2) further boosts synthesis naturalness.
- Implicit diacritization is observed: the model correctly infers and renders vowelization and prosody despite undiacritized input text.
No Mel-cepstral distortion (MCD) scores are provided; MOS is the only reported quantitative metric.
6. Limitations and Future Directions
Several limitations and avenues for extension are highlighted:
- Pre-training is limited to MSA from a single broadcast dataset (MGB2), so dialectal and code-switched Arabic are not yet addressed.
- TTS fine-tuning details such as learning rate schedules and teacher-forcing ratios are not fully specified.
- No explicit duration modeling or adversarial objectives are included, which may limit prosodic expressivity.
- Future work could include:
- Dialect-specific pre-training and fine-tuning
- Integration of joint diacritization modules or grapheme-to-phoneme adapters
- Incorporation of learned duration predictors or prosody encoders
A plausible implication is that expanding coverage to dialectal and spontaneous speech, adding explicit prosodic modules, and detailed hyperparameter reporting may further improve quality and applicability (Toyin et al., 2023).