Papers
Topics
Authors
Recent
Search
2000 character limit reached

MMS-TTS-Ara: Arabic TTS Transformer

Updated 8 February 2026
  • The paper introduces a unified transformer-based TTS model for Arabic that shares encoder-decoder parameters between text and speech streams.
  • It employs joint self-supervised objectives like masked-span prediction, denoising autoencoding, and cross-modal VQ alignment to enhance latent representation.
  • It achieves near-ground-truth MOS scores on high-quality datasets and shows robust generalization in low-resource settings with additional ASR pre-fine-tuning.

MMS-TTS-Ara refers to a system or model for Arabic text-to-speech (TTS) synthesis, with particular relevance to research exemplified by ArTST, a pre-trained Arabic text and speech transformer. ArTST is built upon the unified-modal SpeechT5 framework, originally developed for English, and rigorously adapted for Modern Standard Arabic (MSA) through large-scale, monolingual pre-training. The system robustly covers the full TTS pipeline from raw undiacritized Arabic text input to naturalistic speech output via a neural vocoder, with a focus on modular architecture, cross-modal representation learning, and evaluation under both rich and low-resource conditions (Toyin et al., 2023).

1. Unified Transformer-Based Architecture

At the core of MMS-TTS-Ara, as realized in ArTST, is a single encoder–decoder Transformer architecture. This model replicates the configuration of SpeechT5’s “base” model, composed of 12 self-attention and feed-forward layers (per encoder and decoder), with dmodel768d_{model} \approx 768, feed-forward size of ~2048, and 12 attention heads each. A unique feature is the strict parameter sharing between the text and speech streams: both modalities utilize the same encoder, decoder, and core transformer stack, with modality-specific “prenet” and “postnet” adapters attached at either end. These prenets process raw text tokens or 80-dimensional log-mel spectrogram frames into the appropriate input space for the transformer layers, while postnets (including a 5-layer 1D convolutional stack for speech) handle output refinement.

The model incorporates a VQ-VAE quantizer and codebook (size 500, inherited from HuBERT clustering) for enforcing discrete cross-modal representations. During pre-training, 10% of encoder outputs are swapped with quantized embeddings before decoder cross-attention, promoting robust codebook usage and facilitating latent alignment between text and speech.

Key pre-training hyperparameters are as follows:

Hyperparameter Value Notes
Speech max len (batch) 250k steps (\approx15.6 s audio)
Text max len (batch) 600 characters
Pre-training steps 200k
Warm-up steps 64k
Learning rate 2×1042 \times 10^{-4} (Adam)
Hardware 4 × A100 GPUs ×14 days

2. Joint Self-Supervised Pre-Training Objectives

Pre-training employs four self-supervised objectives to align and enrich both text and speech modalities:

  • Speech masked-span prediction (MSP): The model predicts HuBERT-derived cluster labels cic_i for randomly masked time-steps MM in the input speech, with objective LMSP=iMlogp(ciEspeech_prenet(xspeech)M)\mathcal{L}_{MSP} = -\sum_{i \in M} \log p(c_i | E_{speech\_prenet}(x_{speech})_{\setminus M}).
  • Speech denoising auto-encoding (LDAE_speech\mathcal{L}_{DAE\_speech}): The model reconstructs clean speech from masked and pre-processed spectrograms, with a decoder postnet output loss of Dspeech_postnetD(Espeech_prenet(xspeech)M)xspeech2\| D_{speech\_postnet} \circ D(E_{speech\_prenet}(x_{speech})_{\setminus M}) - x_{speech} \|^2.
  • Text denoising auto-encoding (LDAE_text\mathcal{L}_{DAE\_text}): Similar masking is performed in the text domain, with the loss tTlogp(ytDtext_decoder(y<t,Etext_prenet(xtext)T))-\sum_{t \in T} \log p(y_t \mid D_{text\_decoder}(y_{<t}, E_{text\_prenet}(x_{text})_{\setminus T})).
  • Cross-modal VQ alignment and diversity (LCML\mathcal{L}_{CML}): Latent vectors are explicitly aligned to their quantized proxies with iE()iz^i2+λLdiversity(z^)\sum_i \| E(\cdot)_i - \hat{z}_i \|^2 + \lambda L_{diversity}(\hat{z}), with random substitution of 10% encoder activations by codebook samples before cross-attention.

This joint training is designed to promote shared semantic and acoustic structure while enforcing codebook variety across the latent space (Toyin et al., 2023).

3. Fine-Tuning for Arabic TTS

The Arabic TTS fine-tuning pipeline utilizes exclusively MSA data, exploiting both the Arabic Speech Corpus (ASC: \sim3.8 h train, 0.28 h test, 1,400 reference words) and the Classical Arabic TTS dataset (ClArTTS: 11.16 h train, 0.24 h test, \sim76k words). Input representation is strictly at the character level, using undiacritized Arabic letters, numerals, select English characters, and special symbols—yielding a vocabulary consisting of all necessary Arabic characters and limited extras.

For TTS, input sequences (max 600 chars) are projected via the text prenet, encoded, and decoded to an output sequence of 80-bin mel-spectrogram frames (max 250k steps), which are further refined by the speech postnet. Training employs standard seq2seq teacher-forcing; dropout schedules and fine-tuning optimizer specifics are not specified, but pre-training utilized the Adam optimizer with a learning rate of 2×1042 \times 10^{-4}.

The TTS objective used during fine-tuning is strictly the reconstruction L1 (or L2) loss between predicted and ground-truth spectrograms:

LTTS=Y^speechYspeech1\mathcal{L}_{TTS} = \| \hat{Y}_{speech} - Y_{speech} \|_1

where Y^speech=Dspeech_postnetDEtext_prenet(xtext)\hat{Y}_{speech} = D_{speech\_postnet} \circ D \circ E_{text\_prenet}(x_{text}).

There is no introduction of adversarial losses, explicit duration predictors, or guided attention penalties.

4. Neural Vocoder and Synthesis Pipeline

For waveform synthesis, MMS-TTS-Ara leverages an off-the-shelf HiFi-GAN neural vocoder, using a pre-trained checkpoint originally developed for SpeechT5. The full synthesis chain is therefore: text input \to ArTST model \to 80-bin mel spectrogram \to HiFi-GAN generator \to 16 kHz waveform. This allows for robust mapping from encoded text to high-fidelity, natural speech, without the need for specialized Arabic waveform models.

5. Quantitative Evaluation and Generalization

Performance is evaluated using the Mean Opinion Score (MOS) metric (1–5 scale) on held-out test utterances, as rated by 15 native Arabic speakers. For fully fine-tuned ArTST:

Model Corpus MOS Notes
Ground-truth (recorded) ASC 4.31
Ground-truth (recorded) ClArTTS 4.64
SpeechT5 (English pre-trained) ClArTTS 1.88 Robotic/mispronounced speech
ArTST (Arabic pre-trained) ClArTTS 4.11
ArTST (Arabic pre-trained) ASC 3.44
ArTST* (MGB2 ASR pre-fine-tune) ClArTTS 4.31 Matches ground-truth

Key findings include:

  • Monolingual Arabic pre-training is essential; models initialized from English SpeechT5 produce very poor Arabic speech (MOS < 2).
  • ArTST achieves high-quality, near-ground-truth results on ClArTTS (MOS > 4) and demonstrates generalization to low-resource settings on the much smaller ASC.
  • Additional pre-fine-tuning on large unaligned ASR data (MGB2) further boosts synthesis naturalness.
  • Implicit diacritization is observed: the model correctly infers and renders vowelization and prosody despite undiacritized input text.

No Mel-cepstral distortion (MCD) scores are provided; MOS is the only reported quantitative metric.

6. Limitations and Future Directions

Several limitations and avenues for extension are highlighted:

  • Pre-training is limited to MSA from a single broadcast dataset (MGB2), so dialectal and code-switched Arabic are not yet addressed.
  • TTS fine-tuning details such as learning rate schedules and teacher-forcing ratios are not fully specified.
  • No explicit duration modeling or adversarial objectives are included, which may limit prosodic expressivity.
  • Future work could include:
    • Dialect-specific pre-training and fine-tuning
    • Integration of joint diacritization modules or grapheme-to-phoneme adapters
    • Incorporation of learned duration predictors or prosody encoders

A plausible implication is that expanding coverage to dialectal and spontaneous speech, adding explicit prosodic modules, and detailed hyperparameter reporting may further improve quality and applicability (Toyin et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MMS-TTS-Ara.