Zero-Shot Duration Prediction
- Zero-shot duration prediction is a technique to infer duration variables in speech and time-series tasks without in-domain training.
- Transformer-based, flow, and reinforcement learning architectures enable robust inference for natural prosody and precise timing.
- Empirical studies demonstrate reduced MAE and improved WER, highlighting its effectiveness in diverse TTS and forecasting applications.
Zero-shot duration prediction refers to the assignment of durations—typically acoustic frame counts or inter-event intervals—without access to any training data or explicit fine-tuning for the target speaker, time series domain, or context. In text-to-speech (TTS), this capability is critical for both speaker identity preservation and speech intelligibility, especially when generating utterances for previously unseen speakers or new language domains. In time-series applications, zero-shot duration forecasting generalizes to producing future duration or interval sequences without in-domain fitting. Research in this area now spans speech synthesis (TTS), forecasting architectures, diffusion and flow-based models, retrieval-augmented neural pipelines, and transformer-based in-context learners.
1. Foundational Definitions and Motivations
Zero-shot duration prediction is defined operationally as the inference of duration variables (phoneme durations in speech, inter-event intervals in time series) for domains, speakers, or contexts that were entirely absent during model training. In TTS, the predictor must generalize learned prosodic and timing priors to new voices, styles, or linguistic environments (Pandey et al., 22 Jul 2025). The importance lies in rhythm modeling (speech prosody, timing, naturalness), speaker-specific traits (rate, emphasis), and minimal data regimes.
Application domains include:
- Speech editing and controlled TTS (where durations align new text into existing audio (Tang et al., 2021, Peng et al., 26 May 2025, Li et al., 20 Jul 2025)).
- Time-series forecasting (event intervals, durations between critical phenomena (Huang et al., 4 Mar 2025, Ma et al., 9 Aug 2025, Auer et al., 3 Jun 2025)).
- Voice conversion and speaker adaptation (where duration style is key to identity (Pandey et al., 22 Jul 2025)).
Zero-shot duration prediction is a recurring bottleneck for intelligibility and speaker fidelity, as naive heuristics (e.g., global speaking rate) or generic predictors substantially degrade speech quality and temporal coherence (Li et al., 20 Jul 2025).
2. Model Architectures for Zero-Shot Duration Prediction
Speech Synthesis: Transformer-Based and Flow/Policy Models
- Transformer-based modules: In zero-shot TTS editing, the input phoneme sequence is first embedded via a small CNN and transformer encoder, producing context-aware phoneme embeddings . Reference durations , including zeros for new inserts, are projected and summed with . A transformer encoder and two-layer MLP regress predicted durations . The output is fed into a length regulator to synchronize text and speech embeddings for mel-spectrogram synthesis (Tang et al., 2021).
- Non-autoregressive Continuous Normalizing Flow (CNF) predictors: Infilling-style predictors model the duration vector as a conditional flow between prior and empirical durations. CNF drives a parallel prediction scheme where missing durations are predicted given masked context durations and phonetic embeddings. The flow dynamics are governed by an ODE and trained via Conditional Flow Matching with optimal transport velocity targets (Pandey et al., 22 Jul 2025).
- Speaker-prompted predictors: Speaker-specific duration style is modeled by text-prompt cross-attention against a short mel-spectrogram sample, outputting speaker-conditioned duration estimates via MLP heads (Pandey et al., 22 Jul 2025).
Diffusion and Policy Optimization
- Duration policies as MDPs: DMOSpeech 2 formulates duration generation as a stochastic policy in a Markov Decision Process. The policy, implemented as a transformer encoder-decoder, outputs a distribution over duration classes, conditioned on text and a speech prompt (Li et al., 20 Jul 2025).
- Reinforcement learning optimization: The duration policy is trained to maximize expected reward , combining ASR WER and speaker similarity (cosine embeddings). Optimization is performed using Group Relative Policy Optimization (GRPO), which mixes clipped PPO and KL regularization to stabilize updates and anchor the policy to an initial supervised version (Li et al., 20 Jul 2025).
Encoder-Decoder Predictors with Progress-Encoded Positioning
- Progress-Monitoring Rotary Position Embedding (PM-RoPE): VoiceStar uses a mechanism that normalizes positional encoding by fractional progress relative to the user-specified target duration. The decoder attends to position where is the target length; when progress reaches unity, an end-of-sequence token is emitted, ensuring precise duration control (Peng et al., 26 May 2025). PM-RoPE also aligns text and speech tokens during cross-attention and enables robust extrapolation to durations well beyond training, without explicit change to model architecture or retraining.
Time-Series and Event Duration Forecasting
- PTM Sequential Fusion: SeqFusion enables zero-shot interval forecasting by selecting and fusing predictions from multiple pre-trained models. It projects time-series histories and each PTM’s domain into a common embedding space, selecting the top-k closest PTMs, running recursive blockwise prediction, and aggregating outputs via softmax-weighted fusion. This approach preserves privacy, avoids target-domain training, and readily adapts to duration-series tasks (Huang et al., 4 Mar 2025).
- Retrieval-Augmented and Covariate-Aware Transformers: QuiZSF and COSMIC employ large-scale retrieval and in-context covariate learning. QuiZSF uses a hierarchical index (CRB), multi-grained relational feature extraction (MSIL), and dual-branch cooperation for numerical and textual TS foundation models, optimizing prediction with MSE and MMD losses (Ma et al., 9 Aug 2025). COSMIC combines context patching, rotary position encodings, and quantile-output heads with a novel covariate augmentation regimen, enabling robust zero-shot duration forecasting incorporating static and dynamic contextual signals (Auer et al., 3 Jun 2025).
3. Training Objectives and Loss Functions
- Regression losses: Speech models frequently use L₁ or MSE losses on durations, either directly on frame counts (Tang et al., 2021, Pandey et al., 22 Jul 2025) or log-transformed durations to stabilize scale across phonemes and speakers (Pandey et al., 22 Jul 2025).
- Flow matching objectives: CNF-based predictors apply a squared error between flow field and optimal transport velocity along conditional paths, enforcing constant-speed transitions between context and target durations (Pandey et al., 22 Jul 2025).
- Policy gradients and metric optimization: Reinforcement signals in DMOSpeech 2 combine log-probability of transcription (via ASR) and cosine speaker similarity, normalized and blended at the instance level (Li et al., 20 Jul 2025). The GRPO surrogate loss combines importance-sampled group preference scores and a KL term anchoring to a supervised base.
- Composite loss functions: TTS editing frameworks combine duration losses with reconstruction losses on mel-spectrograms, balancing naturalness with rhythmic precision (Tang et al., 2021). Time-series systems blend MSE with distributional regularizers such as MMD to encourage output diversity and alignment with true temporal statistics (Ma et al., 9 Aug 2025).
4. Data Sources, Inference Protocols, and Evaluation
- Speech TTS datasets: Indian language corpora (IndicVoices, FLEURS), LibriSpeech, Seed-TTS, and EMILIA are commonly used for zero-shot splits, where test speakers and text never appear in training (Pandey et al., 22 Jul 2025, Peng et al., 26 May 2025, Li et al., 20 Jul 2025).
- Forecasting datasets: ETTh1/2, ECL, Traffic, Exchange-Rate, and ILI are established targets for zero-shot duration prediction in event and time series domains (Huang et al., 4 Mar 2025, Ma et al., 9 Aug 2025).
- Inference: TTS systems typically input an edited transcript and speaker prompt (mel-spectrogram or codec tokens), predict durations, regulate lengths in embedding space, and decode speech via vocoders (Griffin-Lim, Encodec) (Tang et al., 2021, Peng et al., 26 May 2025).
- Evaluation metrics:
- Speech: Phoneme-level/word-level MAE (frames), ASR word error rate (WER), speaker similarity (cosine Sim-o), Quality MOS (QMOS), Subjective similarity MOS (SMOS), prosody diversity (CV_f₀) (Tang et al., 2021, Pandey et al., 22 Jul 2025, Li et al., 20 Jul 2025, Peng et al., 26 May 2025).
- Time-series: MSE, RMSE, MAPE, SMAPE, MASE, weighted quantile loss (WQL) (Huang et al., 4 Mar 2025, Ma et al., 9 Aug 2025, Auer et al., 3 Jun 2025).
| Model/Framework | Application Domain | Key Architecture |
|---|---|---|
| DMOSpeech 2 (Li et al., 20 Jul 2025) | Zero-shot TTS | Policy RL, GRPO |
| VoiceStar (Peng et al., 26 May 2025) | Zero-shot TTS w/ control | PM-RoPE, CPM training |
| SeqFusion (Huang et al., 4 Mar 2025) | Duration/interval forecasting | PTM fusion (embedding) |
| QuiZSF (Ma et al., 9 Aug 2025) | Zero-shot TS forecasting | RAG, MSIL, MCC |
| COSMIC (Auer et al., 3 Jun 2025) | Zero-shot forecasting + cov. | Patch-transformer, quantile head |
5. Comparative Results and Trade-offs
Empirical studies consistently demonstrate that high-precision zero-shot duration prediction is indispensable for intelligibility and speaker similarity in TTS. Notably:
- Transformer-based predictors with context-aware embeddings achieve phoneme-level MAE 22 ms, reducing word-level errors by nearly an order of magnitude over two-stage baselines (Tang et al., 2021).
- Speaker-prompted CNF predictors best preserve speaker traits for languages with high prosodic variance, while infill-style CNFs yield best overall intelligibility in more regular languages. No single duration strategy is optimal across all tasks or language domains (Pandey et al., 22 Jul 2025).
- RL-optimized duration policies in DMOSpeech 2 attain WER and similarity scores near oracle best-of-8 and outperform SOTA in both domains with real-time factor 0.032 (Li et al., 20 Jul 2025).
- PM-RoPE in VoiceStar delivers robust zero-shot extrapolation, maintaining naturalness and intelligibility at utterance durations up to 50 s without explicit duration predictors (Peng et al., 26 May 2025).
- In time-series and event forecasting, both fused PTM selectors (SeqFusion) and retrieval-augmented pipelines (QuiZSF) achieve top-1 accuracy in 75–87% of zero-shot settings, with COSMIC delivering state-of-the-art quantile loss and competitive pointwise metrics across datasets with covariates (Huang et al., 4 Mar 2025, Ma et al., 9 Aug 2025, Auer et al., 3 Jun 2025).
6. Design Guidelines, Limitations, and Future Directions
Duration prediction modules remain lightweight and interpretable, adaptable across speech and temporal domains, but require careful tuning to maximize both intelligibility and identity preservation. Recommended practices include:
- Use infill-style predictors when intelligibility on unseen tasks is paramount; prefer speaker-prompted approaches for maximal speaker-specificity (Pandey et al., 22 Jul 2025).
- For models such as VoiceStar, set target duration precisely via encoded length; estimation error in duration degrades speech transcript accuracy (Peng et al., 26 May 2025).
- When covariates are present (static or dynamic), in-context learning and patch-based transformers remain competitive with dedicated supervised models for interval/duration forecasting (Auer et al., 3 Jun 2025).
- Privacy is inherently preserved in sequential PTM selection and fusion approaches, as only compact summaries are transmitted and no fine-tuning occurs on the target (Huang et al., 4 Mar 2025).
- Hybrid strategies combining global prosodic style (prompt) and local timing (forced-aligned context) may further improve trade-offs in multilingual and data-scarce settings (Pandey et al., 22 Jul 2025).
Research in zero-shot duration prediction is converging on architectures that exploit cross-modal alignment, policy optimization, and representation learning to unify speech and time-series domains, with emphasis on robustness, scalability, and interpretable control.