Non-Autoregressive Forced Alignment with LLMs
This presentation explores LLM-ForcedAligner, a novel approach that reformulates speech forced alignment as a slot-filling task. By leveraging large language models' strengths in long-context consistency and slot filling, this method achieves significant improvements in multilingual and long-form speech alignment accuracy while enabling non-autoregressive inference for faster processing.Script
Imagine trying to sync subtitles to a 5-hour multilingual podcast where every word must hit its precise timing mark. Traditional forced alignment systems struggle with this challenge, accumulating errors that create noticeable timing drift over long recordings.
Building on this challenge, forced alignment is the fundamental task of predicting start and end timestamps for each word in speech given its transcript. This technology enables everything from automated subtitling to building high-quality speech datasets for AI training.
Let's examine why current forced alignment methods fall short for modern applications.
Current methods face 3 critical limitations. They depend on language-specific phoneme models that are costly to maintain, they accumulate alignment errors that create noticeable drift in long recordings, and speech language models often hallucinate non-monotonic timestamps during autoregressive generation.
Traditional methods like MFA use statistical models and dynamic programming, while newer speech language models offer impressive multilingual and long-context abilities. However, their autoregressive nature creates new problems for precise timestamp prediction.
The authors introduce a clever reformulation that leverages language model strengths while avoiding their weaknesses.
The key insight is reformulating forced alignment as slot filling rather than text generation. They insert special time tokens into transcripts and train the model to predict discrete timestamp indices for these slots, enabling parallel prediction of all timestamps.
This diagram illustrates the complete approach. During training, they replace timestamps with special time tokens and use dynamic slot insertion to improve generalization. The model learns to predict timestamp indices only at slot positions, and during inference, users can insert time tokens anywhere to get precise timestamps through non-autoregressive decoding.
The implementation uses an Audio Transformer encoder running at 12.5 Hz, creating 80-millisecond time bins that match the encoder frame rate. A linear classifier predicts among 3,750 possible timestamp classes, enabling alignment for recordings up to 500 seconds long.
The training strategy is crucial for enabling non-autoregressive inference. Unlike standard language model training, they use non-shifted targets where each slot predicts its own timestamp based on preceding context, with dynamic slot insertion during training to improve generalization.
Let's examine how this approach performs against established baselines across multiple challenging scenarios.
They trained on an impressive 56,000-hour multilingual dataset spanning 10 languages, combining established corpora like LibriSpeech with internally collected data. Importantly, they created long-form test scenarios by concatenating speeches up to 500 seconds to evaluate drift resistance.
They used Accumulated Averaging Shift as their primary metric, measuring mean absolute timestamp error across all predicted positions. Testing included both pseudo-labeled data from established forced aligners and critically important human-labeled Chinese data for real-world validation.
The results show dramatic improvements across multilingual raw speech tests. While baseline methods average around 130 milliseconds error, their approach achieves just 42.9 milliseconds average error, representing a 69 to 78 percent relative improvement with a single unified model.
Perhaps most impressively, the method maintains low error rates on long-form speech where traditional methods fail catastrophically. On 300-second recordings, they achieve 52.9 milliseconds error while baselines often drift by hundreds of milliseconds, with one test showing 24.8 versus 410.8 milliseconds error.
While the method runs slightly slower than some baselines with a real-time factor of 0.0159, the non-autoregressive inference provides crucial benefits. It prevents timestamp hallucinations common in autoregressive models while enabling parallel prediction of all timestamps simultaneously.
Their ablation studies reveal important design choices. The 80-millisecond resolution matches encoder frame rates while balancing accuracy and generalization, and dynamic slot insertion at 50 percent rates during training significantly improves long-form performance.
Like any research, this work has important limitations that point toward future improvements.
The authors acknowledge several limitations. The method relies heavily on pseudo-labels from existing aligners like MFA for training, human evaluation covers only Chinese, and uneven language distribution in training data may impact performance for lower-resource languages.
This work has significant practical implications for content creators, researchers, and AI developers. It enables scalable multilingual subtitle generation, improves speech corpus construction for training better AI systems, and demonstrates slot filling as a powerful new paradigm for structured prediction tasks.
The researchers have shown that reformulating forced alignment as slot filling unlocks the multilingual and long-context strengths of language models while avoiding their autoregressive pitfalls. This elegant solution transforms a classic speech problem into a modern language modeling success story, achieving remarkable accuracy improvements across languages and durations. Visit EmergentMind.com to explore more cutting-edge research that's reshaping how we think about AI and speech processing.