VibeVoice-ASR: Unified Speech Processing
- VibeVoice-ASR is a unified speech understanding framework that processes long-form, multilingual audio in a single pass, merging ASR, diarization, and timestamping.
- It employs a dual-tokenizer strategy and prompt-based context injection to maintain global coherence and handle challenges like code-switching and overlapping speech.
- Empirical results show significant improvements in diarization error rate and word error rate on datasets such as AISHELL-4 compared to conventional chunked ASR systems.
VibeVoice-ASR is a general-purpose speech understanding framework engineered to address limitations in long-form audio processing, particularly context fragmentation and multi-speaker complexity, which are inadequately handled by conventional chunked and pipelined automatic speech recognition (ASR) systems. It is built atop the VibeVoice architecture and is designed for unified, single-pass processing of up to 60 minutes of continuous audio, supporting multilingual input, seamless code-switching, and direct context injection for domain adaptation (Peng et al., 26 Jan 2026).
1. Unified End-to-End Formulation and Architecture
VibeVoice-ASR reformulates the speech understanding pipeline—merging ASR, speaker diarization, and timestamping—into a single auto-regressive generation task. Instead of modular cascades, the system ingests raw audio and maps it via two specialized pre-trained tokenizers (acoustic and semantic) into a joint continuous latent sequence . This latent representation is then consumed by a decoder-only LLM backbone, such as Qwen-2.5, that generates a structured “Rich Transcription” stream consisting of:
- Speaker identity ("Who")
- Timestamp boundaries ("When")
- Content tokens ("What")
The training objective maximizes the joint likelihood , resulting in a model that produces speaker-attributed and time-aligned transcripts in a unified manner. The decoder attends jointly over encoded audio and injected prompts, producing outputs auto-regressively. The architectural compression allows mapping one hour of audio to approximately 27,000 latent tokens, enabling end-to-end processing within the context window of modern LLMs (up to 65,536 tokens) (Peng et al., 26 Jan 2026).
2. Prompt-Based Context Injection
The framework incorporates a prompt injection mechanism to integrate user-supplied context, improving accuracy for domain-specific terminology and polyphonic entity disambiguation. User prompts—comprising hotwords, keyword lists, or background descriptions—are tokenized and prepended to the audio latents, resulting in input of the form , where are prompt tokens, are acoustic latents, and are semantic latents.
A single embedding matrix is used for all token types, with no increase in parameters. The attention mask is structured such that prompts are visible to all subsequent audio tokens, whereas acoustic latents observe only prior context. This design allows the decoder to dynamically condition output on supplied prompts, directly influencing transcription in a context-aware manner (Peng et al., 26 Jan 2026).
3. Long-Form, Single-Pass Audio Processing
VibeVoice-ASR employs a dual-tokenizer strategy with an ultra-low frame rate, downsampling 24 kHz audio by 3,200× (≈7.5 tokens/sec), thereby compressing 60 minutes of continuous audio to roughly 27,000 tokens suitable for dense LLM attention. Unlike systems that rely on windowed chunking or stitching outputs, VibeVoice-ASR operates in a truly single-pass mode across the entire audio span.
To maintain global transcriptional coherence and prevent context fragmentation, the decoder leverages unrestricted self-attention over the input sequence and generates explicit speaker and timestamp tokens that enforce consistency of attribution and timing. Curriculum learning is employed during training to accommodate progressively longer input sequences, from 8,192 to 65,536 tokens (Peng et al., 26 Jan 2026).
4. Multilingualism and Code-Switching
The model is pre-trained on data spanning over 50 languages, mapped into a common latent space without explicit language tags. The semantic tokenizer ensures language-agnostic content alignment into the shared vocabulary. Synthetic multi-speaker, multi-lingual datasets—including engineered English/Chinese/code-switched dialogues generated with GPT-5 scripting and audio synthesis—are used to instill the ability to handle intra- and inter-utterance code-switching. The model transcribes mixed-language utterances dynamically, inferring language boundaries and content solely from audio latents (Peng et al., 26 Jan 2026).
5. Handling Domain Terminology and Polyphonic Speakers
Prompt-based context injection is central to the preservation and prioritization of domain-specific vocabulary. Hotword lists, entity gazetteers, and descriptive background are explicitly injected as prompts, and synthetic training scripts instruct the decoder to attend to these tokens. For overlapping speakers and diarization, the pre-training pipeline extracts speaker embeddings over overlapping 1.5 s windows (0.75 s hop), applies HDBSCAN clustering, and merges clusters if centroid cosine similarity exceeds 0.67. At inference, the “Rich Transcription” stream interleaves diarized speakers but serializes overlapping speech to the dominant speaker by design. Explicit separation-aware heads are identified as a target for future work (Peng et al., 26 Jan 2026).
6. Training Methodology and Evaluation
Data Pipeline
The training data pipeline comprises pseudo-labeled “in-the-wild” audio—segmented via VAD into sub-30s clips, transcribed and timestamped using Whisper-large-v3-turbo, diarized using WeSpeaker vblinkp embeddings with HDBSCAN, and subject to stringent quality filters. Synthetic datasets (≈6,000 hours) are created via a GPT-5 ↔ VibeVoice closed-loop for multi-speaker and code-switched dialog, with WER-based filtering. Long-form transcription restoration routines stitch and refine transcripts, and non-speech events ([Silence], [Music]) are tagged using GPT-Audio. The supervised fine-tuning dataset is weighted as follows: Standard Benchmarks:Music:Synthetic:Long-Form = 0.5:0.1:0.1:0.3.
Optimization and Objective
The model is optimized via cross-entropy loss (negative log-likelihood), specifically minimizing: Optimization is conducted using the AdamW optimizer and standard hyperparameters, with curriculum learning applied to context window expansion (Peng et al., 26 Jan 2026).
Performance Benchmarks
VibeVoice-ASR demonstrates strong performance across multiple tasks and datasets. On AISHELL-4, the system achieves DER 16.93% and WER 18.99%, compared to WhisperX (DER 14.55%, WER 29.69%). For multi-speaker transcription across AISHELL-4, AMI-IHM/SDM, AliMeeting, and 12 languages in the MLC-Challenge, VibeVoice-ASR delivers mean DER ≈3.4% (versus Gemini-2.5 DER 16.3%, Gemini-3 DER 33%), and tcpWER 15.7% (versus Gemini-2.5 28.9%, Gemini-3 58.8%), with observed absolute WER improvements of 1–2% on numerous languages (Peng et al., 26 Jan 2026).
7. Significance and Outlook
VibeVoice-ASR represents an advance in unified, large-scale speech understanding, illustrating that direct mapping of long-form audio to LLM-compatible latent streams, with explicit context injection, enables accurate, time-aligned transcription with integrated diarization and multi-lingual code-switching—all in a single shot, without recourse to sliding windows or pipelined post-processing. The framework establishes a foundation for extended applications in automated meeting transcription, podcast analysis, and other domains involving complex conversational audio. Future work includes explicit separation-aware modeling for enhanced polyphonic diarization (Peng et al., 26 Jan 2026).