Papers
Topics
Authors
Recent
Search
2000 character limit reached

VibeVoice-ASR: Unified Speech Processing

Updated 30 January 2026
  • VibeVoice-ASR is a unified speech understanding framework that processes long-form, multilingual audio in a single pass, merging ASR, diarization, and timestamping.
  • It employs a dual-tokenizer strategy and prompt-based context injection to maintain global coherence and handle challenges like code-switching and overlapping speech.
  • Empirical results show significant improvements in diarization error rate and word error rate on datasets such as AISHELL-4 compared to conventional chunked ASR systems.

VibeVoice-ASR is a general-purpose speech understanding framework engineered to address limitations in long-form audio processing, particularly context fragmentation and multi-speaker complexity, which are inadequately handled by conventional chunked and pipelined automatic speech recognition (ASR) systems. It is built atop the VibeVoice architecture and is designed for unified, single-pass processing of up to 60 minutes of continuous audio, supporting multilingual input, seamless code-switching, and direct context injection for domain adaptation (Peng et al., 26 Jan 2026).

1. Unified End-to-End Formulation and Architecture

VibeVoice-ASR reformulates the speech understanding pipeline—merging ASR, speaker diarization, and timestamping—into a single auto-regressive generation task. Instead of modular cascades, the system ingests raw audio xx and maps it via two specialized pre-trained tokenizers (acoustic and semantic) into a joint continuous latent sequence z\mathbf{z}. This latent representation is then consumed by a decoder-only LLM backbone, such as Qwen-2.5, that generates a structured “Rich Transcription” stream consisting of:

  • Speaker identity sts_t ("Who")
  • Timestamp boundaries τt\tau_t ("When")
  • Content tokens yty_t ("What")

The training objective maximizes the joint likelihood p(s1:T,τ1:T,y1:Tz)p(s_{1:T}, \tau_{1:T}, y_{1:T} \mid \mathbf{z}), resulting in a model that produces speaker-attributed and time-aligned transcripts in a unified manner. The decoder attends jointly over encoded audio and injected prompts, producing outputs auto-regressively. The architectural compression allows mapping one hour of audio to approximately 27,000 latent tokens, enabling end-to-end processing within the context window of modern LLMs (up to 65,536 tokens) (Peng et al., 26 Jan 2026).

2. Prompt-Based Context Injection

The framework incorporates a prompt injection mechanism to integrate user-supplied context, improving accuracy for domain-specific terminology and polyphonic entity disambiguation. User prompts—comprising hotwords, keyword lists, or background descriptions—are tokenized and prepended to the audio latents, resulting in input of the form [p1,...,pP,a1,...,aN,c1,...,cM][\mathbf{p}_1, ..., \mathbf{p}_P, \mathbf{a}_1, ..., \mathbf{a}_N, \mathbf{c}_1, ..., \mathbf{c}_M], where pi\mathbf{p}_i are prompt tokens, aj\mathbf{a}_j are acoustic latents, and ck\mathbf{c}_k are semantic latents.

A single embedding matrix is used for all token types, with no increase in parameters. The attention mask is structured such that prompts are visible to all subsequent audio tokens, whereas acoustic latents observe only prior context. This design allows the decoder to dynamically condition output on supplied prompts, directly influencing transcription in a context-aware manner (Peng et al., 26 Jan 2026).

3. Long-Form, Single-Pass Audio Processing

VibeVoice-ASR employs a dual-tokenizer strategy with an ultra-low frame rate, downsampling 24 kHz audio by 3,200× (≈7.5 tokens/sec), thereby compressing 60 minutes of continuous audio to roughly 27,000 tokens suitable for dense LLM attention. Unlike systems that rely on windowed chunking or stitching outputs, VibeVoice-ASR operates in a truly single-pass mode across the entire audio span.

To maintain global transcriptional coherence and prevent context fragmentation, the decoder leverages unrestricted self-attention over the input sequence and generates explicit speaker and timestamp tokens that enforce consistency of attribution and timing. Curriculum learning is employed during training to accommodate progressively longer input sequences, from 8,192 to 65,536 tokens (Peng et al., 26 Jan 2026).

4. Multilingualism and Code-Switching

The model is pre-trained on data spanning over 50 languages, mapped into a common latent space without explicit language tags. The semantic tokenizer ensures language-agnostic content alignment into the shared vocabulary. Synthetic multi-speaker, multi-lingual datasets—including engineered English/Chinese/code-switched dialogues generated with GPT-5 scripting and audio synthesis—are used to instill the ability to handle intra- and inter-utterance code-switching. The model transcribes mixed-language utterances dynamically, inferring language boundaries and content solely from audio latents (Peng et al., 26 Jan 2026).

5. Handling Domain Terminology and Polyphonic Speakers

Prompt-based context injection is central to the preservation and prioritization of domain-specific vocabulary. Hotword lists, entity gazetteers, and descriptive background are explicitly injected as prompts, and synthetic training scripts instruct the decoder to attend to these tokens. For overlapping speakers and diarization, the pre-training pipeline extracts speaker embeddings over overlapping 1.5 s windows (0.75 s hop), applies HDBSCAN clustering, and merges clusters if centroid cosine similarity exceeds 0.67. At inference, the “Rich Transcription” stream interleaves diarized speakers but serializes overlapping speech to the dominant speaker by design. Explicit separation-aware heads are identified as a target for future work (Peng et al., 26 Jan 2026).

6. Training Methodology and Evaluation

Data Pipeline

The training data pipeline comprises pseudo-labeled “in-the-wild” audio—segmented via VAD into sub-30s clips, transcribed and timestamped using Whisper-large-v3-turbo, diarized using WeSpeaker vblinkp embeddings with HDBSCAN, and subject to stringent quality filters. Synthetic datasets (≈6,000 hours) are created via a GPT-5 ↔ VibeVoice closed-loop for multi-speaker and code-switched dialog, with WER-based filtering. Long-form transcription restoration routines stitch and refine transcripts, and non-speech events ([Silence], [Music]) are tagged using GPT-Audio. The supervised fine-tuning dataset is weighted as follows: Standard Benchmarks:Music:Synthetic:Long-Form = 0.5:0.1:0.1:0.3.

Optimization and Objective

The model is optimized via cross-entropy loss (negative log-likelihood), specifically minimizing: L=t=1Tlogp(st,τt,yty<t,z)\mathcal{L} = -\sum_{t=1}^T \log p(s_t,\tau_t,y_t \mid y_{<t}, \mathbf{z}) Optimization is conducted using the AdamW optimizer and standard hyperparameters, with curriculum learning applied to context window expansion (Peng et al., 26 Jan 2026).

Performance Benchmarks

VibeVoice-ASR demonstrates strong performance across multiple tasks and datasets. On AISHELL-4, the system achieves DER 16.93% and WER 18.99%, compared to WhisperX (DER 14.55%, WER 29.69%). For multi-speaker transcription across AISHELL-4, AMI-IHM/SDM, AliMeeting, and 12 languages in the MLC-Challenge, VibeVoice-ASR delivers mean DER ≈3.4% (versus Gemini-2.5 DER 16.3%, Gemini-3 DER 33%), and tcpWER 15.7% (versus Gemini-2.5 28.9%, Gemini-3 58.8%), with observed absolute WER improvements of 1–2% on numerous languages (Peng et al., 26 Jan 2026).

7. Significance and Outlook

VibeVoice-ASR represents an advance in unified, large-scale speech understanding, illustrating that direct mapping of long-form audio to LLM-compatible latent streams, with explicit context injection, enables accurate, time-aligned transcription with integrated diarization and multi-lingual code-switching—all in a single shot, without recourse to sliding windows or pipelined post-processing. The framework establishes a foundation for extended applications in automated meeting transcription, podcast analysis, and other domains involving complex conversational audio. Future work includes explicit separation-aware modeling for enhanced polyphonic diarization (Peng et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VibeVoice-ASR.