VoxServe: Unified Streaming Inference

Updated 4 February 2026

VoxServe is a unified, streaming-centric serving system for SpeechLMs that decouples model logic from system-level optimizations.
It implements asynchronous inference, dynamic batching, and CUDA resource management to achieve low end-to-end latency and high throughput.
Experimental evaluations showed up to 13.4× performance gains over baselines, supporting diverse TTS and STS pipelines in real-time deployment.

VoxServe is a unified, streaming-centric serving system for Speech LLMs (SpeechLMs) designed to deliver low end-to-end latency, high throughput, and robust streamability—a set of guarantees indispensable for production-grade, real-time speech model deployment. The system introduces a strict separation between model logic (tokenization, LLM steps, detokenization) and system-level optimizations (dynamic batching, streaming-aware scheduling, CUDA resource management). VoxServe’s framework supports diverse SpeechLM architectures, including both text-to-speech (TTS) and speech-to-speech (STS) pipelines, implementing highly performant asynchronous inference and scheduling paradigms (Kamahori et al., 30 Jan 2026).

1. Model-Execution Abstraction and System Architecture

VoxServe’s architecture is comprised of two distinct processes: the Interface process and the Execution process. The Interface process exposes HTTP/gRPC endpoints and relays prompt requests (optionally with reference audio) to the Execution process. Within the Execution process, three principal components coordinate inference (see ASCII diagram below):

$L = t_1 - t_0$ 9

The core system abstraction is a unified model interface representing any SpeechLM as a subclass with the following methods:

preprocess(RequestState rs): prepares input tensors and state, allocates caches.
llm_forward(BatchTensor IDs, Masks, Feats): runs the LLM backbone, typically as a CUDA graph.
sampling(Logits, RequestState): applies sampling strategies (e.g., top-k, top-p), updates request state.
postprocess(BatchTokens): produces audio waveform chunks, maintains detokenizer cache.
Optionally, depth_forward/depth_sampling for codebook-by-codebook decoding.

All models implement these methods, facilitating model-agnostic, batched, and CUDA-optimized execution by the Scheduler and Worker. This explicit interface enables VoxServe to batch and schedule requests efficiently regardless of SpeechLM architectural idiosyncrasies.

Model Interface Pseudocode:

$t_0$ 0

2. Streaming-Centric Performance Metrics

VoxServe is evaluated against three primary metrics suited to streaming inference settings:

End-to-First-Audio Latency ( $L$ ); “Time-To-First-Audio” (TTFA):

$L = t_1 - t_0$ where $t_0$ is request submission time and $t_1$ is receipt of the first playable audio chunk.

Throughput ( $T$ ):

$T = \frac{\text{total requests completed in }W}{W} \qquad \text{(subject to TTFA}\le L_{\max})$ Quantifies the number of requests served per second under a latency constraint.

Streaming Viability ( $S$ ):

For chunk $i$ , arrival $t_i$ and playback duration $C_i$ , a stream is viable if all

$L = t_1 - t_0$ 0

Overall streamability is: $L = t_1 - t_0$ 1

Streaming viability reflects the strictness of timing needed for uninterrupted real-time playback; the system prioritizes both the initial latency and maintaining timely delivery of subsequent waveform chunks.

3. Scheduling Algorithm with Streaming Awareness

VoxServe’s Scheduler operates an infinite loop in which it dynamically prioritizes subtasks (LLM forward vs. detokenization) for all active requests, keyed to the current streaming phase:

Phase 1 (Startup): Requests before first audio are latency-critical and granted maximum priority to minimize their TTFA.
Phase 2 (Steady-State): Post first audio, each chunk’s production is associated with a soft deadline: $L = t_1 - t_0$ 2 The request’s current slack:

$L = t_1 - t_0$ 3

Priority is a function of slack:

$L = t_1 - t_0$ 4

where $L = t_1 - t_0$ 5 controls the bias toward near-deadline streams.

Scheduler loop (simplified):

$t_0$ 1 This dynamic, streaming-aware approach ensures that requests at risk of deadline violations receive GPU priority, improving both TTFA and streaming viability.

4. Asynchronous Inference and Resource Utilization

VoxServe decouples CPU-side scheduling and sampling from GPU-side LLM and detokenizer execution, enabling high concurrency and device utilization. The Worker process maintains two CUDA streams:

Stream 0: Handles LLM forward tasks.
Stream 1: Handles detokenizer (waveform synthesis) tasks.

Each GPU task is tagged with a CUDA event dependency on the latest request state. CPU threads, using Python and PyTorch’s CUDA constructs, queue GPU work as soon as input tensors are ready, then immediately proceed to next-stage sampling or batch preparation without blocking on device availability.

This overlap yields highly efficient resource usage and minimal idle periods, contributing to the substantial throughput and latency improvements observed experimentally.

5. Experimental Evaluation and Results

Evaluation was conducted on a single NVIDIA H100-80GB node (PyTorch 2.1, CUDA 12.1, FlashInfer attention kernel), comparing VoxServe to prevailing baselines across three SpeechLMs:

Model	Baseline max $L = t_1 - t_0$ 6 @500ms p90 TTFA	VoxServe max $L = t_1 - t_0$ 7 @500ms p90 TTFA	Speed-up
CosyVoice	0.4 req/s (100% S)	4.0 req/s (100% S)	10×
Orpheus	0.8 req/s (100% S)	10.0 req/s (≳99% S)	12.5×
Step-Audio	0.3 req/s (100% S)	3.5 req/s (100% S)	11.7×

Multi-GPU data-parallel scaling produced approximately linear throughput gains up to four GPUs for CosyVoice, and system performance improved equivalently in disaggregated scenarios (LLM on GPU₀, detok on GPU₁) despite inter-GPU transfer. In a throughput-oriented (non-streaming) scenario with 1000 concurrent CosyVoice requests, baseline achieved ~10× real-time throughput, while VoxServe (optimized) achieved ~134× real-time throughput, yielding a 13.4-fold performance gain over baseline at comparable latency and viability (Kamahori et al., 30 Jan 2026).

6. Implementation and Optimization Strategies

VoxServe’s optimization suite includes:

Unified model interface: permitting shared, prebuilt CUDA graphs across multiple model families and variants.
Stable tensor shapes: achieved through fixed chunk sizes and batch dimensions, resulting in high CUDA graph cache hit-rates.
FlashInfer attention kernels: for highly optimized LLM forward passes.
Detokenizer batching: chunks are processed in batches with per-request cache for KV/conv states, maximizing device efficiency.
Asynchronous execution: multiple CUDA streams and light-weight per-request state objects allow fine-grained concurrency.
Shape-constrained dynamic batching: Scheduler groups tasks by compatible tensor shapes, further boosting throughput.

Cumulatively, these optimizations enable uniform performance improvements across at least seven open-source SpeechLM families with an implementation size of approximately 20K Python+PyTorch lines.

7. Deployment Considerations

Practical deployment requires alignment of configuration, hardware, and integration pipelines:

Configurable parameters (per model):
- chunk_size: tokens per detokenizer call (e.g., 15 for CosyVoice, 28/21 overlap for Orpheus, 25+3 lookahead for Step-Audio).
- sampling parameters: temperature, top_k, top_p, repetition_penalty.
- max_batch_size for LLM and detok.
- CUDA device allocation, particularly for multi-GPU or disaggregated setups.
- Scheduler thresholds: startup concurrency cap, slack threshold, $L = t_1 - t_0$ 8 in priority formula.
Hardware requirements: At least one H100-class GPU (≥32 GB); large SpeechLMs (9B+ params) may require the full memory capacity or even necessitate model-disaggregated deployment.
Integration steps:

Install with pip install voxserve or clone the repository.
If using a new SpeechLM, subclass and implement preprocess, llm_forward, sampling, postprocess.
Configure scheduler for streaming or throughput-optimized mode.
Launch the execution process with the selected model configuration.
Submit HTTP/gRPC inference requests (JSON with optional audio); receive output as streaming WAV audio chunks.

The modular design allows rapid onboarding of new SpeechLMs with minimal integration overhead. All system-level optimizations, including batching, CUDA graphing, and scheduling, are automatically leveraged by models conforming to the unified interface (Kamahori et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

VoxServe: Streaming-Centric Serving System for Speech Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VoxServe.