VoxServe: Unified Streaming Inference
- VoxServe is a unified, streaming-centric serving system for SpeechLMs that decouples model logic from system-level optimizations.
- It implements asynchronous inference, dynamic batching, and CUDA resource management to achieve low end-to-end latency and high throughput.
- Experimental evaluations showed up to 13.4× performance gains over baselines, supporting diverse TTS and STS pipelines in real-time deployment.
VoxServe is a unified, streaming-centric serving system for Speech LLMs (SpeechLMs) designed to deliver low end-to-end latency, high throughput, and robust streamability—a set of guarantees indispensable for production-grade, real-time speech model deployment. The system introduces a strict separation between model logic (tokenization, LLM steps, detokenization) and system-level optimizations (dynamic batching, streaming-aware scheduling, CUDA resource management). VoxServe’s framework supports diverse SpeechLM architectures, including both text-to-speech (TTS) and speech-to-speech (STS) pipelines, implementing highly performant asynchronous inference and scheduling paradigms (Kamahori et al., 30 Jan 2026).
1. Model-Execution Abstraction and System Architecture
VoxServe’s architecture is comprised of two distinct processes: the Interface process and the Execution process. The Interface process exposes HTTP/gRPC endpoints and relays prompt requests (optionally with reference audio) to the Execution process. Within the Execution process, three principal components coordinate inference (see ASCII diagram below):
1 2 3 4 5 6 7 8 |
+------------------+ +------------+ +------------+
| Scheduler |-----> | Worker |-----> | Model |
| – Tracks status | | – Manages | | – Implements
| of each req. | | CUDA | | preprocess,
| – Picks next | | streams | | llm_forward,
| tasks | | – Launches | | sampling,
+------------------+ | kernels | | postprocess
+------------+ +------------+ |
The core system abstraction is a unified model interface representing any SpeechLM as a subclass with the following methods:
preprocess(RequestState rs): prepares input tensors and state, allocates caches.llm_forward(BatchTensor IDs, Masks, Feats): runs the LLM backbone, typically as a CUDA graph.sampling(Logits, RequestState): applies sampling strategies (e.g., top-k, top-p), updates request state.postprocess(BatchTokens): produces audio waveform chunks, maintains detokenizer cache.- Optionally,
depth_forward/depth_samplingfor codebook-by-codebook decoding.
All models implement these methods, facilitating model-agnostic, batched, and CUDA-optimized execution by the Scheduler and Worker. This explicit interface enables VoxServe to batch and schedule requests efficiently regardless of SpeechLM architectural idiosyncrasies.
Model Interface Pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
class SpeechLMModel: def preprocess(self, request): # e.g. text_tokenize, audio_encode, allocate caches return rs def llm_forward(self, batch_ids, batch_masks, batch_feats): # runs LLM backbone (CUDA graph) return batch_logits def sampling(self, batch_logits, batch_rs): # e.g. top-k, top-p, repetition penalty # updates rs.next_ids, rs.next_masks, rs.next_feats return def postprocess(self, batch_tokens, batch_rs): # detokenizer forward (CUDA graph) in fixed-size chunks # returns waveform bytes, updates rs.detok_cache return batch_waves |
2. Streaming-Centric Performance Metrics
VoxServe is evaluated against three primary metrics suited to streaming inference settings:
- End-to-First-Audio Latency (); “Time-To-First-Audio” (TTFA):
where is request submission time and is receipt of the first playable audio chunk.
- Throughput ():
Quantifies the number of requests served per second under a latency constraint.
- Streaming Viability ():
For chunk , arrival and playback duration , a stream is viable if all
Overall streamability is:
Streaming viability reflects the strictness of timing needed for uninterrupted real-time playback; the system prioritizes both the initial latency and maintaining timely delivery of subsequent waveform chunks.
3. Scheduling Algorithm with Streaming Awareness
VoxServe’s Scheduler operates an infinite loop in which it dynamically prioritizes subtasks (LLM forward vs. detokenization) for all active requests, keyed to the current streaming phase:
- Phase 1 (Startup): Requests before first audio are latency-critical and granted maximum priority to minimize their TTFA.
- Phase 2 (Steady-State): Post first audio, each chunk’s production is associated with a soft deadline: The request’s current slack:
Priority is a function of slack:
where controls the bias toward near-deadline streams.
Scheduler loop (simplified):
1 2 3 4 5 6 7 8 |
while True: now = current_time() ready_tasks = gather_ready_tasks() # (req_id, task_type) priorities = { j: compute_priority(j, now) for j in active_requests } scheduled = sorted(ready_tasks, key=lambda (j,t): priorities[j], reverse=True) # Batch up to B LLM tasks; D detokinizer tasks; issue GPU batches issue_gpu_batches(scheduled[:batch_limit]) wait_for_next_event() # I/O or GPU-done callback |
4. Asynchronous Inference and Resource Utilization
VoxServe decouples CPU-side scheduling and sampling from GPU-side LLM and detokenizer execution, enabling high concurrency and device utilization. The Worker process maintains two CUDA streams:
- Stream 0: Handles LLM forward tasks.
- Stream 1: Handles detokenizer (waveform synthesis) tasks.
Each GPU task is tagged with a CUDA event dependency on the latest request state. CPU threads, using Python and PyTorch’s CUDA constructs, queue GPU work as soon as input tensors are ready, then immediately proceed to next-stage sampling or batch preparation without blocking on device availability.
This overlap yields highly efficient resource usage and minimal idle periods, contributing to the substantial throughput and latency improvements observed experimentally.
5. Experimental Evaluation and Results
Evaluation was conducted on a single NVIDIA H100-80GB node (PyTorch 2.1, CUDA 12.1, FlashInfer attention kernel), comparing VoxServe to prevailing baselines across three SpeechLMs:
| Model | Baseline max @500ms p90 TTFA | VoxServe max @500ms p90 TTFA | Speed-up |
|---|---|---|---|
| CosyVoice | 0.4 req/s (100% S) | 4.0 req/s (100% S) | 10× |
| Orpheus | 0.8 req/s (100% S) | 10.0 req/s (≳99% S) | 12.5× |
| Step-Audio | 0.3 req/s (100% S) | 3.5 req/s (100% S) | 11.7× |
Multi-GPU data-parallel scaling produced approximately linear throughput gains up to four GPUs for CosyVoice, and system performance improved equivalently in disaggregated scenarios (LLM on GPU₀, detok on GPU₁) despite inter-GPU transfer. In a throughput-oriented (non-streaming) scenario with 1000 concurrent CosyVoice requests, baseline achieved ~10× real-time throughput, while VoxServe (optimized) achieved ~134× real-time throughput, yielding a 13.4-fold performance gain over baseline at comparable latency and viability (Kamahori et al., 30 Jan 2026).
6. Implementation and Optimization Strategies
VoxServe’s optimization suite includes:
- Unified model interface: permitting shared, prebuilt CUDA graphs across multiple model families and variants.
- Stable tensor shapes: achieved through fixed chunk sizes and batch dimensions, resulting in high CUDA graph cache hit-rates.
- FlashInfer attention kernels: for highly optimized LLM forward passes.
- Detokenizer batching: chunks are processed in batches with per-request cache for KV/conv states, maximizing device efficiency.
- Asynchronous execution: multiple CUDA streams and light-weight per-request state objects allow fine-grained concurrency.
- Shape-constrained dynamic batching: Scheduler groups tasks by compatible tensor shapes, further boosting throughput.
Cumulatively, these optimizations enable uniform performance improvements across at least seven open-source SpeechLM families with an implementation size of approximately 20K Python+PyTorch lines.
7. Deployment Considerations
Practical deployment requires alignment of configuration, hardware, and integration pipelines:
- Configurable parameters (per model):
chunk_size: tokens per detokenizer call (e.g., 15 for CosyVoice, 28/21 overlap for Orpheus, 25+3 lookahead for Step-Audio).samplingparameters: temperature, top_k, top_p, repetition_penalty.max_batch_sizefor LLM and detok.- CUDA device allocation, particularly for multi-GPU or disaggregated setups.
- Scheduler thresholds: startup concurrency cap, slack threshold, in priority formula.
- Hardware requirements: At least one H100-class GPU (≥32 GB); large SpeechLMs (9B+ params) may require the full memory capacity or even necessitate model-disaggregated deployment.
- Integration steps:
- Install with
pip install voxserveor clone the repository. - If using a new SpeechLM, subclass and implement
preprocess,llm_forward,sampling,postprocess. - Configure scheduler for streaming or throughput-optimized mode.
- Launch the execution process with the selected model configuration.
- Submit HTTP/gRPC inference requests (JSON with optional audio); receive output as streaming WAV audio chunks.
The modular design allows rapid onboarding of new SpeechLMs with minimal integration overhead. All system-level optimizations, including batching, CUDA graphing, and scheduling, are automatically leveraged by models conforming to the unified interface (Kamahori et al., 30 Jan 2026).