WESR-Bench: Word-Level ASR & Vocal Event Benchmark
- WESR-Bench is a comprehensive framework that introduces a fine-grained vocal event taxonomy and a position-aware protocol to disentangle ASR errors from event detection.
- It compiles a bilingual, richly annotated dataset with 21 vocal event categories divided into discrete and continuous types, facilitating robust model training and comparison.
- Benchmark results show marked improvements in event detection precision and recall with minimal impact on overall ASR transcription quality, paving the way for future research.
WESR-Bench is a rigorously constructed evaluation framework and dataset for word-level event-speech recognition, with a primary focus on detecting and localizing non-verbal vocal events alongside speech transcription. Developed to address the lack of standardized benchmarks for this task, WESR-Bench introduces a position-aware protocol to disentangle automatic speech recognition (ASR) errors from event detection/localization, supports a fine-grained taxonomy of vocal events, and supplies both an annotated test set and a large-scale training corpus for model development and comparison (Yang et al., 8 Jan 2026).
1. Taxonomy of Vocal Events
WESR-Bench operationalizes a refined taxonomy of 21 vocal event categories, split into discrete (instantaneous, point-like) and continuous (interval, word-spanning) types:
- Discrete events (15), marked using square brackets and assigned to inter-word positions, include [laughs], [chuckle], [giggle], [crowd_laughter], [cry], [sobbing], [cough], [clear_throat], [scream], [roar], [shout], [breathing], [inhale], [exhale], [sigh].
- Continuous events (6), marked using angle brackets spanning one or more words, include <laughing>…</laughing>, <crying>…</crying>, <shouting>…</shouting>, <panting>…</panting>, <whispering>…</whispering>, <singing>…</singing>.
For each tag, annotators were provided with a succinct definition and audio exemplars (e.g., [chuckle]: “Laugh quietly”; <singing>…</singing>: “Singing voice, spanning a sequence of words.”). This systematic annotation enables fine-grained analyses and lays the groundwork for precise benchmarking.
2. Dataset Construction and Annotation Protocol
WESR-Bench’s benchmark set was compiled through hybrid acoustic-textual retrieval strategies over web audio sources (movies, dramas, podcasts, audiobooks). The retrieval pipeline used BEATs embeddings for acoustic queries and AF-CLAP embeddings for text-based search; for each event class, three audio and three text queries were issued, and top-k utterances retrieved. After review and filtering:
- The final evaluation set contains 927 utterances (≈ 3 hours, 42% English/58% Chinese).
- Event occurrence statistics: 1,918 tags, with 58.8% continuous and 41.2% discrete; 29.1% of utterances have ≥2 tags, and 24.6% involve ≥2 distinct event types.
Three trained annotators inserted event tags at precise positions, following qualification and training, working independently and compensated at market rate, with all utterances subject to senior expert review for label accuracy and boundary alignment.
3. Position-Aware Evaluation Protocol
The central methodological contribution is a position-aware protocol explicitly designed to separate ASR (lexical) from event-detection errors:
- Event-Preserving Alignment: Event tags are removed from the reference to produce a plain-text sequence, and hypothesized output is tokenized, keeping event tags atomic. SequenceMatcher computes the optimal alignment, and all edit operations are applied ensuring event tags are never deleted; replacements cause tags to be inserted at the closest matching word positions.
- Event-to-Position Mapping: For a sequence of N reference words, 2N+1 distinct positions are defined: N word-positions (continuous events) and N+1 inter-word positions (discrete events). Discrete tags are mapped to inter-word positions while continuous tags span word-positions.
- Metrics: For each event type, predicted and reference sets ( and ) of positions are compared.
Overlaps of continuous spans receive partial credit, ensuring recall and F1 better reflect localization fidelity.
4. Training Corpus and Model Development
WESR-Train provides 1,767 hours of labeled speech with vocal events, supporting multilinguality (English and Chinese). Data sources include:
| Source | Approx. Hours | Description |
|---|---|---|
| NonverbalTTS | 14 | Human-inspired pipeline |
| NVSpeech-170k | 332 | Model-expanded corpus |
| NonVerbalSpeech-38K | 87 | Pipeline-annotated |
| SMIIP-NV (in-house) | 35 | Mandarin corpus |
| Gemini-annotated Web mining | 1,299 | Large-scale web audio annotation |
Preprocessing included denoising (MossFormer2, DNSMOS≥2.0), hybrid retrieval, Gemini-2.5-Pro automatic word-level annotation to the WESR taxonomy, and deduplication against the benchmark set. All tags from external sources were manually remapped to conform with the defined taxonomy.
Model fine-tuning was conducted on three backbone architectures: Whisper-Large-v3 (1.5B params), Kimi-Audio-7B-Instruct (7B), Qwen3-Omni-30B-A3B-Instruct (30B). Event tokens were added to tokenizers, with the following learning configurations:
- Whisper: LR=1e−5, batch size 8, warmup 0.1, 3 epochs (H100×8, 4h).
- Kimi/Qwen: LR=1e−6, sequence length 4,096, 3 epochs (H100×8/H200×8, 5h).
ASR quality retention was evaluated via Common Voice WER: e.g., Qwen3-Omni 7.2% (en), 6.0% (zh-CN); WESR-Qwen 8.6% (en), 7.2% (zh-CN).
5. Benchmarking Results and Analyses
WESR-Bench enables standardized model comparison by reporting micro/macro-averaged F1 scores on all 21 tags. Fine-tuned models achieve substantial gains over few-shot LLM baselines:
| Model | Micro F1 | Macro F1 |
|---|---|---|
| Kimi-Audio (2-shot) | 7.3% | 10.3% |
| Gemini-2.5-Pro (2-shot) | 53.9% | 29.9% |
| Gemini-3-Pro (2-shot) | 58.9% | 32.3% |
| WESR-Whisper | 70.6% | 37.7% |
| WESR-Kimi | 70.5% | 37.8% |
| WESR-Qwen | 71.4% | 38.0% |
Ablations show that continuous events (F1=64.1%, WESR-Qwen) are detected with higher recall and F1 than discrete events (F1=28.2%), consistent with partial credit in interval scoring and the increased challenge of precise inter-word localization. Aggregated event categories (e.g., all laughs) demonstrate absolute improvements of 20–40 percentage points over few-shot configurations. The addition of event tagging causes only a modest increase (1.4–1.5 pp) in WER on general ASR tests, confirming that event recognition does not materially degrade lexical transcription. Error analysis indicates ongoing difficulty with low-energy or brief events (breathing, sighing).
6. Applications and Research Trajectories
WESR-Bench is designed to catalyze progress in several downstream and cross-disciplinary domains:
- Downstream applications: Enhanced subtitling (e.g., laugh/cry cues), expressive TTS, sentiment analysis at word/event granularity, speaker state estimation, and improved dialogue system context tracking.
- Research directions: Expansion beyond English and Chinese (including code-mixed and low-resource languages), scaling taxonomy to environmental and non-vocal acoustic events, model adaptation and compression for edge settings, and integration with multimodal (e.g., video, physiological) signals to boost robustness and granularity in event detection.
A plausible implication is that WESR-Bench, through its position-aware protocol and comprehensive taxonomy, establishes a new standard for fine-grained vocal event recognition and localization, facilitating precise comparison and accelerating model development (Yang et al., 8 Jan 2026).