WESR-Bench: Word-Level ASR & Vocal Event Benchmark

Updated 15 January 2026

WESR-Bench is a comprehensive framework that introduces a fine-grained vocal event taxonomy and a position-aware protocol to disentangle ASR errors from event detection.
It compiles a bilingual, richly annotated dataset with 21 vocal event categories divided into discrete and continuous types, facilitating robust model training and comparison.
Benchmark results show marked improvements in event detection precision and recall with minimal impact on overall ASR transcription quality, paving the way for future research.

WESR-Bench is a rigorously constructed evaluation framework and dataset for word-level event-speech recognition, with a primary focus on detecting and localizing non-verbal vocal events alongside speech transcription. Developed to address the lack of standardized benchmarks for this task, WESR-Bench introduces a position-aware protocol to disentangle automatic speech recognition (ASR) errors from event detection/localization, supports a fine-grained taxonomy of vocal events, and supplies both an annotated test set and a large-scale training corpus for model development and comparison (Yang et al., 8 Jan 2026).

1. Taxonomy of Vocal Events

WESR-Bench operationalizes a refined taxonomy of 21 vocal event categories, split into discrete (instantaneous, point-like) and continuous (interval, word-spanning) types:

Discrete events (15), marked using square brackets and assigned to inter-word positions, include [laughs], [chuckle], [giggle], [crowd_laughter], [cry], [sobbing], [cough], [clear_throat], [scream], [roar], [shout], [breathing], [inhale], [exhale], [sigh].
Continuous events (6), marked using angle brackets spanning one or more words, include <laughing>…</laughing>, <crying>…</crying>, <shouting>…</shouting>, <panting>…</panting>, <whispering>…</whispering>, <singing>…</singing>.

For each tag, annotators were provided with a succinct definition and audio exemplars (e.g., [chuckle]: “Laugh quietly”; <singing>…</singing>: “Singing voice, spanning a sequence of words.”). This systematic annotation enables fine-grained analyses and lays the groundwork for precise benchmarking.

2. Dataset Construction and Annotation Protocol

WESR-Bench’s benchmark set was compiled through hybrid acoustic-textual retrieval strategies over web audio sources (movies, dramas, podcasts, audiobooks). The retrieval pipeline used BEATs embeddings for acoustic queries and AF-CLAP embeddings for text-based search; for each event class, three audio and three text queries were issued, and top-k utterances retrieved. After review and filtering:

The final evaluation set contains 927 utterances (≈ 3 hours, 42% English/58% Chinese).
Event occurrence statistics: 1,918 tags, with 58.8% continuous and 41.2% discrete; 29.1% of utterances have ≥2 tags, and 24.6% involve ≥2 distinct event types.

Three trained annotators inserted event tags at precise positions, following qualification and training, working independently and compensated at market rate, with all utterances subject to senior expert review for label accuracy and boundary alignment.

3. Position-Aware Evaluation Protocol

The central methodological contribution is a position-aware protocol explicitly designed to separate ASR (lexical) from event-detection errors:

Event-Preserving Alignment: Event tags are removed from the reference to produce a plain-text sequence, and hypothesized output is tokenized, keeping event tags atomic. SequenceMatcher computes the optimal alignment, and all edit operations are applied ensuring event tags are never deleted; replacements cause tags to be inserted at the closest matching word positions.
Event-to-Position Mapping: For a sequence of N reference words, 2N+1 distinct positions are defined: N word-positions (continuous events) and N+1 inter-word positions (discrete events). Discrete tags are mapped to inter-word positions while continuous tags span word-positions.
Metrics: For each event type, predicted and reference sets ( $P$ and $R$ ) of positions are compared.

$\text{Precision} = \frac{|R \cap P|}{|P|}, \quad \text{Recall} = \frac{|R \cap P|}{|R|}, \quad F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Overlaps of continuous spans receive partial credit, ensuring recall and F1 better reflect localization fidelity.

4. Training Corpus and Model Development

WESR-Train provides 1,767 hours of labeled speech with vocal events, supporting multilinguality (English and Chinese). Data sources include:

Source	Approx. Hours	Description
NonverbalTTS	14	Human-inspired pipeline
NVSpeech-170k	332	Model-expanded corpus
NonVerbalSpeech-38K	87	Pipeline-annotated
SMIIP-NV (in-house)	35	Mandarin corpus
Gemini-annotated Web mining	1,299	Large-scale web audio annotation

Preprocessing included denoising (MossFormer2, DNSMOS≥2.0), hybrid retrieval, Gemini-2.5-Pro automatic word-level annotation to the WESR taxonomy, and deduplication against the benchmark set. All tags from external sources were manually remapped to conform with the defined taxonomy.

Model fine-tuning was conducted on three backbone architectures: Whisper-Large-v3 (1.5B params), Kimi-Audio-7B-Instruct (7B), Qwen3-Omni-30B-A3B-Instruct (30B). Event tokens were added to tokenizers, with the following learning configurations:

Whisper: LR=1e−5, batch size 8, warmup 0.1, 3 epochs (H100×8, 4h).
Kimi/Qwen: LR=1e−6, sequence length 4,096, 3 epochs (H100×8/H200×8, 5h).

ASR quality retention was evaluated via Common Voice WER: e.g., Qwen3-Omni 7.2% (en), 6.0% (zh-CN); WESR-Qwen 8.6% (en), 7.2% (zh-CN).

5. Benchmarking Results and Analyses

WESR-Bench enables standardized model comparison by reporting micro/macro-averaged F1 scores on all 21 tags. Fine-tuned models achieve substantial gains over few-shot LLM baselines:

Model	Micro F1	Macro F1
Kimi-Audio (2-shot)	7.3%	10.3%
Gemini-2.5-Pro (2-shot)	53.9%	29.9%
Gemini-3-Pro (2-shot)	58.9%	32.3%
WESR-Whisper	70.6%	37.7%
WESR-Kimi	70.5%	37.8%
WESR-Qwen	71.4%	38.0%

Ablations show that continuous events (F1=64.1%, WESR-Qwen) are detected with higher recall and F1 than discrete events (F1=28.2%), consistent with partial credit in interval scoring and the increased challenge of precise inter-word localization. Aggregated event categories (e.g., all laughs) demonstrate absolute improvements of 20–40 percentage points over few-shot configurations. The addition of event tagging causes only a modest increase (1.4–1.5 pp) in WER on general ASR tests, confirming that event recognition does not materially degrade lexical transcription. Error analysis indicates ongoing difficulty with low-energy or brief events (breathing, sighing).

6. Applications and Research Trajectories

WESR-Bench is designed to catalyze progress in several downstream and cross-disciplinary domains:

Downstream applications: Enhanced subtitling (e.g., laugh/cry cues), expressive TTS, sentiment analysis at word/event granularity, speaker state estimation, and improved dialogue system context tracking.
Research directions: Expansion beyond English and Chinese (including code-mixed and low-resource languages), scaling taxonomy to environmental and non-vocal acoustic events, model adaptation and compression for edge settings, and integration with multimodal (e.g., video, physiological) signals to boost robustness and granularity in event detection.

A plausible implication is that WESR-Bench, through its position-aware protocol and comprehensive taxonomy, establishes a new standard for fine-grained vocal event recognition and localization, facilitating precise comparison and accelerating model development (Yang et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

WESR: Scaling and Evaluating Word-level Event-Speech Recognition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WESR-Bench.

WESR-Bench: Word-Level ASR & Vocal Event Benchmark

1. Taxonomy of Vocal Events

2. Dataset Construction and Annotation Protocol

3. Position-Aware Evaluation Protocol

4. Training Corpus and Model Development

5. Benchmarking Results and Analyses

6. Applications and Research Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

WESR-Bench: Word-Level ASR & Vocal Event Benchmark

1. Taxonomy of Vocal Events

2. Dataset Construction and Annotation Protocol

3. Position-Aware Evaluation Protocol

4. Training Corpus and Model Development

5. Benchmarking Results and Analyses

6. Applications and Research Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research