AudioSet-R: Enhanced Audio Event Corpus

Updated 29 January 2026

AudioSet-R is a refined audio event corpus that leverages a three-stage LLM-assisted relabeling process to improve label accuracy and semantic granularity.
The framework sequentially applies audio comprehension, guided tag synthesis, and semantic alignment to ensure detailed and reliable annotations.
Empirical evaluations show mAP improvements of 0.035–0.050 across diverse models, demonstrating significant gains in weakly supervised and multi-label classification.

AudioSet-R is a refined audio event corpus constructed by systematically relabeling Google’s widely used AudioSet benchmark using a multi-stage framework that leverages audio–language foundation models and prompt chaining. AudioSet-R improves upon longstanding limitations in label accuracy and completeness that have constrained downstream model performance, particularly in weakly supervised and multi-label audio classification. The three-stage pipeline—comprising audio comprehension, guided label synthesis, and semantic alignment—yields a large, high-quality relabeled dataset validated by superior empirical results across both supervised and self-supervised learning architectures.

1. Multi-Stage Label Reannotation Framework

The core contribution of AudioSet-R is its three-stage relabeling architecture, which decomposes automated annotation into modular subtasks, inspired by prompt-chaining methodologies. Each 10-second audio clip undergoes the following sequential processes:

Stage I: Audio Comprehension (Semantic Extraction)

Objective: Produce a fine-grained textual summary for each clip, including primary content description, human vocal attributes (if present), and musical elements (if present).
Inputs: Raw waveform or mel-spectrogram $x$ .
Model: Qwen-Audio through multi-turn chat prompts.
Prompts:
- T₁,₁ requests a concise description of the main sound events (≤50 words).
- T₁,₂ is triggered if human voice is detected, requesting speaker attributes (gender, emotional state, language).
- T₁,₃ is activated if music is detected, requesting genre and instrument identification.
Output: Aggregated tuple $(\text{content\_desc},\ \text{voice\_info},\ \text{music\_info})$ , with fault-tolerant retries if constraints are violated.

Stage II: Label Synthesis (Guided Tag Generation)

Objective: Synthesize a small set of candidate tags from the structured summary, adhering to AudioSet’s 527-class taxonomy.
Model: Mistral LLM with explicit prompt restrictions.
Output: Free-form candidate tag list $L_{\mathrm{cand}} = \{l_1, ..., l_k\}$ , $k \leq K \approx 3$ .

Stage III: Semantic Alignment (Ontology Mapping & Filtering)

Objective: Map candidate labels to exact AudioSet classes, filtering out mismatches.
Methods:
- DeepSeek R1 taxonomic mapping (exact, fuzzy, synonym matching).
- CLAP semantic filtering, retaining only labels with similarity $S(x, y) \ge \tau$ , where $S(x, y) = \frac{f_a(x)\cdot f_t(y)}{\|f_a(x)\|\;\|f_t(y)\|}$ and $f_a$ , $f_t$ are audio/text encoders.
Output: Final set $L_{\mathrm{final}} \subseteq T_{\mathrm{AS}}$ per clip.

Pipeline Summary (Algorithm 1):

For each clip $x$ , the framework executes:

AudioComprehension( $x$ ) ⇒ $d$
LabelSynthesis( $d$ ) ⇒ $L_{\mathrm{cand}}$
DeepSeekAlign( $L_{\mathrm{cand}}, T_{\mathrm{AS}}$ ) ⇒ $L_{\mathrm{mapped}}$
Filter to final $L_{\mathrm{final}}$ using CLAP similarity
Store $(x, L_{\mathrm{final}})$ for the relabeled corpus

2. Dataset Construction and Statistical Attributes

AudioSet-R relabeling was performed on the balanced training ( $\approx 22{,}000$ clips) and validation ( $\approx 20{,}000$ clips) splits, yielding approximately $42{,}000$ reannotated clips.

Label Density and Distribution:

Split	Original AudioSet (Mean Labels/Clip)	AudioSet-R (Mean Labels/Clip)
Train	2.39	≈ 3.1
Val	2.55	≈ 3.4

Label-count per clip increases by ~25%. AudioSet-R exhibits improved coverage of under-represented fine-grained classes (e.g., "baby crying," "electric guitar") and a reduced bias toward broad classes ("Speech," "Music"). Manual spot checks on 1,000 relabeled clips report >90% agreement with expert annotation, compared with ~75% for the original labels. This suggests notable gains in label reliability and semantic density.

3. Evaluation Protocol and Metrics

AudioSet-R’s impact on classification performance was substantiated using both supervised (AST, PANNs CNN6, CNN14) and self-supervised (SSAST, AudioMAE) model architectures:

Metrics:
- Class-wise Average Precision (AP): $\mathrm{AP}_c = \int_{0}^{1} p_c(r)\,dr$ , where $p_c(r)$ is precision at recall $r$ .
- mean Average Precision (mAP): $\mathrm{mAP}=\frac{1}{C}\sum_{c=1}^{C}\mathrm{AP}_c$ .
- Precision, Recall, and F1 for error analysis.
Experimental Setup:
- AST: 87M parameters, trained from scratch (25 epochs, lr=1e-5, batch=16).
- SSAST: 89M params, fine-tuned (30 epochs, lr=3e-4).
- AudioMAE: 86M params, fine-tuned (60 epochs, lr=1e-3).
- Evaluations compared training/evaluation combinations on original and relabeled splits.
Validation Splits: Models were assessed by training and/or evaluating on either the original balanced subset or AudioSet-R relabeled splits.

4. Empirical Results

Comprehensive experiments revealed consistent mAP improvements ranging from +0.035 to +0.050 when both training and evaluating on AudioSet-R, validating increased label quality and downstream generalization. Representative mAP results:

Model	Orig→Orig	Orig→Relabel	ReLabel→Orig	ReLabel→Relabel
SSAST	0.2528	0.2603	0.2472	0.2989
AudioMAE	0.3408	0.3123	0.2917	0.3425
AST	0.0979	0.1078	0.0936	0.1310
CNN6	0.2530	0.2310	0.2140	0.2620
CNN14	0.2630	0.2340	0.2230	0.2690

Evaluating models trained on original AudioSet against AudioSet-R validation splits yields +0.01–0.02 mAP gains, confirming higher annotation quality. The raised label-count and expanded coverage underpin these performance enhancements (Sun et al., 21 Aug 2025).

5. Analytical Insights and Limitations

Several factors underpin the observed improvements:

Richer semantic summaries facilitate fine-grained tag selection during relabeling.
Prompt chaining ameliorates hallucinations and enforces hierarchical consistency.
CLAP-based semantic filtering eliminates spurious or imprecise labels, enhancing reliability.

Error analysis highlights remaining challenges:

Persistent confusion among acoustically similar subclasses (e.g., "crying" vs. "sobbing").
Ambiguity in environmental sound categories.
Self-supervised models (such as AudioMAE) lack intrinsic fine-grained semantic discriminative capabilities.

Relabeling to date focuses on the balanced subset; scaling to the full unbalanced AudioSet ( $\sim 2$ million clips) is noted as a future direction requiring further optimization.

Limitations include:

Manual template design for prompts constrains consistency; automated template search is an avenue for future work.
Model biases in Qwen-Audio and Mistral can introduce systematic labeling errors, suggesting iterative human–LLM collaboration for correction.

6. Significance and Generality in Audio Research

AudioSet-R demonstrates that multi-stage LLM-assisted relabeling substantially advances the accuracy, density, and semantic granularity of large-scale weakly annotated audio corpora. The framework generalizes effectively across diverse classifier architectures, as substantiated by experiments. A plausible implication is broader applicability of foundation model-driven relabeling for other multimodal, weakly annotated datasets in machine listening and audio event analysis.

The code and relabeled dataset are publicly available (https://github.com/colaudiolab/AudioSet-R), offering a vetted resource for benchmarking and further research (Sun et al., 21 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AudioSet-R: A Refined AudioSet with Multi-Stage LLM Label Reannotation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AudioSet-R.