AudioSet-R: Enhanced Audio Event Corpus
- AudioSet-R is a refined audio event corpus that leverages a three-stage LLM-assisted relabeling process to improve label accuracy and semantic granularity.
- The framework sequentially applies audio comprehension, guided tag synthesis, and semantic alignment to ensure detailed and reliable annotations.
- Empirical evaluations show mAP improvements of 0.035–0.050 across diverse models, demonstrating significant gains in weakly supervised and multi-label classification.
AudioSet-R is a refined audio event corpus constructed by systematically relabeling Google’s widely used AudioSet benchmark using a multi-stage framework that leverages audio–language foundation models and prompt chaining. AudioSet-R improves upon longstanding limitations in label accuracy and completeness that have constrained downstream model performance, particularly in weakly supervised and multi-label audio classification. The three-stage pipeline—comprising audio comprehension, guided label synthesis, and semantic alignment—yields a large, high-quality relabeled dataset validated by superior empirical results across both supervised and self-supervised learning architectures.
1. Multi-Stage Label Reannotation Framework
The core contribution of AudioSet-R is its three-stage relabeling architecture, which decomposes automated annotation into modular subtasks, inspired by prompt-chaining methodologies. Each 10-second audio clip undergoes the following sequential processes:
Stage I: Audio Comprehension (Semantic Extraction)
- Objective: Produce a fine-grained textual summary for each clip, including primary content description, human vocal attributes (if present), and musical elements (if present).
- Inputs: Raw waveform or mel-spectrogram .
- Model: Qwen-Audio through multi-turn chat prompts.
- Prompts:
- T₁,₁ requests a concise description of the main sound events (≤50 words).
- T₁,₂ is triggered if human voice is detected, requesting speaker attributes (gender, emotional state, language).
- T₁,₃ is activated if music is detected, requesting genre and instrument identification.
- Output: Aggregated tuple , with fault-tolerant retries if constraints are violated.
Stage II: Label Synthesis (Guided Tag Generation)
- Objective: Synthesize a small set of candidate tags from the structured summary, adhering to AudioSet’s 527-class taxonomy.
- Model: Mistral LLM with explicit prompt restrictions.
- Output: Free-form candidate tag list , .
Stage III: Semantic Alignment (Ontology Mapping & Filtering)
- Objective: Map candidate labels to exact AudioSet classes, filtering out mismatches.
- Methods:
- DeepSeek R1 taxonomic mapping (exact, fuzzy, synonym matching).
- CLAP semantic filtering, retaining only labels with similarity , where and , are audio/text encoders.
- Output: Final set per clip.
Pipeline Summary (Algorithm 1):
For each clip , the framework executes:
- AudioComprehension() ⇒
- LabelSynthesis() ⇒
- DeepSeekAlign() ⇒
- Filter to final using CLAP similarity
- Store for the relabeled corpus
2. Dataset Construction and Statistical Attributes
AudioSet-R relabeling was performed on the balanced training ( clips) and validation ( clips) splits, yielding approximately reannotated clips.
Label Density and Distribution:
| Split | Original AudioSet (Mean Labels/Clip) | AudioSet-R (Mean Labels/Clip) |
|---|---|---|
| Train | 2.39 | ≈ 3.1 |
| Val | 2.55 | ≈ 3.4 |
Label-count per clip increases by ~25%. AudioSet-R exhibits improved coverage of under-represented fine-grained classes (e.g., "baby crying," "electric guitar") and a reduced bias toward broad classes ("Speech," "Music"). Manual spot checks on 1,000 relabeled clips report >90% agreement with expert annotation, compared with ~75% for the original labels. This suggests notable gains in label reliability and semantic density.
3. Evaluation Protocol and Metrics
AudioSet-R’s impact on classification performance was substantiated using both supervised (AST, PANNs CNN6, CNN14) and self-supervised (SSAST, AudioMAE) model architectures:
- Metrics:
- Class-wise Average Precision (AP): , where is precision at recall .
- mean Average Precision (mAP): .
- Precision, Recall, and F1 for error analysis.
- Experimental Setup:
- AST: 87M parameters, trained from scratch (25 epochs, lr=1e-5, batch=16).
- SSAST: 89M params, fine-tuned (30 epochs, lr=3e-4).
- AudioMAE: 86M params, fine-tuned (60 epochs, lr=1e-3).
- Evaluations compared training/evaluation combinations on original and relabeled splits.
- Validation Splits: Models were assessed by training and/or evaluating on either the original balanced subset or AudioSet-R relabeled splits.
4. Empirical Results
Comprehensive experiments revealed consistent mAP improvements ranging from +0.035 to +0.050 when both training and evaluating on AudioSet-R, validating increased label quality and downstream generalization. Representative mAP results:
| Model | Orig→Orig | Orig→Relabel | ReLabel→Orig | ReLabel→Relabel |
|---|---|---|---|---|
| SSAST | 0.2528 | 0.2603 | 0.2472 | 0.2989 |
| AudioMAE | 0.3408 | 0.3123 | 0.2917 | 0.3425 |
| AST | 0.0979 | 0.1078 | 0.0936 | 0.1310 |
| CNN6 | 0.2530 | 0.2310 | 0.2140 | 0.2620 |
| CNN14 | 0.2630 | 0.2340 | 0.2230 | 0.2690 |
Evaluating models trained on original AudioSet against AudioSet-R validation splits yields +0.01–0.02 mAP gains, confirming higher annotation quality. The raised label-count and expanded coverage underpin these performance enhancements (Sun et al., 21 Aug 2025).
5. Analytical Insights and Limitations
Several factors underpin the observed improvements:
- Richer semantic summaries facilitate fine-grained tag selection during relabeling.
- Prompt chaining ameliorates hallucinations and enforces hierarchical consistency.
- CLAP-based semantic filtering eliminates spurious or imprecise labels, enhancing reliability.
Error analysis highlights remaining challenges:
- Persistent confusion among acoustically similar subclasses (e.g., "crying" vs. "sobbing").
- Ambiguity in environmental sound categories.
- Self-supervised models (such as AudioMAE) lack intrinsic fine-grained semantic discriminative capabilities.
Relabeling to date focuses on the balanced subset; scaling to the full unbalanced AudioSet ( million clips) is noted as a future direction requiring further optimization.
Limitations include:
- Manual template design for prompts constrains consistency; automated template search is an avenue for future work.
- Model biases in Qwen-Audio and Mistral can introduce systematic labeling errors, suggesting iterative human–LLM collaboration for correction.
6. Significance and Generality in Audio Research
AudioSet-R demonstrates that multi-stage LLM-assisted relabeling substantially advances the accuracy, density, and semantic granularity of large-scale weakly annotated audio corpora. The framework generalizes effectively across diverse classifier architectures, as substantiated by experiments. A plausible implication is broader applicability of foundation model-driven relabeling for other multimodal, weakly annotated datasets in machine listening and audio event analysis.
The code and relabeled dataset are publicly available (https://github.com/colaudiolab/AudioSet-R), offering a vetted resource for benchmarking and further research (Sun et al., 21 Aug 2025).