USAD: Universal Speech and Audio Representation via Distillation

Published 23 Jun 2025 in cs.SD, cs.CL, and eess.AS | (2506.18843v1)

Abstract: Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper introduces USAD, a novel framework that distills knowledge from separate speech and general audio self-supervised models into a single encoder capable of creating universal representations for diverse audio tasks.
USAD employs a dual-teacher, sparse layer-to-layer distillation strategy with an efficient L1-cosine loss, significantly reducing computational overhead by approximately 75% compared to dense distillation.
Experimental results show that USAD achieves competitive performance across a wide range of speech and non-speech audio benchmarks (SUPERB, HEAR, AudioSet), demonstrating the effectiveness and efficiency of unified audio representation learning for simplified downstream applications.

Universal Speech and Audio Distillation: A Unified Approach to Audio Representation Learning

The paper "USAD: Universal Speech and Audio Representation via Distillation" (2506.18843) addresses the persistent fragmentation in self-supervised audio representation learning, where models are typically specialized for either speech or non-speech (sound/music) domains. The authors propose Universal Speech and Audio Distillation (USAD), a unified framework that leverages knowledge distillation from domain-specific self-supervised learning (SSL) models to train a single encoder capable of extracting general-purpose representations across speech, sound, and music.

Methodology

USAD is built upon the insight that while speech and non-speech audio share underlying signal characteristics, existing SSL models are optimized for their respective domains, leading to suboptimal cross-domain generalization. The USAD framework employs a dual-teacher, sparse layer-to-layer (L2L) distillation strategy:

Dual-Teacher Distillation: USAD simultaneously distills knowledge from two pre-trained SSL models—one specialized in speech (e.g., WavLM Base+) and one in general audio (e.g., ATST Frame). Both teachers process the same mixed-domain input, and the student model is trained to match their intermediate representations.
Sparse L2L Distillation: Instead of dense, computationally expensive layer-wise matching, USAD distills only from a subset of layers (e.g., 4 out of 12), leveraging the redundancy between adjacent transformer layers. This reduces computational overhead by approximately 75% compared to dense L2L approaches.
Loss Function: The distillation objective combines L1 distance and cosine similarity between the student’s predicted features and the teachers’ feed-forward network (FFN) outputs, eschewing contrastive losses and negative sampling for efficiency.

A critical design choice is the use of frame-based feature extraction for both teachers and the student, ensuring temporal alignment and preserving fine-grained information necessary for speech tasks, while maintaining sufficient generality for non-speech audio.

Experimental Results

USAD is evaluated on a comprehensive suite of benchmarks, including SUPERB (speech), HEAR (holistic audio), AudioSet (audio tagging), and ESC-50 (sound classification). The training corpus, Mix126k-B, is a balanced mixture of large-scale speech, sound, and music datasets, with upsampling to ensure domain parity.

Key findings include:

Competitive Performance Across Domains: USAD Base (94M parameters) achieves a SUPERB average score of 787.0, outperforming all audio SSL baselines and closely matching or surpassing domain-specific teacher models in both speech and audio tasks.
Scalability: Increasing model size (USAD Large, 330M parameters) yields further gains, with an average SUPERB score of 851.7 and strong results on HEAR, closing the gap with state-of-the-art task-specific models.
Efficiency: Sparse L2L distillation and the L1-cosine loss enable USAD to reach high performance with significantly reduced computational cost compared to dense distillation or contrastive learning approaches.
Ablation Studies: The choice of teacher models, data distribution, and distillation strategy are all shown to impact downstream performance. Notably, frame-based teachers and balanced training data are essential for robust cross-domain generalization.

Notable Numerical Results

On SUPERB, USAD Base achieves 868.9 (frame-level speech), 938.0 (instance-level speech), and 554.2 (audio) scores, with an overall average of 787.0.
On HEAR, USAD Large attains an average score of 79.7, surpassing the concatenated teacher topline (78.5) and approaching the best per-task results on several benchmarks.
USAD models consistently outperform single-domain SSL models in joint evaluations, demonstrating the effectiveness of the unified approach.

Implications and Future Directions

USAD demonstrates that a single encoder, distilled from multiple domain-specific SSL experts, can achieve near state-of-the-art performance across a wide range of speech and audio tasks. This unification has several practical and theoretical implications:

Simplified Downstream Integration: Multimodal and audio-enabled systems (e.g., audio-LLMs, speech-to-audio generation) can leverage a single, general-purpose encoder, reducing system complexity and maintenance overhead.
Resource Efficiency: Sparse distillation and unified training reduce the need for maintaining and deploying multiple large models, which is particularly beneficial for edge and real-time applications.
Foundation for Multimodal AI: As audio becomes increasingly central in multimodal AI, universal representations such as those produced by USAD are likely to become foundational components for large-scale, cross-domain models.

The authors identify several avenues for future work, including extending USAD to multilingual speech, improving robustness to domain shifts, and integrating the framework into large audio-LLMs. The demonstrated scalability and efficiency of USAD suggest that further gains are possible with larger models and more diverse training data.

Conclusion

USAD provides a principled and efficient solution to the challenge of universal audio representation learning. By distilling from multiple domain-specific SSL models using a sparse, frame-aligned strategy, USAD achieves strong, balanced performance across speech, sound, and music tasks. The approach offers a practical path toward unified audio encoders, with significant implications for the development of generalist AI systems and multimodal applications.

Markdown Report Issue