Sampling-Frequency-Independent Universal Sound Separation

Published 22 Sep 2023 in eess.AS, cs.LG, and cs.SD | (2309.12581v1)

Abstract: This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: universalities with respect to source types and recording conditions. The former property has been studied in the USS literature, which has greatly increased the number of source types that can be handled by a single neural network. However, the latter property (e.g., SF) has received less attention despite its necessity. Since the SF varies widely depending on the downstream tasks, the universal source separator must handle a wide variety of SFs. In this paper, to encompass the two properties, we propose an SF-independent (SFI) extension of a computationally efficient USS network, SuDoRM-RF. The proposed network uses our previously proposed SFI convolutional layers, which can handle various SFs by generating convolutional kernels in accordance with an input SF. Experiments show that signal resampling can degrade the USS performance and the proposed method works more consistently than signal-resampling-based methods for various SFs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel SFI convolutional layer that adapts to various sampling frequencies without requiring resampling.
It integrates these layers into the SuDoRM-RF network, enabling effective separation of diverse sound sources under different recording conditions.
Experimental results on the FUSS48k dataset demonstrate significant improvements in SI-SDR metrics compared to traditional resampling-based methods.

Sampling-Frequency-Independent Universal Sound Separation

Introduction

The paper "Sampling-Frequency-Independent Universal Sound Separation" introduces a universal sound separation (USS) method that is capable of handling untrained sampling frequencies (SFs) through the incorporation of sampling-frequency-independent (SFI) convolutional layers. The primary motivation is to develop a sound separator that universally applies to diverse source types and recording conditions. While previous efforts have emphasized the need for handling various source types, executing separation across different SFs has remained underexplored. Recognizing the SF's variable nature across tasks, the proposed work integrates SFI convolutional layers within the SuDoRM-RF network, demonstrating notable improvements over traditional signal-resampling techniques.

Figure 1: Network architectures of the proposed SFI version of SuDoRM-RF, original SuDoRM-RF, and mask predictor.

SuDoRM-RF Network Overview

SuDoRM-RF is a state-of-the-art USS network characterized by its computational efficiency and superior performance in handling a diverse range of source types. It employs a one-dimensional convolutional layer architecture with trainable analysis and synthesis filterbanks, often referred to as the encoder and decoder. The encoder transfigures input signals into a pseudo time-frequency domain, which the mask predictor uses to derive separation masks for each source. The network innovatively addresses scenarios with variable numbers of sources, allowing for robust assignment and permutation invariant training through a negative average SNR-based loss function.

SFI Convolutional Layer

The paper introduces a novel extension, the SFI convolutional layer, which generates convolutional kernels compatible with the SF of the input. By employing analog filters as archtypes, the digital filter analogously adapts through a frequency-domain filter design method, making it invariant to the input SF. This adaptability is pivotal in environments with a wide distribution of SFs, enabling the network to maintain consistent performance without the inefficiencies introduced by signal resampling.

The SFI convolutional layer replaces traditional convolution layers within SuDoRM-RF, ensuring the network's seamless transition between different SF environments with negligible computational overhead, beyond the initial generation of weights per unique SF.

Figure 2: Architecture of SFI convolutional layer using frequency-domain digital filter design.

Experimental Evaluation

Experiments conducted using the FUSS48k dataset affirm the effectiveness of the SFI approach. By resampling the dataset at various SFs, the proposed method significantly outperforms traditional methods which rely on signal resampling. This performance superiority is evident in both SI-SDR and SI-SDR improvement metrics across all evaluated N sources.

Figure 3: Separation performances of proposed and signal-resampling-based methods for each N. Red dotted lines denote trained SF. Error bars show standard errors.

The results underscore the degradation introduced by resampling-based methods, particularly at lower SFs, which contrasts sharply with the consistent results obtained using the proposed SFI layers. The SFI extension of SuDoRM-RF demonstrates maintained efficacy across a diverse SF spectrum, validating the utility of the proposed framework for universal separation tasks.

Conclusion

This research delivers a practical and efficient solution for universal sound separation in environments with variable SFs. By innovatively integrating SFI convolutional layers within the SuDoRM-RF structure, the approach elegantly circumvents challenges posed by SF inconsistencies, ensuring high performance across various recording contexts. The paper paves the way for universally applicable sound separation models, providing a crucial step toward seamless preprocessing capabilities across myriad audio signal processing applications. Future research may explore further optimizations for real-time applications and broader testing across additional datasets.

Markdown Report Issue