Multimodal Urban Sound Tagging with Spatiotemporal Context

Published 31 Oct 2020 in eess.AS | (2011.00175v2)

Abstract: Noise pollution significantly affects our daily life and urban development. Urban Sound Tagging (UST) has attracted much attention recently, which aims to analyze and monitor urban noise pollution. One weakness of the previous UST studies is that the spatial and temporal context of sound signals, which contains complementary information about when and where the audio data was recorded, has not been investigated. To address this problem, in this paper, we propose a multimodal UST system that deeply mines the audio and spatiotemporal context together. In order to incorporate characteristics of different acoustic features, two sets of four spectrograms are first extracted as the inputs of residual neural networks. Then, the spatiotemporal context is encoded and combined with acoustic features to explore the efficiency of multimodal learning for discriminating sound signals. Moreover, a data filtering approach is adopted in text processing to further improve the performance of multi-modality. We evaluate the proposed method on the UST challenge (task 5) of DCASE2020. Experimental results demonstrate the effectiveness of the proposed method.