Graph-Enhanced Dual-Stream Feature Fusion with Pre-Trained Model for Acoustic Traffic Monitoring

Published 26 Dec 2024 in eess.AS and eess.SP | (2412.19078v1)

Abstract: Microphone array techniques are widely used in sound source localization and smart city acoustic-based traffic monitoring, but these applications face significant challenges due to the scarcity of labeled real-world traffic audio data and the complexity and diversity of application scenarios. The DCASE Challenge's Task 10 focuses on using multi-channel audio signals to count vehicles (cars or commercial vehicles) and identify their directions (left-to-right or vice versa). In this paper, we propose a graph-enhanced dual-stream feature fusion network (GEDF-Net) for acoustic traffic monitoring, which simultaneously considers vehicle type and direction to improve detection. We propose a graph-enhanced dual-stream feature fusion strategy which consists of a vehicle type feature extraction (VTFE) branch, a vehicle direction feature extraction (VDFE) branch, and a frame-level feature fusion module to combine the type and direction feature for enhanced performance. A pre-trained model (PANNs) is used in the VTFE branch to mitigate data scarcity and enhance the type features, followed by a graph attention mechanism to exploit temporal relationships and highlight important audio events within these features. The frame-level fusion of direction and type features enables fine-grained feature representation, resulting in better detection performance. Experiments demonstrate the effectiveness of our proposed method. GEDF-Net is our submission that achieved 1st place in the DCASE 2024 Challenge Task 10.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that GEDF-Net improves vehicle detection by fusing pre-trained audio features with graph-based attention to overcome data scarcity.
It employs a dual-stream architecture with VTFE and VDFE branches to extract both vehicle type and direction features for precise categorization.
Experimental results on the DCASE 2024 dataset show superior performance, with improvements in Kendall’s Tau and RMSE, achieving first place in the challenge.

Graph-Enhanced Dual-Stream Feature Fusion with Pre-Trained Model for Acoustic Traffic Monitoring

The paper introduces a novel approach to acoustic traffic monitoring, aiming to address significant challenges in the domain, particularly those arising from the scarcity of labeled real-world traffic data and the complexity inherent in diverse monitoring scenarios. The proposed model, Graph-Enhanced Dual-Stream Feature Fusion Network (GEDF-Net), is designed to improve vehicle detection by simultaneously considering vehicle type and direction.

Methodology

GEDF-Net incorporates a dual-stream feature fusion strategy, encompassing:

Vehicle Type Feature Extraction (VTFE) Branch: This branch utilizes a pre-trained audio model (PANNs) to enhance feature representation, thereby mitigating the data scarcity issue. To further refine these features, a graph attention mechanism is applied, capturing temporal relationships and emphasizing important audio events.
Vehicle Direction Feature Extraction (VDFE) Branch: This branch employs GCC-PHAT to extract features related to the direction of vehicle movement, which is critical for accurate traffic monitoring.

The distinct features extracted by these branches are fused using a frame-level feature fusion module. This integration allows for a fine-grained representation of traffic events that takes into account both vehicle type and travel direction. The final component of the model, a category count predictor, estimates the counts of vehicles categorized by both type and direction.

Experimental Results

The experimental evaluation, which was conducted using the DCASE 2024 Challenge Task 10 dataset, demonstrates the GEDF-Net system's superior performance. The authors report achieving first place in the challenge, highlighting the method’s efficacy. Performance metrics used include Kendall's Tau Rank Correlation and RMSE, with GEDF-Net showing improvements over baseline methods in these metrics.

GEDF-Net's effectiveness is primarily attributable to the intelligent use of pre-trained models and graph attention mechanisms, which both enhance feature representation and address the scarcity of labeled traffic data. Furthermore, ablation studies verified the impact of each component—demonstrating the benefits conferred by the integration of the pre-trained model and graph attention within the VTFE branch.

Implications and Future Directions

The findings from this study have several key implications for the field of acoustic traffic monitoring. Foremost is the utility of pre-trained models in scenarios characterized by data scarcity, where external knowledge sources such as PANNs can significantly enhance model performance. The employment of graph attention mechanisms to capture contextual relationships between audio frames further illustrates the potential for refined temporal feature modeling in similar applications.

Potential future developments could focus on extending the GEDF-Net model to other domains where data scarcity is a prominent issue, utilizing similar dual-stream architecture and incorporating graph-based attention to enhance feature extraction processes. Additionally, exploring alternative pre-training datasets and architectures may yield further performance enhancements.

In conclusion, GEDF-Net represents a step forward in acoustic traffic monitoring, combining advanced feature extraction techniques with robust data augmentation strategies to improve vehicle detection and classification. As smart cities and automated traffic systems continue to evolve, methodologies like those proposed in this study will likely play a crucial role in enhancing their efficiency and accuracy.

Markdown Report Issue