Audio-visual Event Localization on Portrait Mode Short Videos

Published 9 Apr 2025 in cs.MM, cs.AI, and cs.CV | (2504.06884v1)

Abstract: Audio-visual event localization (AVEL) plays a critical role in multimodal scene understanding. While existing datasets for AVEL predominantly comprise landscape-oriented long videos with clean and simple audio context, short videos have become the primary format of online video content due to the the proliferation of smartphones. Short videos are characterized by portrait-oriented framing and layered audio compositions (e.g., overlapping sound effects, voiceovers, and music), which brings unique challenges unaddressed by conventional methods. To this end, we introduce AVE-PM, the first AVEL dataset specifically designed for portrait mode short videos, comprising 25,335 clips that span 86 fine-grained categories with frame-level annotations. Beyond dataset creation, our empirical analysis shows that state-of-the-art AVEL methods suffer an average 18.66% performance drop during cross-mode evaluation. Further analysis reveals two key challenges of different video formats: 1) spatial bias from portrait-oriented framing introduces distinct domain priors, and 2) noisy audio composition compromise the reliability of audio modality. To address these issues, we investigate optimal preprocessing recipes and the impact of background music for AVEL on portrait mode videos. Experiments show that these methods can still benefit from tailored preprocessing and specialized model design, thus achieving improved performance. This work provides both a foundational benchmark and actionable insights for advancing AVEL research in the era of mobile-centric video content. Dataset and code will be released.

Abstract PDF Upgrade to Chat

Summary

Audio-Visual Event Localization on Portrait Mode Short Videos: A Review

The research paper titled "Audio-Visual Event Localization on Portrait Mode Short Videos" introduces a novel dataset, AVE-PM, designed to address the growing importance of mobile-centric video content in the field of audio-visual event localization (AVEL). The paper underscores the shifting paradigms in online video format, prompted by increased smartphone usage, and addresses the unique challenges presented by short, portrait-oriented videos. The AVE-PM dataset represents a critical advancement, featuring 25,335 clips across 86 fine-grained categories with precise frame-level annotations.

Dataset Characteristics and Methodological Framework

The AVE-PM dataset is explicitly crafted to fill a gap in existing AVEL datasets, which predominantly focus on landscape-oriented long videos with relatively simple audio compositions. In contrast, AVE-PM captures the nuances of portrait mode short videos, which often include overlapping audio elements such as sound effects, voiceovers, and music. Such audio complexity presents a significant challenge for conventional AVEL models, particularly due to the unique spatial biases introduced by portrait framing.

The dataset's comprehensive coverage across various domains, as depicted in the taxonomy used in its construction, ensures a representative collection of real-world short video content typically found on platforms like Douyin. Importantly, every video in the dataset is human-annotated for event onsets and offsets, lending high reliability and depth to the data. The presence of a haveBGM flag for each video further enhances the dataset, providing a nuanced understanding of audio complexity and its implications for AVEL models.

Empirical Findings and Model Evaluations

Through rigorous empirical testing, the paper demonstrates a notable average performance drop of 18.66% among state-of-the-art AVEL methods when evaluated across different video formats. This decrease supports the notion that models trained on landscape-format videos do not necessarily generalize well to those in portrait format, revealing a significant domain gap. Two primary challenges were identified: spatial bias inherent in portrait framing, and the complex, often noisy audio typical of short videos. These insights underscore the necessity of tailored preprocessing techniques and model adjustments to optimize performance on such content.

Key strategies employed include various resizing and cropping methods to mitigate spatial bias and the exclusion of audio tracks with background music to reduce interference from non-event-related sounds. Notably, models employing random cropping techniques showed significant performance improvements, suggesting that capturing diverse visual information is vital for accurate localization in portrait videos.

Implications and Future Directions

The introduction of the AVE-PM dataset and the associated analysis have several implications for both practical applications and theoretical advancements in AVEL. Practically, the development of preprocessing techniques and model adaptations for short video formats can lead to better multimodal scene understanding in social media contexts, mobile video applications, and real-time event detection systems. Theoretically, these findings encourage further investigation into cross-modal interactions in complex audio settings and highlight the need for models that can effectively adapt to varied content formats, especially as mobile device usage proliferates.

The paper's exploration of portrait mode videos invites future research to refine and expand AVEL methods to accommodate the dynamic data characteristics of emerging video formats. Further integration of adaptive fusion mechanisms for auditory and visual signals could also enhance model robustness against complex audio landscapes. This work serves as a stepping stone for advancing AVEL techniques to better align with the evolving nature of digital content consumption.