Decoupling Features in Hierarchical Propagation for Video Object Segmentation

Published 18 Oct 2022 in cs.CV | (2210.09782v3)

Abstract: This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.

Abstract PDF Upgrade to Chat

Citations (124)

View on Semantic Scholar

Summary

The paper introduces a dual-branch design that separates object-agnostic and object-specific embeddings, achieving superior segmentation accuracy.
It leverages a Gated Propagation Module with single-head attention to overcome computational bottlenecks and enable real-time performance.
Empirical results show significant improvements on benchmarks like DAVIS and YouTube-VOS, with high accuracy and efficient processing rates.

Decoupling Features in Hierarchical Propagation for Video Object Segmentation

The presented paper introduces a novel approach to improve the efficiency and accuracy of semi-supervised Video Object Segmentation (VOS) through the Decoupling Features in Hierarchical Propagation (DeAOT) framework. Building on the foundations laid by the Associating Objects with Transformers (AOT) approach, which employs hierarchical propagation, DeAOT refines this idea by addressing inherent limitations that arise in the transition from object-agnostic to object-specific representations. The authors critically assess the challenges posed by the gradual integration of object-specific data that potentially overshadow the object-agnostic information in deep propagation layers, leading to an inefficient learning of visual embeddings.

DeAOT's major contribution is in decoupling the hierarchical propagation process into two independent branches for object-agnostic and object-specific embeddings. This separation is posited to effectively preserve visual information and enhance the refinement of visual features throughout the propagation layers. To maintain computational efficiency despite the introduction of dual branching, the researchers propose the Gated Propagation Module (GPM), which utilizes single-head attention as opposed to the conventional multi-head approach, a recognized bottleneck in AOT's process. This innovation not only aligns with computational resource constraints but also optimizes the propagation module's functionality.

The empirical results underscore the efficacy of the DeAOT framework. Specifically, it significantly outperforms the AOT method across various datasets, including YouTube-VOS and DAVIS 2017, achieving accuracy rates of up to 86.2% and 92.9% on DAVIS 2016 and presenting robust performance on the VOT 2020 benchmark with an EAO score of 0.622. Furthermore, the framework exhibits commendable efficiency, with versions running at 22.4fps and 53.4fps for different settings, highlighting its suitability for real-time applications.

These findings also carry theoretical implications for the design of propagation mechanisms in vision transformers. By maintaining separate focus on object-specific and object-agnostic data processing within the visual information hierarchy, DeAOT not only demonstrates improved accuracy and efficiency in VOS tasks but also suggests a paradigm for robust feature separation that can be extrapolated to other deep learning applications.

Looking forward, the DeAOT framework suggests potential directions in scaling model architectures to handle complex segmentation tasks without compromising computational feasibility. The proposed dual-branch structure invites further research into dynamic weighting schemes and adaptation mechanisms between branches, potentially leading to higher resilience in diverse visual scenes. Moreover, the integration of GPM in neural architectures could be explored in various contexts beyond VOS, such as multi-task learning frameworks and complex scene understanding tasks, which require maintaining a balance between detail preservation and specificity.

In conclusion, DeAOT finely orchestrates the balance between object-specific detailing and object-agnostic abstraction in VOS applications, achieving tangible performance gains. This research not only advances the field of video object segmentation but also illustrates the importance of hierarchical feature management and the potential of efficient gating strategies in deep learning models across various domains.

Markdown Report Issue