TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Published 12 Apr 2022 in cs.CV | (2204.05525v1)

Abstract: Although vision transformers (ViTs) have achieved great success in computer vision, the heavy computational cost hampers their applications to dense prediction tasks such as semantic segmentation on mobile devices. In this paper, we present a mobile-friendly architecture named \textbf{To}ken \textbf{P}yramid Vision Trans\textbf{former} (\textbf{TopFormer}). The proposed \textbf{TopFormer} takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation. Experimental results demonstrate that our method significantly outperforms CNN- and ViT-based networks across several semantic segmentation datasets and achieves a good trade-off between accuracy and latency. On the ADE20K dataset, TopFormer achieves 5\% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device. Furthermore, the tiny version of TopFormer achieves real-time inference on an ARM-based mobile device with competitive results. The code and models are available at: https://github.com/hustvl/TopFormer

Abstract PDF Upgrade to Chat

Citations (164)

View on Semantic Scholar

Summary

The paper introduces a novel token pyramid transformer that integrates MobileNetV2-inspired blocks with transformer layers for efficient mobile semantic segmentation.
It leverages multi-scale feature tokens to achieve a superior accuracy-latency trade-off, outperforming MobileNetV3 by 5% mIoU on ADE20K.
TopFormer’s design paves the way for scalable mobile AI solutions by combining dense prediction capabilities with a significantly reduced computational footprint.

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

The paper introduces TopFormer, a novel vision transformer architecture designed to address the computational constraints of mobile devices in dense prediction tasks, particularly semantic segmentation. The architecture aims to surpass the performance of both traditional Convolutional Neural Networks (CNNs) and existing Vision Transformer (ViT) models by achieving an optimal trade-off between accuracy and computational latency.

Architecture and Methodology

TopFormer is structured around a token pyramid, extracted from mobile-friendly convolutional layers inspired by MobileNetV2, which processes high-resolution input images into smaller-scale informative tokens. These tokens, representing multi-scale features, are leveraged as inputs to the Vision Transformer block, maximizing semantic richness while minimizing computational load.

Key components include:

Token Pyramid Module: Efficiently constructed using lightweight MobileNetV2 blocks, this module generates multi-scale feature representations, which are progressively pooled and concatenated to form a reduced representation for subsequent transformer operations.
Semantics Extractor: This segment utilizes several transformer blocks incorporating Multi-Head Self-Attention (MHSA) to provide scale-aware global semantics. These blocks are optimized with batch normalization and a reduced number of channels in keys and queries, significantly decreasing the computational footprint.
Semantics Injection Module: Combines scale-aware semantics with localized token features, enhancing representation in a computationally efficient manner. By applying operations such as sigmoid attention and feature concatenation, this module enables robust hierarchical feature formation conducive to dense prediction tasks.
Segmentation Head: Utilizes upscaled augmented tokens for final segmentation map production, presenting significant improvements over existing CNN and ViT techniques in terms of precision and efficiency.

Experimental Results

TopFormer demonstrates substantial advancements over prior models across various datasets, including ADE20K, Pascal Context, and COCO-Stuff, showcasing increased mean Intersection over Union (mIoU) scores while maintaining lower latency and computational costs. Notably, on ADE20K, TopFormer surpasses MobileNetV3 by 5% in mIoU with reduced latency on ARM-based devices. The tiny iteration of TopFormer achieves real-time segmentation capabilities on these devices, underscoring its practical application in mobile contexts.

Implications and Future Directions

The proposed architecture has significant implications for mobile AI applications where efficiency is crucial. It sets a benchmark for balancing accuracy with resource constraints, paving the way for further exploration into lightweight transformers for various vision tasks. Subsequent research could enhance object detection capabilities and refine model scalability across different device profiles.

In summary, TopFormer represents a significant stride in vision transformer design for mobile applications. By effectively leveraging the strengths of both CNNs and transformers and emphasizing scale-aware processing, it addresses the challenges inherent in deploying dense prediction models on resource-constrained platforms. Future iterations may focus on expanding application domains and refining architecture components for broader AI deployment scenarios.