Towards Real-Time Automatic Portrait Matting on Mobile Devices

Published 8 Apr 2019 in cs.CV, cs.AI, cs.LG, and cs.NE | (1904.03816v1)

Abstract: We tackle the problem of automatic portrait matting on mobile devices. The proposed model is aimed at attaining real-time inference on mobile devices with minimal degradation of model performance. Our model MMNet, based on multi-branch dilated convolution with linear bottleneck blocks, outperforms the state-of-the-art model and is orders of magnitude faster. The model can be accelerated four times to attain 30 FPS on Xiaomi Mi 5 device with moderate increase in the gradient error. Under the same conditions, our model has an order of magnitude less number of parameters and is faster than Mobile DeepLabv3 while maintaining comparable performance. The accompanied implementation can be found at \url{https://github.com/hyperconnect/MMNet}.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces MMNet, a novel model employing multi-branch architecture and depthwise separable convolutions to balance computational efficiency with high-performance portrait matting.
It demonstrates real-time processing at 30 FPS on mobile devices by leveraging an adaptive width multiplier and enhanced loss functions to reduce gradient error.
The study’s extensive ablation experiments validate the effectiveness of its design choices, paving the way for efficient deployment in edge computing scenarios.

An Examination of Real-Time Automatic Portrait Matting on Mobile Devices

The paper "Towards Real-Time Automatic Portrait Matting on Mobile Devices" presents a notable contribution to the field of image matting, particularly focusing on optimizing computational efficiency to enable real-time applications on mobile devices. This study proposes a new model, MMNet, which enhances the automatic matting of portraits while maintaining robust performance metrics against state-of-the-art models, thus addressing the challenge of balancing model complexity and computational speed.

In portrait matting, the difficulty arises from the need to infer a seven-dimensional attribute space from merely three RGB color channels. Traditional approaches often depend on additional inputs like trimaps or scribbles, making real-time application difficult due to high computational demands. The novelty of MMNet lies in its encoder-decoder architecture, incorporating multi-scale information extraction and the use of linear bottleneck blocks to achieve substantial speedups and reduced model size.

Key Methodological Features

Multi-Branch Architecture: MMNet uses multi-branch dilated convolutions within the encoder block, which allow various scales of spatial information to be extracted simultaneously. This design aggregates multi-scale features without a significant increase in computational complexity.
Efficient Convolutions: By employing depthwise separable convolutions—a core technique in lightweight models like MobileNet—the model significantly reduces computational overhead without sacrificing effectiveness in feature extraction. These convolutions are integrated into the linear bottleneck structure to maintain a low-dimensional representation, crucial for efficient processing.
Adaptive Width Multiplier: The model introduces a width multiplier, granting the ability to control the trade-off between model performance and computational cost. This flexibility is particularly valuable for catering to the varied resource constraints of different mobile devices.
Enhanced Loss Function: A nuanced loss function, integrating gradient and auxiliary losses, is devised to enhance the model's capability to accurately differentiate fine-grained edges, a critical aspect in high-fidelity image matting.

Experimental Insights

The experiment section underscores MMNet’s superiority in terms of latency and gradient error metrics when benchmarked against existing state-of-the-art models such as Mobile DeepLabv3 and the LDN+FB methods. Notably, MMNet achieves real-time performance at 30 FPS on devices like the Xiaomi Mi 5, showcasing its potential for deployment in consumer-facing applications.

The study methodically explores the architectural variations through ablation studies to quantify the contribution of individual components like the multi-branch dilated convolutions and enhancement blocks. These evaluations reveal that each component collectively contributes to a notable reduction in gradient error.

Implications and Future Directions

MMNet’s architecture paves the way for further enhancements in real-time image processing applications. The reduction in model size without substantial performance sacrifice indicates possibilities for deployment in edge computing scenarios, where bandwidth and local processing constraints are prominent. Moreover, the quantization-aware training presents an avenue for integrating NN models into resource-constrained environments without invoking additional computational infrastructure.

Looking ahead, extending the framework of MMNet to handle generic image matting could yield significant advancements in automatic saliency matting and video matting tasks. Incorporating distillation techniques and exploring lower-bit quantization could provide further speed-ups and efficiency gains.

In conclusion, the paper acknowledges the efficient use of computational resources in achieving real-time mobile compatibility as its central contribution. By balancing computational demands with nuanced design strategies, MMNet represents a forward step in leveraging AI-driven image processing capabilities in real-world applications. The work sets a foundation for future exploration into even more efficient and versatile models tailored for mobile and edge technologies.

Markdown Report Issue