Real-time deep hair matting on mobile devices

Published 19 Dec 2017 in cs.CV | (1712.07168v2)

Abstract: Augmented reality is an emerging technology in many application domains. Among them is the beauty industry, where live virtual try-on of beauty products is of great importance. In this paper, we address the problem of live hair color augmentation. To achieve this goal, hair needs to be segmented quickly and accurately. We show how a modified MobileNet CNN architecture can be used to segment the hair in real-time. Instead of training this network using large amounts of accurate segmentation data, which is difficult to obtain, we use crowd sourced hair segmentation data. While such data is much simpler to obtain, the segmentations there are noisy and coarse. Despite this, we show how our system can produce accurate and fine-detailed hair mattes, while running at over 30 fps on an iPad Pro tablet.

Abstract PDF Upgrade to Chat

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a novel hair matting method by adapting MobileNet for real-time segmentation on mobile devices.
It refines network architecture with skip connections and an edge-consistent loss function to produce high-quality hair mattes.
Experimental results demonstrate state-of-the-art accuracy and speed, achieving over 30 fps on devices like the iPad Pro.

Real-time Deep Hair Matting on Mobile Devices

This paper introduces a real-time hair matting method suitable for mobile devices by adapting the MobileNet CNN architecture for accurate hair segmentation, even with noisy, crowd-sourced training data. The system achieves over 30 fps on an iPad Pro and addresses the challenges of hair's complex structure and limited computational resources in mobile environments.

Addressing Real-Time Hair Segmentation Challenges

The paper highlights the difficulties in achieving real-time, fine-grained hair segmentation on mobile devices, noting that traditional methods and high-capacity CNNs like VGG16 are computationally expensive and memory-intensive. To overcome these limitations, the authors propose a modified MobileNet architecture, known for its efficiency and compactness, for hair segmentation. The modification enables the network to run in real-time on mobile devices, addressing a critical need in beauty applications where live hair color augmentation demands both speed and accuracy.

Methodology: Adapting MobileNets and Refining Hair Mattes

The authors detail their two-fold approach to achieving accurate hair segmentation and matting. First, they modify the MobileNet architecture into a fully convolutional network (HairSegNet) by removing the final classification layers and adjusting the step size of certain layers to preserve fine details. They then construct a decoder using inverted MobileNet architecture principles to upsample the CNN features into a hair mask. To tackle the challenge of training with noisy, crowd-sourced data, they introduce a method for real-time hair matting without relying on precise matting training data. This involves refining the network architecture with skip connections and incorporating a secondary loss function that promotes perceptually appealing matting results by ensuring consistency between image and mask edges.

Implementation Details and Architectural Choices

The implementation involved several key steps:

MobileNet Modification: Adapting MobileNet into a fully convolutional network by removing classification layers and adjusting stride lengths.
Decoder Construction: Building a decoder using transposed convolutions to upsample feature maps.
Skip Connections: Adding skip connections to preserve high-resolution details.
Loss Function: Combining binary cross-entropy loss with a mask-image gradient consistency loss to refine matting results.
Data Preprocessing: Implementing face detection and cropping to focus on relevant regions.
Training Regime: Using Adadelta optimizer with specific parameters for learning rate and regularization.

The network architecture, illustrated in Figure 1 of the paper, includes skip connections between the encoder and decoder layers to capture high-resolution details. The mask-image gradient consistency loss, defined in Equation 1, ensures that the mask edges align with image edges, enhancing the perceptual quality of the matting. The authors trained their models using the Adadelta optimizer with specific parameters for learning rate and regularization, balancing the trade-off between accuracy and generalization.

Experimental Results and Performance Metrics

The method was evaluated on a crowd-sourced dataset, the LFW Parts dataset, and a hair dataset from Guo and Aarabi. Performance metrics included F1-score, Performance, IoU, Accuracy, and mask-image gradient consistency loss. The results demonstrated that HairMatteNet achieves state-of-the-art accuracy while running in real-time on a mobile device. The authors found that HairMatteNet consistently outperformed HairSegNet and HairSegNet with Guided Filter post-processing in terms of gradient consistency loss, indicating better adherence to image edges. The quantitative results are presented in Table 1, showcasing the effectiveness of the proposed method across different datasets.

Comparative Analysis and Ablation Studies

The authors conducted a comparative analysis against existing methods, including those using VGG16-based architectures and CRF post-processing, highlighting the advantages of their MobileNet-based approach in terms of speed and memory usage. Ablation studies were performed to assess the impact of different architectural choices, such as the number of channels in the decoder layers. They determined that 64 channels provided a slightly better performance, balancing complexity and accuracy. Additionally, experiments with different input image resolutions showed that while higher resolutions could capture finer details, they also emphasized issues like non-homogeneous masks and increased computational costs.

Conclusion

The paper makes a significant contribution by demonstrating real-time hair matting on mobile devices using a modified MobileNet architecture. The approach effectively addresses the challenges of noisy training data and limited computational resources. Future work could explore fully automatic training methods from noisy data and further improvements to matting quality while maintaining real-time performance.

Markdown Report Issue