Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach

Published 6 Oct 2024 in cs.LG, cs.AI, and cs.DC | (2410.05338v1)

Abstract: Recent advances in Deep Neural Networks (DNNs) have demonstrated outstanding performance across various domains. However, their large size is a challenge for deployment on resource-constrained devices such as mobile, edge, and IoT platforms. To overcome this, a distributed inference setup can be used where a small-sized DNN (initial few layers) can be deployed on mobile, a bigger version on the edge, and the full-fledged, on the cloud. A sample that has low complexity (easy) could be then inferred on mobile, that has moderate complexity (medium) on edge, and higher complexity (hard) on the cloud. As the complexity of each sample is not known beforehand, the following question arises in distributed inference: how to decide complexity so that it is processed by enough layers of DNNs. We develop a novel approach named DIMEE that utilizes Early Exit (EE) strategies developed to minimize inference latency in DNNs. DIMEE aims to improve the accuracy, taking into account the offloading cost from mobile to edge/cloud. Experimental validation on GLUE datasets, encompassing various NLP tasks, shows that our method significantly reduces the inference cost (> 43%) while maintaining a minimal drop in accuracy (< 0.3%) compared to the case where all the inference is made in cloud.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces DIMEE, a novel framework that applies early exit strategies to allocate DNN layers across mobile, edge, and cloud devices.
It employs clustering based on sample complexity, processing easy cases on mobile and harder ones on cloud to reduce computation costs.
Experimental results reveal a cost reduction of over 43% with less than 0.3% drop in accuracy compared to cloud-only inference.

Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach

The paper presents DIMEE, a novel method leveraging early exit strategies to optimize distributed inference across mobile, edge, and cloud platforms. It addresses the challenge of deploying large DNNs on resource-constrained devices through a scalable inference model that dynamically allocates computational resources based on sample complexity.

Overview of DIMEE

DIMEE utilizes distributed inference by deploying partial DNN layers on mobile and edge devices and the full DNN on the cloud. Each component processes samples based on their computational complexity determined by early exits. This strategic layer deployment allows easy samples to be processed on mobile devices, moderately complex samples on edge devices, and complex samples in the cloud, optimizing inference cost and latency.

Figure 1: The DNN model is partitioned across devices, processing samples optimally based on complexity.

Experimental results on GLUE datasets demonstrate DIMEE's effectiveness in reducing inference costs significantly while maintaining high accuracy. The method achieves a cost reduction of over 43% while incurring less than 0.3% drop in accuracy compared to cloud-only inference.

Early Exit Strategy and Complexity Pools

Early exit strategies are integral to DIMEE, enabling adaptive processing by assessing sample complexity through intermediate classifiers. DIMEE clusters samples into easy, moderate, and hard pools using embedding distances calculated during training. These pools guide the assignment of computational resources during deployment.

Figure 2: The accuracy profiles of inference across mobile, edge and cloud devices.

The clustering approach optimizes inference by dynamically assessing complexity, thereby minimizing resources for simpler tasks and utilizing cloud computing for complex samples. This method effectively balances processing efficiency and accuracy.

Implementation Details

Key parameters governing DIMEE's performance include the deployment of layers at mobile and edge devices, denoted as $m$ and $n$ layers, respectively. The choice of threshold $α$ for early exits is optimized to maximize confidence while accounting for processing and offloading costs.

The layer configuration impacts the trade-off between device processing costs and cloud offloading costs. Higher values of $m$ and $n$ reduce offloading to the cloud, increasing processing requirements on local devices but lowering latency costs.

Cost Structure and Optimization

DIMEE's cost structure includes processing costs ( $\lambda_m$ , $\lambda_e$ ), offloading costs ( $o_e$ , $o_c$ ), and cloud charges ( $\gamma$ ). The method defines a reward function, optimizing the threshold $α$ to balance confidence and costs effectively.

Figure 3: Cost analysis illustrating the impact of cost variations on accuracy and efficiency.

The adaptive summation of costs for mobile processing, edge processing, and cloud offloading underpins DIMEE's performance optimization, guiding resource allocation precisely based on sample complexity.

Conclusion

DIMEE introduces a robust framework for distributed inference across mobile, edge, and cloud platforms. By strategically deploying layers and utilizing early exits, the approach significantly reduces inference costs and maintains high accuracy, offering a practical solution for deploying DNNs on resource-constrained devices. Future work may extend this methodology to other domains and refine the clustering mechanism for enhanced performance across diverse tasks.

Markdown Report Issue