Device–Edge Synergy in AI Inference

Updated 5 February 2026

Device–edge synergy is a paradigm that coordinates sensing, computation, communication, and storage between end devices and edge servers for efficient AI inference.
It employs hierarchical partitioning, split inference, and federated learning to dynamically balance latency, energy use, and communication bandwidth.
Empirical studies show significant speedups, energy savings, and accuracy gains, validating the effectiveness of integrated device and edge processing.

Device–edge synergy refers to the coordinated, often dynamically optimized, allocation and orchestration of sensing, computation, communication, and storage resources between end devices and edge servers (or clusters), to achieve efficient, adaptive, and reliable AI inference or learning tasks under real-world constraints. Rather than isolating intelligence either on the device (with severe resource bottlenecks) or on the server (with communication or privacy limitations), device–edge synergy exploits the strengths of both layers by adaptively partitioning models, correlating data flows, and optimizing resource allocation. This paradigm is foundational to edge intelligence, collaborative inference, federated learning, and integrated sensing-communication-computation (ISCC) systems.

1. Fundamental Principles and Architectural Patterns

Device–edge synergy systematically addresses the joint optimization of latency, energy consumption, communication bandwidth, and inference accuracy through multi-level system partitioning and resource sharing. Key architectural motifs include:

Hierarchical Compute Partitioning: Systems are organized into device, edge, (and possibly cloud) tiers, where each layer provides progressively greater compute and memory capacity (Ali et al., 1 Oct 2025). Compute partitioning is dynamically adapted depending on workload, network conditions, and real-time constraints.
Model Partitioning / Split Inference: Deep models (DNNs, GNNs, Transformers) are partitioned at specific layers or blocks, with upstream layers computed on-device and downstream layers on the edge. The partition point is selected based on minimizing composite latency, energy, or communication load while meeting accuracy targets (Li et al., 2018, Zhang et al., 2021, Zhou et al., 5 Dec 2025).
Early Exit / Right-Sizing: Branchy networks provide multiple exit points, allowing for adaptive inference termination at earlier layers on resource-constrained devices to save processing and communication time under strict latency/deadline budgets (Li et al., 2018).
Federated and Collaborative Learning: Devices perform online or streaming updates locally, periodically synchronizing parameters or summary statistics with peers or central servers to enable rapid adaptation and privacy-preserving learning (Ito et al., 2020, Ali et al., 1 Oct 2025).
Multi-Device Aggregation and Personal Edge: Clusters of heterogeneous on-body/near-body devices (e.g., SensiX (Min et al., 2020)) or ad hoc ensembles of IoT nodes cooperate to improve inference resilience and accuracy, using translation and selection mechanisms to cope with device and data variability.
Resource Cooperative Scheduling (3C): Edge networks jointly share communication, computation, and caching resources using D2D communication to flexibly allocate sub-tasks, generalizing earlier isolated communication- or computation-centric models (Tang et al., 2018).

2. Optimization Frameworks and Mathematical Models

Device–edge synergy relies on explicit mathematical optimization to drive system configuration:

Latency and Resource-Constrained Partitioning: Objective functions commonly minimize end-to-end latency,

$T_{\text{total}}(p) = \sum_{k=1}^{p-1} T_{\text{edge}}(k) + T_{\text{comm}}(s_{\text{in}}) + T_{\text{comm}}(s_{p-1}) + \sum_{k=p}^n T_{\text{dev}}(k)$

subject to resource (compute, memory, bandwidth), accuracy, and deadline constraints (Li et al., 2018, Makaya et al., 2024).

Reinforcement Learning-Based AutoML: The search for optimal split points, pruning ratios, and feature compression levels is cast as a sequential decision process solved with DDPG or other deep RL methods to find Pareto-frontier solutions balancing computation and communication (Zhang et al., 2021).
Integer Linear Programming (ILP) and Heuristic Scheduling: Joint scheduling of sub-tasks, resource assignments, and D2D data relay is formulated as an ILP or approximated with scalable LP-based heuristics for practical deployment at moderate scale (Tang et al., 2018, Li et al., 2023).
Information Bottleneck and Rate–Distortion Theoretic Approaches: Multi-device feature encoding uses variational (distributed) information bottleneck objectives to ensure transmitted representations preserve maximal task-relevant information under channel or representational constraints (Shao et al., 2021, Ke et al., 2024, Yang et al., 2024).
Max–min Pairwise Task Gain: Task-aware over-the-air inference couples device sensing, feature computation, and joint resource optimization, maximizing the discriminant gain between the most-confusable classes (Zhuang et al., 2023).

3. Partitioning Algorithms and Co-Inference Design

Device–edge co-inference frameworks operationalize the above principles via runtime selection, system profiling, and/or machine learning-driven scheduling:

Fast Partition Selection: Edgent evaluates all candidate DNN split points using per-layer regression models (<1 ms) and returns the partition that minimizes predicted latency under resource and accuracy constraints (Li et al., 2018).
Integrated GNN Architecture-Mapping Search: GCoDE searches a unified supernet, where each communication node defines a potential split, and uses a GIN predictor to estimate energy and latency, finding optimal device–edge mapped architectures (Zhou et al., 5 Dec 2025).
Digital Twin-Assisted Collaboration: In AIoT scenarios, digital twins emulate both the device's real-time inference progress and the joint system workload evolution for every candidate offloading point, feeding dense augmented data to an ML-assisted optimal stopping policy that adapts to system dynamics (Hu et al., 2024).
Multi-Tier Resource Orchestration: KubeEdge.AI and EdgeSphere employ Kubernetes/Mesos-inspired scheduling to partition workloads across device, edge, and cloud, using both general resource constraints and attribute- or context-based offer matching, with periodic monitoring and dynamic reassignment (Wang et al., 2020, Makaya et al., 2024).
Federated Model Merge: Online learning methods, e.g., OS-ELM, facilitate "one-shot" cooperative aggregation by exchanging internal statistics (U,V matrices) rather than full models or raw data, minimizing communication and enabling rapid convergence (Ito et al., 2020).

4. Multi-Device and Multi-View Cooperative Inference

Device–edge synergy extends beyond device–server splits to fully leverage spatially distributed, multi-view, or multi-modal sensor deployments:

Task-Oriented Encoding for Distributed Sensing: Variational distributed information bottleneck objectives encode each device's local sensory input into compressed, task-relevant representations that are jointly decoded at the edge to maximize mutual information with the target task (Shao et al., 2021, Yang et al., 2024).
Selective Communication and Redundancy Pruning: Selective transmission schemes (e.g., selective retransmission (SR) (Shao et al., 2021), redundancy pruning via embeddings or histograms (Palena et al., 2024)) adaptively reduce communication based on confidence or context similarity, saving 18–74% of data with negligible accuracy loss.
Over-the-Air Aggregation and ISCC: ISCC designs, e.g., with AirComp over OFDM, combine the analog aggregation of features across devices with beamforming and joint resource control to efficiently realize distributed inference under strict latency and energy constraints (Zhuang et al., 2023).
Multi-Level Orchestration: DAG-based task orchestration frameworks schedule application graphs over personal and commercial edge devices, managing interference, redundancy, and reliability through dynamic, availability-aware application placement and replication (Li et al., 2023).

5. Hardware-Software Co-Design and System Realizations

Effective device–edge synergy mandates co-design across hardware accelerators, runtime platforms, and AI models:

Specialized Edge Hardware: Deployments leverage NPUs, tiny AI accelerators (MAX78xxx, Loihi), or FPGAs for on-device partitioned execution, with resource profiling and virtualization for dynamic scheduling (e.g., Synergy for wearables (Gong et al., 2023)).
Lightweight AI Runtimes: Modularity is achieved through containerized inference runtimes (ONNX, TFLite), pipelined or DAG-based task graphs, and quantized/structured models for low-power operation (Wang et al., 2020, Ren et al., 2021).
Adaptive Communication Protocols: MQTT, CoAP, and custom RPC or direct IR/impulse radio protocols facilitate lightweight, reliable device–edge synchrony (Ke et al., 2024).
Dynamic Model Management: Model updates, cache refreshing, and event-driven policy logic are handled at the edge with rule engines, federated learning aggregation, and dataflow orchestration (Ren et al., 2021, Wang et al., 2020).
Energy and Latency Profiling: Automated, per-layer or per-operator regression models estimate latency and energy to inform partitioning or scheduling at millisecond resolution (Li et al., 2018, Zhou et al., 5 Dec 2025).

6. Performance Gains, Trade-offs, and Empirical Results

Experimental evaluations across diverse platforms, models, and tasks substantiate device–edge synergy’s concrete benefits:

Paper / Method	Key Metric(s)	Quantitative Gains/Findings
Edgent (Li et al., 2018)	Latency, accuracy	Meets deadlines, 10–15 ppt higher accuracy than baselines at edge or device only
GCoDE (Zhou et al., 5 Dec 2025)	Latency, device energy, accuracy	44.9× speedup, 98.2% energy reduction (vs. DGCNN), accuracy equal or better
SensiX (Min et al., 2020)	Dynamic accuracy, robustness	+13 ppt (HAR), +7 ppt (Keyword), +30% accuracy at low device availability
Neuromorphic IB (Ke et al., 2024)	Spike rate, error, bits sent	>95% accuracy with ≤20% bits, outperforms digital SSCC by 3–8% error; <2% error rise under SNR mismatch
SR/DDIB (Shao et al., 2021)	Bit-rate, latency, accuracy	<1 KB view, sub-10 ms latency, 91.6% (3D recog @360 bits) vs. 89.5% baseline
3C Framework (Tang et al., 2018)	System energy (simulation)	83.8% reduction vs. 1C/2C, 27.5% still saved when D2D expensive
Edge-AI Scheduling (Makaya et al., 2024)	Bandwidth savings, detection	Safety: 80% link savings, 10× latency cut, 30% wearable energy reduction
Synergy (Gong et al., 2023)	Throughput, latency, power	8× throughput, >50× latency reduction, up to 5× energy savings (multi-tenant)
DSSD (Ning et al., 16 Jul 2025)	LLM throughput, accuracy	1.14×–2.31× speedup over DSD, identical LLM accuracy, uplink cost 10–100× lower

Overall, device–edge synergy systematically reduces latency and energy, shrinks communication burdens by orders of magnitude, increases robustness to device/network churn, and often delivers higher accuracy than any non-collaborative baseline (Li et al., 2018, Zhou et al., 5 Dec 2025, Shao et al., 2021, Min et al., 2020, Palena et al., 2024).

7. Challenges, Best Practices, and Future Directions

Despite substantial progress, several open challenges and best-practice guidelines persist:

Resource Heterogeneity: Detailed device profiling (compute, energy, link characteristics, availability) is essential. Strategies must adapt partitioning dynamically as device status changes (Li et al., 2023, Min et al., 2020).
Model Compression and Adaptation: Mixed-precision, pruning, early-exit, and neural architecture search with real-time feedback are required for robust deployment (Zhang et al., 2021, Ali et al., 1 Oct 2025).
Security, Privacy, and Trust: Integration of secure aggregation, differential privacy, and trusted execution environments protects federated and collaborative edge operations (Ali et al., 1 Oct 2025).
Scalability: Efficient scheduling, hybrid offline/online optimization, and learning-based orchestrators are needed to handle dynamic populations at scale (Li et al., 2023, Makaya et al., 2024).
Integrated Sensing-Communication-Computation: ISCC/over-the-air AI and joint task-driven encoder design (ADE-MI (Yang et al., 2024)) will further collapse latency and bandwidth bottlenecks for future cognitive wireless applications.
Adaptive Controllers: Hierarchical controllers that continuously reallocate workloads in response to fine-grained metrics (latency, energy, accuracy, link status) are key enablers for the edge–cloud continuum (Ali et al., 1 Oct 2025).

Best practices include dynamic partitioning based on current system metrics, model compression adapted to bandwidth, selective data transmission, confidence-based redundancy pruning, event-triggered compute pipelines, and continual system- and task-level profiling.

Future advancements will be driven by cross-layer co-design, neuromorphic and event-driven hardware, robust federated protocols, and automated, intelligent orchestration layers that exploit device–edge synergy to its full potential in next-generation pervasive intelligent systems.