Hierarchical Client-Edge-Cloud Federated Learning

Updated 11 January 2026

Client-Edge-Cloud Hierarchical Federated Learning is a distributed machine learning framework that organizes training into three tiers to handle diverse resources and non-IID data.
It employs edge-level distance-weighted aggregation and cloud-level layer-wise fusion to accelerate convergence and reduce communication overhead.
Advanced personalization and optimal client–edge assignment strategies significantly improve accuracy and resource efficiency, achieving gains up to 83% under extreme non-IID conditions.

Client-Edge-Cloud Hierarchical Federated Learning (HFL) is a multi-tier distributed machine learning paradigm enabling scalable privacy-preserving training across resource- and data-heterogeneous Internet of Things (IoT), cyber-physical, and edge-cloud environments. It extends classical Federated Learning by organizing collaborative workflows into client (device), edge (aggregator/server), and cloud (global/model service) tiers. This structure has become essential for addressing communication bottlenecks, data privacy mandates, statistical and architectural heterogeneity, and dynamic resource limitations in large-scale intelligent systems.

1. System Architecture and Model Heterogeneity

Hierarchical Federated Learning frameworks enumerate three principal tiers:

Tier 1 (Clients): Edge devices (e.g., sensors, smartphones, IoT nodes) each train local models $L_k$ on private data $D_k$ ; architectural diversity is intrinsic, with widely varying numbers of layers $m_k$ and parameter dimensionalities $d_{k,j}$ . Clients communicate with edge servers via local wireless channels (Bluetooth, ZigBee).
Tier 2 (Edge Aggregators): Edge servers $i$ orchestrate clusters $K_i$ of clients grouped by model architecture. They execute intra-cluster aggregation, producing edge models $E_i^t$ at every round $t$ . Communication with clients uses low-latency links; cloud communication traverses Internet backhaul.
Tier 3 (Cloud Server): The cloud aggregates models $E_i^t$ received from all edge servers, performing layer-wise aggregation and redistributing global models $G_i^t$ conforming to respective cluster architectures (Gao et al., 2024, Liu et al., 2019).

Model heterogeneity is formalized by $L_k^t = (w_k^{t,(1)}, ..., w_k^{t,(m_k)})$ with $w_k^{t,(j)} \in \mathbb{R}^{d_{k,j}}$ , where $m_k$ and $d_{k,j}$ are non-uniform across $k$ . Non-IIDness is maximal; each $D_k$ is drawn from arbitrary label-skewed $P_k$ .

2. Aggregation and Communication Protocols

2.1 Edge-Level Aggregation

First round ( $t=1$ ): Quantity-weighted averaging

$E_i^1 = \sum_{k\in K_i} \frac{N_k}{\sum_{\ell \in K_i} N_\ell} L_k^1$

with $N_k = |D_k|$ .

Subsequent rounds ( $t>1$ ): Distance-weighted averaging

$d(L_k^t, G_i^{t-1}) = \| L_k^t - G_i^{t-1} \|_2$

$E_i^t = \sum_{k\in K_i} \frac{d(L_k^t, G_i^{t-1})}{\sum_{\ell \in K_i} d(L_\ell^t, G_i^{t-1})} L_k^t$

2.2 Cloud-Level Aggregation (“MaxCommon”)

For each layer $j = 1,...,n$ :

Extract all edge-aggregated models $l_j^i$ that possess layer $j$
Aggregate layer $j$ :

$Gl_j^t = \sum_{i\in X_j} \frac{\sum_{k\in K_i} N_k}{\sum_{p\in X_j} \sum_{k\in K_p} N_k} l_j^i$

Assemble $G_i^t = Gl_1^t \oplus ... \oplus Gl_{m_i}^t$ matching edge $i$ ’s architecture

This enables layer-wise knowledge transfer without public data or distillation.

2.3 Efficient Communication

Per round, the main communication is: | Direction | Volume | |------------------|----------------------------------------| | Clients → Edge | $\sum_i \sum_{k\in K_i} M_k$ | | Edge → Cloud | $\sum_i |E_i|$ | | Cloud → Edge | $\sum_i |G_i|$ |

Hierarchical aggregation reduces Internet backbone load compared to “flat FL,” i.e., all clients communicating directly with the cloud.

3. Non-IID Data Handling and Personalization

HFL is specifically designed for non-IID scenarios—client label spaces and marginal distributions diverge significantly. Edge clustering by model architecture and geographic affinity, selective weighting in aggregation (distance-based), and layer-wise aggregation mitigate model divergence and staleness (Gao et al., 2024, Lee et al., 11 Apr 2025).

Advanced schemes such as Personalized Hierarchical Edge-enabled Federated Learning (PHE-FL) interpolate per-edge models, measuring the local/global generalization trade-off via per-edge test splits and data-driven weighting ( $\alpha_k$ ):

$PEAM_k^t = \alpha_k EAM_k^t + (1-\alpha_k) CAM_k^t$

where $EAM_k^t$ is the local aggregation, $CAM_k^t$ is the averaged non-local knowledge, and $\alpha_k$ is computed dynamically (Lee et al., 11 Apr 2025). This personalization yields up to 83% absolute accuracy gain under extreme non-IID settings.

4. Theoretical Analysis: Convergence, Complexity, and Scheduling

No explicit formal proof is given for HAF-Edge, but empirical observations (and classical FL theory) indicate that:

Distance-based weighting at the edge layer accelerates global convergence by prioritizing updates with larger deviation—often associated with higher local model “informativeness” under non-IID data;
Hierarchical two-level aggregation (client-edge, edge-cloud) lowers global model variance and accelerates convergence compared to flat FL;
Model aggregation intervals: exactly one edge and one cloud aggregation per round induce rapid convergence, with diminishing returns for more frequent cloud aggregation (Gao et al., 2024, Liu et al., 2019).

For flat and hierarchical architectures, theoretical bounds for convex and non-convex objectives depend on key scheduling parameters (edge aggregation interval $\kappa_1$ , cloud $\kappa_2$ ) and gradient-divergence measures (quantifying non-IIDness as $\delta$ , $\Delta$ ) (Liu et al., 2019, Mhaisen et al., 2020); optimal convergence is achieved when edge data is nearly IID, allowing rare global sync.

5. Resource Allocation and Client–Edge Association

Optimal client–edge association mitigates statistical skew and balances computational load. Formalized as an integer program minimizing per-edge class distribution L1-divergence $\theta = \sum_{n=1}^N r^{(n)}\|D^{(n)}\|_1$ , tractable heuristics (group equalization, branch-and-bound) recover near-centralized performance with as few as two multi-edge candidate assignments per client (Mhaisen et al., 2020). Balanced classes at the edge, together with selective client assignment and resource-aware scheduling, yield speed and accuracy gains up to 56% and 99% over naive partitioning.

Further, hierarchical aggregation supports advanced orchestration methodologies—dynamic client selection under intermittent participation (Plan A/B stagewise decision-making) (Wu et al., 13 Feb 2025), resource allocation (convex programs for delay/energy minimization) (Luo et al., 2020), and communication-efficient model compression via adaptive clustering, local aggregation with sparsified random projections (Zhu et al., 2024).

6. Empirical Evaluation and Benchmarking

Multiple datasets (MNIST, FMNIST, CIFAR-10) and model architectures (1nn through 5nn) underpin experimental analyses:

HAF-Edge achieves superior accuracy and much faster convergence than FedAvg and MaxCommon in both IID and heavy non-IID conditions. For 1nn models on MNIST, 80% test accuracy is reached in ~10 rounds (versus 35/40 for baselines), with final accuracy up to 85% (Gao et al., 2024).
Under extensive model and data heterogeneity, hierarchical aggregation schemes consistently outperform traditional two-level FL, achieving up to 2×–3× speedup and 3× client energy savings (Liu et al., 2019).
Personalized edge models resolve instability and accuracy degradation in severe hierarchical non-IID (Lee et al., 11 Apr 2025).
Communication and energy costs are reduced by clustering, core-selection, and local compression, with accuracy preserved within 1–3% (Zhu et al., 2024).

7. Key Insights, Limitations, and Future Directions

Architectural clustering: Enables local aggregation over homogeneous models, avoiding parameter misalignment (Gao et al., 2024).
Selective, layer-wise aggregation: Layer-wise MaxCommon facilitates knowledge transfer when architectures diverge; avoids public data or distillation requirements.
Aggregation interval optimization: One edge and one cloud aggregation round per global update is optimal; more frequent cloud sync yields diminishing returns.
Statistical balancing and assignment: Simple algorithms for client–edge partitioning close the gap to centralized performance under non-IIDness (Mhaisen et al., 2020).
Limitations: Many frameworks lack rigorous convergence proofs under general non-IID, heterogeneity, and asynchronous settings; straggler and resource heterogeneity effects require further study (Liu et al., 2019, Gao et al., 2024).
Future work: Extending to deeper hierarchies (multi-tier beyond three levels), asynchronous aggregation, incentivization and trust mechanisms, secure/hardened aggregation against poisoning, and empirical adaptation to dynamic and intermittent participation (Plan A/B).

In summary, client–edge–cloud hierarchical federated learning is a robust, extensible class of distributed machine learning systems that support resource and model heterogeneity, tolerate severe statistical non-IIDness, and enable multi-tier privacy-preserving aggregation. The HAF-Edge family (Gao et al., 2024), together with associated frameworks (Liu et al., 2019, Lee et al., 11 Apr 2025, Mhaisen et al., 2020), provide foundational advances in communication efficiency, convergence acceleration, personalization, and theoretical design guidelines for next-generation edge-cloud intelligence.