Federated Learning with Personalization Layers

Updated 21 February 2026

Federated Learning with Personalization Layers is a framework that divides model parameters into a global base and local personalization layers to address non-IID data challenges.
It employs strategies like adaptive layer selection, clustering, and hypernetwork-based personalization to enhance model specialization while reducing communication costs.
Empirical results demonstrate improvements of 5–9% in test accuracy over FedAvg and significant bandwidth savings, proving its practical efficiency in diverse applications.

Federated Learning with Personalization Layers refers to a class of federated learning (FL) algorithms and neural network architectures that partition model parameters into shared (global) and client-specific (personalization) layers, enabling efficient model specialization on heterogeneous clients without undermining collaborative representation learning. The canonical instantiation is FedPer, in which a “base” block is globally aggregated across clients, while the “personalization” layers are optimized purely locally and never communicated. Numerous variants and refinements, spanning client- and layer-level adaptation, adaptive layer selection, clustering, hypernetwork-based personalization, and rigorous privacy analysis, have emerged to address challenges of non-IID data, communication efficiency, and model diversity across clients.

1. Formal Definitions and Core Formulation

Given $K$ clients, each with local dataset $D_k$ sampled from distribution $\mathcal{D}_k$ , denote the model parameters as $w$ , global objective as in FedAvg: $\min_{w} F(w) := \sum_{k=1}^K p_k F_k(w), \qquad p_k = \frac{|D_k|}{\sum_j |D_j|}$ where $F_k(w) = \mathbb{E}_{(x,y)\sim \mathcal{D}_k}[\ell(w; x, y)]$ for loss $\ell$ .

Personalization layers induce a decomposition $w = (w_b, w_p^k)$ :

$w_b \in \mathbb{R}^{d_b}$ : shared “base” parameters aggregated globally.
$w_p^k \in \mathbb{R}^{d_p}$ : client $D_k$ 0’s private parameters, never communicated.

The objective is: $D_k$ 1 with $D_k$ 2 (Arivazhagan et al., 2019).

This bi-level optimization allows $D_k$ 3 to capture globally useful representations, and $D_k$ 4 to specialize to the idiosyncrasies of client $D_k$ 5’s data distribution.

2. Canonical Algorithms and Architectures

FedPer (Feature-Transfer Approach)

Model: Deep feedforward network, e.g., convolutional backbone (6–8 layers) plus two fully-connected (FC) layers. Personalization may be in the final FC layer or the last two layers.
Training workflow: Server broadcasts $D_k$ 6, each client performs local SGD on $D_k$ 7 (updating both), but uploads only the shared update $D_k$ 8 to server, which aggregates by weighted average:

$D_k$ 9

The personalization block $\mathcal{D}_k$ 0 is always retained locally and never averaged (Arivazhagan et al., 2019).

Variants and Extensions

Exact SGD with Personalization Layers (PFLEGO): Updates $\mathcal{D}_k$ 1 at the server with unbiased stochastic gradients, while updating $\mathcal{D}_k$ 2 (personal heads) entirely on-device, ensuring unbiasedness and reduced per-round compute (Nikoloutsopoulos et al., 2022).
Post-hoc Fine-Tuning: Train global model to convergence, then locally fine-tune only the personalization block on each client; improves local test accuracy where heterogeneity is severe (Kulkarni et al., 2020).
Layer-wise and Adaptive Personalization: Methods such as PLayer-FL (Elhussein et al., 12 Feb 2025), FedLAG (Nguyen et al., 2024), and pMixFed (Saadati et al., 19 Jan 2025) assign layers to be personalized or exchanged by various metrics (federation sensitivity, gradient conflict, adaptive mixing).

3. Adaptive and Data-Driven Layer Selection

A recognized weakness of enforcing a fixed split (“last $\mathcal{D}_k$ 3 layers are always personalized”) is the inability to adjust the granularity of personalization to the actual statistical divergence at each layer. Recent advancements include:

Federation Sensitivity (PLayer-FL): Uses a first-order “sensitivity” metric

$\mathcal{D}_k$ 4

where a spike marks the optimal base-to-head transition. Empirically, this correlates strongly with gradient variance and Hessian trace (Elhussein et al., 12 Feb 2025).

Gradient Conflict (FedLAG): Measures pairwise angles between client layer-update vectors; layers with high conflict (obtuse inter-client gradient angles) are excluded from global aggregation and treated as personalization layers (Nguyen et al., 2024).

Both methods realize significantly better fairness and average accuracy than heuristic layer cuts.

4. Layer-wise Model Aggregation and Hypernetwork Approaches

Layer-wise Personalized Aggregation (pFedLA): Maintains a client-specific weight matrix $\mathcal{D}_k$ 5 for each layer $\mathcal{D}_k$ 6 and peer client $\mathcal{D}_k$ 7 to optimally combine peer parameters:

$\mathcal{D}_k$ 8

These weights are generated by a hypernetwork conditioned on client embeddings and trained end-to-end for optimal per-client performance (Ma et al., 2022).

Multi-branch Architecture (pFedMB): Each layer has $\mathcal{D}_k$ 9 branches; clients learn convex combinations of branches via client-specific vectors $w$ 0. Aggregation uses $w$ 1-weighted FedAvg, fostering implicit clustering of clients with similar data (Mori et al., 2022).
Feature Fusion and Relation Networks (pFedPM): Uploads feature prototypes instead of gradients, enabling model heterogeneity and label skew adaptation with dramatically reduced uplink (Xing et al., 2024).

5. Practical Impact, Computation, and Communication

Empirical Results: On non-IID CIFAR-10/100 and Flickr Aesthetics, FedPer improves over FedAvg by 5–9% test accuracy, with two personalized layers giving the best results in highly heterogeneous regimes (Arivazhagan et al., 2019).
Bandwidth and Cost: Personalization layers substantially reduce communication since only the shared backbone is synchronized. For instance, PL-FL reduces per-round bandwidth by 65% compared to full-model FL on LSTM-based forecasting (Bose et al., 2023, Bose et al., 2024).
Computation: Approaches like PFLEGO minimize full-network passes per round, e.g., two passes per round regardless of local step count, versus linear scaling in FedAvg (Nikoloutsopoulos et al., 2022). Sequential layer expansion further reduces computation (down to ~36% of FedAvg) by “unfreezing” base sub-layers according to scheduling (Jang et al., 2024).

6. Privacy, Security, and Information Leakage

Privacy Advantages: Personalization layers are never transmitted, shrinking the dimensionality of communicated updates and hiding client-specific features (e.g., final task-specific heads), directly reducing the effectiveness of membership and attribute inference attacks (Jourdan et al., 2021).
Empirical Assessments: In activity recognition, FedPer improves not only accuracy (by 1–7%) but also reduces attribute inference attack accuracy by 10–20 percentage points and membership inference attack success to near chance, outperforming local differential privacy noise-injection (Jourdan et al., 2021).

7. Limitations, Open Challenges, and Future Directions

Scalability: The memory footprint of personal heads or layers grows linearly with the number of clients and size of $w$ 2. For very deep networks or massive populations, techniques such as hypernetwork-based head generation [FedTP, (Li et al., 2022)] or Bayesian parameter selection (Luo et al., 2024) are proposed.
Layer/personality allocation: Automatically determining which layers (or even elements) to personalize is an active research area. Bayesian uncertainty quantification provides an element-level mask optimizing for maximum tolerance with minimum global impact (Luo et al., 2024). Data-driven or gradient-based split methods outperform ad-hoc rules.
Clustered and Hierarchical Personalization: Several works propose dynamically clustering clients by model weights, inference outputs, or measured distributional shifts, with shared sub-personalization between similar clients (e.g., FedTSDP (Zhu et al., 2023)).
Meta-Learning and Hyperparameter Personalization: Meta-nets for learning batch normalization reweighting and local learning rates by client statistics demonstrably improve multi-domain generalization (Lee et al., 2023). Cross-domain and speech recognition studies confirm substantial accuracy improvements over classical fine-tuning and hand-crafted strategies.

The personalization layer paradigm provides theoretical robustness and strong empirical utility gains for federated learning under statistical heterogeneity, non-IID data, and strict communication constraints. Its evolving ecosystem includes adaptive split policies, hypernetwork and meta-learning–powered parameterization, and provable privacy and convergence properties (Arivazhagan et al., 2019, Jourdan et al., 2021, Elhussein et al., 12 Feb 2025, Nguyen et al., 2024, Ma et al., 2022).