Federated Clustering: Privacy-Preserving Learning
- Federated clustering is a decentralized, privacy-preserving unsupervised learning approach that aggregates locally computed models to infer global data partitions.
- It employs methods like federated k-means, fuzzy c-means, and deep clustering techniques enhanced with secure aggregation and differential privacy.
- Recent advances focus on robustness to non-IID data, asynchronous updates, and efficient machine unlearning for fast, privacy-compliant cluster updates.
Federated clustering (FC) is a class of decentralized unsupervised learning protocols that enable multiple clients—each holding private, typically heterogeneous, unlabeled data—to jointly infer global data groupings or cluster structures while preventing direct exchange of raw data. Motivated by privacy regulations and the prevalence of distributed data (e.g., in healthcare, banking, IoT), FC generalizes classic clustering objectives (e.g., -means, spectral, deep cluster analysis) to architectures that restrict information sharing to models or privatized summaries. The resulting landscape features algorithmic innovations, formal privacy guarantees, and substantial advances in robust unsupervised inference under non-IID conditions.
1. Core Principles and Problem Formulation
In FC, clients possess local datasets (each drawn from ), and the collective goal is to recover a global partition, typically minimizing a centralized cost such as
where are cluster centroids (or prototypes), is 's cluster assignment, and is an affinity or distortion loss (Euclidean distance, negative log-likelihood, or more generally, a deep embedding loss as in representation clustering).
A principal challenge is the inability to compute affinities or means across client boundaries for arbitrary without explicit data sharing. Data heterogeneity (non-IID condition) further complicates inference: client may observe only a subspace or modal subset, so local optima diverge from global solutions. Privacy constraints demand protocols in which exchanged information (model weights, cluster statistics, synthetic data, summary graphs) does not expose individual data points or sensitive attributes (Li et al., 2022, Yan et al., 19 May 2025, He et al., 14 Nov 2025).
2. Classical and Model-Driven Federated Clustering Algorithms
Early federated clustering approaches adapt centralized algorithms to the federated regime by leveraging communication-efficient proxies.
Federated -Means (k-FED): Each client runs Lloyd’s -means locally, sends centroids to the server, which aggregates (by averaging or further -means) and broadcasts updated centroids. This process repeats for several rounds. The global centroid update is typically
where is th local centroid on client and is the number of local assignments. To address local/global mismatch in assignments, weighted updates and robust centroid matching are employed (Holzer et al., 2023, Xu et al., 2024).
Federated Fuzzy -Means (FFCM): Each client applies fuzzy assignments, returning a weighted set of cluster memberships (membership matrices), which the server aggregates using variants of fuzzy centroids. FFCM improves cluster flexibility but remains sensitive to severe heterogeneity (Stallmann et al., 2022, Yan et al., 2022).
Secure Federated Clustering (SecFC, OmniFC): To achieve central-optimal performance with strong privacy, SecFC (Li et al., 2022), and the more general OmniFC (Yan et al., 19 May 2025), leverage Lagrange coded computing or secret sharing. Clients encode their quantized data as evaluations of secret polynomials, transmitting only shares to the server (and potentially peers). The server reconstructs exact global distance matrices or Lloyd’s -means updates through multi-party secure computation, ensuring information-theoretic privacy against server or client collusion.
Dynamic and Asynchronous Protocols: Solutions such as Dynamically Weighted Federated -Means (Holzer et al., 2023) and Asynchronous Federated Cluster Learning (AFCL) (Zhang et al., 2024) introduce adaptive weighting/momentum schemes, robust aggregation steps, and asynchronous updates to improve convergence and cope with varying participation and unknown cluster numbers.
3. Representation Learning and Deep Federated Clustering
High-dimensional, non-vectorial, or multimodal data require distributed representation learning for effective FC.
Cluster-Contrastive Federated Clustering (CCFC, CCFC++): CCFC (Liu et al., 2024) operates by sharing cluster-friendly encoders and predictors (often deep nets) across clients, with each client optimizing a cluster-contrastive loss based on global centroids. The protocol alternates server-side aggregation of model parameters and centroids with local contrastive learning. CCFC++ (Yan et al., 2024) introduces a decorrelation regularizer penalizing covariance off-diagonals, mitigating “dimensional collapse” under non-IID splits, and empirically boosting NMI by up to 0.34.
Federated Deep Subspace Clustering (FDSC): FDSC (Zhang et al., 2024) introduces a federated deep subspace clustering network with an encoder (shared, communicated), self-expressive layer (private, modeling intra-client affinities), and decoder (private). Local neighborhood-preserving regularization enhances the self-expressiveness property, and global encoder aggregation occurs via FedAvg. FDSC empirically outperforms centralized deep subspace clustering on various image sets.
Privacy-Preserving Deep Clustering with Synthetic Data: Multiple works (Yan et al., 2022, Yan et al., 2022) propose protocols where clients train generative models (GANs), transmit only synthetic samples to the server, which then performs deep (e.g., autoencoder-based) clustering. This synthetic data proxy scheme improves privacy—since server and peers never see original samples—and is robust to non-IID client distributions. Deep clustering (e.g., via deep clustering networks/DCN) further refines the pseudolabels sent back to clients. Federated cluster-wise refinement frameworks (Nardi et al., 2024) combine autoencoders, cluster-based FL, and cluster association graphs for highly heterogeneous, overlapping global/local cluster sets.
4. Graph- and Structure-Based Federated Clustering
Recent advances leverage structural data representations and private aggregation of graph structures or prototype hierarchies.
Private Federated Graph Clustering (SPP-FGC): Clients encode local data relationships as private structural graphs (e.g., via GMMs, sparse graphical models) and transmit these to the server. The server aggregates block-wise local graphs, aligning cluster-structures via KL divergence and constructing an integrated global graph on which block-diagonalization and spectral embedding yield the global clustering (He et al., 14 Nov 2025). SPP-FGC guarantees differential privacy on model parameters (via Laplace mechanism) and restricts shared information to low-entropy structure.
One-Shot and Hierarchical Federated Clustering: Fed-HIRE (Cai et al., 10 Jan 2026) adopts a client prototype-level communication model: each client discovers fine-grained “clusterlets” using competition-based partitioning, communicates them in a single round, and the server recursively fuses these into a hierarchy of cluster representations (multi-granular clustering). This modular paradigm achieves SOTA results across a wide range of tabular benchmarks.
Collaborative (Vertical/Horizontal) Representations: DC-Clustering (Kawamata et al., 11 Jun 2025) addresses complex, mixed vertical/horizontal data splits by sharing only dimensionality-reduced intermediate representations (constructed using local PCA or learned mappings) and collaboratively constructing a common embedding space via a shared anchor set and affine transformation. Subsequent clustering proceeds centrally on these collaborative representations, matching centralized performance across various real-world scenarios.
5. Privacy and Security Guarantees
A core constraint in FC is the rigorous protection of local data privacy.
- Information-Theoretic Security: Protocols such as SecFC (Li et al., 2022) and OmniFC (Yan et al., 19 May 2025) use polynomial secret sharing, achieving information-theoretic security against up to colluding parties, with server and peers never recovering raw data or cluster assignments beyond the revealed clustering result.
- Differential Privacy (DP): DP-FedC (Li et al., 2023) and SPP-FGC (He et al., 14 Nov 2025) inject calibrated noise (e.g., Gaussian or Laplace) into all local updates or shared parameters, providing -DP guarantees. Direct privacy analysis is also provided for GAN-based synthetic data sharing, where the probability of recovering any individual sample is , being the synthetic dataset size and the original client dataset size (Yan et al., 2022).
- Local Differential Privacy: Several protocols (Masuyama et al., 2023) use Laplace perturbation at the client on feature values or representative nodes, enabling instance-level privacy in continual learning scenarios.
- Compressed Secure Aggregation: SCMA (Pan et al., 2022) applies Reed–Solomon–encoded mask sharing for secure and communication-efficient aggregation of sparse cluster prototypes/counts, supporting both federated learning and machine unlearning.
6. Robustness, Unlearning, and Adaptivity
Recent expansions address robustness to extreme heterogeneity, device drop-out, asynchronous settings, and user-driven unlearning.
- Robustness to Non-IID Data: Protocols such as SDA-FC (Yan et al., 2022) and AFCL (Zhang et al., 2024) sustain high clustering accuracy and stability even as clients become highly non-IID or only cover partial modality subsets. Empirically, traditional centroid-aggregation methods (k-FED, FFCM) collapse under such scenarios.
- Fault and Drop-out Tolerance: Many one-shot and graph-based protocols (e.g., SDA-FC, SPP-FGC) tolerate up to 50% client failures in practice, maintaining clustering performance by leveraging invariance of global synthetic or graph-based representations.
- Machine Unlearning: MUFC (Pan et al., 2022) introduces a federated protocol for exact machine unlearning—efficiently removing data points or clients and recomputing clusterings consistent with the reduced dataset, leveraging K-means++ reseeding and SCMA for fast, privacy-preserving updates with up to 84× speed-up over retraining.
- Asynchronous Convergence and Unknown Cluster Number: AFCL (Zhang et al., 2024) and ART-based continual clustering (Masuyama et al., 2023) address practical requirements—handling unknown , unbalanced client participation, and continually evolving or streaming data.
7. Empirical Benchmarks and Future Directions
Empirical evaluation across FC algorithms benchmark clustering performance using metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), clustering accuracy, and Kappa statistics. SOTA frameworks (Fed-HIRE (Cai et al., 10 Jan 2026), SPP-FGC (He et al., 14 Nov 2025), OmniFC (Yan et al., 19 May 2025), FDSC (Zhang et al., 2024), CCFC++ (Yan et al., 2024)) achieve near-centralized accuracy and consistently outperform classic federated baselines (k-FED, FFCM) under both IID and non-IID benchmarks, for image, tabular, time-series, and genomics data.
Key empirical findings:
| Approach | NMI improvement (vs. baseline) | Non-IID robustness | Provable Privacy |
|---|---|---|---|
| CCFC++ | up to 0.34 | Yes | Standard FL |
| SPP-FGC | up to 10% | Yes | ε-DP |
| Fed-HIRE | 3–15% | Yes | Yes |
| OmniFC | 0.2–0.4 (Kappa) | Yes | Info-theoretic |
| DP-FedC | up to 10% | Yes | (ε,δ)-DP |
| SDA-FC/PPFC-GAN | stable/robust | Yes |
Future research scales FC to streaming clients, dynamic/streaming data, algorithm-agnostic deep clustering, secure aggregation atop DP, and federated clustering under horizontal, vertical, or hybrid data splits. Theoretical challenges include characterizing tight privacy–utility trade-offs, convergence under strong adversaries, and robust model-selection in heterogeneous federated ecosystems.
References
- (Stallmann et al., 2022) Towards Federated Clustering: A Federated Fuzzy -Means Algorithm (FFCM)
- (Holzer et al., 2023) Dynamically Weighted Federated k-Means
- (Zhang et al., 2024) Federated Deep Subspace Clustering
- (Yan et al., 2024) CCFC++: Enhancing Federated Clustering through Feature Decorrelation
- (Liu et al., 2024) CCFC: Bridging Federated Clustering and Contrastive Learning
- (Yan et al., 2022) Privacy-Preserving Federated Deep Clustering based on GAN
- (Yan et al., 19 May 2025) OmniFC: Rethinking Federated Clustering via Lossless and Secure Distance Reconstruction
- (Li et al., 2022) Secure Federated Clustering
- (He et al., 14 Nov 2025) Towards Federated Clustering: A Client-wise Private Graph Aggregation Framework
- (Zhang et al., 2024) Asynchronous Federated Clustering with Unknown Number of Clusters
- (Pan et al., 2022) Machine Unlearning of Federated Clusters
- (Kawamata et al., 11 Jun 2025) A new type of federated clustering: A non-model-sharing approach
- (Cai et al., 10 Jan 2026) One-Shot Hierarchical Federated Clustering
- (Nardi et al., 2024) Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions