Anchor-based Fair Clustering Framework
- The paper presents AFCF—a scalable algorithm that ensures exact per-cluster fairness by matching demographic proportions via novel anchor selection and constrained optimization.
- It employs the FDAS mechanism to select representative anchors that maintain both spatial coverage and group balance, significantly reducing computational overhead.
- The framework utilizes an ADMM-based solver to efficiently handle fairness-preserving label propagation, achieving linear scalability on large datasets.
The Anchor-based Fair Clustering Framework (AFCF) enables linear-time scalable fair clustering on large datasets, rigorously preserving demographic group fairness properties while drastically accelerating existing fair clustering algorithms. AFCF integrates novel fair sampling for anchor selection, a fairness-preserving label propagation mechanism grounded in constrained optimization, and an efficient ADMM solver, demonstrating consistent empirical efficacy across large benchmark datasets (Wei et al., 13 Nov 2025).
1. Fair Anchor Selection: FDAS Mechanism
AFCF achieves both spatial and demographic representativeness of a subset of anchors through the Fair Directly Alternate Sampling (FDAS) algorithm. Given a dataset , a partition of the data into protected groups , and group proportions , FDAS selects anchors according to the following two-phase procedure:
(A) Quota Computation. Each group receives a quota , with the remainder allocated iteratively to those groups underrepresented relative to . This guarantees and for all , .
(B) Within-Group Spatial Coverage. For each group , points are scored via , normalized to . Iteratively, the highest scoring point within group is selected as an anchor. After each selection, scores are decayed as to promote spatial dispersion within the group. The process continues until anchors are chosen from every group.
The FDAS approach ensures that the selected anchors reflect both the global group proportions and spatial distribution, with computational complexity , where is ambient dimensionality and (Wei et al., 13 Nov 2025).
2. Anchor Graph Construction and Fairness-Preserving Label Propagation
Post anchor selection, any fair clustering algorithm is applied to the -anchor set, yielding cluster labels . The challenge is then to transfer these cluster assignments, preserving fairness, to the full dataset. This is mediated by constructing an nonnegative affinity matrix so that cluster label propagation maintains demographic parity.
The propagation problem is formalized as: subject to
- (the -simplex for each ),
- For each cluster and group ,
where is the anchor feature matrix, indexes anchors from cluster , and indexes data for group .
Fairness Preservation: The joint group-label constraint enforces that the final per-cluster group proportions on all data points match exactly those observed among the anchor assignments: where and balance is defined as .
Label propagation computes final soft assignments (with the anchor cluster one-hot matrix), and hard cluster labels by (Wei et al., 13 Nov 2025).
3. ADMM-Based Optimization
To efficiently solve the constrained quadratic problem, AFCF employs an Alternating Direction Method of Multipliers (ADMM) framework. Introducing slack variable and dual variable , the augmented Lagrangian is
Iterative updates alternately minimize for (simplex-constrained QPs), update (closed form within each block to enforce the fairness constraint), and perform dual ascent on . Each ADMM iteration costs , with (the number of anchors) typically --$100)$.
Convergence is measured via primal/dual residuals and , which empirically decrease as the algorithm proceeds. Adaptive stepsize schemes (e.g., updating every 10 steps) are used to optimize convergence (Wei et al., 13 Nov 2025).
4. Theoretical Guarantees
AFCF provides two formal guarantees:
(a) Fairness Equivalence: Under the formulated group-label joint constraint, the final clustering of all points recovers the {\em exact} per-cluster demographic group proportions present in the anchor clustering. This implies preservation of standard fairness metrics, including balance and disparate impact.
(b) Linear-Time Scalability: For fixed , , and cluster count , the total computational complexity of AFCF is
where is the complexity of the fair clustering subroutine on anchors. With , this yields overall linear scaling in the number of samples , a substantial reduction from the quadratic or higher costs of many existing fair clustering frameworks (Wei et al., 13 Nov 2025).
5. Empirical Evaluation
AFCF was benchmarked on five real-world datasets:
| Dataset | Size | # Clusters | Sensitive Attribute |
|---|---|---|---|
| Law School | 18,000 | 2 | Gender |
| Credit Card | 29,000 | 5 | Gender |
| Bank | 41,000 | 2 | Marital Status |
| Zafar | 100,000 | 2 | Binary Sensitive |
| Census II | 2,460,000 | 5 | Gender |
Performance metrics included clustering quality (Accuracy, Normalized Mutual Information) and fairness (Balance, Minimal Normalized Conditional Entropy). Representative state-of-the-art methods—SpFC, VFC, FFC, FMSC, and fairletFC—were integrated into the AFCF pipeline.
Key empirical findings:
- Computational Speedup: On Census II, VFC alone required ≈1,500s; VFC-AF (AFCF version) executed in ≈918s. SpFC could not complete within 30 minutes on Bank, whereas SpFC-AF finished in 35s. In general, AFCF enabled one to two orders of magnitude acceleration.
- Clustering Quality and Fairness Preservation: Clustering accuracy and NMI varied by only a few percentage points; fairness metrics such as balance and MNCE were preserved within 1–2% of anchor clustering levels, consistent with the theoretical guarantee.
- Ablation Analysis: Substituting FDAS for random or vanilla DAS anchor sampling resulted in degenerate clusters or substantial fairness loss. Excluding the group-label joint constraint in the graph update ("AC" ablation) degraded balance by up to 10% (Wei et al., 13 Nov 2025).
6. Significance and Implications
AFCF decouples scalability from the core fair clustering algorithm: any fair clustering routine applied to the anchor subset inherits AFCF’s linear-time scalability and exact fairness preservation when extended to the whole dataset. This modularity allows rapid experimentation and deployment across large-scale, high-stakes environments requiring fairness guarantees in unsupervised learning. The systematic empirical and theoretical analysis demonstrates AFCF’s ability to bridge the computational gap in fair clustering, establishing it as a plug-and-play, practical framework for scalable fair learning (Wei et al., 13 Nov 2025).