Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anchor-based Fair Clustering Framework

Updated 20 November 2025
  • The paper presents AFCF—a scalable algorithm that ensures exact per-cluster fairness by matching demographic proportions via novel anchor selection and constrained optimization.
  • It employs the FDAS mechanism to select representative anchors that maintain both spatial coverage and group balance, significantly reducing computational overhead.
  • The framework utilizes an ADMM-based solver to efficiently handle fairness-preserving label propagation, achieving linear scalability on large datasets.

The Anchor-based Fair Clustering Framework (AFCF) enables linear-time scalable fair clustering on large datasets, rigorously preserving demographic group fairness properties while drastically accelerating existing fair clustering algorithms. AFCF integrates novel fair sampling for anchor selection, a fairness-preserving label propagation mechanism grounded in constrained optimization, and an efficient ADMM solver, demonstrating consistent empirical efficacy across large benchmark datasets (Wei et al., 13 Nov 2025).

1. Fair Anchor Selection: FDAS Mechanism

AFCF achieves both spatial and demographic representativeness of a subset of anchors through the Fair Directly Alternate Sampling (FDAS) algorithm. Given a dataset XRd×n\mathbf{X}\in\mathbb{R}^{d\times n}, a partition of the data into tt protected groups G={G1,,Gt}\mathcal{G}=\{G_1, \dots, G_t\}, and group proportions ρr=Gr/n\rho_r=|G_r|/n, FDAS selects mnm\ll n anchors according to the following two-phase procedure:

(A) Quota Computation. Each group receives a quota qr=mρrq_r = \lfloor m \cdot \rho_r \rfloor, with the remainder Δ=mr=1tqr\Delta = m - \sum_{r=1}^t q_r allocated iteratively to those groups underrepresented relative to mρrm\rho_r. This guarantees rqr=m\sum_r q_r = m and for all rr, qrmρr<1|q_r - m\rho_r| < 1.

(B) Within-Group Spatial Coverage. For each group rr, points are scored via si=p=1dXp,is_i = \sum_{p=1}^d X_{p,i}, normalized to ss/max(s)s\leftarrow s/\max(s). Iteratively, the highest scoring point within group rr is selected as an anchor. After each selection, scores are decayed as ss(1s)/max(s)s \leftarrow s \odot (1-s)/\max(s) to promote spatial dispersion within the group. The process continues until qrq_r anchors are chosen from every group.

The FDAS approach ensures that the selected anchors reflect both the global group proportions and spatial distribution, with computational complexity O(nd)O(nd), where dd is ambient dimensionality and mnm\ll n (Wei et al., 13 Nov 2025).

2. Anchor Graph Construction and Fairness-Preserving Label Propagation

Post anchor selection, any fair clustering algorithm F\mathcal{F} is applied to the mm-anchor set, yielding cluster labels {1,,c}m\ell\in\{1, \dots, c\}^m. The challenge is then to transfer these cluster assignments, preserving fairness, to the full dataset. This is mediated by constructing an m×nm\times n nonnegative affinity matrix Z\mathbf{Z} so that cluster label propagation maintains demographic parity.

The propagation problem is formalized as: minZRm×nXHZF2+αZF2\min_{\mathbf{Z}\in\mathbb{R}^{m\times n}} \|\mathbf{X} - \mathbf{H}\mathbf{Z}\|_F^2 + \alpha\|\mathbf{Z}\|_F^2 subject to

  • Z:,iΔm\mathbf{Z}_{:,i} \in \Delta^m (the mm-simplex for each ii),
  • For each cluster ll and group rr,

jCliGrZj,i=tl,r,tl,r={jClGr}nm\sum_{j\in \mathcal{C}_l}\sum_{i\in G_r} Z_{j,i} = t_{l,r},\qquad t_{l,r} = |\{j \in \mathcal{C}_l \cap G_r\}|\cdot \frac{n}{m}

where H\mathbf{H} is the anchor feature matrix, Cl\mathcal{C}_l indexes anchors from cluster ll, and GrG_r indexes data for group rr.

Fairness Preservation: The joint group-label constraint enforces that the final per-cluster group proportions on all nn data points match exactly those observed among the anchor assignments: balance(Cfinal)=balance(Canchor)\text{balance}(\mathcal{C}_{\text{final}}) = \text{balance}(\mathcal{C}_{\text{anchor}}) where ρr(l)=ClGrCl\rho_r^{(l)} = \frac{|\mathcal{C}_l\cap G_r|}{|\mathcal{C}_l|} and balance is defined as minlminrrρr(l)ρr(l)\min_l\min_{r\neq r'} \frac{\rho_r^{(l)}}{\rho_{r'}^{(l)}}.

Label propagation computes final soft assignments Y=ZL\mathbf{Y} = \mathbf{Z}^\top \mathbf{L} (with L\mathbf{L} the anchor cluster one-hot matrix), and hard cluster labels by y^i=argmaxlYi,l\hat{y}_i = \arg\max_l Y_{i,l} (Wei et al., 13 Nov 2025).

3. ADMM-Based Optimization

To efficiently solve the constrained quadratic problem, AFCF employs an Alternating Direction Method of Multipliers (ADMM) framework. Introducing slack variable E\mathbf{E} and dual variable Λ\mathbf{\Lambda}, the augmented Lagrangian is

Lρ(Z,E,Λ)=XHZF2+αZF2+Λ,ZE+ρ2ZEF2\mathcal{L}_\rho(\mathbf{Z}, \mathbf{E}, \mathbf{\Lambda}) = \|\mathbf{X} - \mathbf{H}\mathbf{Z}\|_F^2 + \alpha\|\mathbf{Z}\|_F^2 + \langle \mathbf{\Lambda}, \mathbf{Z} - \mathbf{E}\rangle + \frac{\rho}{2} \|\mathbf{Z} - \mathbf{E}\|_F^2

Iterative updates alternately minimize for Z\mathbf{Z} (simplex-constrained QPs), update E\mathbf{E} (closed form within each block to enforce the fairness constraint), and perform dual ascent on Λ\mathbf{\Lambda}. Each ADMM iteration costs O(nm2)O(nm^2), with mm (the number of anchors) typically O(10O(10--$100)$.

Convergence is measured via primal/dual residuals rk=Zk+1Ek+1Fr_k = \|\mathbf{Z}^{k+1} - \mathbf{E}^{k+1}\|_F and sk=ρEk+1EkFs_k = \rho\|\mathbf{E}^{k+1} - \mathbf{E}^k\|_F, which empirically decrease as the algorithm proceeds. Adaptive stepsize schemes (e.g., updating ρ\rho every 10 steps) are used to optimize convergence (Wei et al., 13 Nov 2025).

4. Theoretical Guarantees

AFCF provides two formal guarantees:

(a) Fairness Equivalence: Under the formulated group-label joint constraint, the final clustering of all nn points recovers the {\em exact} per-cluster demographic group proportions present in the anchor clustering. This implies preservation of standard fairness metrics, including balance and disparate impact.

(b) Linear-Time Scalability: For fixed dd, mm, and cluster count cc, the total computational complexity of AFCF is

O(nd+f(m)+nm2+nmc)O(nd + f(m) + nm^2 + nmc)

where f(m)f(m) is the complexity of the fair clustering subroutine on mm anchors. With mnm\ll n, this yields overall linear scaling in the number of samples nn, a substantial reduction from the quadratic or higher costs of many existing fair clustering frameworks (Wei et al., 13 Nov 2025).

5. Empirical Evaluation

AFCF was benchmarked on five real-world datasets:

Dataset Size # Clusters Sensitive Attribute
Law School 18,000 2 Gender
Credit Card 29,000 5 Gender
Bank 41,000 2 Marital Status
Zafar 100,000 2 Binary Sensitive
Census II 2,460,000 5 Gender

Performance metrics included clustering quality (Accuracy, Normalized Mutual Information) and fairness (Balance, Minimal Normalized Conditional Entropy). Representative state-of-the-art methods—SpFC, VFC, FFC, FMSC, and fairletFC—were integrated into the AFCF pipeline.

Key empirical findings:

  • Computational Speedup: On Census II, VFC alone required ≈1,500s; VFC-AF (AFCF version) executed in ≈918s. SpFC could not complete within 30 minutes on Bank, whereas SpFC-AF finished in 35s. In general, AFCF enabled one to two orders of magnitude acceleration.
  • Clustering Quality and Fairness Preservation: Clustering accuracy and NMI varied by only a few percentage points; fairness metrics such as balance and MNCE were preserved within 1–2% of anchor clustering levels, consistent with the theoretical guarantee.
  • Ablation Analysis: Substituting FDAS for random or vanilla DAS anchor sampling resulted in degenerate clusters or substantial fairness loss. Excluding the group-label joint constraint in the graph update ("AC" ablation) degraded balance by up to 10% (Wei et al., 13 Nov 2025).

6. Significance and Implications

AFCF decouples scalability from the core fair clustering algorithm: any fair clustering routine applied to the anchor subset inherits AFCF’s linear-time scalability and exact fairness preservation when extended to the whole dataset. This modularity allows rapid experimentation and deployment across large-scale, high-stakes environments requiring fairness guarantees in unsupervised learning. The systematic empirical and theoretical analysis demonstrates AFCF’s ability to bridge the computational gap in fair clustering, establishing it as a plug-and-play, practical framework for scalable fair learning (Wei et al., 13 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anchor-based Fair Clustering Framework (AFCF).