Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sub-center ArcFace Loss

Updated 27 January 2026
  • The paper introduces a loss function that extends ArcFace by representing each class with K sub-centers, effectively capturing intra-class variations and mitigating label noise.
  • It employs a max-over-K strategy for pooling cosine similarities, isolating anomalies to non-dominant sub-centers while reinforcing dominant modes with clean data.
  • Experimental results demonstrate a 3–6% performance boost on noisy benchmarks and improved generalization in both face and landmark recognition tasks.

Sub-center ArcFace Loss is a generalization of the ArcFace (Additive Angular Margin Loss) used primarily in deep face recognition and related embedding learning tasks. It extends ArcFace by representing each class with KK sub-centers (prototypes) instead of a single center, permitting more flexible modeling of intra-class variation and conferring substantial robustness to label noise and outliers in large-scale, unconstrained datasets.

1. Mathematical Definition

Given sample xiRdx_i \in \mathbb{R}^d (ℓ₂-normalized, scaled by a constant s>0s>0, typically s=64s=64) with ground-truth class yi{1,,N}y_i \in \{1,\ldots,N\}, and learnable sub-centers WRd×N×KW \in \mathbb{R}^{d \times N \times K}, where Wj,kW_{j,k} is the kk-th sub-center for class jj, the Sub-center ArcFace loss is defined as follows (Deng et al., 2018, Ha et al., 2020):

  • All Wj,kW_{j,k} are ℓ₂-normalized, Wj,k=1\|W_{j,k}\|=1.
  • Cosine similarity between xix_i and Wj,kW_{j,k}:

Sj,k(xi)=Wj,kxiS_{j,k}(x_i) = W_{j,k}^\top x_i

  • For each class, pool over sub-centers:

Sj(xi)=max1kKSj,k(xi)S_j(x_i) = \max_{1 \leq k \leq K} S_{j,k}(x_i)

  • Compute the positive angle:

θyi=arccosSyi(xi)\theta_{y_i} = \arccos S_{y_i} (x_i)

  • Define the target logit with angular margin m>0m>0:

logitj(xi)={scos(θyi+m),if j=yi sSj(xi),otherwise\text{logit}_{j}(x_i) = \begin{cases} s \cdot \cos(\theta_{y_i} + m), &\text{if } j=y_i \ s \cdot S_j (x_i), &\text{otherwise} \end{cases}

  • Compute cross-entropy loss:

Li=logexp[scos(θyi+m)]exp[scos(θyi+m)]+jyiexp[sSj(xi)]L_i = -\log\frac{ \exp[s \cdot \cos(\theta_{y_i} + m)]}{ \exp[s \cdot \cos(\theta_{y_i} + m)] + \sum_{j \neq y_i} \exp[s \cdot S_j(x_i)] }

For mini-batch training, this loss is averaged over all ii. The formulation easily generalizes standard ArcFace (set K=1K=1).

2. Geometric and Statistical Intuition

In standard ArcFace (K=1K=1), each class is restricted to a single prototype on the hypersphere, compelling all class samples—regardless of pose, lighting, or noise—to cluster about a single vector. This renders ArcFace susceptible to degradation in the presence of outliers and noisy labels, which can distort the class prototype.

With K>1K > 1 sub-centers, each class forms up to KK distinct “modes” on the unit hypersphere. Clean, frontal, or canonical examples self-organize around the main (dominant) sub-center, while atypical, hard, or mislabeled instances attach to secondary (non-dominant) sub-centers. Each sample is assigned to its closest sub-center via maxkSj,k\max_k S_{j,k}, isolating anomalies from the dominant mode and preserving intra-class compactness within meaningful sub-clusters (Deng et al., 2018).

This mechanism naturally encourages self-organizing cluster separation: the dominant sub-center accrues clean data, non-dominant sub-centers attract residual variation (e.g., pose, occlusion), and the angular margin mm maintains local angular discriminability. In effect, most network gradient updates from clean samples reinforce the dominant mode, while gradients from outliers are sequestered to non-dominant modes, mitigating distortion of embeddings.

3. Algorithmic Implementation

A typical training pipeline for Sub-center ArcFace comprises the following steps (Deng et al., 2018, Ha et al., 2020):

  1. Produce feature embeddings: zi=f(imagei;θ)z_i = f(\mathrm{image}_i; \theta); normalize and scale xi=zi/zisx_i = z_i / \| z_i \| \cdot s.
  2. ℓ₂-normalize all sub-centers: W^j,k=Wj,k/Wj,k\widehat{W}_{j,k} = W_{j,k}/\|W_{j,k}\|.
  3. For each input xix_i, compute Sj,k(i)=W^j,kxi/sS_{j,k}^{(i)} = \widehat{W}_{j,k}^\top x_i / s for all j,kj,k.
  4. For each class jj, pool Sj(i)=maxkSj,k(i)S_j^{(i)} = \max_k S_{j,k}^{(i)}.
  5. Compute positive logit for class yiy_i using angular margin: scos(arccosSyi(i)+m)s \cdot \cos(\arccos S_{y_i}^{(i)} + m); negatives receive sSj(i)s \cdot S_j^{(i)}.
  6. Apply softmax and cross-entropy to compute loss for xix_i.
  7. Back-propagate loss, update θ\theta and WW via SGD/Adam, and re-normalize Wj,kW_{j,k}.

For scalable training (millions of classes), a center-parallel strategy can distribute WW across GPUs. In frameworks such as PyTorch or TensorFlow, replace the final weight tensor WRd×NW\in\mathbb{R}^{d\times N} with WRd×N×KW\in\mathbb{R}^{d\times N\times K}, inserting a max-over-KK prior to logit computation.

4. Selection and Effects of Sub-center Count KK

KK Value Typical Regime Effect on Learning
1 Low-noise, small-scale Reduces to ArcFace (single center)
2–5 Medium-to-high noise Enhances robustness, preserves compactness
≥10 Large data, low utility Sub-centers sparsely used, performance degrades

Experiments advocate K=3K=3 as a practical default on noisy data (label noise 50%\gtrsim 50\%) (Deng et al., 2018, Ha et al., 2020). For “Web” datasets or massive class count, K=3K=3–5 yields best trade-off between noise isolation and discriminative power. K10K \geq 10 is discouraged due to weakened intra-class margins and sparse assignment, while for clean, small-scale regimes K=1K=1 or $2$ is sufficient. Cross-validation for K{2,3,5}K\in\{2,3,5\} on a representative validation subset is recommended.

5. Empirical and Comparative Results

Sub-center ArcFace consistently outperforms standard ArcFace (K=1) in both face and landmark recognition under noisy or imbalanced conditions. Reported improvements, using verification TPR@FPR=1e41\text{e}{-4} on IJB-C and retrieval GAP for landmarks, are tabulated below.

Task/Dataset Baseline (K=1) Sub-center ArcFace Post-Drop/Filtering
MS1MV0 (noisy), face (IJB-C) 90.27% 93.72% (+3.45) 95.92% (+5.65)
MS1MV3 (clean), face 96.50%
Celeb500K (50% noise) 92.15% 96.91%
Google Landmark, val GAP ∼0.84 ∼0.85
Google Landmark, dinamic m 0.8671

These results demonstrate that sub-center ArcFace nearly recovers or exceeds the performance of manually cleaned training, with ≈3–6% boost on standard benchmarks and robust generalization under label noise (Deng et al., 2018, Ha et al., 2020). On the highly imbalanced GLDv2 dataset, sub-center ArcFace with K=3K=3 and dynamic margin yielded a +0.026 validation GAP over constant-margin ArcFace.

6. Extensions: Dynamic Margin Schedules

To address extreme class imbalance (e.g., long-tail distributions in landmark recognition), a “dynamic margin” strategy modulates the angular margin mm per class according to sample count nn, via m(n)=anλ+bm(n) = a n^{-\lambda} + b (clipped to [mmin,mmax][m_\text{min}, m_\text{max}]). Hyperparameters (a,b,λ)(a, b, \lambda) are set so m(nmin)=mmaxm(n_\text{min})=m_\text{max}, m(nmax)=mminm(n_\text{max})=m_\text{min}, with recommended λ=1/4\lambda = 1/4 and m[0.05,0.50]m \in [0.05, 0.50] for large-scale, imbalanced datasets (Ha et al., 2020). This approach improves generalization, notably in tail classes.

7. Practical Considerations and Best Practices

  • Initialization: Sub-centers can be initialized using standard Xavier/He initialization; no extra cluster collapse or orthogonality regularization is required, as diversity emerges during learning (Ha et al., 2020).
  • Embedding normalization and scaling: Set s=64s=64, m=0.5m=0.5 by default, in line with ArcFace settings (Deng et al., 2018).
  • Training schedule: Progressive fine-tuning over increasing image resolutions and class subsets is beneficial for complex tasks (Ha et al., 2020).
  • Noise filtering: After initial training, discard samples with angular distance >75>75^\circ to nearest sub-center. Optionally, drop non-dominant sub-centers and retrain on purified data (Deng et al., 2018).
  • Hardware scaling: For million-class regimes, distribute sub-centers with center-parallel sharding (Deng et al., 2018).
  • Data augmentation: In landmark recognition, excessive augmentation can harm retrieval accuracy (Ha et al., 2020).
  • Memory trade-offs: Higher KK increases memory demands and may require careful balancing against batch size and class coverage.

Sub-center ArcFace can be seamlessly integrated into existing deep face recognition and descriptor learning pipelines, yielding substantial noise robustness and easy adaptation for class-imbalanced applications (Deng et al., 2018, Ha et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sub-center ArcFace Loss.