Sub-center ArcFace Loss
- The paper introduces a loss function that extends ArcFace by representing each class with K sub-centers, effectively capturing intra-class variations and mitigating label noise.
- It employs a max-over-K strategy for pooling cosine similarities, isolating anomalies to non-dominant sub-centers while reinforcing dominant modes with clean data.
- Experimental results demonstrate a 3–6% performance boost on noisy benchmarks and improved generalization in both face and landmark recognition tasks.
Sub-center ArcFace Loss is a generalization of the ArcFace (Additive Angular Margin Loss) used primarily in deep face recognition and related embedding learning tasks. It extends ArcFace by representing each class with sub-centers (prototypes) instead of a single center, permitting more flexible modeling of intra-class variation and conferring substantial robustness to label noise and outliers in large-scale, unconstrained datasets.
1. Mathematical Definition
Given sample (ℓ₂-normalized, scaled by a constant , typically ) with ground-truth class , and learnable sub-centers , where is the -th sub-center for class , the Sub-center ArcFace loss is defined as follows (Deng et al., 2018, Ha et al., 2020):
- All are ℓ₂-normalized, .
- Cosine similarity between and :
- For each class, pool over sub-centers:
- Compute the positive angle:
- Define the target logit with angular margin :
- Compute cross-entropy loss:
For mini-batch training, this loss is averaged over all . The formulation easily generalizes standard ArcFace (set ).
2. Geometric and Statistical Intuition
In standard ArcFace (), each class is restricted to a single prototype on the hypersphere, compelling all class samples—regardless of pose, lighting, or noise—to cluster about a single vector. This renders ArcFace susceptible to degradation in the presence of outliers and noisy labels, which can distort the class prototype.
With sub-centers, each class forms up to distinct “modes” on the unit hypersphere. Clean, frontal, or canonical examples self-organize around the main (dominant) sub-center, while atypical, hard, or mislabeled instances attach to secondary (non-dominant) sub-centers. Each sample is assigned to its closest sub-center via , isolating anomalies from the dominant mode and preserving intra-class compactness within meaningful sub-clusters (Deng et al., 2018).
This mechanism naturally encourages self-organizing cluster separation: the dominant sub-center accrues clean data, non-dominant sub-centers attract residual variation (e.g., pose, occlusion), and the angular margin maintains local angular discriminability. In effect, most network gradient updates from clean samples reinforce the dominant mode, while gradients from outliers are sequestered to non-dominant modes, mitigating distortion of embeddings.
3. Algorithmic Implementation
A typical training pipeline for Sub-center ArcFace comprises the following steps (Deng et al., 2018, Ha et al., 2020):
- Produce feature embeddings: ; normalize and scale .
- ℓ₂-normalize all sub-centers: .
- For each input , compute for all .
- For each class , pool .
- Compute positive logit for class using angular margin: ; negatives receive .
- Apply softmax and cross-entropy to compute loss for .
- Back-propagate loss, update and via SGD/Adam, and re-normalize .
For scalable training (millions of classes), a center-parallel strategy can distribute across GPUs. In frameworks such as PyTorch or TensorFlow, replace the final weight tensor with , inserting a max-over- prior to logit computation.
4. Selection and Effects of Sub-center Count
| Value | Typical Regime | Effect on Learning |
|---|---|---|
| 1 | Low-noise, small-scale | Reduces to ArcFace (single center) |
| 2–5 | Medium-to-high noise | Enhances robustness, preserves compactness |
| ≥10 | Large data, low utility | Sub-centers sparsely used, performance degrades |
Experiments advocate as a practical default on noisy data (label noise ) (Deng et al., 2018, Ha et al., 2020). For “Web” datasets or massive class count, –5 yields best trade-off between noise isolation and discriminative power. is discouraged due to weakened intra-class margins and sparse assignment, while for clean, small-scale regimes or $2$ is sufficient. Cross-validation for on a representative validation subset is recommended.
5. Empirical and Comparative Results
Sub-center ArcFace consistently outperforms standard ArcFace (K=1) in both face and landmark recognition under noisy or imbalanced conditions. Reported improvements, using verification TPR@FPR= on IJB-C and retrieval GAP for landmarks, are tabulated below.
| Task/Dataset | Baseline (K=1) | Sub-center ArcFace | Post-Drop/Filtering |
|---|---|---|---|
| MS1MV0 (noisy), face (IJB-C) | 90.27% | 93.72% (+3.45) | 95.92% (+5.65) |
| MS1MV3 (clean), face | 96.50% | — | — |
| Celeb500K (50% noise) | 92.15% | 96.91% | — |
| Google Landmark, val GAP | ∼0.84 | ∼0.85 | — |
| Google Landmark, dinamic m | — | 0.8671 | — |
These results demonstrate that sub-center ArcFace nearly recovers or exceeds the performance of manually cleaned training, with ≈3–6% boost on standard benchmarks and robust generalization under label noise (Deng et al., 2018, Ha et al., 2020). On the highly imbalanced GLDv2 dataset, sub-center ArcFace with and dynamic margin yielded a +0.026 validation GAP over constant-margin ArcFace.
6. Extensions: Dynamic Margin Schedules
To address extreme class imbalance (e.g., long-tail distributions in landmark recognition), a “dynamic margin” strategy modulates the angular margin per class according to sample count , via (clipped to ). Hyperparameters are set so , , with recommended and for large-scale, imbalanced datasets (Ha et al., 2020). This approach improves generalization, notably in tail classes.
7. Practical Considerations and Best Practices
- Initialization: Sub-centers can be initialized using standard Xavier/He initialization; no extra cluster collapse or orthogonality regularization is required, as diversity emerges during learning (Ha et al., 2020).
- Embedding normalization and scaling: Set , by default, in line with ArcFace settings (Deng et al., 2018).
- Training schedule: Progressive fine-tuning over increasing image resolutions and class subsets is beneficial for complex tasks (Ha et al., 2020).
- Noise filtering: After initial training, discard samples with angular distance to nearest sub-center. Optionally, drop non-dominant sub-centers and retrain on purified data (Deng et al., 2018).
- Hardware scaling: For million-class regimes, distribute sub-centers with center-parallel sharding (Deng et al., 2018).
- Data augmentation: In landmark recognition, excessive augmentation can harm retrieval accuracy (Ha et al., 2020).
- Memory trade-offs: Higher increases memory demands and may require careful balancing against batch size and class coverage.
Sub-center ArcFace can be seamlessly integrated into existing deep face recognition and descriptor learning pipelines, yielding substantial noise robustness and easy adaptation for class-imbalanced applications (Deng et al., 2018, Ha et al., 2020).