Sub-center ArcFace Loss

Updated 27 January 2026

The paper introduces a loss function that extends ArcFace by representing each class with K sub-centers, effectively capturing intra-class variations and mitigating label noise.
It employs a max-over-K strategy for pooling cosine similarities, isolating anomalies to non-dominant sub-centers while reinforcing dominant modes with clean data.
Experimental results demonstrate a 3–6% performance boost on noisy benchmarks and improved generalization in both face and landmark recognition tasks.

Sub-center ArcFace Loss is a generalization of the ArcFace (Additive Angular Margin Loss) used primarily in deep face recognition and related embedding learning tasks. It extends ArcFace by representing each class with $K$ sub-centers (prototypes) instead of a single center, permitting more flexible modeling of intra-class variation and conferring substantial robustness to label noise and outliers in large-scale, unconstrained datasets.

1. Mathematical Definition

Given sample $x_i \in \mathbb{R}^d$ (ℓ₂-normalized, scaled by a constant $s>0$ , typically $s=64$ ) with ground-truth class $y_i \in \{1,\ldots,N\}$ , and learnable sub-centers $W \in \mathbb{R}^{d \times N \times K}$ , where $W_{j,k}$ is the $k$ -th sub-center for class $j$ , the Sub-center ArcFace loss is defined as follows (Deng et al., 2018, Ha et al., 2020):

All $W_{j,k}$ are ℓ₂-normalized, $x_i \in \mathbb{R}^d$ 0.
Cosine similarity between $x_i \in \mathbb{R}^d$ 1 and $x_i \in \mathbb{R}^d$ 2:

$x_i \in \mathbb{R}^d$ 3

For each class, pool over sub-centers:

$x_i \in \mathbb{R}^d$ 4

Compute the positive angle:

$x_i \in \mathbb{R}^d$ 5

Define the target logit with angular margin $x_i \in \mathbb{R}^d$ 6:

$x_i \in \mathbb{R}^d$ 7

Compute cross-entropy loss:

$x_i \in \mathbb{R}^d$ 8

For mini-batch training, this loss is averaged over all $x_i \in \mathbb{R}^d$ 9. The formulation easily generalizes standard ArcFace (set $s>0$ 0).

2. Geometric and Statistical Intuition

In standard ArcFace ( $s>0$ 1), each class is restricted to a single prototype on the hypersphere, compelling all class samples—regardless of pose, lighting, or noise—to cluster about a single vector. This renders ArcFace susceptible to degradation in the presence of outliers and noisy labels, which can distort the class prototype.

With $s>0$ 2 sub-centers, each class forms up to $s>0$ 3 distinct “modes” on the unit hypersphere. Clean, frontal, or canonical examples self-organize around the main (dominant) sub-center, while atypical, hard, or mislabeled instances attach to secondary (non-dominant) sub-centers. Each sample is assigned to its closest sub-center via $s>0$ 4, isolating anomalies from the dominant mode and preserving intra-class compactness within meaningful sub-clusters (Deng et al., 2018).

This mechanism naturally encourages self-organizing cluster separation: the dominant sub-center accrues clean data, non-dominant sub-centers attract residual variation (e.g., pose, occlusion), and the angular margin $s>0$ 5 maintains local angular discriminability. In effect, most network gradient updates from clean samples reinforce the dominant mode, while gradients from outliers are sequestered to non-dominant modes, mitigating distortion of embeddings.

3. Algorithmic Implementation

A typical training pipeline for Sub-center ArcFace comprises the following steps (Deng et al., 2018, Ha et al., 2020):

Produce feature embeddings: $s>0$ 6; normalize and scale $s>0$ 7.
ℓ₂-normalize all sub-centers: $s>0$ 8.
For each input $s>0$ 9, compute $s=64$ 0 for all $s=64$ 1.
For each class $s=64$ 2, pool $s=64$ 3.
Compute positive logit for class $s=64$ 4 using angular margin: $s=64$ 5; negatives receive $s=64$ 6.
Apply softmax and cross-entropy to compute loss for $s=64$ 7.
Back-propagate loss, update $s=64$ 8 and $s=64$ 9 via SGD/Adam, and re-normalize $y_i \in \{1,\ldots,N\}$ 0.

For scalable training (millions of classes), a center-parallel strategy can distribute $y_i \in \{1,\ldots,N\}$ 1 across GPUs. In frameworks such as PyTorch or TensorFlow, replace the final weight tensor $y_i \in \{1,\ldots,N\}$ 2 with $y_i \in \{1,\ldots,N\}$ 3, inserting a max-over- $y_i \in \{1,\ldots,N\}$ 4 prior to logit computation.

4. Selection and Effects of Sub-center Count $y_i \in \{1,\ldots,N\}$ 5

$y_i \in \{1,\ldots,N\}$ 6 Value	Typical Regime	Effect on Learning
1	Low-noise, small-scale	Reduces to ArcFace (single center)
2–5	Medium-to-high noise	Enhances robustness, preserves compactness
≥10	Large data, low utility	Sub-centers sparsely used, performance degrades

Experiments advocate $y_i \in \{1,\ldots,N\}$ 7 as a practical default on noisy data (label noise $y_i \in \{1,\ldots,N\}$ 8) (Deng et al., 2018, Ha et al., 2020). For “Web” datasets or massive class count, $y_i \in \{1,\ldots,N\}$ 9–5 yields best trade-off between noise isolation and discriminative power. $W \in \mathbb{R}^{d \times N \times K}$ 0 is discouraged due to weakened intra-class margins and sparse assignment, while for clean, small-scale regimes $W \in \mathbb{R}^{d \times N \times K}$ 1 or $W \in \mathbb{R}^{d \times N \times K}$ 2 is sufficient. Cross-validation for $W \in \mathbb{R}^{d \times N \times K}$ 3 on a representative validation subset is recommended.

5. Empirical and Comparative Results

Sub-center ArcFace consistently outperforms standard ArcFace (K=1) in both face and landmark recognition under noisy or imbalanced conditions. Reported improvements, using verification TPR@FPR= $W \in \mathbb{R}^{d \times N \times K}$ 4 on IJB-C and retrieval GAP for landmarks, are tabulated below.

Task/Dataset	Baseline (K=1)	Sub-center ArcFace	Post-Drop/Filtering
MS1MV0 (noisy), face (IJB-C)	90.27%	93.72% (+3.45)	95.92% (+5.65)
MS1MV3 (clean), face	96.50%	—	—
Celeb500K (50% noise)	92.15%	96.91%	—
Google Landmark, val GAP	∼0.84	∼0.85	—
Google Landmark, dinamic m	—	0.8671	—

These results demonstrate that sub-center ArcFace nearly recovers or exceeds the performance of manually cleaned training, with ≈3–6% boost on standard benchmarks and robust generalization under label noise (Deng et al., 2018, Ha et al., 2020). On the highly imbalanced GLDv2 dataset, sub-center ArcFace with $W \in \mathbb{R}^{d \times N \times K}$ 5 and dynamic margin yielded a +0.026 validation GAP over constant-margin ArcFace.

6. Extensions: Dynamic Margin Schedules

To address extreme class imbalance (e.g., long-tail distributions in landmark recognition), a “dynamic margin” strategy modulates the angular margin $W \in \mathbb{R}^{d \times N \times K}$ 6 per class according to sample count $W \in \mathbb{R}^{d \times N \times K}$ 7, via $W \in \mathbb{R}^{d \times N \times K}$ 8 (clipped to $W \in \mathbb{R}^{d \times N \times K}$ 9). Hyperparameters $W_{j,k}$ 0 are set so $W_{j,k}$ 1, $W_{j,k}$ 2, with recommended $W_{j,k}$ 3 and $W_{j,k}$ 4 for large-scale, imbalanced datasets (Ha et al., 2020). This approach improves generalization, notably in tail classes.

7. Practical Considerations and Best Practices

Initialization: Sub-centers can be initialized using standard Xavier/He initialization; no extra cluster collapse or orthogonality regularization is required, as diversity emerges during learning (Ha et al., 2020).
Embedding normalization and scaling: Set $W_{j,k}$ 5, $W_{j,k}$ 6 by default, in line with ArcFace settings (Deng et al., 2018).
Training schedule: Progressive fine-tuning over increasing image resolutions and class subsets is beneficial for complex tasks (Ha et al., 2020).
Noise filtering: After initial training, discard samples with angular distance $W_{j,k}$ 7 to nearest sub-center. Optionally, drop non-dominant sub-centers and retrain on purified data (Deng et al., 2018).
Hardware scaling: For million-class regimes, distribute sub-centers with center-parallel sharding (Deng et al., 2018).
Data augmentation: In landmark recognition, excessive augmentation can harm retrieval accuracy (Ha et al., 2020).
Memory trade-offs: Higher $W_{j,k}$ 8 increases memory demands and may require careful balancing against batch size and class coverage.

Sub-center ArcFace can be seamlessly integrated into existing deep face recognition and descriptor learning pipelines, yielding substantial noise robustness and easy adaptation for class-imbalanced applications (Deng et al., 2018, Ha et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

ArcFace: Additive Angular Margin Loss for Deep Face Recognition (2018)

Google Landmark Recognition 2020 Competition Third Place Solution (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sub-center ArcFace Loss.