r100 Series Models for Face Recognition
- r100 Series Models are deep CNN backbones with a 100-layer ResNet design that excel in both unmasked and masked face recognition.
- They employ bottleneck residual blocks and margin-based loss training to achieve over 99% unmasked and 90% masked verification accuracy at 0.01% FAR.
- Designed for high-throughput scenarios like civil aviation security, they balance substantial computational demands with superior recognition performance.
The r100 series designates a family of deep convolutional neural network (CNN) backbones specifically tailored for high-accuracy face recognition, particularly in operationally challenging environments such as civil aviation security where masked face prevalence is high. The r100 models are canonical 100-layer residual networks derived from the ResNet architecture, incorporating bottleneck residual blocks and margin-based classification heads. Both standard and masked-specific trained variants exist, with the latter adapted through data augmentation strategies to maximize robustness to face occlusions due to masks. The r100 series is characterized by high verification and retrieval performance at stringent false acceptance rates (FAR), and serves as a leading backbone for large-scale face recognition systems (Zhang et al., 23 Jan 2026).
1. Network Architecture: r100 Series and Masked Variants
The r100 backbone is defined as a 100-layer residual network—a deeper analogue to ResNet-50—utilizing only standard bottleneck residual blocks. The canonical building block applies:
where each is a convolution followed by batch normalization, and is a ReLU nonlinearity. The global network follows the prototypical ResNet pattern: initial 7x7 convolution, batch normalization, ReLU, and MaxPool (“stem”), followed by four sequential stages of bottleneck blocks, ending with global average pooling, a fully-connected layer, and softmax output. The precise allocation of blocks among the four stages and the full parameter count are not provided; the model is generically referenced as a “100-layer” ResNet (Zhang et al., 23 Jan 2026).
The r100_mask_v2 variant preserves identical architecture but is distinguished by its training data: 15% of examples in the WebFace42M source dataset are masked faces (synthetic or real). No change to layer types, residual block structure, or channel widths is reported.
2. Training Protocols and Hyperparameterization
Training of both r100 and r100_mask_v2 is performed on WebFace42M with 100,000 “live-ID” samples added, employing a classification-style margin loss parametrized as
This margin configuration matches typical CosFace or ArcFace output heads. The routines differ as follows:
| Model | Init. Learning Rate | Epochs | LR Decay Type | Masked Data Ratio |
|---|---|---|---|---|
| r100 (v1) | 0.30 | — | linear→decay | 0% |
| r100_mask_v2 | 0.20 | 30 | step/decay | 15% |
No explicit information is available regarding batch size, optimizer, weight decay, or detailed data loading protocol. The r100_mask_v2 version (as opposed to earlier v1 or alternative v3) is recommended as the primary masked-face model.
3. Quantitative Evaluation: Verification and Retrieval Performance
The series demonstrates strong performance at standard evaluation points: verification at 0.01% FAR on 100k-pair test sets, and face search (top-n accuracy) across galleries up to 100k distractors.
Verification Accuracy
| Model | Threshold (0.01% FAR) | Unmasked Acc. (%) | Masked Acc. (%) |
|---|---|---|---|
| r100 | 0.2996 | 99.11 | 80.93 |
| r100_mask_v1 | — | — | 88.98 |
| r100_mask_v2 | 0.2991 (unmasked) | 99.06 (unmasked) | 90.07 (masked) |
| r100_mask_v3 | — | — | 89.70 |
Search (Retrieval) Performance: r100 vs. r100_mask_v2
Top-1 Accuracy (%)
| #Gallery/104 | Unmasked r100 | Unmasked r100_mask_v2 | Masked r100_mask_v2 |
|---|---|---|---|
| 1 | 98.18 | 98.15 | 89.60 |
| 2 | 97.40 | 97.32 | 85.33 |
| 3 | 96.96 | 96.85 | 83.39 |
| 10 | 96.15 | 96.05 | 80.22 |
Top-5 Accuracy (%)
| #Gallery/104 | Unmasked r100 | Unmasked r100_mask_v2 | Masked r100_mask_v2 |
|---|---|---|---|
| 1 | 99.81 | 99.79 | 96.55 |
| 2 | 99.59 | 99.60 | 94.06 |
| 3 | 99.44 | 99.43 | 92.54 |
| 10 | 99.18 | 99.15 | 90.30 |
No confidence intervals or statistical significance tests are provided in the source.
4. Comparative Performance Analysis Against Other Backbones
The r100 family consistently outperforms r50 and r34_mask_v1 backbones and achieves superior results to ViT-Tiny in all relevant metrics. At 0.01% FAR, the r100 backbone yields 99.11% accuracy for unmasked faces, which is +2.18% relative to r50, +2.09% over r34_mask_v1, and +2.88% over ViT-Tiny. For masked face verification, r100_mask_v2 reaches 90.07%, outperforming r50_mask_v2 by +5.27%, r34_mask by +10.20%, and ViT-Tiny_mask by +0.64%. Vision Transformer (ViT) models show competitive Top-5 recall but are penalized by higher computational and memory cost due to global attention mechanisms.
5. Operational Deployment and Trade-offs
The r100 series is intended for deployment in high-throughput environments, such as airport security, where achieving >99% accuracy at low FAR is a priority and compute resource constraints can be relaxed. The series’ substantial computational and storage requirements (on the order of tens of millions of parameters) favor server-class GPU or high-performance CPU implementations with batch processing.
For edge or mobile scenarios, the deployment of r50-mask or ViT-Tiny variants is suggested as a practical trade-off, accepting a minor loss in recognition accuracy in exchange for improved inference speed and reduced memory requirements. The compute/memory demands of r100 preclude routine real-time operation on low-power platforms.
6. Recommendations and Context for Civil Aviation and Masked Face Scenarios
The r100_mask_v2 model is preferred for environments where masked faces are common, such as in post-pandemic civil aviation, as it demonstrates a 5–10% absolute gain over baseline models at the operationally relevant 0.01% FAR. The addition of 15% masked-face examples during training confers substantial robustness, without necessitating architectural modifications. ViT models may be considered where highest Top-5 recall is needed, but impose higher hardware burdens.
The series remains the default recommendation for both masked and unmasked face recognition in aviation security, balancing operational accuracy requirements against hardware availability. Where significant real-time constraints exist, practitioners can downscale to r50_mask or ViT-Tiny, accepting a modest depreciation in performance (Zhang et al., 23 Jan 2026).