r100_mask_v2: CTTA & Masked Face Recognition

Updated 30 January 2026

r100_mask_v2 is a framework featuring continual test-time adaptation via spatial patch masking on CIFAR100C and a ResNet-100 backbone for masked face recognition.
It employs dual loss strategies, using mask consistency and entropy minimization for CTTA and an additive cosine margin softmax loss for handling occlusions in face recognition.
Empirical results demonstrate state-of-the-art performance with significant accuracy improvements over baselines in both severe corruption and high-occlusion scenarios.

r100_mask_v2 denotes two distinct but prominent deep learning variants within vision research: (1) a continual test-time adaptation (CTTA) method instantiated on CIFAR100C (severity 5) under the Mask to Adapt (M2A) framework, and (2) a dedicated ResNet-100 backbone trained for robust masked face recognition under high-occlusion conditions. Both leverage state-of-the-art backbone architectures and masking strategies, but differ markedly in their adaptation paradigms, loss constructions, and targeted evaluation domains. Their empirical superiority among tested baselines for their respective tasks makes r100_mask_v2 a critical reference point for test-time adaptation and identity recognition under occlusion.

1. Network and Adaptation Architecture

For CTTA (M2A), r100_mask_v2 applies spatial patch-based random masking to inputs of a fixed backbone (typically a ResNet variant pretrained on source distribution). The masking is performed at test time and is not integrated into the backbone design itself.

In masked face verification, r100_mask_v2 employs a standard ResNet-100 backbone. The architecture consists of:

A 7×7 convolution stem
Four stages of “bottleneck” residual blocks, totaling 33 blocks
Global average pooling followed by a 512-dimensional embedding
Each residual block implements:

$y = \mathrm{ReLU}(F(x_l; W_l) + x_l)$

where $F(x)$ is a stacked sequence of 1×1 (C→C/4), 3×3 (C/4→C/4), and 1×1 (C/4→C) convolutions with intermediate ReLU activations.

The embedding is projected via a fully connected layer and $\ell_2$ -normalized.

The face backbone "mask-specific" modifications consist solely of training-time mask augmentation (random overlays on input pixels), not structural changes to the network itself (Zhang et al., 23 Jan 2026).

2. Masking Schemes and Schedule

In the M2A CTTA paradigm (Doloriel, 8 Dec 2025), r100_mask_v2 utilizes:

Spatial Patch Masking: The $H \times W$ input grid is divided into axis-aligned square patches. For each view $t$ , $P_t$ patches are sampled to mask a fraction $m_t$ of all pixels.
Masking schedule: For $n=3$ views and fixed step $\alpha=0.1$ , the sequence of masked fractions is $m = (0.0, 0.1, 0.2)$ , ensuring progressive corruption from the unmasked anchor.
Mask application: At each view, a binary mask $M^{(t)}$ is sampled; masked input is $x^{(t)} = x \odot (1 - \mathrm{broadcast}_C(M^{(t)}))$ .

For frequency masking and alternative spatial strategies, performance is inferior. Patch-based masking achieves mean error 19.8% versus 23.5% (pixel-wise) and 34.4% (frequency-masked).

In robust face recognition, mask augmentation is applied by overlaying synthetic mask regions to 15% of the images per epoch during training, with no specification of mask shape or type (Zhang et al., 23 Jan 2026).

3. Loss Functions and Optimization Objectives

For r100_mask_v2 (M2A), two key losses are imposed:

Mask Consistency Loss ( $L_{\mathrm{cons}}$ ):

$L_\mathrm{cons} = \sum_{t=1}^{n-1} H(p^{(t)}, \mathrm{sg}(p^{(0)})) + \sum_{1 \leq r < t \leq n-1} H(p^{(t)}, \mathrm{sg}(p^{(r)}))$

where $p^{(t)}$ is the softmax prediction for masked view $t$ , $p^{(0)}$ is the anchor (unmasked), $H(p, q)$ the cross-entropy, and $\mathrm{sg}$ is stop-gradient.

Entropy Minimization Loss ( $L_{\mathrm{ent}}$ ):

$L_\mathrm{ent} = \frac{1}{n} \sum_{t=0}^{n-1} H(p^{(t)})$

which penalizes output entropy to drive prediction confidence.

The total objective is $L_{\mathrm{total}} = L_{\mathrm{cons}} + \lambda L_{\mathrm{ent}}$ , with $\lambda=1.0$ for CIFAR100C and related benchmarks.

In face recognition, r100_mask_v2 uses an additive cosine margin softmax (CosFace) loss:

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \log \frac {\exp(s\, [\cos(\theta_{y_i}) - m])} {\exp(s\, [\cos(\theta_{y_i}) - m]) + \sum_{j \neq y_i} \exp(s\, \cos(\theta_{j}))}$

with $m = 0.4$ , $s$ (implicit, e.g., 64), and no multiplicative margin (Zhang et al., 23 Jan 2026).

4. Training Details and Evaluation Protocols

CTTA (M2A) on CIFAR100C:

Pretrained source model adapted on test stream.
Adam optimizer; learning rate $1 \times 10^{-3}$ ; batch size $20$.
Masking: grid-aligned spatial patches (default).
Each batch processes $n=3$ masked views per sample (one optimizer step per batch).
Data preloaded, streaming in fixed order.
Performance metric: mean error on severity 5 CIFAR100C corruptions.

Masked Face Recognition:

Dataset: Webface42M (ca. 42M images), with 15% mask augmentation; additional 100,000 live ID samples.
Preprocessing: resize (112×112), random horizontal flip, pixel normalization.
SGD (momentum 0.9, weight decay $5 \times 10^{-4}$ ), LR = 0.20, 30 epochs.
Face comparison: 100,000 genuine + 100,000 impostor pairs per masked/unmasked, threshold selected at FAR = 0.01% for accuracy reporting.
Face search: top-1/top-5 retrieval at gallery sizes up to 100,000.

5. Empirical Results and Comparative Analysis

CTTA (M2A, r100_mask_v2, spatial-patch):

CIFAR100C (severity 5): mean error 19.8%, outperforming all compared CTTA methods.
Frequency masking underperforms: e.g., all-frequency 34.4% error, low/high-frequency >35%.
Loss ablation: both mask consistency and entropy losses are necessary—removal of either causes catastrophic failure (95.8% and 98.3% error, respectively).
Supervised upper bound: 16.1% error (Doloriel, 8 Dec 2025).

Masked Face Recognition (r100_mask_v2):

Unmasked test (FAR = 0.01%): 99.06% accuracy (baseline r100 = 99.11%).
Masked test (FAR = 0.01%): 90.07% accuracy (vs. r100_mask_v1: 88.98%, r50_mask_v3: 85.00%, ViT-Tiny-mask: 89.43%).
Masked face search (gallery = 10,000): Top-1 = 89.60%, Top-5 = 96.55%.
The combination of 30 epochs and LR=0.20 achieves the best trade-off; lower initial LR and increased epochs improve masked recognition performance.

Summary Table: Key Results

Task	Variant	Masked Acc./Error (%)	Baseline Comparison
CTTA, CIFAR100C-5	r100_mask_v2	19.8 (mean error)	Next best: REM 23.4, SAR 26.2
Face Verification	r100_mask_v2	90.07 (masked acc.)	r100_mask_v1 88.98, r50 85.00
Face Verification	r100_mask_v2	99.06 (unmasked)	r100 99.11

Both instantiations yield state-of-the-art performance for their respective settings.

6. Ablations and Implementation Considerations

CTTA Ablations (Doloriel, 8 Dec 2025):

Spatial patch masking > spatial pixel > frequency masking (mean errors: 19.8%, ~23.5%, ≥34.4%).
Mask-view count $n=3$ is optimal.
Mask-step $\alpha=0.1$ or $0.2$ optimal; $\alpha=0.3$ degrades performance.

Masked Face Recognition Ablations (Zhang et al., 23 Jan 2026):

LR/epoch schedule: Intermediate values (LR=0.20, 30 ep) optimal. Higher mask proportion in training could further improve masked performance but may harm unmasked accuracy.
Margin ( $m=0.4$ ) follows CosFace best practice and is held constant.

Implementation tips for CTTA:

Deterministic seeds across frameworks.
Use existing REM codebase with modified masking module.
No additional GPU memory requirements; three forwards per batch suffice.
Logging per-batch loss terms for debugging expected trends.

7. Significance and Future Directions

r100_mask_v2 demonstrates that either:

For CTTA, straightforward random spatial patch masking (without reliance on attention or uncertainty signals), combined with dual consistency and entropy objectives, is sufficient to drive robust continual adaptation under severe corruption.
For masked face recognition, large-capacity ResNet backbones augmented with moderate mask exposure during training, informed margin softmax losses, and suitable optimization regimes, can yield high-accuracy recognition under heavy occlusion, bridging most of the performance gap induced by masks.

Both lines of work highlight the value of simplicity in design: minimal modifications to backbone or adaptation routines produce pronounced robustness gains. Future work in face recognition may examine higher mask ratios and margin scaling; in CTTA, more sophisticated mask schedules or adaptive masking strategies may further enhance adaptation without additional reliance on prediction-level calibration or complex priors.

Markdown Report Issue Upgrade to Chat

References (2)

Masked Face Recognition under Different Backbones (2026)

Mask to Adapt: Simple Random Masking Enables Robust Continual Test-Time Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to r100_mask_v2.