AffectNet+: Advanced FER Benchmark
- AffectNet+ is an advanced benchmark for facial expression recognition that integrates soft-label annotations, enriched metadata, and synthetic augmentation to address ambiguous and compound expressions.
- It employs a dual-method soft-label construction using ensemble binary classifiers and AU-based techniques to generate smooth probability distributions over eight primary emotion classes.
- The resource leverages photorealistic synthetic augmentation via 3D morphable models and multi-task learning frameworks, significantly improving FER performance and addressing class imbalance.
AffectNet+ is an advanced benchmark and resource for facial expression recognition (FER) research, building upon the foundational AffectNet dataset by introducing soft-label annotations, enriched metadata, synthetic augmentation, and multi-task learning strategies that jointly leverage categorical and dimensional representations of affect. AffectNet+ supports robust FER by more accurately modeling ambiguous and compound expressions, mitigating class imbalance, and enabling high-fidelity evaluation across demographic and data-complexity subsets. Its construction and associated methodologies are detailed in multiple recent works, notably "AffectNet+: A Database for Enhancing Facial Expression Recognition with Soft-Labels" (Fard et al., 2024), “Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition” (&&&1&&&), and the augmentation methodology of "Deep Neural Network Augmentation: Generating Faces for Affect Analysis" (Kollias et al., 2018).
1. Dataset Composition, Labeling, and Metadata
AffectNet+ is derived from the publicly available AffectNet dataset, which contains approximately one million web-crawled face images, with ∼456,000 containing at least one human-provided emotion label. In AffectNet+, only the eight basic & contempt emotion classes—Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger, and Contempt—are retained (the “Other” category is removed). The training partition consists of 287,651 manually labeled images; the validation set has 4,000 balanced images (500 per class).
Each facial image is annotated with:
- Discrete categorical emotion: one of the eight primary classes.
- Valence and arousal: continuous labels in .
- Soft-labels: An eight-dimensional vector , indicating the estimated probability that emotion is present in image , with (Fard et al., 2024).
- Demographic and geometric metadata: Age (regression), gender, ethnicity (Indian, Black, White, Middle-Eastern, Hispanic), head pose (yaw, pitch, roll), 68- and 28-point facial landmarks (in ).
Data is further stratified by complexity: “Easy” samples (67.5% of train) are those where the top-1 soft-label agrees with the original hard-label; “Challenging” (19.96%) and “Difficult” (12.5%) subsets are defined by lower soft/hard agreement.
2. Soft-Label Construction and Annotation Protocols
AffectNet+ departs from classical “hard-label” (one-hot) protocol by assigning each image a probability vector over possible emotions. Soft-labels are calculated via the fusion of two statistically grounded annotators:
- Ensemble of Binary Classifiers (EBC): For each emotion , three binary (one-vs-rest) CNNs (ResNet-50, EfficientNet-B3, XceptionNet) are trained on a multi-annotated subset. At inference, each produces a probability . Each classifier has a confidence score (Eq. 4 (Fard et al., 2024)). Semantic scores are , and the final EBC score is the mean over three networks.
- Action-Unit (AU)–Based Classifier: Each emotion is represented by a binary 21-d AU vector (). For each emotion, a ResNet-50 predicts both the one-vs-rest classification and AUs. The soft-label for emotion in image is scored by weighted AU similarity and an artificial softmax, then averaged with the network’s binary output. The final AU-based score is (Eq. 11 (Fard et al., 2024)), modulated by a per-class AU confidence.
- Final Fusion: For each image, the soft-label entry for class is (Eq. 12).
This protocol is designed to mitigate single-annotator bias, model compound expressions, and produce smooth decision boundaries.
3. Advanced Augmentation: Synthetic Data Generation
A complementary strategy for augmenting AffectNet to form AffectNet+ utilizes photorealistic face synthesis via 3D Morphable Model (3DMM) deformation and Poisson blending (Kollias et al., 2018). The workflow involves:
- 3DMM Fitting: Fitting LSFM-based 3D shape, blendshape-based expression, pose, and texture models to a neutral AffectNet face by minimizing feature-space photometric and landmark error (Eqns. (6), (13)).
- Affect-Driven Deformation: Mapping either coordinates or basic expression labels to specific blendshapes or mean meshes, using precomputed clusters from 600K annotated 4DFAB frames partitioned into 550 VA cells.
- Image Synthesis: The deformed mesh is rendered with the source texture into the original image frame, and composited with Poisson blending to ensure seamless photorealism.
- Augmented Dataset: The process produces, e.g., 2.5M VA-synthesized images and 176K basic-expression images, expanding AffectNet for robust FER model training.
Quantitative comparisons conclusively demonstrate this approach’s superiority to GAN-based augmentation in both expression classification and VA regression tasks, as measured by CCC, Pearson-R, MSE, and binary accuracy (Table 1 (Kollias et al., 2018)).
4. Network Architectures and Multi-Task Learning
AffectNet+ catalyzed methodological advances in multi-task affect modeling, notably combining discrete and continuous annotations in unified learning frameworks. A prominent architecture consists of:
- Shared Backbone: e.g., DenseNet, which extracts global facial features, producing a $1024$-vector for each image (Antoniadis et al., 2021).
- Graph Convolutional Network (GCN): Nodes correspond to seven categorical emotions plus valence/arousal ( total), with initial features as 300-D GloVe embeddings. The GCN, with a two-layer propagation ( hidden), captures empirical interdependencies using a sparsified adjacency matrix computed from Cat–Dim Spearman correlations, combined with self-loop and edge re-weighting for stability.
- Task Heads: The first seven rows of the final GCN output matrix provide the weights for categorical classifiers via ; the remaining two rows serve as regressors for valence/arousal.
Training uses a combined multi-task loss,
with class-weighted cross-entropy for classification and CCC-negated loss for regression,
where is the Concordance Correlation Coefficient: This MTL–GCN scheme yields state-of-the-art discrete accuracy (66.46% mean class accuracy, surpassing previous bests in the low-to-mid 60s) and strong VA prediction ($0.767, 0.649$ for valence, arousal CCC; (Antoniadis et al., 2021)).
5. Data Complexity, Bias Mitigation, and Evaluation
AffectNet+ explicitly addresses label and class imbalance by a combination of negative sampling, complexity-aware partitioning, and balanced evaluation metrics:
- Negative Sampling for EBC: For emotion , negatives comprise 20% uniformly sampled from other classes and 80% proportionally to AU-intersection counts, prioritizing confusable negatives (Fard et al., 2024).
- Complexity Subsets: Training/test splits stratified as Easy, Challenging, Difficult enable granular generalization analysis.
- Metrics: Baselines report both raw and average accuracy for hard-label tasks; Soft-FER leverages weighted MAE (Eq. 11) and weighted failure rate (W-FR), which reflect the fidelity of the predicted soft-label distribution.
- Ablation: Loss choices (CCC vs MSE), MTL vs single-task, and explicit Cat–Dim modeling are objectively compared. For instance, MTL yields +1.3% accuracy, CCC-loss a further ≈1%, and GCN an additional +0.8%, consistently boosting performance beyond compositional baselines (Antoniadis et al., 2021).
6. Performance Benchmarks and Use Cases
Quantitative benchmarks on AffectNet+ demonstrate the efficacy of these strategies:
- Hard-label FER (ResNet-50): 52.06% overall accuracy on validation; 85.86% for Easy, 51.62% for Challenging, 34.34% for Difficult samples.
- Soft-label regression: W-MAE of 17.30%, with W-FR of 10.85% (across all val images). Notably, Easy cases achieve W-FR = 8.00%, Difficult = 18.66% (Fard et al., 2024).
- EBC/AU fusion: Average per-class accuracy rises from 79.5% (plain classifier) to 88.5% (with AU head).
- Synthetic augmentation: CCC scores (AffectNet, VGG-FACE backbone) improve from 0.50/0.37 to 0.62/0.54 (valence/arousal), outstripping GAN-based methods by a wide margin (Kollias et al., 2018).
AffectNet+ with its accompanying protocols supports:
- Compound/multi-label expression modeling
- Model uncertainty quantification
- Fairness-by-metadata and subgroup generalization studies
- Domain adaptation and pose-aware learning
- Intensity-aware and subset-specialized loss designs
- Class imbalance resilience via negative sampling and subset evaluation.
Public availability of images, annotations, soft-labels, subsets, and metadata positions AffectNet+ as a definitive resource for FER research across static and dynamic domains.