AffectNet+: Advanced FER Benchmark

Updated 17 February 2026

AffectNet+ is an advanced benchmark for facial expression recognition that integrates soft-label annotations, enriched metadata, and synthetic augmentation to address ambiguous and compound expressions.
It employs a dual-method soft-label construction using ensemble binary classifiers and AU-based techniques to generate smooth probability distributions over eight primary emotion classes.
The resource leverages photorealistic synthetic augmentation via 3D morphable models and multi-task learning frameworks, significantly improving FER performance and addressing class imbalance.

AffectNet+ is an advanced benchmark and resource for facial expression recognition (FER) research, building upon the foundational AffectNet dataset by introducing soft-label annotations, enriched metadata, synthetic augmentation, and multi-task learning strategies that jointly leverage categorical and dimensional representations of affect. AffectNet+ supports robust FER by more accurately modeling ambiguous and compound expressions, mitigating class imbalance, and enabling high-fidelity evaluation across demographic and data-complexity subsets. Its construction and associated methodologies are detailed in multiple recent works, notably "AffectNet+: A Database for Enhancing Facial Expression Recognition with Soft-Labels" (Fard et al., 2024), “Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition” (&&&1&&&), and the augmentation methodology of "Deep Neural Network Augmentation: Generating Faces for Affect Analysis" (Kollias et al., 2018).

1. Dataset Composition, Labeling, and Metadata

AffectNet+ is derived from the publicly available AffectNet dataset, which contains approximately one million web-crawled face images, with ∼456,000 containing at least one human-provided emotion label. In AffectNet+, only the eight basic & contempt emotion classes—Neutral, Happy, Sad, Surprise, Fear, Disgust, Anger, and Contempt—are retained (the “Other” category is removed). The training partition consists of 287,651 manually labeled images; the validation set has 4,000 balanced images (500 per class).

Each facial image is annotated with:

Discrete categorical emotion: one of the eight primary classes.
Valence and arousal: continuous labels in $[-1,1]$ .
Soft-labels: An eight-dimensional vector $\mathbf{SL}_k=[P_0,{\dots},P_7]$ , indicating the estimated probability $P_i$ that emotion $i$ is present in image $k$ , with $\sum_i P_i \approx 1$ (Fard et al., 2024).
Demographic and geometric metadata: Age (regression), gender, ethnicity (Indian, Black, White, Middle-Eastern, Hispanic), head pose (yaw, pitch, roll), 68- and 28-point facial landmarks (in $(x,y)$ ).

Data is further stratified by complexity: “Easy” samples (67.5% of train) are those where the top-1 soft-label agrees with the original hard-label; “Challenging” (19.96%) and “Difficult” (12.5%) subsets are defined by lower soft/hard agreement.

2. Soft-Label Construction and Annotation Protocols

AffectNet+ departs from classical “hard-label” (one-hot) protocol by assigning each image a probability vector over possible emotions. Soft-labels are calculated via the fusion of two statistically grounded annotators:

Ensemble of Binary Classifiers (EBC): For each emotion $i$ , three binary (one-vs-rest) CNNs (ResNet-50, EfficientNet-B3, XceptionNet) are trained on a multi-annotated subset. At inference, each produces a probability $P_j(\mathrm{emo}_i|img_k)$ . Each classifier has a confidence score $CS^{EB}_j(i) = \frac{1}{2}(\text{TPR} + \text{TNR})$ (Eq. 4 (Fard et al., 2024)). Semantic scores are $SC^{EB}_j(i,k) = CS^{EB}_j(i) \cdot P_j(\mathrm{emo}_i|img_k)$ , and the final EBC score is the mean over three networks.
Action-Unit (AU)–Based Classifier: Each emotion is represented by a binary 21-d AU vector ( $\mathbf{AU}_i$ ). For each emotion, a ResNet-50 predicts both the one-vs-rest classification and AUs. The soft-label for emotion $i$ in image $k$ is scored by weighted AU similarity and an artificial softmax, then averaged with the network’s binary output. The final AU-based score is $P_{AU}(i, k) = \frac{1}{2}(\mathbf{BPV}_k(i) + \mathbf{APV}_k(i))$ (Eq. 11 (Fard et al., 2024)), modulated by a per-class AU confidence.
Final Fusion: For each image, the soft-label entry for class $i$ is $sl(i,k) = \frac{1}{2}(SC^{EB}_{\mathrm{Mean}}(i,k) + CS^{AU}(i)P_{AU}(i,k))$ (Eq. 12).

This protocol is designed to mitigate single-annotator bias, model compound expressions, and produce smooth decision boundaries.

3. Advanced Augmentation: Synthetic Data Generation

A complementary strategy for augmenting AffectNet to form AffectNet+ utilizes photorealistic face synthesis via 3D Morphable Model (3DMM) deformation and Poisson blending (Kollias et al., 2018). The workflow involves:

3DMM Fitting: Fitting LSFM-based 3D shape, blendshape-based expression, pose, and texture models to a neutral AffectNet face by minimizing feature-space photometric and landmark error (Eqns. (6), (13)).
Affect-Driven Deformation: Mapping either $(v,a)$ coordinates or basic expression labels to specific blendshapes or mean meshes, using precomputed clusters from 600K annotated 4DFAB frames partitioned into 550 VA cells.
Image Synthesis: The deformed mesh is rendered with the source texture into the original image frame, and composited with Poisson blending to ensure seamless photorealism.
Augmented Dataset: The process produces, e.g., 2.5M VA-synthesized images and 176K basic-expression images, expanding AffectNet for robust FER model training.

Quantitative comparisons conclusively demonstrate this approach’s superiority to GAN-based augmentation in both expression classification and VA regression tasks, as measured by CCC, Pearson-R, MSE, and binary accuracy (Table 1 (Kollias et al., 2018)).

4. Network Architectures and Multi-Task Learning

AffectNet+ catalyzed methodological advances in multi-task affect modeling, notably combining discrete and continuous annotations in unified learning frameworks. A prominent architecture consists of:

Shared Backbone: e.g., DenseNet, which extracts global facial features, producing a $1024$-vector for each image (Antoniadis et al., 2021).
Graph Convolutional Network (GCN): Nodes correspond to seven categorical emotions plus valence/arousal ( $n=9$ total), with initial features as 300-D GloVe embeddings. The GCN, with a two-layer propagation ( $512\rightarrow 1024$ hidden), captures empirical interdependencies using a sparsified adjacency matrix computed from Cat–Dim Spearman correlations, combined with self-loop and edge re-weighting for stability.
Task Heads: The first seven rows of the final GCN output matrix provide the weights for categorical classifiers via $\hat{y}_i = \mathrm{softmax}(w^c_i \cdot x)$ ; the remaining two rows serve as regressors for valence/arousal.

Training uses a combined multi-task loss,

$L = L^c + L^r$

with class-weighted cross-entropy for classification and CCC-negated loss for regression,

$L^r = 1 - \frac{\rho_v + \rho_a}{2}$

where $\rho_c$ is the Concordance Correlation Coefficient: $\rho_c = \frac{2 s_{xy}}{s_x^2 + s_y^2 + (\bar{x} - \bar{y})^2}$ This MTL–GCN scheme yields state-of-the-art discrete accuracy (66.46% mean class accuracy, surpassing previous bests in the low-to-mid 60s) and strong VA prediction ($0.767, 0.649$ for valence, arousal CCC; (Antoniadis et al., 2021)).

5. Data Complexity, Bias Mitigation, and Evaluation

AffectNet+ explicitly addresses label and class imbalance by a combination of negative sampling, complexity-aware partitioning, and balanced evaluation metrics:

Negative Sampling for EBC: For emotion $i$ , negatives comprise 20% uniformly sampled from other classes and 80% proportionally to AU-intersection counts, prioritizing confusable negatives (Fard et al., 2024).
Complexity Subsets: Training/test splits stratified as Easy, Challenging, Difficult enable granular generalization analysis.
Metrics: Baselines report both raw and average accuracy $(\overline{Acc} = \frac{1}{2}(TPR+TNR))$ for hard-label tasks; Soft-FER leverages weighted MAE (Eq. 11) and weighted failure rate (W-FR), which reflect the fidelity of the predicted soft-label distribution.
Ablation: Loss choices (CCC vs MSE), MTL vs single-task, and explicit Cat–Dim modeling are objectively compared. For instance, MTL yields +1.3% accuracy, CCC-loss a further ≈1%, and GCN an additional +0.8%, consistently boosting performance beyond compositional baselines (Antoniadis et al., 2021).

6. Performance Benchmarks and Use Cases

Quantitative benchmarks on AffectNet+ demonstrate the efficacy of these strategies:

Hard-label FER (ResNet-50): 52.06% overall accuracy on validation; 85.86% for Easy, 51.62% for Challenging, 34.34% for Difficult samples.
Soft-label regression: W-MAE of 17.30%, with W-FR of 10.85% (across all val images). Notably, Easy cases achieve W-FR = 8.00%, Difficult = 18.66% (Fard et al., 2024).
EBC/AU fusion: Average per-class accuracy rises from 79.5% (plain classifier) to 88.5% (with AU head).
Synthetic augmentation: CCC scores (AffectNet, VGG-FACE backbone) improve from 0.50/0.37 to 0.62/0.54 (valence/arousal), outstripping GAN-based methods by a wide margin (Kollias et al., 2018).

AffectNet+ with its accompanying protocols supports:

Compound/multi-label expression modeling
Model uncertainty quantification
Fairness-by-metadata and subgroup generalization studies
Domain adaptation and pose-aware learning
Intensity-aware and subset-specialized loss designs
Class imbalance resilience via negative sampling and subset evaluation.

Public availability of images, annotations, soft-labels, subsets, and metadata positions AffectNet+ as a definitive resource for FER research across static and dynamic domains.