RAF-DB and AffectNet Datasets for FER
- RAF-DB and AffectNet are foundational facial expression datasets featuring in-the-wild images and detailed discrete and continuous emotion annotations.
- They address challenges such as class imbalance, label noise, and demographic bias, providing crucial benchmarks for FER model evaluation.
- Advanced methods like 3D fusion, dynamic sampling, and cross-dataset adaptation boost FER accuracy and generalization across complex real-world conditions.
RAF-DB (Real-world Affective Faces Database) and AffectNet are foundational benchmark datasets for facial expression recognition (FER) in unconstrained settings. Both provide large-scale, diverse samples of human faces labeled with discrete emotional categories, and AffectNet further supports continuous valence-arousal (VA) annotation. Their influential properties derive from scale, real-world complexity, annotation protocols, and persistent challenges of class/distribution imbalance and label noise. These datasets underpin state-of-the-art FER algorithm development, evaluation, and cross-dataset generalization analysis.
1. Dataset Composition and Annotation Protocols
RAF-DB contains approximately 29,672 to 30,000 in-the-wild face images with seven basic expression categories: Neutral, Happy, Sad, Surprise, Fear, Disgust, and Anger. Images are collected via Internet queries and manually annotated—each image receives labels from up to 40 raters, often using majority vote to resolve the final label. The single-label subset is most common in FER research; the compound-label subset is occasionally used for analyzing expression mixtures. The official split is typically 12,271 training and 3,068–3,368 test images, with substantial class imbalance favoring Neutral and Happy.
AffectNet aggregates over 1 million human faces harvested by keyword queries in multiple languages. Of these, roughly 287,000–450,000 receive manual categorical annotation: eight discrete categories (the seven basic emotions plus Contempt). AffectNet further provides continuous VA scores (range [–1, +1]) for valence (pleasantness) and arousal (intensity), making it the most comprehensive FER dataset for both categorical classification and regression tasks. Data splits generally comprise 283,901 training and 3,500–4,000 validation images. A supplementary subset (AffectNet_Auto) includes ≈350,000 automatically labeled faces, with documented annotation uncertainty (≈35% label noise).
Both datasets have rigorous annotation protocols but manifest ambiguous cases; for example, fine-grained differences between Neutral and slight Happiness, or overlapping cues between Surprise and Fear, lead to estimated real label-noise rates of 10–20% (Liu et al., 2022).
2. Statistical and Demographic Properties
RAF-DB and AffectNet are both affected by class and demographic imbalances. The majority of samples are young adults (16–53 years), with children (0–15) and seniors (54+) comprising <5% of their respective datasets. Gender distribution skews slightly male (RAF-DB: ~60%, AffectNet: ~54%), and both datasets are predominantly White (RAF-DB: ~55%, AffectNet: ~57%), with other races (Asian, Black, Indian, Latinx, Middle Eastern) underrepresented (Hosseini et al., 16 Feb 2025).
Class distributions are highly imbalanced: Neutral and Happy constitute 30–40% apiece; rarer emotions (Fear, Disgust, Contempt) are typically <5%. Diversity metrics such as Richness, Evenness, and Dominance, as defined in recent fairness analyses, confirm that AffectNet offers greater attribute richness (score 100 vs. RAF-DB 5.3), but both datasets display significant dominance (RAF-DB: 76.1, AffectNet: 69.1) by majority groups, limiting representational fairness (Hosseini et al., 16 Feb 2025).
Environmental variation (illumination, pose, occlusion) is substantial, especially in AffectNet, which further complicates FER tasks and exacerbates distributional discrepancies between benchmarks.
3. Evaluation Protocols and Metrics
Standard preprocessing for both RAF-DB and AffectNet involves face detection (typically MTCNN or RetinaFace), cropping, alignment to a canonical eye axis, resizing to network input size (often 224×224), and normalization to [–1, 1] or per-ImageNet statistics (Dong et al., 2024, Ma et al., 2021). Data splits consistently follow the original authors’ recommendations; cross-validation is rare due to dataset scale.
Performance is quantified using several metrics:
- For discrete expressions, overall accuracy, per-class accuracy, mean class accuracy, precision, recall, and F1 scores are standard (Ma et al., 2021, Dong et al., 2024). Macro and weighted averages are used, particularly in RAF-DB, to mitigate class imbalance effects.
- For VA regression in AffectNet, metrics include mean squared error (MSE), mean absolute error (MAE), root mean square error (RMSE), and concordance correlation coefficient (CCC), calculated separately for valence and arousal (Dong et al., 2024).
Recent results utilizing EMOCA-based 3D classifiers yield RAF-DB discrete accuracy as high as 79.27% (3D only) and 94.0% (late fusion with 2D backbone), surpassing previous SOTA (FMAE: 93.09%) (Dong et al., 2024). AffectNet sees SOTA VA scoring (CCC_val=0.724, CCC_aro=0.650) with late fusion architectures. VTFF visual transformer methods attain 88.14% accuracy on RAF-DB and 61.85% mean class accuracy on AffectNet (validation set), representing top-tier performance (Ma et al., 2021).
4. Biases, Fairness, and Cross-Dataset Generalization
Both RAF-DB and AffectNet display pronounced capture and category biases. Marginal distribution shifts () arise from variation in data capture (source, pose, lighting), while category bias () results from differing annotation standards and class-prior mismatch (Li et al., 2019). MMD-based analyses reveal significant divergence, with ResNet-50 features showing cross-dataset accuracy drops of −17.1 pp (RAF-DB→AffectNet) and −6.5 pp (AffectNet→RAF-DB) for seven-way classification. By mitigating both forms of bias via emotion-conditional adaptation networks (ECAN), researchers have achieved up to ~10 pp accuracy gains in unsupervised transfer tasks (Li et al., 2019).
Fairness evaluations utilize metrics such as demographic parity difference (), equalized odds (), equal opportunity (), and treatment equality (). Analyses on RAF-DB and AffectNet underscore that majority classes (Neutral, Happy) and White subjects drive the largest fairness gaps. Model performance versus fairness trade-offs are evident: transformer models (GPT-4o-mini, ViT) excel in accuracy yet amplify demographic/attribute bias, while CNNs (ResNet) are "fairer" by these metrics. Recommendations include attribute-based oversampling, explicit fairness regularization, and balanced splits (Hosseini et al., 16 Feb 2025).
5. Robustness to Label Noise and Ambiguity
Label uncertainty is pervasive in in-the-wild FER datasets. ULC-AG (Uncertain Label Correction via Auxiliary Action Unit Graphs) achieves robust FER by combining a weighted regularization target branch with an auxiliary action unit branch (leveraging semantic AU–emotion graphs via GCN). Empirical results demonstrate substantial accuracy improvements: RAF-DB rises from 85.82% (baseline) to 89.31% (ULC-AG); AffectNet from 57.94% (baseline) to 61.57% (ULC-AG) (Liu et al., 2022).
ULC-AG is especially resilient under synthetic noise (label randomization up to 30%) and real-world ambiguity (AffectNet_Auto: ULC-AG accuracy 57.37% vs. baseline 53.23%). Dynamic batch-wise sampling, confidence-based re-labeling, and multi-label auxiliary tasks emerge as best practices for FER when annotation reliability is compromised.
6. Integration with Deep Learning and 3D Inference
Recent advances implement fusion strategies for FER: intermediate and late fusion of 2D CNN/transformer features with 3D morphable model parameters (FLAME, EMOCA, SMIRK) have set new SOTA in both discrete expression classification and VA regression (Dong et al., 2024). Late fusion—the independent inference followed by output aggregation—outperforms feature-level (intermediate) fusion, suggesting modality complementarity and reduction of redundant encoding. The extraction of detailed 3D shape, expression, and pose coefficients offers expressive geometric information, boosting classifier sensitivity and error correction.
Visual transformer approaches (VTFF) translate face images into attention-driven sequences of "visual words," capturing complex spatial and contextual cues needed for FER under real-world occlusion, pose, and illumination variation (Ma et al., 2021).
7. Recommendations and Future Directions
FER research leveraging RAF-DB and AffectNet should consistently:
- Report and address dataset attribute distributions using diversity metrics (Richness, Evenness, Dominance).
- Mitigate imbalances and label uncertainty via balanced splits, oversampling, auxiliary multi-label side tasks, and per-sample re-labeling.
- Conduct fairness analyses using demographic parity, equalized odds, and treatment equality metrics on both models and data.
- Employ cross-dataset adaptation (e.g., ECAN, MMD-based alignment) when transferring models between benchmarks.
- Release demographic-balanced splits, document annotation noise, and evaluate models for both accuracy and fairness.
A plausible implication is that as FER models grow in capacity and dataset scale increases, mechanism-based corrections (3D fusion, AU graphs, fairness constraints) will be mandatory to ensure robust, unbiased, and high-fidelity emotion inference across real-world populations and pose/environments (Dong et al., 2024, Liu et al., 2022, Hosseini et al., 16 Feb 2025).