BLEMORE Dataset: Multimodal Emotion Analysis
- BLEMORE is a multimodal emotion recognition dataset that captures both single and blended emotional expressions using meticulously annotated audio-video clips from professional actors.
- It supports key tasks such as emotion presence detection and relative salience estimation, employing actor-disjoint splits and rigorous cross-validation protocols.
- Baseline models, including advanced multimodal approaches like HiCMAE, highlight improved performance in detecting emotions, though challenges remain in accurately estimating salience ratios.
BLEMORE is a multimodal emotion recognition dataset designed to capture the complexity of blended affective states with precise control over relative emotional salience. Unlike traditional resources that constrain emotional analysis to single, discrete labels, BLEMORE addresses the need in affective computing and emotion understanding for data reflecting simultaneous—often competing—emotional processes as expressed in both facial and vocal modalities. The dataset provides over 3,000 video and audio clips sourced from professional actors, each systematically annotated for both the set of emotions expressed and their proportional prominence within each blend, thereby enabling rigorous evaluation of models on the tasks of emotion presence recognition and relative salience estimation (Lachmann et al., 19 Jan 2026).
1. Dataset Composition and Structure
BLEMORE comprises 3,050 annotated clips, sourced from 58 professional actors (28 male, 30 female; aged 21–77, mean 36 years). The dataset is partitioned into 1,390 single-emotion clips and 1,660 blended-emotion clips. The emotional taxonomy includes six basic classes: anger, disgust, fear, happiness, sadness, and neutral (–). Blends are exclusively pairwise combinations of the five non-neutral emotions (), resulting in 10 canonical pairs: anger–disgust, anger–fear, anger–happiness, anger–sadness, disgust–fear, disgust–happiness, disgust–sadness, fear–happiness, fear–sadness, and happiness–sadness.
Each blended-emotion clip follows one of three relative-salience configurations for the pair, denoted as :
- 50/50: perfect balance
- 70/30: A dominant
- 30/70: B dominant
Clip durations range from 1 to 30 seconds. BLEMORE employs an actor-disjoint split: 43 actors allocated to training and 15 to test (balanced gender). Training data is further divided into five cross-validation folds, each actor-disjoint and balanced for gender and clip counts, ensuring rigorous benchmarking and generalization assessment.
2. Annotation Protocol and Ground Truth
Annotations derive from actor instructions: participants were tasked to recall and enact single or blended emotional scenarios, explicitly targeting prescribed salience ratios for blends. For single-emotion clips, labels are one-hot in ; for blends, two emotions are marked "present" (e.g., happiness + sadness). Salience annotation records the intended proportionality (50/50, 70/30, 30/70).
A human-validation subset (clips from 18 actors) underwent further scrutiny: evaluators viewing both audio and video identified both emotions in the blend; mean presence-accuracy in this assessment was , exceeding chance levels and corroborating the clarity of actor portrayals and annotation integrity.
3. Task Definitions and Evaluation Metrics
BLEMORE supports two principal predictive tasks: emotion presence detection and relative salience estimation. Each clip has:
- : predicted vector in
- : ground-truth vector of the same structure
Auxiliary functions formalize evaluation:
- : projects to presence only; e.g., for blends, assigns , for singles
- : indicator; $1$ if , else $0$
Metrics:
- Presence-accuracy (ACC):
Requires exact match of present emotions with no extraneous or missing elements.
- Salience-accuracy (ACC):
Demands not only correct emotion-set, but also correct 50/50 vs. 70/30 (or 30/70) configuration.
4. Baseline Architectures and Training Regimes
Three classes of encoders operationalize the data: video, audio, and multimodal.
- Video only: OpenFace 2.0 (action-unit and head/gaze features), CLIP-Vision, ImageBind-Vision, VideoMAEv2 (ViT-B/16 masked autoencoder), and Video Swin Transformer.
- Audio only: HuBERT and WavLM, both large masked-prediction speech encoders.
- Multimodal: HiCMAE, a pre-trained audio-visual masked autoencoder geared for emotion tasks.
Encoding outputs are aggregated either by computing seven summary statistics per feature (mean, std, percentiles at 10, 25, 50, 75, 90) for frame-level representations, or via temporal subsampling into 16-frame tubes, with per-tube encodings aggregated at the decision stage.
Classifier heads include simple linear projections and single-hidden-layer MLPs (256 or 512 units, ReLU, dropout). HiCMAE employs a single linear layer with softmax activation.
The loss is Kullback-Leibler divergence to the ground-truth distribution (one-hot for single emotions, soft for blends). HiCMAE utilizes cross-entropy. Optimization applies Adam (learning rate , weight decay ), with batch sizes of 32 (aggregation) or 512 (subsampling), and training for up to 200–300 epochs (epoch count varies by method and early stopping via cross-validation). HiCMAE fine-tuning spans 50–100 epochs with a cosine LR schedule and warm-up.
Post-processing thresholds (presence) and (salience) are selected per CV fold by grid search; test set evaluation adheres to the thresholds yielding the best validation fold.
5. Results and Comparative Analysis
BLEMORE evaluations reveal that multimodal fusion consistently outperforms single-modality approaches. The following table summarizes validation and test results for leading models (best results bolded):
| Encoder | Val ACC | Val ACC | Test ACC | Test ACC |
|---|---|---|---|---|
| CLIP | 0.266 ± 0.021 | 0.105 ± 0.012 | 0.258 | 0.096 |
| ImageBind | 0.290 ± 0.028 | 0.130 ± 0.008 | 0.261 | 0.087 |
| OpenFace | 0.228 ± 0.014 | 0.119 ± 0.014 | 0.226 | 0.081 |
| VideoMAEv2 | 0.273 ± 0.025 | 0.106 ± 0.014 | 0.293 | 0.054 |
| HuBERT | 0.243 ± 0.023 | 0.104 ± 0.024 | 0.274 | 0.120 |
| WavLM | 0.265 ± 0.027 | 0.121 ± 0.012 | 0.311 | 0.084 |
| ImageBind+WavLM | 0.345 ± 0.035 | 0.170 ± 0.055 | 0.327 | 0.114 |
| ImageBind+HuBERT | 0.339 ± 0.023 | 0.158 ± 0.053 | 0.298 | 0.084 |
| VideoMAEv2+WavLM | 0.343 ± 0.022 | 0.140 ± 0.028 | 0.332 | 0.102 |
| VideoMAEv2+HuBERT | 0.332 ± 0.016 | 0.138 ± 0.012 | 0.332 | 0.114 |
| HiCMAE | 0.298 ± 0.025 | 0.180 ± 0.036 | 0.268 | 0.180 |
Unimodal classifiers max out at ≈ 29% presence and ≈ 13% salience accuracy on validation. Multimodal models deliver substantial improvements: ImageBind+WavLM (val presence 0.345), HiCMAE (val salience 0.180). On the held-out test set, the best models reach 0.332 (VideoMAEv2+HuBERT) for presence and 0.180 (HiCMAE) for salience. Task complexity is underscored by low scores for trivial baselines (single-emotion: 0.074 presence, 0.000 salience; blend: 0.056 presence, 0.033 salience).
Comparison of aggregation versus subsampling (CV only) for selected encoders reveals minor performance variations, with aggregation and subsampling each viable depending on input modality and downstream architecture.
6. Insights, Limitations, and Methodological Considerations
Multimodal fusion yields best-in-class results, confirming the complementary nature of facial and vocal cues for emotion perception. Notably, predicting the presence of blended emotions is markedly easier than estimating their relative prominence; while ImageBind+WavLM outperforms in presence accuracy, HiCMAE provides superior salience accuracy but remains below 20%.
Fixed-threshold strategies for presence and salience (parameters , ) are sensitive: marginal output changes can shift discrete predictions, yielding a pronounced validation–test gap. Modeling salient blended emotions as multi-label, soft-classification is inherently brittle under threshold-based discretization. Alternatives, such as continuous regression, ranking losses, multi-task architectures, or bi-center loss objectives, are suggested as prospective avenues for more stable estimation.
7. Impact and Research Significance
BLEMORE addresses a critical void in affective computing by providing:
- An extensive (3,050 clips), meticulously curated dataset of both single and blended emotional expressions across 10 blend categories and three salience conditions, as well as six canonical singles.
- High-fidelity, multimodal (audio and video) data collected under standardized conditions.
- Rigorous train/test protocols (actor-disjoint, reproducible splits and folds).
- Granular ground truth annotations capturing both the identities and relative proportions of blended emotions.
By enabling robust benchmarking across presence and salience recognition tasks, BLEMORE facilitates novel algorithmic research that more accurately models the complex, interleaved affective signals found in naturalistic human communication, and offers a foundation for future methodological advances in emotion analysis (Lachmann et al., 19 Jan 2026).