BLEMORE Dataset: Multimodal Emotion Analysis

Updated 26 January 2026

BLEMORE is a multimodal emotion recognition dataset that captures both single and blended emotional expressions using meticulously annotated audio-video clips from professional actors.
It supports key tasks such as emotion presence detection and relative salience estimation, employing actor-disjoint splits and rigorous cross-validation protocols.
Baseline models, including advanced multimodal approaches like HiCMAE, highlight improved performance in detecting emotions, though challenges remain in accurately estimating salience ratios.

BLEMORE is a multimodal emotion recognition dataset designed to capture the complexity of blended affective states with precise control over relative emotional salience. Unlike traditional resources that constrain emotional analysis to single, discrete labels, BLEMORE addresses the need in affective computing and emotion understanding for data reflecting simultaneous—often competing—emotional processes as expressed in both facial and vocal modalities. The dataset provides over 3,000 video and audio clips sourced from professional actors, each systematically annotated for both the set of emotions expressed and their proportional prominence within each blend, thereby enabling rigorous evaluation of models on the tasks of emotion presence recognition and relative salience estimation (Lachmann et al., 19 Jan 2026).

1. Dataset Composition and Structure

BLEMORE comprises 3,050 annotated clips, sourced from 58 professional actors (28 male, 30 female; aged 21–77, mean 36 years). The dataset is partitioned into 1,390 single-emotion clips and 1,660 blended-emotion clips. The emotional taxonomy includes six basic classes: anger, disgust, fear, happiness, sadness, and neutral ( $e_1$ – $e_6$ ). Blends are exclusively pairwise combinations of the five non-neutral emotions ( $C(5,2)=10$ ), resulting in 10 canonical pairs: anger–disgust, anger–fear, anger–happiness, anger–sadness, disgust–fear, disgust–happiness, disgust–sadness, fear–happiness, fear–sadness, and happiness–sadness.

Each blended-emotion clip follows one of three relative-salience configurations for the pair, denoted as $(A,B)$ :

50/50: perfect balance
70/30: A dominant
30/70: B dominant

Clip durations range from 1 to 30 seconds. BLEMORE employs an actor-disjoint split: 43 actors allocated to training and 15 to test (balanced gender). Training data is further divided into five cross-validation folds, each actor-disjoint and balanced for gender and clip counts, ensuring rigorous benchmarking and generalization assessment.

2. Annotation Protocol and Ground Truth

Annotations derive from actor instructions: participants were tasked to recall and enact single or blended emotional scenarios, explicitly targeting prescribed salience ratios for blends. For single-emotion clips, labels are one-hot in $\{e_1, ..., e_6\}$ ; for blends, two emotions are marked "present" (e.g., happiness + sadness). Salience annotation records the intended proportionality (50/50, 70/30, 30/70).

A human-validation subset (clips from 18 actors) underwent further scrutiny: evaluators viewing both audio and video identified both emotions in the blend; mean presence-accuracy in this assessment was $\approx 0.43$ , exceeding chance levels and corroborating the clarity of actor portrayals and annotation integrity.

3. Task Definitions and Evaluation Metrics

BLEMORE supports two principal predictive tasks: emotion presence detection and relative salience estimation. Each clip $k$ has:

$\hat{p}^{(k)}$ : predicted vector in $\{\gamma e_i + \delta e_j | i \neq j, (\gamma, \delta) \in \{(1,0), (0.5,0.5), (0.3,0.7)\} \} \cup \{e_6\}$
$d^{(k)}$ : ground-truth vector of the same structure

Auxiliary functions formalize evaluation:

$\pi$ : projects to presence only; e.g., for blends, assigns $0.5e_i + 0.5e_j$ , for singles $e_i$
$\sigma(\Delta)$ : indicator; $1$ if $||\Delta|| \leq 0$ , else $0$

Metrics:

Presence-accuracy (ACC $_{\text{presence}}$ ):

$\mathrm{ACC}_{\text{presence}} = \frac{1}{N}\sum_{k=1}^{N} \sigma( \| \pi(\hat{p}^{(k)}) - \pi(d^{(k)}) \| )$

Requires exact match of present emotions with no extraneous or missing elements.

Salience-accuracy (ACC $_{\text{salience}}$ ):

$\mathrm{ACC}_{\text{salience}} = \frac{1}{N}\sum_{k=1}^{N} \sigma( \| \hat{p}^{(k)} - d^{(k)} \| )$

Demands not only correct emotion-set, but also correct 50/50 vs. 70/30 (or 30/70) configuration.

4. Baseline Architectures and Training Regimes

Three classes of encoders operationalize the data: video, audio, and multimodal.

Video only: OpenFace 2.0 (action-unit and head/gaze features), CLIP-Vision, ImageBind-Vision, VideoMAEv2 (ViT-B/16 masked autoencoder), and Video Swin Transformer.
Audio only: HuBERT and WavLM, both large masked-prediction speech encoders.
Multimodal: HiCMAE, a pre-trained audio-visual masked autoencoder geared for emotion tasks.

Encoding outputs are aggregated either by computing seven summary statistics per feature (mean, std, percentiles at 10, 25, 50, 75, 90) for frame-level representations, or via temporal subsampling into 16-frame tubes, with per-tube encodings aggregated at the decision stage.

Classifier heads include simple linear projections and single-hidden-layer MLPs (256 or 512 units, ReLU, dropout). HiCMAE employs a single linear layer with softmax activation.

The loss is Kullback-Leibler divergence to the ground-truth distribution (one-hot for single emotions, soft for blends). HiCMAE utilizes cross-entropy. Optimization applies Adam (learning rate $5 \times 10^{-6}$ , weight decay $1 \times 10^{-3}$ ), with batch sizes of 32 (aggregation) or 512 (subsampling), and training for up to 200–300 epochs (epoch count varies by method and early stopping via cross-validation). HiCMAE fine-tuning spans 50–100 epochs with a cosine LR schedule and warm-up.

Post-processing thresholds $\alpha$ (presence) and $\beta$ (salience) are selected per CV fold by grid search; test set evaluation adheres to the thresholds yielding the best validation fold.

5. Results and Comparative Analysis

BLEMORE evaluations reveal that multimodal fusion consistently outperforms single-modality approaches. The following table summarizes validation and test results for leading models (best results bolded):

Encoder	Val ACC $_{\text{presence}}$	Val ACC $_{\text{salience}}$	Test ACC $_{\text{presence}}$	Test ACC $_{\text{salience}}$
CLIP	0.266 ± 0.021	0.105 ± 0.012	0.258	0.096
ImageBind	0.290 ± 0.028	0.130 ± 0.008	0.261	0.087
OpenFace	0.228 ± 0.014	0.119 ± 0.014	0.226	0.081
VideoMAEv2	0.273 ± 0.025	0.106 ± 0.014	0.293	0.054
HuBERT	0.243 ± 0.023	0.104 ± 0.024	0.274	0.120
WavLM	0.265 ± 0.027	0.121 ± 0.012	0.311	0.084
ImageBind+WavLM	0.345 ± 0.035	0.170 ± 0.055	0.327	0.114
ImageBind+HuBERT	0.339 ± 0.023	0.158 ± 0.053	0.298	0.084
VideoMAEv2+WavLM	0.343 ± 0.022	0.140 ± 0.028	0.332	0.102
VideoMAEv2+HuBERT	0.332 ± 0.016	0.138 ± 0.012	0.332	0.114
HiCMAE	0.298 ± 0.025	0.180 ± 0.036	0.268	0.180

Unimodal classifiers max out at ≈ 29% presence and ≈ 13% salience accuracy on validation. Multimodal models deliver substantial improvements: ImageBind+WavLM (val presence 0.345), HiCMAE (val salience 0.180). On the held-out test set, the best models reach 0.332 (VideoMAEv2+HuBERT) for presence and 0.180 (HiCMAE) for salience. Task complexity is underscored by low scores for trivial baselines (single-emotion: 0.074 presence, 0.000 salience; blend: 0.056 presence, 0.033 salience).

Comparison of aggregation versus subsampling (CV only) for selected encoders reveals minor performance variations, with aggregation and subsampling each viable depending on input modality and downstream architecture.

6. Insights, Limitations, and Methodological Considerations

Multimodal fusion yields best-in-class results, confirming the complementary nature of facial and vocal cues for emotion perception. Notably, predicting the presence of blended emotions is markedly easier than estimating their relative prominence; while ImageBind+WavLM outperforms in presence accuracy, HiCMAE provides superior salience accuracy but remains below 20%.

Fixed-threshold strategies for presence and salience (parameters $\alpha$ , $\beta$ ) are sensitive: marginal output changes can shift discrete predictions, yielding a pronounced validation–test gap. Modeling salient blended emotions as multi-label, soft-classification is inherently brittle under threshold-based discretization. Alternatives, such as continuous regression, ranking losses, multi-task architectures, or bi-center loss objectives, are suggested as prospective avenues for more stable estimation.

7. Impact and Research Significance

BLEMORE addresses a critical void in affective computing by providing:

An extensive (3,050 clips), meticulously curated dataset of both single and blended emotional expressions across 10 blend categories and three salience conditions, as well as six canonical singles.
High-fidelity, multimodal (audio and video) data collected under standardized conditions.
Rigorous train/test protocols (actor-disjoint, reproducible splits and folds).
Granular ground truth annotations capturing both the identities and relative proportions of blended emotions.

By enabling robust benchmarking across presence and salience recognition tasks, BLEMORE facilitates novel algorithmic research that more accurately models the complex, interleaved affective signals found in naturalistic human communication, and offers a foundation for future methodological advances in emotion analysis (Lachmann et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BLEMORE Dataset.

BLEMORE Dataset: Multimodal Emotion Analysis

1. Dataset Composition and Structure

2. Annotation Protocol and Ground Truth

3. Task Definitions and Evaluation Metrics

4. Baseline Architectures and Training Regimes

5. Results and Comparative Analysis

6. Insights, Limitations, and Methodological Considerations

7. Impact and Research Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BLEMORE Dataset: Multimodal Emotion Analysis

1. Dataset Composition and Structure

2. Annotation Protocol and Ground Truth

3. Task Definitions and Evaluation Metrics

4. Baseline Architectures and Training Regimes

5. Results and Comparative Analysis

6. Insights, Limitations, and Methodological Considerations

7. Impact and Research Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research