EMOTIC Dataset: Emotion Recognition in the Wild

Updated 23 January 2026

EMOTIC Dataset is a large-scale image resource with 23,571 images and 34,320 person instances annotated using 26 emotion categories and continuous VAD metrics.
It integrates personal, bodily, and contextual cues to enable robust emotion recognition in unconstrained, real-world scenarios.
Annotation protocols use crowd-sourced multi-labels and VAD mean-aggregation, with evaluation metrics addressing class imbalance and supporting advanced model architectures.

The EMOTIC dataset is a large-scale resource for the study of emotion recognition in images "in the wild," with rigorous multi-label categorical and continuous affective annotation schemes. Designed to support machine learning approaches that jointly exploit personal, bodily, and rich contextual cues, EMOTIC has become a primary benchmark in affective computing for recognizing human emotions beyond facial expression—enabling models to interpret affect in unconstrained, real-world scenarios and across a broad spectrum of emotion categories and affective states (Ninh et al., 2023, Wang et al., 2023, Costa et al., 2023, Kosti et al., 2020).

1. Dataset Composition and Annotation Schema

EMOTIC comprises approximately 23,571 natural, unconstrained images, sourced from web, social media, and public image collections; these images are annotated with 34,320 person instances, where each individual is independently labeled (Kosti et al., 2020, Ninh et al., 2023). For each annotated person, the dataset provides:

Discrete Labels: Multi-label assignments over 26 emotion categories, including but not limited to Peace, Affection, Engagement, Disapproval, Anger, Sensitivity, Fear, Suffering. Label selection is non-exclusive, with each person typically assigned 1–9 applicable categories (Ninh et al., 2023, Mehra et al., 8 Feb 2025).
Continuous Labels: Three-dimensional Valence–Arousal–Dominance (VAD) annotations, each on a scale from 1–10 (or equivalently [0,10], integer-valued), reflecting the person's pleasantness (V), activation (A), and control (D). In alternative annotation protocols, VAD may be in [–1, 1] or in [0, 1], dependent on paper-specific normalization (Ninh et al., 2023, Costa et al., 2023, Mehra et al., 8 Feb 2025).
Metadata: Per-person gender (M/F) and age group (child, teenager, adult) labels are also recorded. Global context metadata such as image source (e.g., COCO, ADE20K) is included for each sample (Costa et al., 2023).
Localization: Each person is localized by an axis-aligned bounding box in COCO format, with uncropped full-scene context images available for context-aware modeling (Ninh et al., 2023, Kosti et al., 2020).

Annotation is performed by crowdworkers (typically via Amazon Mechanical Turk). For each person, annotators are presented with both the cropped bounding box and the whole-scene context; discrete categories are collected via checklist, and VAD scores via sliders or integer selection, with the number of annotators per instance varying by data split (see Section 3). Final labels are obtained by per-category voting (for discrete) and mean-aggregation (for VAD) (Ninh et al., 2023, Kosti et al., 2020).

2. Emotion Representations and Label Distribution

The 26 discrete emotion categories are derived to represent a broad, nuanced taxonomy beyond basic Ekman emotions, encompassing both primary and social/complex states. The full category list includes: Affection, Anger, Annoyance, Anticipation, Aversion, Confidence, Disapproval, Disconnection, Disquietment, Doubt/Confusion, Embarrassment, Engagement, Esteem, Excitement, Fatigue, Fear, Happiness, Pain, Peace, Pleasure, Sadness, Sensitivity, Suffering, Surprise, Sympathy, Yearning (Ninh et al., 2023, Kosti et al., 2020).

Label distribution is highly imbalanced, with “Engagement” the most prevalent (≈55% of person-instances) and “Embarrassment” among the least common (≈1%), resulting in a long-tailed frequency structure (Ninh et al., 2023, Kosti et al., 2020). VAD annotations approximately uniformly cover the 1–10 integer range, but clustering around moderate (central) values is observed (Kosti et al., 2020).

Category	Example Frequency (Person-Instances)
Engagement	~28,000
Confidence	~22,000
Excitement	~20,000
Embarrassment	~1,500
Pain	~1,900

A plausible implication is that evaluation protocols and losses must explicitly address this class imbalance to avoid bias in multi-label settings.

3. Data Splits and Annotation Protocol

While some works treat EMOTIC as a single corpus, the canonical split (Kosti et al. 2019) is widely used and consists of:

Training Set: ≈18,300 person-instances (or ~12,821 images), annotated by a single worker per person.
Validation Set: ≈2,860 person-instances (or ~2,350 images), each labeled by five independent annotators for inter-rater agreement estimation.
Test Set: ≈6,900 person-instances (or ~4,700 images), with three annotators per person; this split is held out for standardized benchmarking and challenge submissions (Ninh et al., 2023, Costa et al., 2023, Mehra et al., 8 Feb 2025, Kosti et al., 2020).

Quality control includes annotator qualification tests, in-batch sentinels, and outlier filtering. Ground-truth discrete labels are set by union or mode across annotators; for VAD, the (integer or continuous) mean is used (Kosti et al., 2020).

4. Evaluation Metrics and Loss Functions

The standard metric for discrete emotion prediction on EMOTIC is mean Average Precision (mAP) over the 26 categories. Given category $i$ , let average precision $\text{AP}_i$ be computed from the precision-recall curve; then

$\mathrm{mAP} = \frac{1}{26} \sum_{i=1}^{26} \mathrm{AP}_i$

For continuous VAD regression, mean absolute error (MAE) per dimension is used:

$\mathrm{MAE} = \frac{1}{M} \sum_{j=1}^M |\hat{y}_j - y_j|$

Some works also report the Jaccard Index for multilabel accuracy, Average Absolute Error (AAE), or use margin/Hinge/Smooth-L1 (Huber) variants for regression objectives (Ninh et al., 2023, Wang et al., 2023, Kosti et al., 2020, Costa et al., 2023).

Weighted Euclidean loss ( $L^{disc}$ ) for the discrete branch uses per-class weights $w_i = 1 / \ln(c + p_i)$ , where $p_i$ is the class frequency and $c$ a smoothing hyperparameter, enforcing greater penalty for underrepresented classes (Ninh et al., 2023, Costa et al., 2023).

5. Impact on Affective Computing and Model Development

The EMOTIC dataset enables and benchmarks a range of architectures that incorporate both personal and contextual cues. Baseline systems typically employ dual-branch convolutional networks, processing person and scene context separately before fusion (Kosti et al., 2020). Recent advances include:

Multi-branch architectures leveraging face, body, and context branches independently (input sizes e.g., 224×224 for context/body, 48×48 for face crops), supporting robust multi-cue fusion (Ninh et al., 2023).
Context-aware models integrating high-level semantic scene encoders (e.g., ViT-style and GIN context encoders) and object/attribute relationships to boost performance on ambiguous cases (Wang et al., 2023, Costa et al., 2023).
Pose/depth augmentation for handling occlusions and partial person visibility, often employing graph-based methods (e.g., spatial–temporal GCNs) and EmbraceNet-style random-modality fusion (Wang et al., 2023).
Loss weighting and ablation for addressing severe class imbalance and quantifying the contributions of body, face, context, scene, and depth streams.

An explicit finding is that context features (global scene, object semantics, depth) confer the largest mAP gains (+5–7%) over body/face-only models, validating the psychological insight that emotion perception in real-world images crucially depends on situational context (Wang et al., 2023).

Quantitative performance has advanced from mean AP of 27.38% (body+context CNN, Kosti et al. 2019) to over 40% with modern multi-modal and scene-semantic architectures (Wang et al., 2023, Ninh et al., 2023).

6. Specialized Applications and Benchmark Usage

EMOTIC is routinely used to benchmark:

Emotion recognition models: discrete category multi-label and VAD regression, in person-localized and context-integrated settings (Ninh et al., 2023, Wang et al., 2023, Costa et al., 2023).
Vision-language systems: recent works investigate the capacity of LLMs to infer emotions from structured VAD values (e.g., using EMOTIC to map valence/arousal to category labels or semantic captions) (Mehra et al., 8 Feb 2025).
Real-time and on-device inference: efficient single-stream context models achieve >90 frames/sec on consumer GPUs, supporting edge deployment scenarios (Costa et al., 2023).

Notably, LLMs exhibit poor discrete category mapping from VAD alone but perform well on free-text affective description when conditioned solely on VAD (Mehra et al., 8 Feb 2025). A plausible implication is that nuanced emotion understanding requires richer multimodal input than raw VAD, but VAD still serves as a privacy-preserving input for generative affective description.

7. Access, Licensing, and Extensions

The EMOTIC dataset and baseline code are open-sourced for non-commercial research via the project repository (http://sunai.uoc.edu/emotic/, https://github.com/rkosti/emotic) (Kosti et al., 2020). Community extensions include fine-tuning splits, specialized context encoders, and expanded annotation studies in combination with other emotion recognition resources.

A plausible implication is that EMOTIC will remain central to advancing context-rich computer vision models for affective computing, and its annotation schema serves as a template for future datasets that capture the nuanced, socially embedded nature of emotion "in the wild."