GoEmotions: Fine-Grained Emotion Dataset

Updated 2 February 2026

GoEmotions is a manually annotated dataset comprising 58,009 Reddit comments labeled across 27 discrete emotion categories plus a neutral class.
The dataset uses a rigorous three-rater annotation protocol with majority voting, ensuring high retention and consistency despite class imbalances.
Its application in transfer learning, data augmentation, and neuroanatomical mapping has established it as a benchmark in fine-grained emotion detection research.

GoEmotions is a large-scale, manually annotated dataset of English Reddit comments designed for fine-grained emotion detection and classification tasks in natural language processing. With 58,009 labeled examples spanning 27 discrete emotion categories plus Neutral, GoEmotions provides both the coverage and granularity required for rigorous multi-label and transfer learning benchmarks. This resource has supported diverse research across emotion taxonomy evaluation, neural modeling, transfer protocols, and even mapping language-derived emotional content to neuroanatomical regions.

1. Dataset Composition and Emotion Taxonomy

GoEmotions contains 58,009 Reddit comments (sampled from 200 highly active subreddits circa mid-2019), curated via length, de-duplication, and light cleaning procedures (Demszky et al., 2020). Each comment is annotated for 27 fine-grained emotions—admiration, amusement, anger, anxiety, awe, awkwardness, boredom, calm, confusion, craving, disgust, empathy, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, sadness, surprise, trust—and a Neutral label serving as a catch-all for non-emotional text (Demszky et al., 2020, Singh et al., 2021, Lecourt et al., 5 Mar 2025). These categories are grounded in established psychological taxonomies (Cowen & Keltner, 2017; Plutchik; Ekman) with annotation definitions explicitly provided to raters.

Label distribution exhibits pronounced class imbalance. For instance, Neutral (10,025), Joy (5,449), Anger (5,029), and Optimism (6,203) occur with high frequency, whereas Embarrassment (532), Grief (404), Awkwardness (812), and Boredom (743) are rare (Demszky et al., 2020). The multi-label schema permits annotators to select any number of labels per instance: 83 % single-label, 15 % dual-label, 2 % triple-label, and 0.2 % carrying four or more labels (Singh et al., 2021, Alvarez-Gonzalez et al., 2021).

2. Annotation Protocols and Quality Measures

Annotation comprises crowdsourcing, with every comment assigned to three independent raters. Each annotator selects applicable emotions (all-that-apply) or Neutral if none suit. Labels are finalized by majority (≥2/3) rule for each category (Demszky et al., 2020). Comments lacking raters’ agreement on any label are discarded, yielding high retention (94 %) and effective redundancy (Alvarez-Gonzalez et al., 2021). Definitions for all emotions (short sentences and examples) are shown to annotators, standardizing interpretation (Singh et al., 2021).

Inter-annotator agreement is moderate: pairwise Cohen’s κ ≈ 0.44, Krippendorff’s α ≈ 0.31, and Fleiss’ κ ≈ 0.33–0.38 across labels (Demszky et al., 2020, Lecourt et al., 5 Mar 2025). Principal Preserved Component Analysis (PPCA) reveals that annotation structure is stable across rater groups: e.g., preserved variance R₁ > 0.82, R₅ > 0.93 for top principal directions (Demszky et al., 2020). Multi-label co-occurrence patterns indicate substantial overlap within affective clusters (e.g., Joy/Admiration/Amusement) (Alvarez-Gonzalez et al., 2021).

Profanity, names, and religion terms are masked in the corpus, with non-English or ambiguous comments removed (Alvarez-Gonzalez et al., 2021, Lecourt et al., 5 Mar 2025). Comments are truncated to fit model input bounds where required.

3. Experimental Protocols, Baseline Models, and Evaluation Metrics

GoEmotions underpins a variety of modeling paradigms, including statistical baselines (Bag-of-Words, TF-IDF/Logistic Regression/Random Forests), neural methods (FastText+DNN/bi-LSTM), and transformer-based architectures (BERT, RoBERTa) (Alvarez-Gonzalez et al., 2021, Alaeddini, 2024, Wang et al., 2024). The canonical baseline is BERT-base-uncased, using [CLS] token representations and a linear output layer spanning all 28 classes (Demszky et al., 2020).

Datasets are typically split with an 80/10/10 train/validation/test ratio. No label-level balancing is performed in the original splits; experiments on class-imbalance mitigation have employed oversampling, selective augmentation, and synthetic data infusion (Su et al., 18 Nov 2025, Wang et al., 2024).

Evaluation relies on the multi-label binary cross-entropy loss: $\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N\sum_{c=1}^C\Big[y_{i,c}\ln\sigma(z_{i,c}) + (1-y_{i,c})\ln(1-\sigma(z_{i,c}))\Big]$ with per-label thresholding. Metrics reported include macro and micro F1, precision, recall, and accuracy (multi-label or single-label, depending on mapping; statistical formulas detailed in the source papers (Demszky et al., 2020, Alaeddini, 2024, Alvarez-Gonzalez et al., 2021, Lecourt et al., 5 Mar 2025)).

BERT-based baselines reach 0.46 micro-F1 and 0.43 macro-F1 across 28 labels (Demszky et al., 2020). Classical ML baselines (TF-IDF + LR) achieve nearly 0.53 micro-F1, evidencing strong lexical signal (Alvarez-Gonzalez et al., 2021). Augmentation and architectural variants—multi-task heads (CDP, MLM), minority-class boosting, cascaded bi-LSTMs—push macro-F1 above 0.52 (Singh et al., 2021, Wang et al., 2024).

4. Methods for Imbalance, Data Extension, and Transfer Learning

GoEmotions’ long-tail distribution necessitates explicit approaches for rare emotion detection. Targeted data balancing merges the original corpus with Sentiment140-labeled tweets (using a RoBERTa classifier) and GPT-4 mini-generated synthetic samples, selected and audited via classifier confidence and manual review. Coverage is equalized to ≈4,013 samples per class, greatly increasing minority-class performance (e.g., “grief” F1 from 0.45 → 0.67) and improving overall micro-F1 by 12 percentage points (Su et al., 18 Nov 2025).

Augmentation methods explored include:

Duplication Data Augmentation (DDA): synonym replacement, random swap/deletion (WordNet-based; EDA) (Wang et al., 2024).
Contextual Embedding Replacement: BERT MLM top-k token substitution.
Paraphrase-based ProtAugment: BART-based models for diverse sentence-level rephrasing.

Minority-only augmentation consistently improves F1 for low-frequency classes, whereas full-dataset amplification can worsen imbalance (Wang et al., 2024). Mixed precision training (float16 activations, float32 gradients/losses) accelerates experiments by ~1.8×, permitting larger batch sizes (Su et al., 18 Nov 2025).

Transfer learning protocols initialize on other emotion datasets (CARER, ISEAR, Emotion-Stimulus) before GoEmotions fine-tuning. These strategies yield 1–4 percentage point macro-F1 improvements over from-scratch baselines, especially for low-resource classes and benchmark adaptation (Singh et al., 2021, Wang et al., 2024, Demszky et al., 2020). BERT+MLM transfer is robust across domains (Singh et al., 2021).

5. Applications, Limitations, and Benchmark Impact

GoEmotions supports modeling for empathetic dialogue agents, content moderation, mental-health analysis, and social science inquiry (Demszky et al., 2020, Alvarez-Gonzalez et al., 2021, Lecourt et al., 5 Mar 2025). Its granularity enables benchmarking across fine-grained categories, label remapping (e.g., to Ekman’s six basic emotions), and robust scalability to neuroanatomical mapping frameworks (Vos et al., 12 Aug 2025).

Limitations include:

Domain bias: Reddit-specific language and demography constrain generality.
Inter-annotator disagreement: κ ≈ 0.33–0.44 signals the inherent subjectivity in emotion perception, especially for subtle or social states (Demszky et al., 2020, Singh et al., 2021, Lecourt et al., 5 Mar 2025).
Long-tail effect: Many categories (< 1,000 samples) suffer lower F1, demanding augmentation or rebalancing (Su et al., 18 Nov 2025, Wang et al., 2024, Alvarez-Gonzalez et al., 2021).
Lack of multimodal features: Only written text is considered, excluding prosody, facial cues, or audio (Alaeddini, 2024, Demszky et al., 2020).
Sarcasm, irony, and community slang remain challenging for most classifiers (Alaeddini, 2024).

Macro-F1 metrics emphasize rare classes, revealing gaps not visible in micro-F1. Zero-shot LLM models (GPT family) underperform fine-tuned classifiers: BERT SOTA (macro-F1: 52.75%) vs. ChatGPT zero-shot (25.55%) (Lecourt et al., 5 Mar 2025). Prompt engineering marginally improves LLM performance, but full supervised adaptation remains necessary for competitive fine-grained emotion detection.

6. Neuroanatomical and Cognitive Modeling with GoEmotions

GoEmotions has underpinned recent research linking text-derived emotional content to brain regions (Vos et al., 12 Aug 2025). Comments are chunked, embedded (OpenAI text-embedding-ada-002, 1,536 dimensions), reduced via PCA, clustered (K-means: 18 clusters), and greedily mapped to 18 predefined emotion-related regions (Montreal Neurological Institute coordinates). Mean “emotional intensity” per label (lexicon scores, 0.1–2.0 scale) ranks Love (0.709), Joy (0.593), Relief (0.560), Sadness (0.486), Fear (0.412), Anger (0.390) (Vos et al., 12 Aug 2025). These gradients align with meta-analytic neuroimaging findings—Love correlates with ventral tegmental area, anterior cingulate activation whereas Fear and Disgust align right-hemispheric structures.

LLM-generated texts match human distribution for “basic” emotions but show attenuated empathy and self-referential region activation (e.g., medial prefrontal cortex), yielding a brain-based benchmark for evaluating AI emotional expressiveness. The pipeline offers scalable, cost-effective means of analyzing naturalistic language and distinguishing clinical populations.

7. Prospects and Future Directions

Research continues to address class skew through augmentation, cost-sensitive loss functions, and advanced transfer learning (Wang et al., 2024, Su et al., 18 Nov 2025). Incorporating annotator-level uncertainty, emotion taxonomies, and hierarchical structures is advocated for more nuanced modeling (Singh et al., 2021). Prospective efforts include multimodal corpus extension, cross-domain adaptation, and formal statistical hypothesis testing for benchmark evaluations (Alaeddini, 2024, Lecourt et al., 5 Mar 2025).

GoEmotions remains an authoritative benchmark for multi-label emotion detection, supporting advances in model architecture, data balancing, transfer learning, and cognitive-affective mapping. Its design enables rigorous, reproducible experiments and sets a robust foundation for further work in affective computing and computational social science.