Papers
Topics
Authors
Recent
Search
2000 character limit reached

HICEM-15: Cross-Cultural Emotion Model

Updated 9 February 2026
  • HICEM-15 is a data-driven emotion model featuring 15 semantically orthogonal labels derived from unsupervised word embedding clustering.
  • The methodology combines FastText embeddings, UMAP dimensionality reduction, and agglomerative clustering to capture cross-lingual affective semantics.
  • Validation with large-scale human annotation datasets demonstrates HICEM-15’s high semantic coverage and efficient information recovery.

HICEM-15 is a data-driven, high-coverage model of discrete human emotions, optimized for cross-cultural artificial emotional intelligence (AEI) systems via unsupervised analysis of word embeddings. Designed to maximize semantic coverage while minimizing category count, HICEM-15 defines a set of 15 semantically orthogonal emotion labels, systematically validated for alignment across six major world languages and for fidelity to human-perceived affective distinctions in real-world affect annotation tasks (Wortman et al., 2022).

1. Model Definition and Taxonomy

HICEM-15 (HIgh-Coverage EMotion model, 15 components) is constructed to provide a minimal yet comprehensive basis for the annotation and computational modeling of human emotions. Its set of 15 summary categories—Neutral, Happiness, Sadness, Anger, Fear, Surprise, Pain, Pleasure, Annoyance, Confusion, Doubt, Discomfort, Awe, Enjoyment, and Bizarre—were determined by agglomerative clustering in word embedding space, with cultural alignment such that each concept can be realized across Arabic, Chinese, English, French, Spanish, and Russian. Table 1 presents the full cross-lingual mapping:

English Arabic Chinese French Spanish Russian
Neutral حيادي 中性 Neutre Neutral Нейтрально
Happiness سعادة 幸福 Joie Felicidad Счастье
Sadness حزن 悲伤 Tristesse Tristeza Печаль
Anger غضب 愤怒 Colère Ira Гнев
Fear خوف 害怕 Peur Miedo Страх
Surprise دهشة 惊讶 Surprise Sorpresa Удивление
Pain ألم 疼痛 Douleur Dolor Боль
Pleasure متعة 愉悦 Plaisir Placer Удовольствие
Annoyance إزعاج 烦恼 Agacement Fastidio Раздражение
Confusion ارتباك 困惑 Confusion Confusión Замешательство
Doubt شك 怀疑 Doute Duda Сомнение
Discomfort انزعاج 不适 Malaise Malestar Дискомфорт
Awe رهبة 敬畏 Émerveillement Asombro Трепет
Enjoyment استمتاع 享受 Bonheur Disfrute Наслаждение
Bizarre غريب 奇怪 Bizarre Extraño Странный

These labels are not intended to represent a theory of emotion but function as empirically grounded semantic centroids in affective space (Wortman et al., 2022).

2. Construction Pipeline: Embedding, Dimensionality Reduction, and Clustering

The HICEM-15 set is derived via the following unsupervised pipeline:

  • Emotion-Concept List Construction: Begins with 1,720 English emotion-concept words (from sources such as Ekman, Plutchik, and online affective lexicons). Expansion uses pre-trained Word2Vec to identify semantic neighbors, followed by manual pruning to exclude adverbs, non-affective terms, and archaisms.
  • Embedding and UMAP Reduction: All words are embedded with FastText (300d, trained on CommonCrawl and Wikipedia). The UMAP (Uniform Manifold Approximation and Projection) algorithm reduces the space to two dimensions, using a cosine metric, nneighbors15n_\mathrm{neighbors}\approx 15, and min_dist0.1min\_dist\approx 0.1 to preserve semantic similarity and antonymy distinctions.
  • Agglomerative Clustering: Ward linkage clustering in UMAP space identifies semantically compact clusters, with the distortion “elbow” method applied to select kk. Individual language results consistently indicate k1115k\approx11–15 as optimal.
  • Cross-Lingual Aggregation: The entire process is repeated post-translation in each target language, followed by global reclustering of top centroids (from the top-50 summary words per language) to k=15k=15. Resultant centroids yield final HICEM-15 summary terms in each language (Wortman et al., 2022).

3. Semantic Coverage and Information Recovery Metrics

Model evaluation emphasizes two primary quantitative metrics:

  • Average Coverage (AvgCov): For model label set MM and set of all embedded concepts WW, average cosine similarity is computed over each wWMw\in W\setminus M to its nearest mMm\in M:

AvgCov(M)=1nwWMmaxmMCosSim(w,m)\mathrm{AvgCov}(M) = \frac{1}{n}\sum_{w\in W\setminus M} \max_{m\in M} \mathrm{CosSim}(w, m)

Aggregated across the six languages, HICEM-15 achieves AvgCovtotal=0.416\mathrm{AvgCov}_\text{total}=0.416, which is comparable to Cowen (27 labels) and GoEmotions (28), but with roughly half the number of categories.

  • Recoverable Information (AvgRec): Measures fidelity to human annotation by simulating an oracle projecting ground-truth instance embeddings into HICEM-15’s space, then reconstructing the original embedding via a learned regression function. The metric is the average cosine similarity between ground-truth and reconstructed vectors:

AvgRec(M)=1ni=1nCosSim(wi,  G(CosSim(wi,M)))\mathrm{AvgRec}(M) = \frac1n\sum_{i=1}^n \mathrm{CosSim}(w_i,\;G(\mathrm{CosSim}(w_i,M)))

HICEM-15 yields AvgRec=0.464\mathrm{AvgRec}=0.464, outperforming random 15-category selections and approaching models with notably more labels. Plutchik-32 achieves the highest values, but at the cost of increased annotation complexity (see Table below).

Model #Labels AvgCov_total AvgRec_total
Ekman 7 0.314 0.327
Plutchik 32 0.428 0.552
HICEM-15 15 0.416 0.464
HICEM-25 25 0.444 0.528

4. Cross-Lingual and Cross-Cultural Alignment

Cross-lingual methodology ensures each centroid corresponds to a semantically stable affective direction in embedding space across the six target languages. The initial English master list is machine-translated, embedded with native-language FastText models, UMAP-reduced, and then clustered analogously to the English pipeline. The top-50 summary labels are pooled and globally reclustered to define k=15 centroids, yielding a final set of semantically equivalent affective categories attested in all six languages. Only post hoc manual filtering for rare or archaic terms is performed, with semantic directions otherwise algorithmically anchored (Wortman et al., 2022).

5. Validation with Large-Scale Human Annotation Datasets

Empirical validation uses two major corpora:

  • BoLD (Body Language Dataset): ≈20,000 video clips with human annotations for 26 emotions plus Valence–Arousal–Dominance (VAD) dimensions.
  • EMOTIC: ≈34,000 images with the same categorical and VAD labels.
  • Recoverable information is computed as an upper bound, using ground-truth instance labels and ridge regression in the HICEM-15 embedding, with results indicating that 15 categories can support information recovery levels well beyond random baselines and approaching those of much larger discrete sets.
  • Projecting dataset annotations onto HICEM-15 and then into VAD space reproduces expected psychological structure (e.g., Russell’s Circumplex), with valence as principal axis and dominance strongly correlated with valence.

6. Implications and Application Domains

HICEM-15 is directly applicable to AEI (artificial emotional intelligence) for next-generation tasks requiring interpretability and efficiency:

  • Annotation Efficiency and Disagreement Reduction: By identifying a minimal, non-redundant set of categories, HICEM-15 minimizes both labelling costs and inter-annotator disagreement.
  • Cross-Cultural Deployment: Semantic centroids defined in six major languages support transfer learning and robust affect annotation in global systems.
  • Modularity and Hierarchy: The underlying clustering admits extensions (e.g., HICEM-25, HICEM-30) for domains requiring additional granularity.
  • Affective Computing Applications: Immediate uses include social robotics (affect tagging), human–machine dialogue systems (empathetic response), and digital phenotyping in mental healthcare.
  • Model Complementarity: HICEM-15’s discrete labels are recommended to be paired with continuous valence–arousal embeddings, as dominance is highly correlated with valence in data.

7. Comparative Position among Emotion Models

HICEM-15 improves on older emotion-models by optimizing the trade-off between semantic coverage and category count. Unlike Ekman’s and Plutchik’s paradigms, HICEM-15 categories are empirically derived, embedding-grounded, language-independent centroids. The model outperforms random category subsets and achieves competitive coverage and information recovery vis-à-vis much larger models, while retaining interpretability and efficiency crucial for scalable affective annotation (Wortman et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HICEM-15.