Graph-Based Emotion Prediction

Updated 31 December 2025

Graph-based emotion prediction is a research area that employs graph representations and GNNs to capture local and global emotional dependencies across modalities like text, audio, and social signals.
It utilizes diverse graph construction strategies—ranging from conversational line graphs to dynamic EEG and facial landmark graphs—to encode semantic, temporal, and relational cues for effective emotion recognition.
State-of-the-art results on benchmarks such as IEMOCAP and MELD highlight its potential in advancing applications in mental health, human–robot interaction, and social network analytics.

Graph-based emotion prediction encompasses a body of research exploiting graph representations and Graph Neural Network (GNN) architectures to model, infer, and classify affective states from diverse data modalities including text, audio, video, physiological signals, and social relationships. The paradigm leverages explicit graph structures to encode local and global relational dependencies—temporal, interpersonal, contextual, and multimodal—that are intrinsic to emotional phenomena in conversation, social contexts, brain signals, and bodily expressions. Recent advances have demonstrated substantial gains over sequence-based and hand-crafted approaches, achieving new state-of-the-art results in Emotion Recognition in Conversation (ERC), speech emotion recognition, multimodal fusion, physiological emotion decoding, and social network emotion propagation. This article systematically reviews graph construction strategies, architectural patterns, learning principles, experimental protocols, and current limitations as exemplified in leading research (Krishnan et al., 2023, Khalid et al., 2022, Wang et al., 2014, Meng et al., 2024, Shirian et al., 2020, Zha et al., 2024, Ghosal et al., 2019, Li et al., 2022, Xu et al., 2020, Liu et al., 2024, Bhattacharyya et al., 13 Jan 2025, Shirian et al., 2020, Li et al., 2022, Narayanan et al., 2020, Liang et al., 2020, Maji et al., 28 Aug 2025, Saravia et al., 2018, Nguyen et al., 2022).

1. Graph Construction Strategies Across Modalities

The foundational step in graph-based emotion prediction is the definition of nodes, edges, and edge types that encode the relevant semantic, syntactic, social, or physiological relationships. In conversational emotion recognition, nodes are typically utterances; edges model local context (adjacent utterances) or speaker interactions. In the LineConGraph framework, each conversation is rendered as a line graph $G = (V, E)$ with nodes $u_i$ and edges $(u_i, u_{i+1})$ plus self-loops, enforcing a short-term, speaker-independent context window (Krishnan et al., 2023).

For speech signals, utterances are partitioned into frames; nodes correspond to frames and edges follow a cycle or line topology, enabling graph convolution over temporal structure (Shirian et al., 2020). In EEG emotion recognition, nodes are recording channels and edges encode physical distance, functional connectivity, or learned attention between regions, with graphs often made dynamic across time windows (Liu et al., 2024). Gait-based emotion systems build skeleton graphs with joints as nodes and biomechanics as edges (Narayanan et al., 2020), while facial expression recognition relies on landmark graphs (nodes = landmarks, edges = spatial proximity or anatomical relations) with hierarchical region coarsening (Maji et al., 28 Aug 2025).

Social emotion contagion is modeled with users as nodes and weighted social ties (calls, messages, co-occurrences) as edges, capturing the propagation of affective states through networks (Khalid et al., 2022, Wang et al., 2014, Bhattacharyya et al., 13 Jan 2025). Text emotion graphs can exploit words as nodes, co-occurrences or syntactic templates as edges, or model higher-order dependencies via multi-layered networks comprising hashtags, keywords, and tweets (Nguyen et al., 2022, Saravia et al., 2018).

Edge types vary by task and modality. Speaker-aware ERC embeddings include relational labels for speaker combinations; multimodal approaches encode modality pairs and intra-modal temporal dependencies; certain frameworks enrich edges with external resources (commonsense, emotional states, or physical features) (Zha et al., 2024, Liang et al., 2020).

2. Core Graph Neural Architectures and Learning Principles

Graph-based emotion models deploy a variety of GNN layers to aggregate information. The most prevalent patterns are:

Graph Convolutional Networks (GCN): Each layer propagates features via normalized adjacency matrices, either for node classification (utterance/segment-level) or graph classification (dialogue-level). Many models use the first-order spectral propagation $X^{(l+1)} = \hat{A}X^{(l)}W^{(l)}$ , with dialog graphs $A$ encoding context (Krishnan et al., 2023, Shirian et al., 2020, Ghosal et al., 2019, Liu et al., 2024). Exact spectral convolution can leverage fast transforms when graph topology is regular (line or cycle), yielding computational efficiency (Shirian et al., 2020).
Graph Attention Networks (GAT): Attention coefficients $\alpha_{ij}$ dynamically re-weight neighbors per node and edge, computed via trainable functions of node features and optionally edge features (relation, sentiment, modality) (Krishnan et al., 2023, Zha et al., 2024, Li et al., 2022, Li et al., 2022). Multi-head, residual and layer-aggregated GATs are adopted for rich feature fusion and over-smoothing mitigation (Li et al., 2022, Li et al., 2022).
Relational GCN: Edges carry types for distinct context/speaker relations (e.g., inter-/intra-speaker), each with its own trainable linear transformation (Ghosal et al., 2019).
EdgeConv and Hierarchical Pooling: For structured domains (faces, skeletons), EdgeConv combines local feature differences for equivariant modeling. Hierarchical region pooling over quotient graphs further distills high-level representations (Maji et al., 28 Aug 2025). Learnable pooling functions and graph-level embeddings enable robust graph classification (Shirian et al., 2020).
Graph Spectrum and Frequency-Domain Operators: GS-MCC introduces graph-Fourier operators to explicitly disentangle low-frequency (consistency) and high-frequency (complementarity) components, leveraging filtered Laplacians and contrastive learning to jointly optimize collaboration between them (Meng et al., 2024).
Dynamic/Learnable Graphs: L-GrIN learns adjacency matrices concurrently with classification, permitting adaptation to modality and sample-specific structure (Shirian et al., 2020, Liu et al., 2024).

3. Emotion Prediction Workflows and Integration of Context

Prediction pipelines are executed at node or graph-level, commonly via a categorical cross-entropy classification head. In ERC and multimodal fusion, utterance nodes are passed through stacked GNNs (GCN, GAT, RelGCN) followed by an MLP and softmax for emotion prediction (Krishnan et al., 2023, Li et al., 2022, Ghosal et al., 2019). Integration of additional context—speaker features, multimodal cues (audio, visual, physiology), sentiment shift or knowledge-enriched edge labels—further augments accuracy (Zha et al., 2024, Liang et al., 2020, Li et al., 2022).

Dynamic models for emotion propagation across social graphs combine per-user temporal LSTM embeddings and graph convolution over social adjacency for next-day affect prediction, sometimes fusing physiological, behavioral, and environment features (Khalid et al., 2022). For image-based emotion, factor-graph models encode user emotions, image-level features, and social influence variables, jointly inferring who influences whom and propagating affect (Wang et al., 2014).

Multimodal approaches such as GraphMFT and GraphCFC construct multiple heterogeneous graphs to capture cross-modal and intra-modal dependencies, applying improved attention networks and specialized subspace fusion strategies (Li et al., 2022, Li et al., 2022, Meng et al., 2024). Multilayered text analysis uses network-of-networks for hashtags, keywords, and tweets, each processed with separate GNN blocks and pooled for group-level emotion classification (Nguyen et al., 2022).

EEG graphs encode spatial and functional connectivity, processed by spatial or spatio-temporal GNNs for node/graph-level emotion prediction; dynamic graph modeling remains an open problem (Liu et al., 2024). Pattern-based emotion extraction relies on graph-mined, syntactically constrained templates, later enriched with semantic embeddings, for robust classification (Saravia et al., 2018).

4. Experimental Results, Benchmarking, and Ablations

Across benchmarks, graph-based architectures yield consistent improvements over sequential, statistical, or multimodal baselines.

ERC benchmarks (MELD, IEMOCAP): LineConGAT (speaker-independent, local context) achieves state-of-the-art F1 (64.58% IEMOCAP, 76.50% MELD), outperforming prior works by up to +10.7% (Krishnan et al., 2023). Speaker-independence is validated via minimal benefit from speaker embeddings.
Social emotion propagation: GCN-LSTM outperforms LSTM-only and Conv-LSTM approaches in stress/happiness prediction (F1: 0.69–0.72 vs. 0.57–0.65) (Khalid et al., 2022). Performance saturates at ≈15 neighbors, with overly central nodes showing higher prediction error.
Multimodal conversation: GS-MCC reports highest W-F1 (73.9% IEMOCAP, 69.0% MELD) via spectral collaborative learning (Meng et al., 2024). Multimodal fusion delivers incrementally higher accuracy; ablations confirm necessity of all modalities (text, audio, visual).
Speech/gesture/facial emotion: Compact graph architectures surpass RNNs and CNNs with far fewer parameters (O(30K–120K)), offering weighted accuracy up to 65% and mAP of 82% for gait-based emotion (Shirian et al., 2020, Narayanan et al., 2020, Shirian et al., 2020, Maji et al., 28 Aug 2025).
Emotion correlation learning: EmoGraph graphs over label co-occurrences boost macro-F1 by 2–8 points in both multi-label and single-label text emotion recognition (Xu et al., 2020).
Heterogeneous graph fusion: HMG-Emo achieves 0.77 weighted F1, outperforming prior multimodal and fusion-based baselines in social network settings (Bhattacharyya et al., 13 Jan 2025).
EEG emotion decoding: Spatio-temporal GNNs and minimum-spanning-tree variants yield accuracy >85% on SEED, with attention-based spatial connections further improving performance (Liu et al., 2024).
Ablations: All cited works report systematic drops with removal of graph context, edge enrichment, modality, or attention, indicating that model gains are contingent on principled graph construction and integration.

5. Limitations, Open Challenges, and Future Directions

Several research threads remain open in graph-based emotion prediction:

Dynamic Graph Topologies: Static edge assignment can limit representational fidelity in rapidly evolving contexts (brain activity, social ties, conversation turns). Adapting graphs online, learning personalized or temporally adaptive adjacency, or inferring graph structure jointly remains an unsolved problem (Liu et al., 2024, Shirian et al., 2020).
Speaker and Context Complexity: Many ERC systems bypass speaker nodes for deployment on unseen speakers; however, modeling deeper speaker-specific priors, real-time incremental updates, or emotional state graphs per speaker is a frontier (Zha et al., 2024).
Multimodal and Heterogeneous Fusion: Existing fusion architectures may suffer from heterogeneity gaps, redundancy, or suboptimal fusion order. Dynamic fusion strategies, external knowledge incorporation, and attention over heterogeneous relations represent important next steps (Li et al., 2022, Li et al., 2022, Liang et al., 2020).
Interpretability: Edge-level interpretability, semantic alignment of graph regions (e.g., quotient graph nodes for facial regions), and impact of social ties warrant targeted methods for transparency (Maji et al., 28 Aug 2025, Wang et al., 2014).
Benchmarking and Generalization: Scarcity of large public datasets, inconsistent validation splits, and domain adaptation (especially in EEG and social network emotion) hinder cross-study comparability and transfer (Liu et al., 2024, Bhattacharyya et al., 13 Jan 2025).
Mixed-emotion Recognition: Most models classify discrete emotions; however, individuals often exhibit blended affective states requiring novel graph or label structures (Liu et al., 2024).
Scalability: Large graphs (e.g., social networks, EEG) may challenge memory and computational resources; approaches exploiting graph coarsening, region pooling, or distributed inference become necessary (Maji et al., 28 Aug 2025, Wang et al., 2014, Khalid et al., 2022).

6. Impact and Contextual Significance

Graph-based emotion prediction advances affective computing across domains of conversational AI, mental health, human–robot interaction, social media analytics, and physiological signal analysis. Explicit relational modeling via graphs allows systematic exploitation of context, interaction, and multimodal cues inaccessible to sequence-based or isolated methods. The emergence of GNNs—including relational, attention, spectral, and edge-convolutional variants—provides a unified platform for integrating structured and unstructured data at scale, supporting interpretability and robustness. State-of-the-art emotion recognition performance, efficient architectures, and adaptability across domains validate the impact of the graph-based paradigm as central to the future of emotion-aware technology.