An Emotion Recognition Framework via Cross-modal Alignment of EEG and Eye Movement Data

Published 5 Sep 2025 in cs.MM | (2509.04938v1)

Abstract: Emotion recognition is essential for applications in affective computing and behavioral prediction, but conventional systems relying on single-modality data often fail to capture the complexity of affective states. To address this limitation, we propose an emotion recognition framework that achieves accurate multimodal alignment of Electroencephalogram (EEG) and eye movement data through a hybrid architecture based on cross-modal attention mechanism. Experiments on the SEED-IV dataset demonstrate that our method achieve 90.62% accuracy. This work provides a promising foundation for leveraging multimodal data in emotion recognition

Abstract PDF Upgrade to Chat

Summary

The paper introduces a cross-modal framework that fuses EEG and eye movement data to enhance emotion recognition accuracy.
It employs hybrid feature selection with RFE and PCA alongside a cross-modal attention mechanism to effectively align temporal dependencies.
The MLP-based classifier, validated on the SEED-IV dataset, achieved 90.62% accuracy and a 90.56% F1-score, outperforming traditional methods.

Introduction

The paper "An Emotion Recognition Framework via Cross-modal Alignment of EEG and Eye Movement Data" presents an advanced framework for emotion recognition using a novel cross-modal alignment of EEG and eye movement data. This approach is motivated by the limitations of unimodal emotion recognition systems, which often struggle in accurately capturing complex emotional states. By integrating EEG—known for high temporal resolution and capturing cortical dynamics—and eye movement data, which monitors attentional shifts and autonomic nervous activity, the framework aims to enhance the effectiveness of emotion recognition systems.

Methodology

The methodology centers on a hybrid architecture that strategically exploits cross-modal attention mechanisms to dynamically align EEG and eye movement data. This section will explore the core steps, including data preprocessing, feature selection, the cross-modal attention mechanism, and the MLP-based classification.

Data Preprocessing and Feature Extraction

The framework utilizes the SEED-IV dataset, which consists of synchronized EEG and eye movement recordings. The EEG signal is processed into Differential Entropy (DE) features across standard frequency bands, resulting in a high-dimensional feature space. Eye movement data is characterized by various metrics such as pupil diameter and blink rate, contributing additional dimensions to the model.

Feature Selection

To manage the complexity and reduce noise inherent in multimodal data, a hybrid selection strategy is employed. It combines Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA) to refine feature sets, emphasizing dimensionality reduction and noise elimination. Specifically, significant dimensionality reduction is achieved by selecting 200 EEG and 20 eye movement features, supplemented with PCA-derived components to ensure that discriminative features are retained.

The core of the framework is the cross-modal attention mechanism that adjusts alignment between EEG and eye movement data based on learned attention weights. This mechanism projects temporal dependencies across modalities, leveraging learnable matrices to create attention scores. It effectively unifies the two data types into a coherent representation, improving the capability to detect emotions accurately.

MLP-based Emotion Recognition

The classification phase uses a deep MLP designed to harness the fused features emerging from the cross-modal alignment. Structured in a residual network format, it supports robust emotion prediction capabilities across four emotional states. Regularization techniques such as dropout and batch normalization are incorporated to mitigate overfitting and stabilize training processes.

Experimental Results

The empirical evaluation utilizes the SEED-IV dataset and employs several performance metrics to validate the proposed framework, contrasting it with baseline methods such as feature fusion with SVMs.

Performance Analysis

The proposed framework achieved a remarkable accuracy of 90.62% with an F1-score of 90.56%, significantly outperforming baselines including early feature fusion approaches and even a variant without feature selection. The integrated feature selection and attention mechanisms were key contributors to this enhancement, as they ensured a robust alignment of multimodal data conducive to high-quality emotion recognition.

Conclusion

The findings highlight the framework's superior potential in emotion recognition through its innovative alignment and fusion of EEG and eye movement data. By validating the system's robustness and accuracy across real-world datasets such as SEED-IV, the study paves the way for future explorations of advanced multimodal fusion techniques and broader applications in adaptive user interfaces and personalized affective computing systems. Future research directions include improving feature selection algorithms and extending temporal modeling to enhance robustness and accuracy further.