Статья

A Hierarchical Cross-Modal Spatial Fusion Network for Multimodal Emotion Recognition

Ming XuDepartment of Automation, Tsinghua University, Beijing, P. R. ChinaTuo ShiHao ZhangZeyi LiuDepartment of Automation, Tsinghua University, Beijing, P. R. ChinaXiao HeDepartment of Automation, Tsinghua University, Beijing, P. R. China

2025en

ABI

Аннотация

Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and Electroencephalography data, in emotion recognition. In this paper, a feature fusion-based hierarchical cross-modal spatial fusion network is proposed that effectively integrates EEG and video features. By designing an Electroencephalography feature extraction network based on 1D convolution and a video feature extraction network based on 3D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this paper. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The hierarchical cross-modal spatial fusion network achieved an accuracy of 97.78% on the valence-arousal dimension of the DEAP dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-HCI dataset, reaching the state-of-the-art level.

Перевод пока недоступен

Идентификаторы

DOI: 10.1109/tai.2024.3523250

Цитирования и источники

Цитирований: 2Использованных источников: 0

Показатели — AkademScholar