A Hierarchical Cross-Modal Spatial Fusion Network for Multimodal Emotion Recognition
Аннотация
Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and Electroencephalography data, in emotion recognition. In this paper, a feature fusion-based hierarchical cross-modal spatial fusion network is proposed that effectively integrates EEG and video features. By designing an Electroencephalography feature extraction network based on 1D convolution and a video feature extraction network based on 3D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this paper. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The hierarchical cross-modal spatial fusion network achieved an accuracy of 97.78% on the valence-arousal dimension of the DEAP dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-HCI dataset, reaching the state-of-the-art level.
Перевод пока недоступен