GENERAL DEEP LEARNING ARCHITECTURES FOR MULTIMODAL EMOTION DETECTION
Abstract
Multimodal emotion recognition is an important area of artificial intelligence, which allows for accurate analysis of human emotional states by combining various data sources such as facial expressions, body movements, speech tone, and physiological signals. This paper studies the application of deep learning architectures to multimodal emotion recognition, in particular, the effectiveness of the late fusion strategy. In the paper, the ST-GCN (Spatio-Temporal Graph Convolutional Network) model is used to extract motion features from body movements, and the DeepFaceEmocNet25 model is used to extract emotion features from facial expressions, trained on the FaceEmocDS dataset. These models are integrated through the late fusion method, providing high accuracy in detecting seven emotion classes (happy, angry, sad, surprised, disgusted, fearful, neutral). Late fusion preserves the independent features of each modality and combines them through concatenation and a fully connected classifier. The paper presents mathematical formulas, practical code examples, and experimental setups, and analyzes the technical details of the system. The multimodal approach is widely used in healthcare, education, security, and gaming industries, but there are challenges such as data heterogeneity, limited data sets, and computational costs. Future research will focus on small-data training, real-time analysis, and cultural adaptability. This work presents innovative deep learning solutions in the field of multimodal emotion recognition.