Статья

Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders

Fazliddin MakhmudovDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaAlpamis KutlimuratovDepartment of AI and Software Engineering, Gachon University, Seongnam 13120, Republic of KoreaFarkhod AkhmedovDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaMohamed S. AbdallahDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of KoreaYoung Im ChoDepartment of Computer Engineering, Gachon University, Seongnam 1342, Republic of Korea

2022en

ABI

Аннотация

Meticulous learning of human emotions through speech is an indispensable function of modern speech emotion recognition (SER) models. Consequently, deriving and interpreting various crucial speech features from raw speech data are complicated responsibilities in terms of modeling to improve performance. Therefore, in this study, we developed a novel SER model via attention-oriented parallel convolutional neural network (CNN) encoders that parallelly acquire important features that are used for emotion classification. Particularly, MFCC, paralinguistic, and speech spectrogram features were derived and encoded by designing different CNN architectures individually for the features, and the encoded features were fed to attention mechanisms for further representation, and then classified. Empirical veracity executed on EMO-DB and IEMOCAP open datasets, and the results showed that the proposed model is more efficient than the baseline models. Especially, weighted accuracy (WA) and unweighted accuracy (UA) of the proposed model were equal to 71.8% and 70.9% in EMO-DB dataset scenario, respectively. Moreover, WA and UA rates were 72.4% and 71.1% with the IEMOCAP dataset.

Перевод пока недоступен

Идентификаторы

DOI: 10.3390/electronics11234047

Цитирования и источники

Цитирований: 9Использованных источников: 0

Показатели — AkademScholar