Emotion-Aware Speaker Diarization Based on Prosodic and Deep Embedding Integration
Kamoliddin ShukurovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanU U KhasanovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanShokhrukhmirzo KholdorovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanMaftuna KarimovaTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanLutfulla MurodjonovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,Uzbekistan
2025
ABI
Abstract
Speaker diarization is the process of identifying speech segments in an audio stream and assigning them to a specific speaker. Since classical systems do not take into account prosodic features, their accuracy decreases in emotional speech situations. In this study, an emotion-sensitive speaker diarization system is proposed. In the model, prosodic vectors derived from prosodic features are combined with the embeddings of the ECAPA-TDNN model in a modulation manner. In the study, the emotion-sensitive speaker diarization model reduced the DER performance of the simple baseline model from 11.6 % to 7.9 %. In addition, it has low computational costs and provides significant results in real-time systems.
Topics
Identifiers
Citations and references
Cited by 017 references
Metrics — AkademScholar · Coming soon