Article

Emotion-Aware Speaker Diarization Based on Prosodic and Deep Embedding Integration

Kamoliddin ShukurovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanU U KhasanovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanShokhrukhmirzo KholdorovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanMaftuna KarimovaTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,UzbekistanLutfulla MurodjonovTUIT named after Mukhammad al-Khwarizmi,Department of robotics and intelligent systems,Tashkent,Uzbekistan

2025

ABI

Abstract

Speaker diarization is the process of identifying speech segments in an audio stream and assigning them to a specific speaker. Since classical systems do not take into account prosodic features, their accuracy decreases in emotional speech situations. In this study, an emotion-sensitive speaker diarization system is proposed. In the model, prosodic vectors derived from prosodic features are combined with the embeddings of the ECAPA-TDNN model in a modulation manner. In the study, the emotion-sensitive speaker diarization model reduced the DER performance of the simple baseline model from 11.6 % to 7.9 %. In addition, it has low computational costs and provides significant results in real-time systems.

Topics

Emotion and Mood Recognition Speech Recognition and Synthesis Speech and Audio Processing

Identifiers

DOI: 10.1109/icaaic64647.2025.11330488

Citations and references

Cited by 017 references

Metrics — AkademScholar · Coming soon