Emotion-Aware Speaker Diarization Based on Prosodic and Deep Embedding Integration
Аннотация
Speaker diarization is the process of identifying speech segments in an audio stream and assigning them to a specific speaker. Since classical systems do not take into account prosodic features, their accuracy decreases in emotional speech situations. In this study, an emotion-sensitive speaker diarization system is proposed. In the model, prosodic vectors derived from prosodic features are combined with the embeddings of the ECAPA-TDNN model in a modulation manner. In the study, the emotion-sensitive speaker diarization model reduced the DER performance of the simple baseline model from 11.6 % to 7.9 %. In addition, it has low computational costs and provides significant results in real-time systems.
Ҳали таржима қилинмаган