Article

Detection of Synthetic Speech Using Spectral-Cepstral Features and BiLSTM Networks Furkat

Furkat RakhmatovTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, UzbekistanFakhriddin AbdirazakovTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, UzbekistanBaxodir Saydullaevich AchilovTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, UzbekistanRuslan BaydullayevTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, UzbekistanSultanmurat NasirovTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, UzbekistanShakhzod JavlievTashkent University of information technologies named after Muhammad al-Khworazmi, Tashkent 100084, Uzbekistan

Informaticajournal2025

ABI

Abstract

Experimental results demonstrate 93,4% accuracy on the test set; error analysis reveals that misclassifications predominantly occur between the Person and Robot classes, whereas the Emotion class is recognized more reliably. Feature comparison indicates that log-mel provides a robust baseline with minimal computational cost, LFCC better preserves high-frequency details characteristic of synthetic artifacts, and CQCC is effective in capturing harmonic structure and modulations. Potential directions for improving generalizability and accuracy are discussed, including feature fusion (CQCC/LFCC/log-mel) and statistical pooling for temporal aggregation. The proposed configuration offers a well-balanced trade-off between performance and computational complexity, serving as a strong baseline for anti-spoofing systems.

Topics

Speech Recognition and Synthesis Speech and Audio Processing Emotion and Mood Recognition

Identifiers

DOI: 10.31449/inf.v49i36.12281

Citations and references

Cited by 00 references

Metrics — AkademScholar · Coming soon