Статья

Linguistic Nuances in Text Analysis: TF-IDF Metric’s Algorithm Implementation for the Karakalpak Language Recognition

Davlatyor MenglievNovosibirsk State University,IT Department,Novosibirsk,RussiaMukhriddin EshkulovJizzakh polytechnic institute,Teacher of Department of physics,Jizzakh,UzbekistanVladimir BarakhninUrgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi,IT Department,Urgench,UzbekistanRuslan AbdullayevNodirbek BoltayevUrgench branch of Tashkent University of Information Technologies named after Muhammad al-Khwarizmi,Department of Digital Education,Urgench,UzbekistanBahodir IbragimovUrgench State University,Department of Mathematics,Urgench,Uzbekistan

2024en

ABI

Аннотация

This article discusses an original approach to calculating the TF-IDF metric for Karakalpak language documents. The paper reviews related work, including efforts to automatically extract stop words and apply the TF-IDF metric tailored to the linguistic characteristics of the Karakalpak language, highlighting the importance of morphological preprocessing to improve the accuracy and efficiency of algorithms.Despite the challenges associated with the agglutinative nature of the Karakalpak language, such as the need for extensive morphological pre-processing to accurately identify and analyze word forms, this study proposes a new algorithm that demonstrates significant potential in dealing with the complexity of the language. By carefully adapting the TF-IDF metric to account for the morphological structure of Karakalpak, the proposed algorithm marks a significant advance in the computational analysis of agglutinative languages.Testing of the algorithm was thorough and included a diverse set of words unique to each dialect, as well as words common to multiple dialects and misspelled words. The algorithm has demonstrated high accuracy in identifying dialect-specific words and processing records in mixed dialects.In addition, this study contributes to the broader field of Turkic languages by offering insights into the structural and lexical features of the Uzbek language.

Темы

Natural Language Processing Techniques

Идентификаторы

DOI: 10.1109/usbereit61901.2024.10584051

Цитирования и источники

Цитирований: 17Использованных источников: 0

Показатели — AkademScholar · Скоро