Linguistic Nuances in Text Analysis: TF-IDF Metric’s Algorithm Implementation for the Karakalpak Language Recognition
Аннотация
This article discusses an original approach to calculating the TF-IDF metric for Karakalpak language documents. The paper reviews related work, including efforts to automatically extract stop words and apply the TF-IDF metric tailored to the linguistic characteristics of the Karakalpak language, highlighting the importance of morphological preprocessing to improve the accuracy and efficiency of algorithms.Despite the challenges associated with the agglutinative nature of the Karakalpak language, such as the need for extensive morphological pre-processing to accurately identify and analyze word forms, this study proposes a new algorithm that demonstrates significant potential in dealing with the complexity of the language. By carefully adapting the TF-IDF metric to account for the morphological structure of Karakalpak, the proposed algorithm marks a significant advance in the computational analysis of agglutinative languages.Testing of the algorithm was thorough and included a diverse set of words unique to each dialect, as well as words common to multiple dialects and misspelled words. The algorithm has demonstrated high accuracy in identifying dialect-specific words and processing records in mixed dialects.In addition, this study contributes to the broader field of Turkic languages by offering insights into the structural and lexical features of the Uzbek language.