Preprint

Automatic Detection of Stop Words for Texts in the Uzbek Language

Khabibulla MadatovShukurla Bekchanov, Urgench state university, 14, Kh. Alimdjan str, Urgench city, 220100, Uzbekistan;Shukurla BekchanovJernej VičičUrgench state university, 14, Kh. Alimdjan str, Urgench city, 220100, Uzbekistan;

Preprints.orgrepository2022en

ABI

Annotatsiya

Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyze and detect stop words in texts in the Uzbek language. Because of the limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex process than in inflected languages: In inflected languages, words such as auxiliary words, articles, prepositions can be included in the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not appropriate to apply all known methods of stop words detection in inflected languages directly to agglutinative languages. In this work, the &ldquo;School corpus&rdquo; which contains 731156 Uzbek words has been investigated. The bigram method of analysis was applied to the corpus. We proposed the collocation method of detecting stop words of the corpus. We proposed the method of automatically detecting stop words of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method.

Mavzular

Natural Language Processing Techniques Translation Studies and Practices Spam and Phishing Detection

Identifikatorlar

DOI: 10.20944/preprints202204.0234.v1

Iqtiboslar va manbalar

1 ta iqtibos 15 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar · Tez orada