Другое

Clustering OF Small-scale Uzbek Texts Using Tf-idf AND Kmeans: an Empirical Evaluation OF Vectorization Parameters

E. H. (Elyor) EgamberdiyevTashkent University of Information Technologies

Nelitirepository2025en

ABI

Аннотация

In this study, we conduct a systematic evaluation of TF-IDF vectorization parameters for clustering small-scale Uzbek-language textual data using the K Means algorithm. While TF-IDF is a widely-used and computationally efficient technique for text representation, it lacks the ability to capture semantic meaning—especially in low-resource languages like Uzbek where pretrained semantic models are limited or unavailable. The primary goal of this research is to assess the impact of various TF-IDF configuration parameters—including n-gram range, maximum and minimum document frequency thresholds, normalization techniques, and custom stopword filtering—on the quality of clustering short and domain-specific Uzbek texts. We designed a dataset of seven manually curated sentences grouped into three distinct semantic categories: tourism and relaxation, artificial intelligence, and aquatic life.

Перевод пока недоступен

Цитирования и источники

Цитирований: 0Использованных источников: 0

Показатели — AkademScholar