Skip to main content
Article

Unsupervised Hierarchical Clustering and Rule-Based Labeling of Uzbek Educational Texts

Khabibulla MadatovAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,UzbekistanSapura SattarovaAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,UzbekistanFeruza AbdrimovaAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,Uzbekistan
2025
ABI

Abstract

This paper presents an unsupervised clustering approach for organizing Uzbek educational texts into meaningful thematic groups. The corpus comprises approximately 9,700 text segments that were minimally normalized to preserve domain terminology. Texts are represented with term frequency–inverse document frequency (TF–IDF) features over unigrams and bigrams, then reduced via truncated singular value decomposition (SVD) to obtain a compact semantic space. We apply hierarchical agglomerative clustering with Ward linkage and select the number of clusters by scanning candidate values and comparing internal indices (Silhouette, Davies–Bouldin). The analysis indicates that ten clusters provide a balanced trade-off between cohesion and separation. Cluster interpretation, based on representative terms, reveals coherent groups aligned with major subject domains (literature, history, geography, mathematics, physics, biology). Quantitative evaluation confirms the reliability of the solution, showing strong agreement between cluster assignments and rule-based labels. The approach supports efficient cataloging of large-scale Uzbek educational resources and enables downstream applications such as curriculum analysis, semantic search, and intelligent content recommendation.

Topics

Identifiers

Citations and references

Cited by 021 references
Metrics — AkademScholar · Coming soon