Article

Unsupervised Hierarchical Clustering and Rule-Based Labeling of Uzbek Educational Texts

Khabibulla MadatovAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,UzbekistanSapura SattarovaAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,UzbekistanFeruza AbdrimovaAl-Beruni Urgench State University,Department of Computer Science and AI Technologies,Urgench,Uzbekistan

2025

ABI

Abstract

This paper presents an unsupervised clustering approach for organizing Uzbek educational texts into meaningful thematic groups. The corpus comprises approximately 9,700 text segments that were minimally normalized to preserve domain terminology. Texts are represented with term frequency–inverse document frequency (TF–IDF) features over unigrams and bigrams, then reduced via truncated singular value decomposition (SVD) to obtain a compact semantic space. We apply hierarchical agglomerative clustering with Ward linkage and select the number of clusters by scanning candidate values and comparing internal indices (Silhouette, Davies–Bouldin). The analysis indicates that ten clusters provide a balanced trade-off between cohesion and separation. Cluster interpretation, based on representative terms, reveals coherent groups aligned with major subject domains (literature, history, geography, mathematics, physics, biology). Quantitative evaluation confirms the reliability of the solution, showing strong agreement between cluster assignments and rule-based labels. The approach supports efficient cataloging of large-scale Uzbek educational resources and enables downstream applications such as curriculum analysis, semantic search, and intelligent content recommendation.

Topics

Advanced Clustering Algorithms Research Text and Document Classification Technologies Information Retrieval and Search Behavior

Identifiers

DOI: 10.1109/apeie66761.2025.11289332

Citations and references

Cited by 021 references

Metrics — AkademScholar · Coming soon