Automatic Topic Detection in Large Text Data in Uzbek using Clustering Methods
Abstract
This article presents an approach to automatic topic detection in large volumes of text data in Uzbek using clustering methods. The research aims to develop and apply the “bubble trap” clustering model for effectively segmenting the vector space of text documents into semantic clusters. This model maintains the volume and center position of the cluster unchanged when adding new vectors, ensuring high accuracy and stability of clustering. The methodology includes preprocessing Uzbek text data, vectorizing using term frequency-inverse document frequency (TF-IDF), and clustering based on semantic similarity. Applying this model to Uzbek text data reveals hidden patterns and structures, demonstrating the model's effectiveness in natural language processing and text mining. The study concludes that the “bubble trap” algorithm offers significant improvements in automatic topic detection for the Uzbek language, providing a reliable tool for analyzing large text corpora and opening new research directions in computational linguistics. This new approach addresses the challenges associated with clustering unstructured text data and contributes to the field by offering a scalable and accurate solution for topic detection in weakly structured text corpora.