Article

Automatic Topic Detection in Large Text Data in Uzbek using Clustering Methods

Umirova SvetlanaSamarkand State Univ.,Department of Uzbek Linguistics,Samarkand,UzbekistanKholmuhamedov BakhtiyorSamarkand State Univ.,Department of Uzbek Linguistics,Samarkand,UzbekistanKarimov Suyun AmirovichSamarkand State Univ.,Department of Uzbek Linguistics,Samarkand,UzbekistanNarzieva MamuraSamarkand State Univ.,Department of Uzbek Linguistics,Samarkand,Uzbekistan

2024en

ABI

Abstract

This article presents an approach to automatic topic detection in large volumes of text data in Uzbek using clustering methods. The research aims to develop and apply the “bubble trap” clustering model for effectively segmenting the vector space of text documents into semantic clusters. This model maintains the volume and center position of the cluster unchanged when adding new vectors, ensuring high accuracy and stability of clustering. The methodology includes preprocessing Uzbek text data, vectorizing using term frequency-inverse document frequency (TF-IDF), and clustering based on semantic similarity. Applying this model to Uzbek text data reveals hidden patterns and structures, demonstrating the model's effectiveness in natural language processing and text mining. The study concludes that the “bubble trap” algorithm offers significant improvements in automatic topic detection for the Uzbek language, providing a reliable tool for analyzing large text corpora and opening new research directions in computational linguistics. This new approach addresses the challenges associated with clustering unstructured text data and contributes to the field by offering a scalable and accurate solution for topic detection in weakly structured text corpora.

Topics

Advanced Text Analysis Techniques Topic Modeling Text and Document Classification Technologies

Identifiers

DOI: 10.1109/ubmk63289.2024.10773487

Citations and references

Cited by 1 26 references

Metrics — AkademScholar · Coming soon