Article

Uzbek Public-Sector Text Classification: Naive Bayes, Logistic Regression and SVM Benchmarks

Nilufar AbdurakhmonovaNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,UzbekistanNilufar Istamovna AdizovaBukhara State University,Bukhara,UzbekistanDilnoza SobirovaBukhara State University,Bukhara,Uzbekistan

2025

ABI

Abstract

The authors present an algorithm for classifying Uzbek-language documents in key public sector domains, including economics, legal texts, healthcare, housing and utilities, and energy. The study compares three established linear baselines—multinomial naive Bayes, logistic regression, and linear SVM—within a fixed, source-agnostic evaluation protocol. Texts are collected from government agencies and media outlets, cleaned of template text, normalized for spelling variants and apostrophes, and vectorized using TF-IDF at the word and character levels, including a hybrid representation. A cross-domain analysis reveals systematic confusions between the housing and utilities and energy categories, as well as between the legal and economic sectors, reflecting shared tariff narratives, outage/maintenance reporting, and regulatory language embedded in economic indicators. Error analysis also shows that character n-grams and script normalization are critical for robustness in agglutinative, mixed-orthography environments. In addition, the article also contains necessary information on the Uzbek language, its nature and features that must be taken into account when developing such tools.

Topics

Text and Document Classification Technologies Sentiment Analysis and Opinion Mining Computational and Text Analysis Methods

Identifiers

DOI: 10.1109/apeie66761.2025.11289232

Citations and references

Cited by 024 references

Metrics — AkademScholar · Coming soon