Uzbek Public-Sector Text Classification: Naive Bayes, Logistic Regression and SVM Benchmarks
Аннотация
The authors present an algorithm for classifying Uzbek-language documents in key public sector domains, including economics, legal texts, healthcare, housing and utilities, and energy. The study compares three established linear baselines—multinomial naive Bayes, logistic regression, and linear SVM—within a fixed, source-agnostic evaluation protocol. Texts are collected from government agencies and media outlets, cleaned of template text, normalized for spelling variants and apostrophes, and vectorized using TF-IDF at the word and character levels, including a hybrid representation. A cross-domain analysis reveals systematic confusions between the housing and utilities and energy categories, as well as between the legal and economic sectors, reflecting shared tariff narratives, outage/maintenance reporting, and regulatory language embedded in economic indicators. Error analysis also shows that character n-grams and script normalization are critical for robustness in agglutinative, mixed-orthography environments. In addition, the article also contains necessary information on the Uzbek language, its nature and features that must be taken into account when developing such tools.
Перевод пока недоступен