Development of Algorithm for Classification of Financial Texts in the Uzbek Language
Аннотация
This article proposes a method for automated text classification by document type based on logistic regression and N-gram vectorization. The goal of the study is to create an algorithm that can reliably process both “simple” corpora with a clear structure and “complex” collections with heterogeneous topics, spelling errors, and informal speech elements. The algorithm includes preprocessing (normalization, tokenization), N-gram feature extraction (unigrams and bigrams) using TF-IDF, and training a multi-class logistic regression classifier (one-vs-rest). Three datasets of varying complexity were prepared for objective validation; metrics are evaluated in terms of Accuracy, Precision/Recall, and F1 (macro/micro), confusion matrices and PR curves are constructed, allowing for analysis of model behavior at class boundaries. Experiments have shown that the algorithm demonstrates high accuracy on structured texts (e.g., official business documents), with the best results achieved by using early stopping and careful hyperparameter tuning. Predictions are returned as categories with probabilities, simplifying threshold tuning and integration into real-world systems such as banking applications, electronic document management, and legal pipelines. A comparative review of alternative solutions was also conducted, highlighting key challenges in the field: sensitivity to code-switching and dialectal forms, dependence on the completeness of the feature dictionary, and degradation on “dirty” data. Planned development areas include expanding the training/test corpora, adding normalization modules (morphology, spelling, dialect), probability calibration, and transitioning to neural network architectures and contextual representations (e.g., transformers and sentence embeddings) for working with large, multidimensional data and increased language variability.