Article

Document-Level Semantic Analysis of Uzbek Texts— Scheme, Benchmark, and Reproducible Results

Saidbek P. BabayazovCyber University,Nurafshon,UzbekistanLola SamandarovaAlfraganus University,Tashkent,UzbekistanMohira MATNAZAROVAUrgench State University,Urgench,UzbekistanUmidbek P. BabayazovUrgench State University,Urgench,UzbekistanKhurshida QodirovaUrgench State Pedagogical Institute,Urgench,UzbekistanShukhrat Khudayberganov

2025

ABI

Abstract

This work presents a practical document-level semantic annotation scheme for the Uzbek language and two target datasets of 1200 and 1500 documents designed for three tasks: topic classification (THEME/SUB), communicative function (FUNC), and concise abstractive summarization (one sentence). We release a reproducible pipeline (70/15/15 splits, training and evaluation scripts, report templates) and demonstrate reference results on a synthetic 2700-document stand that mirrors the size and class distributions of the intended corpora. Strong baselines based on TF-IDF+LR/SVM reach high accuracy for THEME and FUNC; a multilingual Transformer (XLM-R/DistilBERT) further improves performance on borderline cases. For summarization, we compare an extractive centroid against an abstractive mT5-small model constrained to a single sentence. Methodologically, our scheme relies on clear guidelines and a pilot double-annotation with Cohen’s κ; engineering-wise, it favors transparent protocols and low-cost baselines, making the solution appropriate both for production systems and for scholarly assessment.

Topics

Text and Document Classification Technologies Topic Modeling Sentiment Analysis and Opinion Mining

Identifiers

DOI: 10.1109/apeie66761.2025.11289323

Citations and references

Cited by 020 references

Metrics — AkademScholar · Coming soon