Document-Level Semantic Analysis of Uzbek Texts— Scheme, Benchmark, and Reproducible Results
Abstract
This work presents a practical document-level semantic annotation scheme for the Uzbek language and two target datasets of 1200 and 1500 documents designed for three tasks: topic classification (THEME/SUB), communicative function (FUNC), and concise abstractive summarization (one sentence). We release a reproducible pipeline (70/15/15 splits, training and evaluation scripts, report templates) and demonstrate reference results on a synthetic 2700-document stand that mirrors the size and class distributions of the intended corpora. Strong baselines based on TF-IDF+LR/SVM reach high accuracy for THEME and FUNC; a multilingual Transformer (XLM-R/DistilBERT) further improves performance on borderline cases. For summarization, we compare an extractive centroid against an abstractive mT5-small model constrained to a single sentence. Methodologically, our scheme relies on clear guidelines and a pilot double-annotation with Cohen’s κ; engineering-wise, it favors transparent protocols and low-cost baselines, making the solution appropriate both for production systems and for scholarly assessment.