Перейти к основному содержанию
AkademIndex

Продукты

Для разработчиков

AkademBaseскороОткрытый API экосистемы
Латиница
Русский
Статья

Document-Level Semantic Analysis of Uzbek Texts— Scheme, Benchmark, and Reproducible Results

Saidbek P. BabayazovCyber University,Nurafshon,UzbekistanLola SamandarovaAlfraganus University,Tashkent,UzbekistanMohira MATNAZAROVAUrgench State University,Urgench,UzbekistanUmidbek P. BabayazovUrgench State University,Urgench,UzbekistanKhurshida QodirovaUrgench State Pedagogical Institute,Urgench,UzbekistanShukhrat Khudayberganov
2025
ABI

Аннотация

This work presents a practical document-level semantic annotation scheme for the Uzbek language and two target datasets of 1200 and 1500 documents designed for three tasks: topic classification (THEME/SUB), communicative function (FUNC), and concise abstractive summarization (one sentence). We release a reproducible pipeline (70/15/15 splits, training and evaluation scripts, report templates) and demonstrate reference results on a synthetic 2700-document stand that mirrors the size and class distributions of the intended corpora. Strong baselines based on TF-IDF+LR/SVM reach high accuracy for THEME and FUNC; a multilingual Transformer (XLM-R/DistilBERT) further improves performance on borderline cases. For summarization, we compare an extractive centroid against an abstractive mT5-small model constrained to a single sentence. Methodologically, our scheme relies on clear guidelines and a pilot double-annotation with Cohen’s κ; engineering-wise, it favors transparent protocols and low-cost baselines, making the solution appropriate both for production systems and for scholarly assessment.

Темы

Идентификаторы

Цитирования и источники

Показатели — AkademScholar · Скоро