Статья

Neural Sequence Models for Uzbek Morphological Stemming

Ulugbek SalaevUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,UzbekistanGayrat MatlatipovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,Uzbekistan

2025

ABI

Аннотация

Morphological analysis of Uzbek text is a prerequisite for robust normalization and subsequent linguistic processing in Natural Language Processing (NLP). This paper presents a character-level neural network model that maps surface forms to stems using an encoder–decoder architecture with a bidirectional long short-term memory encoder and an attention-based decoder. The model is trained on a morphologically annotated corpus extracted from an Uzbek news text. Inputs are normalized at the character level, including careful handling of apostrophe variants, and the data are split into training and held-out test partitions. Evaluation uses exact stem accuracy and sequence edit distance. The proposed model achieves a test accuracy of approximately 92.3%, outperforming strong neural baselines, including a pure bidirectional long short-term memory tagger, a character convolution plus bidirectional long short-term memory network, and a transformer-based sequence-to-sequence approach, by roughly 3 to 10% points. The experiment shows that most remaining errors arise from multi-character affixes and orthographic apostrophes. The results indicate that attention over character sequences effectively captures productive Uzbek morphology and substantially reduces normalization errors. We will release code and scripts to reproduce all experiments for research use.

Темы

Natural Language Processing Techniques Topic Modeling Text and Document Classification Technologies

Идентификаторы

DOI: 10.1109/apeie66761.2025.11289244

Цитирования и источники

Цитирований: 0Использованных источников: 13

Показатели — AkademScholar · Скоро