Neural Sequence Models for Uzbek Morphological Stemming
Аннотация
Morphological analysis of Uzbek text is a prerequisite for robust normalization and subsequent linguistic processing in Natural Language Processing (NLP). This paper presents a character-level neural network model that maps surface forms to stems using an encoder–decoder architecture with a bidirectional long short-term memory encoder and an attention-based decoder. The model is trained on a morphologically annotated corpus extracted from an Uzbek news text. Inputs are normalized at the character level, including careful handling of apostrophe variants, and the data are split into training and held-out test partitions. Evaluation uses exact stem accuracy and sequence edit distance. The proposed model achieves a test accuracy of approximately 92.3%, outperforming strong neural baselines, including a pure bidirectional long short-term memory tagger, a character convolution plus bidirectional long short-term memory network, and a transformer-based sequence-to-sequence approach, by roughly 3 to 10% points. The experiment shows that most remaining errors arise from multi-character affixes and orthographic apostrophes. The results indicate that attention over character sequences effectively captures productive Uzbek morphology and substantially reduces normalization errors. We will release code and scripts to reproduce all experiments for research use.