Асосий контентга ўтиш
AkademIndex

Маҳсулотлар

Ишлаб чиқувчилар учун

AkademBaseЭкотизим учун очиқ API
Мақола

Neural Sequence Models for Uzbek Morphological Stemming

Ulugbek SalaevUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,UzbekistanGayrat MatlatipovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science and Artificial Intelligence Technologies,Urgench,Uzbekistan
2025
ABI

Аннотация

Morphological analysis of Uzbek text is a prerequisite for robust normalization and subsequent linguistic processing in Natural Language Processing (NLP). This paper presents a character-level neural network model that maps surface forms to stems using an encoder–decoder architecture with a bidirectional long short-term memory encoder and an attention-based decoder. The model is trained on a morphologically annotated corpus extracted from an Uzbek news text. Inputs are normalized at the character level, including careful handling of apostrophe variants, and the data are split into training and held-out test partitions. Evaluation uses exact stem accuracy and sequence edit distance. The proposed model achieves a test accuracy of approximately 92.3%, outperforming strong neural baselines, including a pure bidirectional long short-term memory tagger, a character convolution plus bidirectional long short-term memory network, and a transformer-based sequence-to-sequence approach, by roughly 3 to 10% points. The experiment shows that most remaining errors arise from multi-character affixes and orthographic apostrophes. The results indicate that attention over character sequences effectively captures productive Uzbek morphology and substantially reduces normalization errors. We will release code and scripts to reproduce all experiments for research use.

Ҳали таржима қилинмаган

Мавзулар

Идентификаторлар

Иқтибослар ва манбалар