Статья

Design and Implementation of a Markov Model-Based POS Tagger for the Uzbek Language

Maksud SharipovUrgench State University Named After Abu Rayhan,Department of Computer Science,Biruni Urgench,UzbekistanManzura AbjalovaAlisher Navo'i Tashkent State University of Uzbek Language and Literature,Department of Uzbek Philology,Tashkent,UzbekistanOgabek SobirovUrgench State University Named After Abu Rayhan,Department of Computer Science,Biruni Urgench,Uzbekistan

2025

ABI

Аннотация

This paper presents the design and implementation of a Hidden Markov Model (HMM)-based Part-of-Speech (POS) tagger specifically tailored for the Uzbek language. Addressing the unique challenges posed by Uzbek’s agglutinative morphology, rich inflectional patterns, and limited digital resources, our approach leverages a first-order HMM to probabilistically assign grammatical categories to words within sentences. We detail the development of a custom, pragmatic tagset comprising 13 distinct POS categories, designed to capture the morphosyntactic specificities of Uzbek. The model was rigorously trained and evaluated on an extensive, manually annotated corpus of 48,079 words across 30 diverse domains, collected and refined by computational linguistics experts. This comprehensive dataset, encompassing various formal and semi-formal texts, ensures the model’s robust performance and contextual adaptability. The Viterbi algorithm is employed for efficient tag sequence decoding, utilizing transition and emission probabilities derived from the corpus. By contributing a statistically grounded POS tagging solution, this study aims to enhance foundational Natural Language Processing (NLP) resources for Uzbek, demonstrating the continued viability and effectiveness of HMMs for low-resource, morphologically rich languages.

Темы

Educational Technology and Assessment

Идентификаторы

DOI: 10.1109/ubmk67458.2025.11206943

Цитирования и источники

Цитирований: 1 Использованных источников: 13

Показатели — AkademScholar · Скоро