Асосий контентга ўтиш
AkademIndex

Маҳсулотлар

Ишлаб чиқувчилар учун

AkademBaseЭкотизим учун очиқ API
Мақола

Statistical POS Tagging Algorithms (HMM, CRF)

Odinakhon R. JamoldinovaTashkent State University of Uzbek Language and Literature Named Alisher Navo’I,Dept. of Social Sciences and Humanities,Tashkent,UzbekistanElov Botir BoltayevichTashkent State University of Uzbek Language and Literature Named Alisher Navo’i,Dept. of Computational Linguistics and Digital Technologies,Tashkent,UzbekistanM. Yu. SharipovaBukhara State University,Dept. of Uzbek Philology and Journalism,Bukhara,UzbekistanShakhzoda MiralimovaAndijan State Institute of Foreign Languages,Dept. Uzbek Philology,Andijan,UzbekistanZilola Yuldashevna XusainovaTashkent State University of Uzbek Language and Literature Named Alisher Navo’i,Dept. of Computational Linguistics and Digital Technologies,Tashkent,UzbekistanNizomaddin Uktambay O‘G‘Li KhudayberganovTashkent State University of Uzbek Language and Literature Named Alisher Navo’i,Dept. of Computational Linguistics and Digital Technologies,Tashkent,UzbekistanKholmurod KarimovKarshi State University,Dept. of Psychology,Karshi,Uzbekistan
2025
ABI

Аннотация

This paper presents a comprehensive study of two statistical approaches to part-of-speech (POS) tagging in Uzbek - the Hidden Markov Model (HMM) and the Conditional Random Field (CRF) - from both mathematical and empirical perspectives. We first formalize each model: transition and emission probabilities for HMM, and feature functions with weight parameters for CRF. Both models were trained on a 205 k-token (77821 sentences) CONLL-U corpus annotated with 15 Uzbek-specific POS tags, employing Laplace-smoothed Viterbi decoding for HMM and an L-BFGS-optimized CRF with Viterbi inference. On the held-out test set, the HMM achieved 82% tagging accuracy, while the CRF reached 88%, outperforming HMM by six percentage points thanks to its richer contextual and linguistic features. The results confirm that statistical models remain robust for agglutinative, low-resource languages like Uzbek, yet are sensitive to feature engineering. We conclude with an error analysis, guidelines for model selection, and perspectives on migrating to neural architectures such as BiLSTM-CRF and BERT-based taggers.

Ҳали таржима қилинмаган

Мавзулар

Идентификаторлар

Иқтибослар ва манбалар