Statistical POS Tagging Algorithms (HMM, CRF)
Аннотация
This paper presents a comprehensive study of two statistical approaches to part-of-speech (POS) tagging in Uzbek - the Hidden Markov Model (HMM) and the Conditional Random Field (CRF) - from both mathematical and empirical perspectives. We first formalize each model: transition and emission probabilities for HMM, and feature functions with weight parameters for CRF. Both models were trained on a 205 k-token (77821 sentences) CONLL-U corpus annotated with 15 Uzbek-specific POS tags, employing Laplace-smoothed Viterbi decoding for HMM and an L-BFGS-optimized CRF with Viterbi inference. On the held-out test set, the HMM achieved 82% tagging accuracy, while the CRF reached 88%, outperforming HMM by six percentage points thanks to its richer contextual and linguistic features. The results confirm that statistical models remain robust for agglutinative, low-resource languages like Uzbek, yet are sensitive to feature engineering. We conclude with an error analysis, guidelines for model selection, and perspectives on migrating to neural architectures such as BiLSTM-CRF and BERT-based taggers.
Ҳали таржима қилинмаган