Skip to main content
Article

Developing a Word Classification Algorithm for Turkic Languages with PoS Tagging

Bozorboy IslombekovCyber University,Nurafshon,UzbekistanQodirova GuliUrgench State Pedagogical Institute,Urgench,UzbekistanAybibi R. IskandarovaNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,UzbekistanKhidirova GulnoraNon-State Educational Institution "Ma’mun University",Khiva,UzbekistanXudoyor ShonazarovUrgench Branch of Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi,Urgench,Uzbekistan
2025
ABI

Abstract

This paper presents a practical approach to developing a word classification algorithm with part of speech tagging for Turkic languages. These languages are agglutinative and produce many surface forms, which increases data sparsity and out of vocabulary rates. Our method combines a compact rule and lexicon driven candidate generator with a character level and word level neural tagger that selects the best analysis in context. The design supports script aware normalization for Latin and Cyrillic text, configurable tokenization for clitics, and Universal Dependencies compliant labels. To address limited supervision, we use weakly supervised labeling with dictionary seeds, heuristic filtering, and targeted human review, together with simple augmentation strategies. We evaluate the approach on Uzbek and Turkish test sets and report strong results across accuracy, precision, recall, and F1. The model consistently surpasses a trigram hidden Markov baseline and a word only neural baseline, and it shows stable behavior on unseen word forms due to character level features. The contributions are a reusable pipeline for Turkic morphology, evidence that hybrid generation and selection improves robustness in low resource conditions, and guidelines for preparing data and reporting results within common conference constraints.

Topics

Identifiers

Citations and references

Cited by 016 references
Metrics — AkademScholar · Coming soon