Developing a Word Classification Algorithm for Turkic Languages with PoS Tagging
Abstract
This paper presents a practical approach to developing a word classification algorithm with part of speech tagging for Turkic languages. These languages are agglutinative and produce many surface forms, which increases data sparsity and out of vocabulary rates. Our method combines a compact rule and lexicon driven candidate generator with a character level and word level neural tagger that selects the best analysis in context. The design supports script aware normalization for Latin and Cyrillic text, configurable tokenization for clitics, and Universal Dependencies compliant labels. To address limited supervision, we use weakly supervised labeling with dictionary seeds, heuristic filtering, and targeted human review, together with simple augmentation strategies. We evaluate the approach on Uzbek and Turkish test sets and report strong results across accuracy, precision, recall, and F1. The model consistently surpasses a trigram hidden Markov baseline and a word only neural baseline, and it shows stable behavior on unseen word forms due to character level features. The contributions are a reusable pipeline for Turkic morphology, evidence that hybrid generation and selection improves robustness in low resource conditions, and guidelines for preparing data and reporting results within common conference constraints.