Development of a Rule-based Model and Algorithm for Predicate Identification in Uzbek Language Texts
Abstract
This paper presents a rule-based approach for automatic predicate identification in Uzbek texts. The system encodes five core syntactic patterns that capture verbal-noun + modal constructions, personal endings on non-verbal tokens, auxiliary-verb combinations, and rich tense-mood affixation. These linguistically motivated rules consult two XML resources: mustaqil_fel.xml, which now contains 8,500 verb stems, and istisno.xml, which holds rule-specific exception sets. Evaluation on a 300-sentence gold-standard corpus balanced across literary, scientific, and conversational genres yielded an overall F1-score of 0.86. These results show that carefully crafted rules can provide high-confidence predicate detection for Uzbek and supply a transparent baseline for future hybrid or data-driven enhancements in Uzbek NLP. If carefully designed, such approaches can capture language-specific syntactic regularities with high precision and serve as valuable foundations for hybrid or statistical models. These methods remain a viable and interpretable solution, given the lack of large annotated corpora and the low-resource nature of Uzbek.