Development of an Algorithm for Linguistic Analysis of the Uzbek Texts
Аннотация
This paper presents a linguistic analysis algorithm for the Uzbek language that combines a rule-based morphological module with a sequential model of syntactic role annotation based on BiLSTM-CRF. The morphological analyzer performs normalization and tokenization, uses a lexicon of lemmas and an inventory of affixes with allomorphs, and checks the compatibility of affix chains in a finite-state machine. Furthermore, the algorithm also takes into account morphophonological alternations and handles exceptions, where residual ambiguity is removed by local rules and lightweight ranking. Furthermore, the syntactic model consumes morphological features (Case, TAM, Voice, Neg, key affix indicators, etc.) along with verbal and symbolic representations and predicts five functional roles (Ega, Kesim, To'ldiruvchi, Aniqlovchi, Hol) using the BIOES scheme. Experiments on a corpus of over 20,000 sentences with fixed splits demonstrate fairly good performance. The proposed approach combines interpretability and practicality for resource-constrained settings and can serve as the basis for scalable tools for processing Uzbek texts.