Named Entity Recognition in Uzbek Agriculture: Impact of Script Standardization Across Neural Networks Architectures
Аннотация
Uzbek agricultural discourse generates large volumes of diverse text—farm reports, agronomic recommendations, news, tenders, regulations, and field trial records—but much of the practical information remains narrative. This study examines named entity recognition for this Uzbek language and how to improve extraction quality after standardization of the script. The authors constructed a corpus of 8,700 sentences extracted from farm/agribusiness reports and industry news and introduced a domain-specific taxonomy of 13 classes (PER, ORG, LOC, DATE, LAW/DOC, CROP, PEST, DISEASE, CHEMICAL, EQUIP, WATER_OBJECT, SOIL, MEASURE), annotated using BIOES. The study proposes a lightweight alphabetic standardization pipeline that unifies scripts, normalizes apostrophes and digraphs (g’/o’, sh/ch/ng), preserves hyphens across domains, and processes measurement patterns (e.g., kg/ha, l/ha) as atomic tokens. Three model families were compared using a unified training protocol with early stopping and identical data splits: BiLSTM-CRF with character features, mBERT+CRF, and the parametrically efficient T5 (LoRA). The results show that alphabetic standardization is a first-order factor for Uzbek NER, while encoder-based and seq2seq-based architectures are viable when fed normalized text. Furthermore, domain-aware tokenization and recommendations significantly reduce boundary errors.