Статья

Named Entity Recognition in Uzbek Agriculture: Impact of Script Standardization Across Neural Networks Architectures

Nafisa GanijanovaUrgench State University,Urgench,UzbekistanAlisher AbidjanovCyber Universisty,Nurafshon,UzbekistanAybibi R. IskandarovaNational University of Uzbekistan Named After Mirzo Ulugbek,Tashkent,UzbekistanNormurod RasulovSamarkand State Institute of Foreign Languages,Samarkand,UzbekistanDilfuza RuzmatovaUrgench State Pedagogical Institute,Urgench,UzbekistanUktamjon BituraevTashkent State University of Oriental Studies,Tashkent,Uzbekistan

2025

ABI

Аннотация

Uzbek agricultural discourse generates large volumes of diverse text—farm reports, agronomic recommendations, news, tenders, regulations, and field trial records—but much of the practical information remains narrative. This study examines named entity recognition for this Uzbek language and how to improve extraction quality after standardization of the script. The authors constructed a corpus of 8,700 sentences extracted from farm/agribusiness reports and industry news and introduced a domain-specific taxonomy of 13 classes (PER, ORG, LOC, DATE, LAW/DOC, CROP, PEST, DISEASE, CHEMICAL, EQUIP, WATER_OBJECT, SOIL, MEASURE), annotated using BIOES. The study proposes a lightweight alphabetic standardization pipeline that unifies scripts, normalizes apostrophes and digraphs (g’/o’, sh/ch/ng), preserves hyphens across domains, and processes measurement patterns (e.g., kg/ha, l/ha) as atomic tokens. Three model families were compared using a unified training protocol with early stopping and identical data splits: BiLSTM-CRF with character features, mBERT+CRF, and the parametrically efficient T5 (LoRA). The results show that alphabetic standardization is a first-order factor for Uzbek NER, while encoder-based and seq2seq-based architectures are viable when fed normalized text. Furthermore, domain-aware tokenization and recommendations significantly reduce boundary errors.

Темы

Topic Modeling Text and Document Classification Technologies Biomedical Text Mining and Ontologies

Идентификаторы

DOI: 10.1109/apeie66761.2025.11289348

Цитирования и источники

Цитирований: 0Использованных источников: 20

Показатели — AkademScholar · Скоро