A Hybrid Named-Entity Recognition Algorithm for Ecological Documentation in Uzbekistan
Abstract
The rapid expansion of environmental documentation in Uzbekistan, from parliamentary minutes to environmental impact assessment reports, requires automated tools to quickly find key actors, territories, and institutions. However, existing named entity extraction systems struggle with the dual script (Cyrillic ↔ Latin) and agglutinative morphology of the Uzbek language. This paper proposes a hybrid NER algorithm that combines two deterministic preprocessing modules with the SpaCy statistical model. The orthographic standardization module converts Cyrillic input to unified Latin, normalizing apostrophes and specific letters. The morphological corrector performs lemmatization and corrects typical typos, eliminating suffix "noise" that masks entity boundaries. A corpus of 5,000 environmental sentences, marked up according to the BIOES scheme for the PER, ORG, LOC classes, was compiled for training. Comparative testing shows that the hybrid approach increases the average F<inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</inf>-measure by 4–7% relative to the basic SpaCy model. The developed algorithm takes a step towards creating a reliable infrastructure for analyzing environmental data in the Uzbek language and supports decision-making in the field of sustainable development of the country.