Extracting Ecological Facts from Karakalpak Texts via Named Entity Recognition
Аннотация
Environmental monitoring, conservation journalism, and regulatory enforcement in Karakalpakstan depend on timely access to structured data. Currently, this data is hidden in unstructured prose—field reports, local news, NGO newsletters, impact assessments, and community observations. The authors address this gap by developing a system for recognizing environmental named entities for Karakalpak texts with a compact, user-friendly schema focused on three high-value categories: species, locations, and conservation organizations. The task is complicated by digraphia (Cyrillic/Latin), code switching between Uzbek and Russian, multilingual species names (Latin binomials and local names), rich hydronymy and microtoponyms, as well as organizational pseudonyms and abbreviations that change over time. The resulting system enables dynamic tracking of species mentions, geocoding of location data aggregation, and reliable identification of responsible agencies, facilitating faster incident triage and deeper integration with geospatial layers and field measurements. By focusing the framework on valid entities and documenting replicable guidelines and assessment design, this work creates a reusable framework for ecological text analysis in resource-constrained settings and offers a template for extending named entity identification to neighboring languages and ecological subdomains.