UzNER-100K: A Human-Reviewed Uzbek NER Benchmark with Gazetteer-Augmented Transformer Modeling and Robustness Analysis
Аннотация
UzNER-100K is a human-reviewed benchmark dataset for Uzbek named entity recognition (NER). The released collection contains 114,269 sentences, 1,184,426 tokens, and 200,083 entity mentions. The benchmark includes a 100,000-sentence training split annotated with 18 fine-grained entity types under a BIOES tagging scheme, together with development, standard test, gold candidate, and hard-evaluation subsets. The final training configuration follows a mixed-origin design with 70,000 human-reviewed real sentences and 30,000 reviewed synthetic sentences generated through an LLM-assisted pipeline. The benchmark was designed for both standard model comparison and robustness-oriented evaluation. All splits are fully disjoint, and sentence-level auditing confirms zero train–test overlap. The release includes benchmark files, preprocessing utilities, evaluation tools, annotation-support materials, and reproducibility documentation. The resource is intended to support research on Uzbek NER, low-resource NLP, multilingual and monolingual transformer benchmarking, hybrid NER modeling, and robustness analysis.
Перевод пока недоступен