Uzbek Legal NER Dataset Package: A Gold-Ready Multi-Format Resource with Core Gold and Extended Augmented Layers
Аннотация
This data descriptor presents an Uzbek legal-domain named entity recognition (NER) dataset package designed for low-resource information extraction research. The release covers 12 entity categories—PER, ORG, LOC, DATE, MONEY, POSITION, DOCNO, LAW, COURT, BANK, TIN, and CADASTRE—and is distributed in XLSX, CSV, JSON, and JSONL formats to support reuse across different NLP workflows. The package is organized into two complementary layers: a core subset containing manually reviewable open-source records with entity-oriented metadata, and an extended augmented subset that includes synthetic examples intended for training support in low-resource labels. In addition to the data files, the release provides documentation, split recommendations, data dictionary files, citation guidance, and review-oriented metadata such as provenance, verification status, and quality flags. Character-level start and end offsets are included when automatically recoverable, and the package is structured to facilitate further conversion into strict gold-standard span annotations and sequence-labeling formats. The dataset is intended for Uzbek legal NER benchmarking, weakly supervised resource construction, and augmentation-aware experimental design. At the same time, we explicitly distinguish between manually reviewable open-source records and synthetic augmentation rows, and we recommend using only the verified open-source subset for future gold-standard evaluation.
Перевод пока недоступен