Другое

Uzbek Legal NER Dataset Package: A Gold-Ready Multi-Format Resource with Core Gold and Extended Augmented Layers

Saidov BoburNovosibirsk State University

Zenodo (CERN European Organization for Nuclear Research)repository2026uz

ABI

Аннотация

This data descriptor presents an Uzbek legal-domain named entity recognition (NER) dataset package designed for low-resource information extraction research. The release covers 12 entity categories—PER, ORG, LOC, DATE, MONEY, POSITION, DOCNO, LAW, COURT, BANK, TIN, and CADASTRE—and is distributed in XLSX, CSV, JSON, and JSONL formats to support reuse across different NLP workflows. The package is organized into two complementary layers: a core subset containing manually reviewable open-source records with entity-oriented metadata, and an extended augmented subset that includes synthetic examples intended for training support in low-resource labels. In addition to the data files, the release provides documentation, split recommendations, data dictionary files, citation guidance, and review-oriented metadata such as provenance, verification status, and quality flags. Character-level start and end offsets are included when automatically recoverable, and the package is structured to facilitate further conversion into strict gold-standard span annotations and sequence-labeling formats. The dataset is intended for Uzbek legal NER benchmarking, weakly supervised resource construction, and augmentation-aware experimental design. At the same time, we explicitly distinguish between manually reviewable open-source records and synthetic augmentation rows, and we recommend using only the verified open-source subset for future gold-standard evaluation.

Перевод пока недоступен

Идентификаторы

DOI: 10.5281/zenodo.19682709

Цитирования и источники

Цитирований: 1Использованных источников: 0

Показатели — AkademScholar