UzLegalNER v3_fixed: Uzbek Legal Contracts Named Entity Recognition Dataset (PER/ORG/LOC/POSITION/DATE/MONEY/DOCNO)
Аннотация
UzLegalNER v3_fixed is a named entity recognition (NER) dataset for Uzbek legal contracts and related official documents. The dataset uses a seven-label schema: PER, ORG, LOC, POSITION, DATE, MONEY, DOCNO. We release: (i) a master spreadsheet (XLSX) with sentence-level metadata and character-level entity spans, (ii) a JSONL version with span annotations, and (iii) CoNLL BIO splits (train/dev/test) for standard NER training and benchmarking. Key fields: sent_id (unique per sentence), doc_id (document/group identifier for doc-level splitting), doc_type, script (latin), split, text, and entities (start/end/label/text). Overlapping/nested spans are removed for CoNLL compatibility (the longest span is retained). Intended use: training and evaluating Transformer-based NER models and gazetteer-enhanced methods, with a particular focus on robustness to unseen entity surface forms in legal text.
Перевод пока недоступен