Natural Language Processing for Uzbek and Turkmen: A Systematic Review of Resources, Methods, and Benchmarks
Аннотация
This review surveys the current state of Natural Language Processing for Uzbek and Turkmen, two closely related Turkic languages with agglutinative morphology, rich derivation, and recent script transitions. We synthesize publicly available resources (monolingual, parallel, and speech corpora), preprocessing practices (script normalization, language identification, and de-duplication), and modeling strategies ranging from finite-state morphology and classical sequence labeling to modern multilingual transformer baselines with task-specific fine-tuning. The task coverage includes morphological analysis, part-of-speech tagging, named-entity recognition, machine translation, and speech technologies such as automatic speech recognition and text-to- speech. We highlight an asymmetry in resource maturity—Uzbek has growing datasets and models, whereas Turkmen remains markedly under-resourced—and examine where cross-lingual transfer, joint subword vocabularies, and bilingual adapter techniques can bridge this gap. The paper contributes (i) a paired, comparative synthesis for Uzbek and Turkmen; (ii) a practical, reproducible baseline recipe that specifies data curation, subword tokenization with the SentencePiece library, multilingual transformer adapters, and evaluation using standard machine-translation and token-level precision-recall based scores; and (iii) a roadmap for community benchmarks with transparent training, development, and test splits and broad multilingual coverage. We also catalog gaps in Turkmen orthographic resources and propose cost-effective data acquisition strategies, including targeted web scraping, community sourcing, and dictionary- guided alignment. Ethical considerations around dialectal variation, domain bias, and privacy in web-scale data are discussed. Our findings indicate that modest, data-centric interventions—normalization across scripts, active learning, and high-quality filtering—often yield larger gains than architectural complexity, and that carefully designed shared tasks can catalyze comparable progress across both languages.
Ҳали таржима қилинмаган