Availability OF Labeled Data Sources in the Uzbek Domain: A Study OF Publicly-Available Speech, Text, Image AND Multimodal Resources
Annotatsiya
Modern machine learning and natural language processing systems require large, labeled datasets. Languages and countries with small digital footprints—so-called low-resource contexts—face structural barriers to participation in AI research. This paper presents a systematic, contemporary survey of publicly available labeled datasets for the Uzbek domain across four modalities: speech, text, images, and multimodal (paired) data. We map existing resources, identify critical gaps (e.g., script variation, code-switching, institutional lock-up), and analyze cross-cutting challenges that inhibit dataset creation and reuse. Finally, we propose concrete, actionable strategies—crowdsourcing campaigns, synthetic data generation, cross-lingual transfer, and a national open dataset hub—to accelerate Uzbek AI capacity for academic researchers and AI companies.
Hali tarjima qilinmagan