Другое

Availability OF Labeled Data Sources in the Uzbek Domain: A Study OF Publicly-Available Speech, Text, Image AND Multimodal Resources

F. (Fattoev) AzizbekBachelor student of Computer Science and Software Engineering , Faculty of Information and Communication Technologies , Inha University in Tashkent

Nelitirepository2026en

ABI

Аннотация

Modern machine learning and natural language processing systems require large, labeled datasets. Languages and countries with small digital footprints—so-called low-resource contexts—face structural barriers to participation in AI research. This paper presents a systematic, contemporary survey of publicly available labeled datasets for the Uzbek domain across four modalities: speech, text, images, and multimodal (paired) data. We map existing resources, identify critical gaps (e.g., script variation, code-switching, institutional lock-up), and analyze cross-cutting challenges that inhibit dataset creation and reuse. Finally, we propose concrete, actionable strategies—crowdsourcing campaigns, synthetic data generation, cross-lingual transfer, and a national open dataset hub—to accelerate Uzbek AI capacity for academic researchers and AI companies.

Перевод пока недоступен

Цитирования и источники

Цитирований: 0Использованных источников: 0

Показатели — AkademScholar