Перейти к основному содержанию
AkademIndex

Продукты

Для разработчиков

AkademBaseскороОткрытый API экосистемы
Латиница
Статья

Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals

Bobur Rashidovich SaidovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaVladimir Borisovich BarakhninFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaShohrux MadirimovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaUmid IbragimovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaShakhboz MeylikulovDepartment of Information Technology and Exact Sciences, Termez University of Economics and Service, 38-B, Ibn-Sino str., Termez 190100, UzbekistanSultonbek NormamatovDepartment of Computer Linguistics and Digital Technologies, Faculty of Social and Humanitarian Sciences, Alisher Navo′i Tashkent State University of Uzbek Language and Literature, 103, Yusuf Xos Khojib Str., Tashkent 100013, UzbekistanFeruza BahodirovaDepartment of Interfaculty Foreign Languages, Urgench State University, 14, Kh. Alimdjan str., Urgench 220100, UzbekistanJavlonbek MatnazarovDepartment of Language and Literature, Mamun University, 2, Bol-xovuz str., Khiva 220901, UzbekistanZarnigor FayzullaevaDepartment of Software Engineering, Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan
Datajournal2026en
ABI

Аннотация

This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 “manual-style” sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632/y2d5pcyrzz.3).

Темы

Идентификаторы

Цитирования и источники

Показатели — AkademScholar · Скоро