Article

Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals

Bobur Rashidovich SaidovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaVladimir Borisovich BarakhninFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaShohrux MadirimovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaUmid IbragimovFaculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, RussiaShakhboz MeylikulovDepartment of Information Technology and Exact Sciences, Termez University of Economics and Service, 38-B, Ibn-Sino str., Termez 190100, UzbekistanSultonbek NormamatovDepartment of Computer Linguistics and Digital Technologies, Faculty of Social and Humanitarian Sciences, Alisher Navo′i Tashkent State University of Uzbek Language and Literature, 103, Yusuf Xos Khojib Str., Tashkent 100013, UzbekistanFeruza BahodirovaDepartment of Interfaculty Foreign Languages, Urgench State University, 14, Kh. Alimdjan str., Urgench 220100, UzbekistanJavlonbek MatnazarovDepartment of Language and Literature, Mamun University, 2, Bol-xovuz str., Khiva 220901, UzbekistanZarnigor FayzullaevaDepartment of Software Engineering, Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan

Datajournal2026en

ABI

Abstract

This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER/ORG/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 “manual-style” sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632/y2d5pcyrzz.3).

Topics

Sentiment Analysis and Opinion Mining Emotion and Mood Recognition Mental Health via Writing

Identifiers

DOI: 10.3390/data11020028

Citations and references

Cited by 014 references

Metrics — AkademScholar · Coming soon