Text classification dataset for Uzbek language
Аннотация
It is collected text data from 9 Uzbek news websites and press portals that included news articles and press releases. These websites were selected to cover various categories such as politics, sports, entertainment, technology, and others. In total, we collected 512,750 articles with over 120 million words accross 15 distinct categories, which provides a large and diverse corpus for text classification. It is worth noting that all the text in the corpus is written in the Latin script. <em>Categories (with the name in Uzbek):</em> Local (Mahalliy) World (Dunyo) Sport (Sport) Society (Jamiyat) Law (Qonunchilik) Tech (Texnologiya) Culture (Madaniyat) Politics (Siyosat) Economics (Iqtisodiyot) Auto (Avto) Health (Salomatlik) Crime (Jinoyat) Photo (Foto) Women (Ayollar) Culinary (Pazandachilik) When you reference this article, please be sure to cite it using the following address: BibTex <pre><code>@inproceedings{Kuriyozov2023TextCD, title={Text classification dataset and analysis for Uzbek language}, author={Elmurod Kuriyozov and Ulugbek Salaev and Sanatbek Matlatipov and Gayrat Matlatipov}, year={2023} } </code></pre> APA: <pre><code>Kuriyozov, E., Salaev, U., Matlatipov, S., & Matlatipov, G. (2023). Text classification dataset and analysis for Uzbek language.</code></pre>