Rule-Based Punctuation Algorithm for the Uzbek Language
Abstract
Punctuation analysis occupies an important place in natural language processing, presenting the need for a model that can predict correct punctuations in number of tasks like text pre-processing, spell checking, grammar checking, information retrieval and so on. The task of predicting right punctuation marks is context-dependent making languageindependent general punctuation generation tools non-trivial for the job. Although the idea of creating such tool has already been accomplished for many languages, the Uzbek language is one of the few low-resource languages, and to our knowledge, punctuation analysis and prediction algorithms for Uzbek texts have not yet been developed. In this paper, it is proposed a rulebased algorithm and a model for punctuation analysis of periods and commas in Uzbek language texts. While the major contribution of this paper is a rule-based algorithm for determining the correct or incorrect placement of periods and commas in Uzbek language text, the authors also present the analysis results on a corpus with various fields, acknowledging the need for further analysis of the task, including machine learning and deep learning solutions for the future work. The proposed rule-based algorithm for punctuation analysis will not only help Uzbek texts, but also will hopefully play as a pivot point for other closely-related Turkic languages as well.