Bidirectional LSTM-CRF Models for Punctuation Restoration in Uzbek Texts
Abstract
This study proposes an approach to predict punctuation in Uzbek texts based on the Conditional Random Fields (CRF) model. The main objective of the study is to ensure the structural coherence of texts by accurately placing punctuation marks in Uzbek language texts. The CRF model performs this by thoroughly analyzing words and their contextual features in a sequential framework to predict the location of punctuation marks. As part of the project, a special corpus for Uzbek was constructed, in which the relationships between each word and its punctuation marks were explicitly marked. Texts were annotated based on their morphological and syntactic properties, and a set of features was subsequently defined. The model uses a two-phase approach for punctuation restoration in Uzbek texts, consisting of two layers of Bidirectional LSTM (Long Short-Term Memory) to learn contextual information, while CRF (Conditional Random Fields) captures the dependencies between labels. During the training and testing process, traditional statistical metrics - F1-score, were used for evaluation. The results show that the BiLSTM+CRF+rule model is both effective and reliable in the task of punctuation prediction for the Uzbek language. The BiLSTM+CRF+rule model achieved a remarkable result in predicting punctuation, achieving an overall accuracy of about 89.5 percent.