Article

Bidirectional LSTM-CRF Models for Punctuation Restoration in Uzbek Texts

Maksud SharipovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science,Urgench,UzbekistanHushnudbek S. AdinaevUrgench State University Named After Abu Rayhan Biruni,Department of IT,Urgench,UzbekistanOgabek SobirovUrgench State University Named After Abu Rayhan Biruni,Department of Computer Science,Urgench,Uzbekistan

2025

ABI

Abstract

This study proposes an approach to predict punctuation in Uzbek texts based on the Conditional Random Fields (CRF) model. The main objective of the study is to ensure the structural coherence of texts by accurately placing punctuation marks in Uzbek language texts. The CRF model performs this by thoroughly analyzing words and their contextual features in a sequential framework to predict the location of punctuation marks. As part of the project, a special corpus for Uzbek was constructed, in which the relationships between each word and its punctuation marks were explicitly marked. Texts were annotated based on their morphological and syntactic properties, and a set of features was subsequently defined. The model uses a two-phase approach for punctuation restoration in Uzbek texts, consisting of two layers of Bidirectional LSTM (Long Short-Term Memory) to learn contextual information, while CRF (Conditional Random Fields) captures the dependencies between labels. During the training and testing process, traditional statistical metrics - F1-score, were used for evaluation. The results show that the BiLSTM+CRF+rule model is both effective and reliable in the task of punctuation prediction for the Uzbek language. The BiLSTM+CRF+rule model achieved a remarkable result in predicting punctuation, achieving an overall accuracy of about 89.5 percent.

Topics

Speech Recognition and Synthesis

Identifiers

DOI: 10.1109/ubmk67458.2025.11206853

Citations and references

Cited by 05 references

Metrics — AkademScholar · Coming soon