Статья

Development and Evaluation of Morphological-Oriented Models for Determining the Semantic Similarity of Texts (Using the Uzbek Language as an Example)

Bahodir MuminovDepartment of artificial intelligence, Tashkent State University of Economics, Tashkent, UzbekistanNasiba Muradovna AllaberganovaDepartment of artificial intelligence, Tashkent University of Information Technologies named after Muhammad al-Khwarizmi, Tashkent, Uzbekistan

2025

ABI

Аннотация

Determining the semantic similarity of texts, also known as Semantic Textual Similarity (STS), is a fundamental task in natural language processing (NLP). However, existing state-of-the-art models, developed primarily for the English language, demonstrate reduced effectiveness when applied to low-resource and morphologically rich languages such as Uzbek. The agglutinative structure of the Uzbek language leads to data sparsity and the inefficiency of standard tokenization methods, which necessitate the development of specialized approaches. A new model for determining semantic proximity is proposed in this article, which explicitly integrates morphological information into the architecture of a multilingual transformer. The model architecture is a dual-encoder consisting of a semantic channel based on a pre-trained transformer model and a morphological channel that uses a recurrent neural network (Bi-GRU) to process features extracted by a morphological analyzer. To train and evaluate the model, the first-of-its-kind reference dataset for the Uzbek language, Uz-STS-B, was created through professional translation and multi-stage validation of the English-language STS Benchmark dataset. Experimental results show that the proposed model significantly outperforms existing baseline approaches, including monolingual and multilingual transformer models, demonstrating a 2.23 percentage increase in Spearman's correlation compared to the stronger baseline model. Qualitative analysis and ablation studies confirm that the improvement in performance is directly related to the explicit consideration of the morphological structure of words. The developed models, algorithms, and UzSTS-Toolkit software tools make a significant contribution to the development of NLP technologies for Turkic and other agglutinative languages.

Темы

Topic Modeling Text and Document Classification Technologies Sentiment Analysis and Opinion Mining

Идентификаторы

DOI: 10.1145/3789692.3789735

Цитирования и источники

Цитирований: 0Использованных источников: 12

Показатели — AkademScholar · Скоро