Статья

Sentence-Similarity Pipeline for Paraphrase Detection in Uzbek Texts

Djamshid SultanovDepartment of Infocommunication Engineering, Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi,Tashkent,UzbekistanKhusniya AkhmedovaDepartment of Infocommunication Engineering, Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi,Tashkent,UzbekistanHyunchul AhnGraduate School of Business IT, College of Business Adminstration Kookmin University,Seoul,Republic of korea

2025

ABI

Аннотация

Measuring semantic similarity between sentences is central to plagiarism detection, authorship attribution, question- answering, and many other NLP applications. We consolidate existing Vector-Space and deep neural approaches into a single text-processing pipeline (TPP) for low-resource Uzbek. The TPP supports flexible tokenization, optional emoji handling, and alternative embedding stages (TF-IDF, Word2Vec, Doc2Vec). A lightweight API layer feeds sentence embeddings to a modular paraphrase-classification component that hosts distance-based or neural models. Our design is validated on publicly available Uzbek corpora and synthetic sentence pairs; cosine similarity combined with a logisticregression meta-classifier yields up to 0.86 F -score without external WordNet resources. The pipeline is released as an install-and-play prototype for educational use.

Темы

Advanced Text Analysis Techniques Topic Modeling Sentiment Analysis and Opinion Mining

Идентификаторы

DOI: 10.1109/iscset65760.2025.11540524

Цитирования и источники

Цитирований: 0Использованных источников: 9

Показатели — AkademScholar · Скоро