Sentence-Similarity Pipeline for Paraphrase Detection in Uzbek Texts
Аннотация
Measuring semantic similarity between sentences is central to plagiarism detection, authorship attribution, question- answering, and many other NLP applications. We consolidate existing Vector-Space and deep neural approaches into a single text-processing pipeline (TPP) for low-resource Uzbek. The TPP supports flexible tokenization, optional emoji handling, and alternative embedding stages (TF-IDF, Word2Vec, Doc2Vec). A lightweight API layer feeds sentence embeddings to a modular paraphrase-classification component that hosts distance-based or neural models. Our design is validated on publicly available Uzbek corpora and synthetic sentence pairs; cosine similarity combined with a logisticregression meta-classifier yields up to 0.86 F -score without external WordNet resources. The pipeline is released as an install-and-play prototype for educational use.