Статья

Developing and Processing the Uzbek-English Parallel Corpus: Approaches and Computational Tools

Botir Elov BoltayevichMukhabbat Kurbanova MatyakubovnaNational University of Uzbekistan,Tashkent,UzbekistanTursunay Yusupova AkhmedovnaMarufjon Amirkulov AlikulovichZarina Barnoyeva SayfiddinovnaBukhara State University,Bukhara,UzbekistanMunira Shodmonova Burxonovna

2026

ABI

Аннотация

This article presents an applied and methodological study within computational linguistics, focusing on the creation and analysis of an Uzbek–English parallel corpus. While grounded in linguistic theory, the research emphasizes practical implementation through modern NLP tools and corpus-building methodologies. The study outlines each stage of corpus development, including text selection, data cleaning, alignment, tokenization, lemmatization, and part-of-speech tagging, using frameworks such as spaCy and Stanza. The alignment process applies one-to-one, one-to-many, and many-to-one strategies to ensure semantic equivalence between languages. This paper contributes both a linguistically sound and computationally viable resource for low-resource language processing. The resulting corpus supports applications in machine translation, cross-lingual analysis, and corpus-based linguistic research. Theoretical implications are also discussed, highlighting how corpus design principles influence bilingual modeling and translation accuracy.

Перевод пока недоступен

Темы

Natural Language Processing Techniques Economic and Industrial Development Second Language Acquisition and Learning

Идентификаторы

DOI: 10.1109/iisec69317.2026.11418506

Цитирования и источники

Цитирований: 0Использованных источников: 4

Показатели — AkademScholar