Article

Developing and Processing the Uzbek-English Parallel Corpus: Approaches and Computational Tools

Botir Elov BoltayevichMukhabbat Kurbanova MatyakubovnaNational University of Uzbekistan,Tashkent,UzbekistanTursunay Yusupova AkhmedovnaMarufjon Amirkulov AlikulovichZarina Barnoyeva SayfiddinovnaBukhara State University,Bukhara,UzbekistanMunira Shodmonova Burxonovna

2026

ABI

Abstract

This article presents an applied and methodological study within computational linguistics, focusing on the creation and analysis of an Uzbek–English parallel corpus. While grounded in linguistic theory, the research emphasizes practical implementation through modern NLP tools and corpus-building methodologies. The study outlines each stage of corpus development, including text selection, data cleaning, alignment, tokenization, lemmatization, and part-of-speech tagging, using frameworks such as spaCy and Stanza. The alignment process applies one-to-one, one-to-many, and many-to-one strategies to ensure semantic equivalence between languages. This paper contributes both a linguistically sound and computationally viable resource for low-resource language processing. The resulting corpus supports applications in machine translation, cross-lingual analysis, and corpus-based linguistic research. Theoretical implications are also discussed, highlighting how corpus design principles influence bilingual modeling and translation accuracy.

Topics

Natural Language Processing Techniques Economic and Industrial Development Second Language Acquisition and Learning

Identifiers

DOI: 10.1109/iisec69317.2026.11418506

Citations and references

Cited by 04 references

Metrics — AkademScholar