Skip to main content
Article

Developing and Processing the Uzbek-English Parallel Corpus: Approaches and Computational Tools

2026
ABI

Abstract

This article presents an applied and methodological study within computational linguistics, focusing on the creation and analysis of an Uzbek–English parallel corpus. While grounded in linguistic theory, the research emphasizes practical implementation through modern NLP tools and corpus-building methodologies. The study outlines each stage of corpus development, including text selection, data cleaning, alignment, tokenization, lemmatization, and part-of-speech tagging, using frameworks such as spaCy and Stanza. The alignment process applies one-to-one, one-to-many, and many-to-one strategies to ensure semantic equivalence between languages. This paper contributes both a linguistically sound and computationally viable resource for low-resource language processing. The resulting corpus supports applications in machine translation, cross-lingual analysis, and corpus-based linguistic research. Theoretical implications are also discussed, highlighting how corpus design principles influence bilingual modeling and translation accuracy.

Topics

Identifiers

Citations and references

Cited by 04 references