Developing and Processing the Uzbek-English Parallel Corpus: Approaches and Computational Tools
Аннотация
This article presents an applied and methodological study within computational linguistics, focusing on the creation and analysis of an Uzbek–English parallel corpus. While grounded in linguistic theory, the research emphasizes practical implementation through modern NLP tools and corpus-building methodologies. The study outlines each stage of corpus development, including text selection, data cleaning, alignment, tokenization, lemmatization, and part-of-speech tagging, using frameworks such as spaCy and Stanza. The alignment process applies one-to-one, one-to-many, and many-to-one strategies to ensure semantic equivalence between languages. This paper contributes both a linguistically sound and computationally viable resource for low-resource language processing. The resulting corpus supports applications in machine translation, cross-lingual analysis, and corpus-based linguistic research. Theoretical implications are also discussed, highlighting how corpus design principles influence bilingual modeling and translation accuracy.
Перевод пока недоступен