Uzbek Terminology and Linguistic Engineering Integration: Principles of Semantic Tagging of Terms in Uzbek Scientific Discourse for Language Corpora
Abstract
The paper addresses one of the most urgent problems in Uzbek computational linguistics: the lack of a standardized framework for semantic tagging of scientific and technical terminology. While significant progress has been made in corpus linguistics worldwide, Uzbek still lacks a functional terminological corpus, limiting the development of digital linguistic resources. The study proposes a three-level semantic tagging model that incorporates conceptual classification, domain-specific differentiation, and ontological mapping of terms. A pilot experiment was conducted on 628 terms from linguistics, natural sciences, and information technology, semi-manually annotated within a test corpus of approximately 5,000 sentences extracted from Uzbek scientific texts. The results demonstrate the applicability of the model in disambiguating polysemous terms, structuring terminological semantics, and enhancing corpus usability. Comparative analysis with world and Turkic-language experiences (e.g., WordNet, Turkish National Corpus, Kazakh scientific corpora) highlights both challenges and opportunities for Uzbek. The study concludes that semantic tagging of terms not only supports the development of the Uzbek National Corpus but also enables advanced applications in natural language processing, machine translation, and information retrieval.