UzTreebank: Methodological and Practical Issues in Building a Syntactic Treebank for the Uzbek Language
Abstract
In recent years, syntactic and semantic analysis tools have become increasingly important in various subfields of Natural Language Processing (NLP). These tools enable automatic parsing of large-scale sentences in language corpora, allowing researchers to uncover syntactic structures and statistical regularities of a given language. This study focuses on the development and evaluation of syntactic parsing models for the Uzbek language, employing two widely used approaches: constituency parsing and dependency parsing. For constituency parsing, a rule-based system was developed to identify noun and verb phrases along with their internal constituents. For dependency parsing, a set of hand-crafted linguistic rules was created and applied to syntactically analyze simple Uzbek sentences. As a result of this work, a dependency-based syntactic treebank for Uzbek-Named UzTreebank was constructed. The treebank includes 20,000 automatically parsed simple sentences, of which 10,000 were manually annotated. Additionally, 36 syntactic templates of simple sentences were identified, and 50 linguistic rules were formalized and integrated into the system. The suboptimal performance of the system at its current stage is primarily attributed to the absence of hybrid modeling approaches and the limited size of the training corpus. The paper presents an overview of the rule-based architecture, parsing results, and the current stage of syntactic resource development for the Uzbek language.