Oghuz Dialect Analysis of the Uzbek Language: Methodological Approach and Experimental Study
Abstract
In this research, the authors present a relatively simple to implement yet effective detector for the Oghuz dialect of Uzbek. The method is compatible with standard natural language preprocessing, specifically normalization, tokenization, and spelling-aware regular expressions. Furthermore, a carefully selected set of diagnostic features (euklama enhancers, connectors, and auxiliary particles) is used for text analysis. We evaluate texts by normalizing the total number of pattern matches by the number of tokens and apply a single, adjustable threshold to distinguish dialectal from standard Uzbek. With stratified development, the rule-based system provides strong separability with a practical operating point. At the same time, it delivers high precision and recall, where the addition of a TF-IDF + logistic regression layer provides a small boost in edge cases while maintaining interpretability and low computational cost. A detailed error analysis identifies key error types—interscript variation, colloquial overlaps, and the handling of multi-word/clitic fragments—and motivates targeted corrections to normalization, matching constraints, and MWU rules. In addition to classification, the inventory supports corpus formation and training by providing pattern-based diagnostics, facilitating gradual refinement and major updates to context coders as needed.