Asosiy kontentga oʻtish
AkademIndex

Mahsulotlar

Ishlab chiquvchilar uchun

AkademBasetez oradaEkotizim uchun ochiq API
Lotin
Oʻzbek
Maqola

Investigating linguistic errors in large language model generation of uzbek text

Dilyorjon SolidjonovDepartment of Philology, Kokand UniversityNAJMIDDINOV MUHAMMADJONDepartment of Philology, Kokand University
ABI

Annotatsiya

This study examines how contemporary large language models (LLMs) generate Uzbek, an agglutinative, morphologically rich, and low-resource Turkic language. Using a prompt-controlled corpus of 240 outputs from GPT-5, Gemini 2.0 Pro, and LLaMA 4, we conduct a detailed linguistic error analysis covering orthography, morphotactics, morphophonology, syntax, lexicon, and pragmatic appropriateness. Although the models frequently produce surface-level fluency, they exhibit systematic failures in core grammatical domains such as suffix ordering, vowel harmony, agreement, and speech-level distinctions (e.g. sen/Siz). Morphological errors account for more than one-third of all non-target-like forms, indicating that subword-based tokenisation and English-centric architectural priors remain inadequate for languages with rich inflectional and derivational morphology. These results demonstrate how current LLMs distort the typological profile of Uzbek, flattening both structural complexity and sociocultural nuance. To address this gap, the article proposes a linguistically grounded evaluation framework for low-resource and morphologically complex languages, arguing that genuine linguistic inclusivity in NLP requires models to respect the structural, typological, and social logics of the languages they aim to serve—not merely to scale data or model size.

Mavzular

Identifikatorlar

Iqtiboslar va manbalar

Koʻrsatkichlar — AkademScholar · Tez orada