Асосий контентга ўтиш
AkademIndex

Маҳсулотлар

Ишлаб чиқувчилар учун

AkademBaseтез орадаЭкотизим учун очиқ API
Лотин
Ўзбек
Мақола

Investigating linguistic errors in large language model generation of uzbek text

Dilyorjon SolidjonovDepartment of Philology, Kokand UniversityNAJMIDDINOV MUHAMMADJONDepartment of Philology, Kokand University
ABI

Аннотация

This study examines how contemporary large language models (LLMs) generate Uzbek, an agglutinative, morphologically rich, and low-resource Turkic language. Using a prompt-controlled corpus of 240 outputs from GPT-5, Gemini 2.0 Pro, and LLaMA 4, we conduct a detailed linguistic error analysis covering orthography, morphotactics, morphophonology, syntax, lexicon, and pragmatic appropriateness. Although the models frequently produce surface-level fluency, they exhibit systematic failures in core grammatical domains such as suffix ordering, vowel harmony, agreement, and speech-level distinctions (e.g. sen/Siz). Morphological errors account for more than one-third of all non-target-like forms, indicating that subword-based tokenisation and English-centric architectural priors remain inadequate for languages with rich inflectional and derivational morphology. These results demonstrate how current LLMs distort the typological profile of Uzbek, flattening both structural complexity and sociocultural nuance. To address this gap, the article proposes a linguistically grounded evaluation framework for low-resource and morphologically complex languages, arguing that genuine linguistic inclusivity in NLP requires models to respect the structural, typological, and social logics of the languages they aim to serve—not merely to scale data or model size.

Мавзулар

Идентификаторлар

Иқтибослар ва манбалар

Кўрсаткичлар — AkademScholar · Тез орада