Conditional Text Generation for Uzbek: A Comparative Study with LSTM and mT5-small
Annotatsiya
We investigate conditional text generation for Uzbek in an open-domain setting spanning news, educational writing, and general prose. To enable supervised, prompt-driven generation, we curate a corpus of 12,078 Uzbek sentences and convert it into pairs where brief topic and style cues condition target continuations. A recurrent network baseline is compared with a multilingual text-to-text transformer fine-tuned for two epochs under identical preprocessing and splits. Quality is evaluated with corpus-level n-gram precision and two recall-oriented overlap measures that capture unigram coverage and longest-subsequence structure. The transformer consistently surpasses the recurrent baseline, producing more fluent and stylistically stable continuations: n-gram precision rises from 11.8 to 26.7, unigram overlap from 23.6 to 40.8, and longest-subsequence overlap from 20.1 to 36.9 on the held-out test set. Error analysis shows that the recurrent model often truncates or repeats under longer prompts, whereas the transformer better preserves clause structure but can default to generic or enumerative phrasing when cues are underspecified. Simple prompt rephrasing toward directive intent and mild decoding constraints improves stability and sequence-level structure. These findings indicate that transformer-based conditional generation is a practical and effective approach for Uzbek under low-resource conditions. Future work will expand the corpus, incorporate semantic and human evaluations, and explore parameter-efficient and retrieval-augmented tuning for better grounding.