Статья

Restoring Punctuation in Uzbek Texts Using LLM's Fine Tuning Approaches

Muhammadjon MusaevDepartment of Artificial Intelligence, Tashkent University of Information Technologies named after Muhammad Al Khwarizmi, Tashkent, UzbekistanMannon OchilovDepartment of Artificial Intelligence, Tashkent University of Information Technologies named after Muhammad Al Khwarizmi, Tashkent, UzbekistanMalika AbdullaevaDepartment of Artificial Intelligence, Tashkent University of Information Technologies named after Muhammad Al Khwarizmi, Tashkent, UzbekistanRashid NasimovDepartment of Artificial Intelligence, Tashkent State University of Economics, Tashkent, Uzbekistan

2024en

ABI

Аннотация

This study draws attention to the use of transformation-based ad-hoc tuning models for detecting and recovering punctuation marks in Uzbek language texts. The research was conducted to accurately predict punctuation marks including commas, full stops, exclamation marks and question marks using Bert and XML-Roberta architectures. The research is to develop a specialized database on key aspects of Uzbek punctuation, which helps the existence of focused research in this area. It has yielded good results in tests, confirming the potential of transformer models for punctuation recovery in low-resource languages, including Uzbek. The paper focuses on the development of tokenizers specialized exclusively for Uzbek. Methods for recovering punctuation marks in Uzbek language and improving model accuracy in case of class label mismatch were also discussed for future research. The study indicated that the mean F1 score in determining whether a word after 4 punctuation marks and punctuation marks is uppercase or lowercase in Uzbek language was 87.9%.

Темы

Natural Language Processing Techniques Translation Studies and Practices Topic Modeling

Идентификаторы

DOI: 10.1145/3726122.3726139

Цитирования и источники

Цитирований: 2 Использованных источников: 25

Показатели — AkademScholar · Скоро