Extraction and Data Analysis Basis Words: Case Study on School Corpus
Аннотация
This study analyses a large text corpus based on 142 textbooks created for school education in Uzbekistan. In the proposed approach, the Basis Word Extraction Using Synonym Thesaurus Support method is developed specifically for each grade, using a thesaurus. The corpus was studied in blocks of Primary School Corpus (grades 1–4), Basic Secondary School Corpus (grades 5–9), and Secondary School Corpus (grades 10–11). As a result, basis words that differ from the general corpus were extracted for each grade, as well as new basis words that were not found in previous grades and were specific to this grade. The main idea of this method is to extract basis words from the lemma set of each grade using a synonym database. As a result of this method, 17599 basis words were extracted from the Uzbek Primary School Corpus, 47203 from the Uzbek Basic Secondary School Corpus, and 20491 from the Uzbek Secondary School Corpus. This method enables the analysis of the lexical complexity and class-specific vocabulary of texts intended for schoolchildren.
Перевод пока недоступен