Lists of Karakalpak Stopwords
Annotatsiya
The dataset presents 3 lists of stopwords in the Karakalpak language. The lists were constructed using three automatic methods applied to the same corpus. The corpus was constructed by obtaining a source of 23 school textbooks, it was named "Karakalpak School Corpus". The corpus can be re-constructed using the list of urls of all files comprised in the corpus. The list is part of the dataset (list_of_urls_for_karakalpak_school_corpus.txt). Description of the methods and the lists: A set of grammar rules and the TDIDF algorithm were used to automatically collect a list of single-word stopwords. 4014 stopwords were collected. The name of the file: Karakalpak_stopwords_unigrams.txt. A bigram method was used to extract a list of 3740 bigrams (pairs) of stopwords. The name of the file: Karakalpak_stopwords_bigram.txt. A set of two-word collocations as stopwords was also extracted. The list has 20745 pairs of stopwords. The name of the file: Karakalpak_stopwords_bigrams_with_collocations.txt.
Hali tarjima qilinmagan