Building a Sentence-Level Sentiment Classifier for Karakalpak: From Low-Resource Data to Emotion Trajectories
Abstract
This paper presents a compact and fully interpretable algorithm for analyzing the sentiment and emotional dynamics of Karakalpak educational texts. In the first stage, a sentence-based sentiment classifier for Karakalpak sentences was implemented, based on simple and reliable features (word/char n-grams) and logistic regression. The algorithm outputs a continuous polarity for each sentence, ranging from -1 to +1. In the second stage, the resulting numerical polarity values are analyzed across sentences, extracting three understandable indices of "emotional trajectory": volatility (frequency and strength of switches), stability/self-regulation (proportion of "calm" sections and low amplitude), and balance/trend (midtone and its change toward the end). The pipeline includes script standardization (Latin/apostrophes), lightweight lemmatization for productive suffixes, and accurate sentence segmentation, which is critical for an agglutinative language. The methodological contribution consists of combining an explanatory first-stage model with a deterministic, rule-based analysis of dynamics in the second stage. This design does not require large training corpora and is easily replicable: all solutions are accompanied by understandable "carriers," and essay-level aggregates are transformed into short, practical recommendations for the writing teacher. We describe the tagging and evaluation protocols, discuss limitations (genre sensitivity, code-switching, polysemy), and outline future directions: expanding the corpus, refining the rules for negation and modifiers, soft transfer from related Turkic languages, and integrating lightweight neural modules as the data grows.