Layer-Wise Probing of Paralinguistic Attributes in Fine-Tuned Whisper for Kazakh Speech
Annotatsiya
Large pre-trained speech models similar to Whisper are now commonly used for speech recognition and related tasks. The distribution of paralinguistic features, which include emotions and speaker characteristics across model layers, remains uncertain, particularly for low-resource languages. The current study evaluates each layer of the Kazakh-adapted Whisper encoder to determine its performance in recognizing emotional expression, speaker identity, age, and gender attributes. We extract fixed representations from every encoder layer and test them with both linear and Multilayer Perceptron (MLP) probes. The evaluation process uses accuracy, macro-averaged F1-score (Macro-F1), and balanced accuracy metrics, whereas non-parametric statistical tests evaluate the importance of changes across different layers. The experimental evaluation of KazEmoTTS focuses on emotional expression, whereas Common Voice (Kazakh) data serve for speaker identification and demographic attribute analysis. The results demonstrate that age and gender information are strongly present at all layers of the model with little change in representation across depths, yet speaker identity shows statistically significant but weak variations between layers. Emotion information appears mainly in the model's middle layers, which is the area where probing is most effective. The research findings reveal how Whisper processes Kazakh speech, allowing researchers to choose appropriate layers for paralinguistic speech applications.