Corpus-based Uncertainty Analysis of Multilingual Media under Language Policy
Аннотация
This paper presents a mathematical framework for quantifying graded language mixing in media texts surrounding a policy reform. We model each document as generated by probabilistic n-gram models for two languages, interpret the resulting posterior probabilities as soft-membership degrees, and apply Shannon entropy to measure per-document mixing. A fuzzification exponent controls assignment sharpness, and aggregate entropy across documents yields a corpus-level metric tracked over pre- and post-reform intervals. In a case study of 20 headlines, mean entropy rose from 0.52 to 0.68 nats (∆ = 0.16), indicating increased code-mixing after the policy change. Statistical validation via a paired t-test (t = 3.27, p < 0.01) and a permutation test (p = 0.005) confirms the significance of this shift. Analysis of soft-membership distributions reveals a drop in average English membership from 0.77 to 0.52, further illustrating editorial adaptation. The modular implementation enables scalable analysis of large corpora, and an open-source toolkit is provided to promote reproducibility and extension to other bilingual or multilingual settings. We discuss limitations related to parameter sensitivity, model assumptions, and sample size, and outline future extensions involving imprecise-probability bounds, contextual embeddings, dynamic time-series modeling, and topic-augmented uncertainty. Our results demonstrate the power of information-theoretic tools for detecting subtle shifts in media discourse in response to regulatory changes.
Ҳали таржима қилинмаган