Статья

Dynamic Calibration and Adversarial Verification in Eight-Model Ensembles: Parameter-Independent Acquiescence, Calibration Homeostasis, and the Wilson Gate Relaxation Threshold

Anatoliy KremenchutskiyBukhara State UniversitTursun ShafiyevBukhara State UniversitKairat ToqmyrzaSilumine of Kazakhstan LLPShukrullo RamazanovBukhara State Universit

Zenodo (CERN European Organization for Nuclear Research)repository2026en

ABI

Аннотация

We present a systematic evaluation of multi-LLM knowledge graph verification ensembles scaled from 3 to 8 models, using the Adversarial Calibration Framework (ACF) deployed on the PARAM BILIM supercomputer cluster with eight locally hosted open-source models (2B–32B parameters). Evaluating on 1,169 Wikidata-derived triples annotated by four independent human annotators (Fleiss’ κ = 0.344, gold label stability 100%), we report four principal findings. First, Dynamic Weighting achieves calibration homeostasis: its F1 varies by only 0.014 (range 0.924–0.938) across all ensemble sizes from 3 to 8, compared to a baseline variation of 0.058. Second, acquiescence is parameter-independent: Gemma2:2b (2B parameters, acquiescence = 0.481) and Gemma2:9b (9B, 0.528) show near-identical rates despite a 4.5× parameter difference, while Command-R (32B, 0.481) matches both; Fisher’s exact test yields p = 0.575, confirming non-significance of within-family differences (though we frame this finding as preliminary given N = 106 gold-negative triples). Third, the Wilson confidence interval gate relaxes for the first time at n = 8, where 7/8 agreement (Wilson lower bound = 0.529) passes the τ = 0.5 threshold. Fourth, adversarial prompting exhibits monotonic degradation: Adversarial F1 declines steadily from 0.832 (n = 3) to 0.738 (n = 8). We situate these findings within the broader landscape of LLM calibration research, including SelfCheckGPT, multi-agent debate, and Bayesian ensemble methods. Our results suggest that the critical design variable for ensemble verification is not ensemble size per se, but the inclusion of a gold-calibrated weighting mechanism that absorbs heterogeneity regardless of scale.

Темы

Advanced Graph Neural Networks Explainable Artificial Intelligence (XAI)Adversarial Robustness in Machine Learning

Идентификаторы

DOI: 10.5281/zenodo.19465009

Цитирования и источники

Цитирований: 0Использованных источников: 0

Показатели — AkademScholar · Скоро