Dynamic Calibration and Adversarial Verification in Eight-Model Ensembles: Parameter-Independent Acquiescence, Calibration Homeostasis, and the Wilson Gate Relaxation Threshold
Аннотация
We present a systematic evaluation of multi-LLM knowledge graph verification ensembles scaled from 3 to 8 models, using the Adversarial Calibration Framework (ACF) deployed on the PARAM BILIM supercomputer cluster with eight locally hosted open-source models (2B–32B parameters). Evaluating on 1,169 Wikidata-derived triples annotated by four independent human annotators (Fleiss’ κ = 0.344, gold label stability 100%), we report four principal findings. First, Dynamic Weighting achieves calibration homeostasis: its F1 varies by only 0.014 (range 0.924–0.938) across all ensemble sizes from 3 to 8, compared to a baseline variation of 0.058. Second, acquiescence is parameter-independent: Gemma2:2b (2B parameters, acquiescence = 0.481) and Gemma2:9b (9B, 0.528) show near-identical rates despite a 4.5× parameter difference, while Command-R (32B, 0.481) matches both; Fisher’s exact test yields p = 0.575, confirming non-significance of within-family differences (though we frame this finding as preliminary given N = 106 gold-negative triples). Third, the Wilson confidence interval gate relaxes for the first time at n = 8, where 7/8 agreement (Wilson lower bound = 0.529) passes the τ = 0.5 threshold. Fourth, adversarial prompting exhibits monotonic degradation: Adversarial F1 declines steadily from 0.832 (n = 3) to 0.738 (n = 8). We situate these findings within the broader landscape of LLM calibration research, including SelfCheckGPT, multi-agent debate, and Bayesian ensemble methods. Our results suggest that the critical design variable for ensemble verification is not ensemble size per se, but the inclusion of a gold-calibrated weighting mechanism that absorbs heterogeneity regardless of scale.