Mapping Epistemic Deficits in LLM-Based Knowledge Graph Verification: Diversity Thresholds, Calibration Dynamics, and the Limits of Adversarial Consensus
Annotatsiya
We investigate how ensemble composition affects the reliability of multi-LLM knowledge graph verification. Using the Adversarial Calibration Framework (ACF) deployed on the PARAM BILIM supercomputer cluster with six locally hosted open-source models (7B–32B parameters), we evaluate verification performance on 1,169 Wikidata-derived triples annotated by four independent human annotators (Fleiss’ κ = 0.345, gold label stability 97.3%). Our experiments reveal three principal findings. First, increasing ensemble diversity from 3 to 6 models produces a non-monotonic performance trajectory under a stress-test ordering (high-to-low acquiescence): baseline F1 drops sharply from 0.954 to 0.652 when a fourth, more skeptical model is added, then partially recovers to 0.833 at n = 6. This discontinuity arises from the interaction between majority-vote thresholds and heterogeneous acquiescence rates. Second, Dynamic Weighting, which calibrates model contributions against a gold-standard subset, is the only ACF component that remains stable across all ensemble sizes (F1 = 0.85–0.88). Third, acquiescence rates span the full range from 0.000 (DeepSeek-R1:8b) to 0.979 (Gemma2:9b), suggesting that model selection constitutes an implicit epistemic stance determined by model selection. We formalize these observations through the Epistemic Deficit Profile (EDP), a five-dimensional diagnostic vector that characterizes a verification system’s weaknesses across false-positive rate, acquiescence, calibration bias, structural integrity, and inter-annotator agreement. Our results suggest that framing knowledge graph verification as a problem of mapping epistemic deficits—rather than maximizing accuracy—may be a productive direction for future research.