A Hybrid Ensemble Framework for Rare Event Detection in Large-Scale Tabular Data
Abstract
Rare event detection in large tabular data remains a computationally challenging problem due to class imbalance, heterogeneous feature distributions, and unstable thresholds. Traditional machine learning approaches based on individual models and fixed thresholds often exhibit limited robustness and reproducibility in such settings. This paper proposes a hybrid ensemble framework for rare event detection that integrates heterogeneous machine learning models through threshold-aware probabilistic aggregation. The framework combines gradient-boosted decision trees, regularized linear models, and neural networks, leveraging their complementary inductive biases. To ensure reproducibility and robust performance evaluation under severe class imbalance, a leaky-controlled evaluation protocol is employed, including rootwise summation, probability calibration, and validation-based threshold optimization. The proposed approach is evaluated on a large tabular dataset containing approximately 50,000 observations. Experimental results demonstrate improved rare event detection and robust generalization performance compared to individual baseline models. Explainability is achieved through Shapley Additive Explanations (SHAP)-based attribution analysis and clustering in the explanation space, enabling transparent analysis of ensemble decision-making behavior. The proposed framework represents a general-purpose computational solution for rare event detection and can be applied to a wide range of data-driven decision-making and anomaly detection problems.