Model and Algorithms for Classifying Anomalous Phenomena based on the Convergence of Acoustic-Visual Signals
Annotatsiya
This paper proposes a Context-adaptive Audio-Visual Neural Network (CAVN) model for anomaly detection in public safety systems. Existing approaches primarily rely on visual data and employ simple fusion strategies for combining modalities, which leads to limitations in capturing complex semantic relationships. The proposed model consists of four main components: a visual feature extraction module based on SlowFast architecture, an audio feature extraction module based on Audio Spectrogram Transformer (AST), a fusion module based on bidirectional cross-attention mechanism, and a temporal context aggregation module based on Transformer encoder. The main scientific novelty of the model lies in the adaptive modality balancing mechanism, which dynamically adjusts the relative importance of modalities under different conditions (dark/bright, noisy/quiet). Experimental results demonstrate that the proposed CAVN model outperforms existing methods by in overall accuracy and by in dark conditions. Ablation studies confirmed the contribution of each module to the overall performance.
Hali tarjima qilinmagan