CROSS-INTERACTION-BASED MULTIMODAL FEATURE COMPARISON FOR MOVING OBJECT IDENTIFICATION IN CROWDED VIDEO SCENES
Аннотация
Identifying moving objects in crowded video scenes is difficult because appearance information alone may be unreliable. Different people or objects may have similar visual appearances, while the same object may appear differently due to pose variation, scale changes, partial occlusion, illumination variation, or low visibility. To address this problem, this paper presents a cross-interaction-based multimodal feature comparison method for moving object identification. The proposed method represents each moving object using several complementary modalities, including appearance, geometry, spatial position, context, reliability, and clothing-color features. These heterogeneous features are projected into a common latent space before comparison. For two candidate detections, modality-wise comparison features are constructed using element-wise multiplication and absolute difference. Then, a cross-interaction function learns relationships between modalities, and an MLP estimates the final similarity probability. The proposed method is especially useful in difficult cases such as occlusion, lost track recovery, candidate ambiguity, and object reappearance. Compared with simple feature concatenation, the cross-interaction approach enables the model to learn conditional relationships across modalities and improves the reliability of moving-object identification in crowded scenes.