Статья

Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization

Shakhnoza MuksimovaDepartment of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of KoreaSabina UmirzakovaDepartment of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of KoreaMurodjon SultanovDepartment of Information Systems and Technologies of the Tashkent State University of Economic, Tashkent 100066, UzbekistanYoung Im ChoDepartment of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of Korea

Sensorsjournal2025en

ABI

Аннотация

Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks.

Перевод пока недоступен

Темы

Multimodal Machine Learning Applications Human Pose and Action Recognition Video Analysis and Summarization

Идентификаторы

DOI: 10.3390/s25030707

Цитирования и источники

Цитирований: 4 Использованных источников: 29

Показатели — AkademScholar