Перейти к основному содержанию
AkademIndex

Продукты

Для разработчиков

AkademBaseОткрытый API экосистемы
Статья

Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

Mohammed ElhenawyCARRS-Q and Centre for Data Science, Queensland University of Technology, Kelvin Grove, QLD 4059, AustraliaHuthaifa I. AshqarArtificial Intelligence Program, Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY 10027, USAAndry RakotonirainyCARRS-Q and Centre for Data Science, Queensland University of Technology, Kelvin Grove, QLD 4059, AustraliaTaqwa I. AlhadidiCivil Engineering Department, Al-Ahliyya Amman University, Amman 19328, JordanAhmed JaberAssociation of Palestinian Local Authorities, Ramallah P600, PalestineMohammad Abu TamiNatural, Engineering and Technology Sciences Department, Arab American University, Jenin P.O. Box 240, Palestine
2025en
ABI

Аннотация

Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analyses on the Honda Scenes Dataset, which contains a collection of about 80 h of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. The results also showed that fine-tuning the CLIP models, such as ViT-L/14 (Vision Transformer) and ViT-B/32, significantly improved scene classification, achieving a top F1-score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of advanced driver assistance systems (ADASs). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems.

Перевод пока недоступен

Идентификаторы

Цитирования и источники

Цитирований: 2Использованных источников: 0