Статья

Video Captioning Using Large Language Models

Priyanshu MalaviyaPandit Deendayal Energy University,Department of Computer Science and Engineering,Gandhinagar,Gujarat,India,382007Dhruvit PatelPandit Deendayal Energy University,Department of Computer Science and Engineering,Gandhinagar,Gujarat,India,382007Santosh Kumar BhartiPandit Deendayal Energy University,Department of Computer Science and Engineering,Gandhinagar,Gujarat,India,382007

2024en

ABI

Аннотация

In this research, we delve into the intricate domain of video captioning, a pivotal aspect of multimedia content analysis, by harnessing the synergistic potential of convolutional and recurrent neural networks. Our study meticulously integrates a suite of models, including EfficientNet-LSTM, InceptionV3-LSTM, VGG16-LSTM, and ResNet50-LSTM, capitalizing on the LSTM’s prowess in processing temporal data and CNN’s efficiency in feature extraction. These models are instrumental in interpreting visual inputs and translating them into coherent textual descriptions. We further explore the BLIP model’s effectiveness in bridging the gap between visual perception and language interpretation by finetuning it, thereby enriching our video captioning methodology. A novel experiment in our study involves the fusion of ResNet50 with GPT2, aiming to create a robust framework that marries advanced image recognition with sophisticated language processing capabilities. This hybrid approach is anticipated to enhance the accuracy and context-awareness of caption generation. Additionally, we introduce an Attention-Based CNN+LSTM model, designed to refine the focus on salient aspects in videos, thereby ensuring the generation of more precise and contextually relevant captions. This approach addresses the challenge of varying significance across successive frames, ensuring that the captions accurately reflect the most critical content of the video. Our comprehensive analysis and innovative methodologies contribute significantly to the advancement of video captioning techniques, offering new perspectives and solutions in the intersection of computer vision and natural language processing.

Перевод пока недоступен

Идентификаторы

DOI: 10.1109/inocon60754.2024.10512233

Цитирования и источники

Цитирований: 3Использованных источников: 0

Показатели — AkademScholar