Maqola

Multiscale Feature Extraction and Fusion of Image and Text in VQA

Siyu LuSchool of Automation, University of Electronic Science and Technology of China, Chengdu, 610054, ChinaYueming DingSchool of Automation, University of Electronic Science and Technology of China, Chengdu, 610054, ChinaMingzhe LiuSchool of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325000, ChinaZhengtong YinCollege of Resource and Environment Engineering, Guizhou University, Guiyang, 550025, ChinaLirong YinDepartment of Geography and Anthropology, Louisiana State University, Baton Rouge, LA, 70803, USAWenfeng ZhengSchool of Automation, University of Electronic Science and Technology of China, Chengdu, 610054, China

2023en

ABI

Annotatsiya

Abstract The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.

Hali tarjima qilinmagan

Identifikatorlar

DOI: 10.1007/s44196-023-00233-6

Iqtiboslar va manbalar

2 ta iqtibos0 ta foydalanilgan manba

Koʻrsatkichlar — AkademScholar