Article

TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

Zekun JiangCollege of Computer Science, Sichuan University,Chengdu,610000,ChinaDongjie ChengWest China Biomedical Big Data Center, West China Hospital, Sichuan University,Chengdu,610000,ChinaZiyuan QinShcool of Engineering, Case Western Reserve University,Cleveland,OH,USA,44106Jun GaoCollege of Computer Science, Sichuan University,Chengdu,610000,ChinaQicheng LaoSchool of Artificial Intelligence, Beijing University of Posts and Telecommunications,Beijing,China,100876Abdullaev Bakhrom IsmoilovichUrgench State University,Urgench,Uzbekistan,220100Urazboev GayratUrgench State University,Urgench,Uzbekistan,220100Yuldashov ElyorbekUrgench State University,Urgench,Uzbekistan,220100Bekchanov HabibulloUrgench State University,Urgench,Uzbekistan,220100Defu TangCollege of Animal Science and Technology, Gansu Agricultural University,Lanzhou,China,730000Linjing WeiCollege of Information Science and Technology, Gansu Agricultural University,Lanzhou,China,730000Kang LiWest China Biomedical Big Data Center, West China Hospital, Sichuan University,Chengdu,610000,ChinaLe ZhangCollege of Computer Science, Sichuan University,Chengdu,610000,China

Big Data Mining and Analyticsjournal2024en

ABI

Abstract

This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM's capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(p < 0.01)$</tex> and GSAM <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(p < 0.05)$</tex> , closely matching the performance of SAM BBOX with gold standard bounding box prompts <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(p=0.07)$</tex> , and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.

Topics

Domain Adaptation and Few-Shot Learning COVID-19 diagnosis using AI Advanced Neural Network Applications

Identifiers

DOI: 10.26599/bdma.2024.9020058

Citations and references

Cited by 046 references

Metrics — AkademScholar · Coming soon