Статья

Large language models encode clinical knowledge

Karan SinghalGoogle Research, Mountain View, CA, USA. [email protected]Shekoofeh AziziGoogle Research, Mountain View, CA, USA. [email protected]Tao TuGoogle Research, Mountain View, CA, USAS. Sara MahdaviGoogle Research, Mountain View, CA, USAJason LeeGoogle Research, Mountain View, CA, USAHyung Won ChungGoogle Research, Mountain View, CA, USANathan ScalesGoogle Research, Mountain View, CA, USAAjay Kumar TanwaniGoogle Research, Mountain View, CA, USAHeather Cole-LewisGoogle Research, Mountain View, CA, USAStephen PfohlGoogle Research, Mountain View, CA, USAPerry W. PayneGoogle Research, Mountain View, CA, USAMartin SeneviratneGoogle Research, Mountain View, CA, USAPaul GambleGoogle Research, Mountain View, CA, USAChristopher KellyGoogle Research, Mountain View, CA, USAAbubakr BabikerGoogle Research, Mountain View, CA, USANathanael SchärliGoogle Research, Mountain View, CA, USAAakanksha ChowdheryGoogle Research, Mountain View, CA, USAP. MansfieldGoogle Research, Mountain View, CA, USADina Demner‐FushmanNational Library of Medicine, Bethesda, MD, USABlaise Agüera y ArcasGoogle Research, Mountain View, CA, USADale R. WebsterGoogle Research, Mountain View, CA, USAGreg S. CorradoGoogle Research, Mountain View, CA, USAYossi MatiasGoogle Research, Mountain View, CA, USAKatherine ChouGoogle Research, Mountain View, CA, USAJuraj GottweisGoogle Research, Mountain View, CA, USANenad TomaševDeepMind, London, UKYun LiuGoogle Research, Mountain View, CA, USAAlvin RajkomarGoogle Research, Mountain View, CA, USAJoëlle BarralGoogle Research, Mountain View, CA, USAChristopher SemtursGoogle Research, Mountain View, CA, USAAlan KarthikesalingamGoogle Research, Mountain View, CA, USA. [email protected]Vivek NatarajanGoogle Research, Mountain View, CA, USA. [email protected]

2023en

ABI

Аннотация

Abstract Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model 1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM 2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics 6 ), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Перевод пока недоступен

Идентификаторы

DOI: 10.1038/s41586-023-06291-2

Цитирования и источники

Цитирований: 4Использованных источников: 0

Показатели — AkademScholar