Abbreviations

JMU

JMIR Mhealth Uhealth

JMIR mHealth and uHealth

2291-5222

JMIR Publications

Toronto, Canada

v12i1e57978

38688841

10.2196/57978

Commentary

The Evaluation of Generative AI Should Include Repetition to Assess Stability

Buis

Lorraine

Zhu

Lingxuan

MD 1

https://orcid.org/0009-0001-9077-408X

Mou

Weiming

MD 2

https://orcid.org/0009-0007-1089-6516

Hong

Chenglin

MD 1

https://orcid.org/0009-0009-3565-3486

Yang

Tao

MD 3

https://orcid.org/0009-0007-5246-3284

Lai

Yancheng

MD 1

https://orcid.org/0009-0004-8444-7535

Chang

MEng 4

https://orcid.org/0009-0005-3840-550X

Lin

Anqi

MD 1

https://orcid.org/0000-0002-6324-0410

Zhang

Jian

MD 1

https://orcid.org/0000-0001-7217-0111

Luo

Peng

MD 1

Department of Oncology Zhujiang Hospital Southern Medical University

253 Industrial Avenue

Guangzhou

China 86 020 61643888 luopeng@smu.edu.cn

https://orcid.org/0000-0002-8215-2045

1 Department of Oncology Zhujiang Hospital Southern Medical University

Guangzhou

China 2 Department of Urology Shanghai General Hospital Shanghai Jiao Tong University School of Medicine

Shanghai

China 3 Department of Medical Oncology National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital Chinese Academy of Medical Sciences and Peking Union Medical College

Beijing

China 4 Institute of Logic and Computation

TU Wien

Austria

Corresponding Author: Peng Luo luopeng@smu.edu.cn

2024

6 5 2024

e57978

1 3 2024 30 4 2024

©Lingxuan Zhu, Weiming Mou, Chenglin Hong, Tao Yang, Yancheng Lai, Chang Qi, Anqi Lin, Jian Zhang, Peng Luo. Originally published in JMIR mHealth and uHealth (https://mhealth.jmir.org), 06.05.2024.

2024

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR mHealth and uHealth, is properly cited. The complete bibliographic information, a link to the original publication on https://mhealth.jmir.org/, as well as this copyright and license information must be included.

https://mhealth.jmir.org/2024/1/e51526/

The increasing interest in the potential applications of generative artificial intelligence (AI) models like ChatGPT in health care has prompted numerous studies to explore its performance in various medical contexts. However, evaluating ChatGPT poses unique challenges due to the inherent randomness in its responses. Unlike traditional AI models, ChatGPT generates different responses for the same input, making it imperative to assess its stability through repetition. This commentary highlights the importance of including repetition in the evaluation of ChatGPT to ensure the reliability of conclusions drawn from its performance. Similar to biological experiments, which often require multiple repetitions for validity, we argue that assessing generative AI models like ChatGPT demands a similar approach. Failure to acknowledge the impact of repetition can lead to biased conclusions and undermine the credibility of research findings. We urge researchers to incorporate appropriate repetition in their studies from the outset and transparently report their methods to enhance the robustness and reproducibility of findings in this rapidly evolving field.

large language model generative AI ChatGPT artificial intelligence health care

Since OpenAI released ChatGPT-3.5, there has been a growing interest within the medical community regarding the prospective applications of this general pretrained model in health care [1-7]. Using ChatGPT as a search keyword in the PubMed database, the results show that 2075 papers discussing ChatGPT were published in 2023. As the leading journal in the field of digital medicine, JMIR Publications Inc published a total of 115 papers related to ChatGPT in the year 2023. It should be noted that this is a quick and simple search that may not comprehensively capture all relevant articles, but it provides a general reflection of the growing interest and research on ChatGPT in the medical field. For example, Gilson et al [8] explored the performance of ChatGPT on the United States Medical Licensing Examination (USMLE) step 1 and step 2 exams, discovering that ChatGPT’s performance exceeded the passing score for third-year medical students in step 1. More studies are exploring ChatGPT’s performance on other medical exams, such as the Japanese and German Medical Licensing Examinations [9,10], the Otolaryngology-Head and Neck Surgery Certification Examinations [11], and the UK Standardized Admission Tests [12]. Beyond examinations, many articles have discussed the potential applications of ChatGPT in medicine from various perspectives. Shao et al [13] examined the suitability of using ChatGPT for perioperative patient education in thoracic surgery within English and Chinese contexts. Cheng et al [14] investigated whether ChatGPT could be used to generate summaries for medical research, and Hsu et al [15] evaluated whether ChatGPT could correctly answer basic medication consultation questions. However, we would like to point out that as a relatively new technology, there are some differences in evaluating the potential application of generative artificial intelligence (AI) like ChatGPT in health care that require additional attention from researchers.

The most significant difference affecting the evaluation of ChatGPT compared to traditional AI models known to people is the randomness inherent in the responses generated by ChatGPT. Common perception holds that for a given input, an AI model should produce the same output consistently each time. However, for natural language models like ChatGPT, this is not the case. ChatGPT generates a response by predicting the next most likely word, followed by each subsequent word. The process of generating responses involves a certain degree of randomness. If you access ChatGPT using the application programming interface, you can also control the degree of randomness in the generated responses with the temperature parameter. Even with the same input, the responses provided by ChatGPT will not be the same, and sometimes may even be completely contradictory. Therefore, when evaluating ChatGPT’s performance, it is necessary to generate multiple responses to the same input and assess these responses collectively to explore ChatGPT’s performance accurately; otherwise, there is a high likelihood of drawing biased conclusions. For example, as one of the earliest studies published, Sarraju et al [4] asked the same question three times and assessed whether the three responses given by ChatGPT to the same question were consistent. As OpenAI made the ChatGPT application programming interface accessible, it became feasible to ask the same question many more times. In a recent study investigating whether ChatGPT’s peer-review conclusions are influenced by the reputation of the author’s institution, von Wedel et al [16] conducted 250 repeated experiments for each question to mitigate the effects of ChatGPT’s randomness. However, not all researchers have recognized this aspect. For instance, in a study where ChatGPT was asked to answer the American Heart Association Basic Life Support and Advanced Cardiovascular Life Support exams, they found that ChatGPT could not pass either examination [17]. However, that study only asked the question once without repeating, which means that the randomness of ChatGPT could have had an impact on the experiment, affecting the reliability of the conclusions. In another improved study, researchers acknowledged the impact of ChatGPT’s randomness, asking each question three times. Compared to earlier results, ChatGPT’s performance in this study significantly improved, and it could pass the Basic Life Support exam [18], further underscoring the importance of repetitions. Therefore, it is inappropriate to evaluate ChatGPT’s performance based on a single response if one aims to draw rigorous, scientifically meaningful conclusions. Just as biological experiments typically require three repetitions for validity, without repetition, it becomes challenging to determine whether the observed phenomenon is an inherent characteristic of the model or merely a random occurrence. Additionally, for models intended for clinical practice applications, whether for patient education, diagnosis, or support in clinical documentation writing, we hope that ChatGPT can always provide correct and harmless responses. Repetition also allows us to evaluate the model’s stability and further assess its application value. However, we noticed that many recent manuscripts we reviewed were not aware of this, thus affecting the reliability of the conclusions.

Therefore, in research on the application of generative AI like ChatGPT in health care, appropriate repetition should be included to comprehensively evaluate the model’s performance by assessing the stability of the model in the task set by the author. This should be considered from the beginning of the research. Since models like ChatGPT will continue to be upgraded, if the authors only realize the need for repetition when revising the manuscript, there will be a considerable time gap between the authors’ supplementary analysis and the original analysis. The model has likely been upgraded during this period, introducing new uncertainties into the research. Alternatively, the authors need to completely redo the analysis from scratch during the manuscript revision process, wasting time and effort. Therefore, we hope that future researchers will recognize the necessity of repeated experiments from the start and report in the manuscript how the repetition was carried out in the study [19].

Abbreviations

artificial intelligence

USMLE

United States Medical Licensing Examination

None declared.

Grünebaum

Chervenak

Pollet

Katz

Chervenak

The exciting potential for ChatGPT in obstetrics and gynecology

Am J Obstet Gynecol 2023 06 228 6 696 705

10.1016/j.ajog.2023.03.009

36924907

S0002-9378(23)00154-0

Howard

Hope

Gerada

ChatGPT and antimicrobial advice: the end of the consulting infection doctor?

Lancet Infect Dis 2023 04 23 4 405 406

10.1016/S1473-3099(23)00113-5

36822213

S1473-3099(23)00113-5

Zhu

Mou

Luo

Potential of large language models as tools against medical disinformation

JAMA Intern Med 2024 04 01 184 4 450

10.1001/jamainternmed.2024.0020

38407861

2815262

Sarraju

Bruemmer

Van Iterson

Cho

Rodriguez

Laffin

Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model

JAMA 2023 03 14 329 10 842 844

10.1001/jama.2023.1044

36735264

2801244

PMC10015303

Zhu

Mou

Chen

Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge?

J Transl Med 2023 04 19 21 1 269

10.1186/s12967-023-04123-5

37076876

10.1186/s12967-023-04123-5

PMC10115367

Ali

Dobbs

Hutchings

Whitaker

Using ChatGPT to write patient clinic letters

Lancet Digit Health 2023 04 5 4 e179 e181

10.1016/S2589-7500(23)00048-1

36894409

S2589-7500(23)00048-1

Patel

Lam

ChatGPT: the future of discharge summaries?

Lancet Digit Health 2023 03 5 3 e107 e108

10.1016/S2589-7500(23)00021-3

36754724

S2589-7500(23)00021-3

Gilson

Safranek

Huang

Socrates

Chi

Taylor

Chartash

How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment

JMIR Med Educ 2023 02 08 9 e45312

10.2196/45312

36753318

v9i1e45312

PMC9947764

Meyer

Riese

Streichert

Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German Medical Licensing Examination: observational study

JMIR Med Educ 2024 02 08 10 e50965

10.2196/50965

38329802

v10i1e50965

PMC10884900

Yanagita

Yokokawa

Uchida

Tawara

Ikusaka

Accuracy of ChatGPT on medical questions in the National Medical Licensing Examination in Japan: evaluation study

JMIR Form Res 2023 10 13 7 e48023

10.2196/48023

37831496

v7i1e48023

PMC10612006

Long

Lowe

Zhang

Santos

Alanazi

O'Brien

Wright

Cote

A novel evaluation model for assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: performance study

JMIR Med Educ 2024 01 16 10 e49970

10.2196/49970

38227351

v10i1e49970

PMC10828939

Giannos

Delardas

Performance of ChatGPT on UK standardized admission tests: insights from the BMAT, TMUA, LNAT, and TSA examinations

JMIR Med Educ 2023 04 26 9 e47737

10.2196/47737

37099373

v9i1e47737

PMC10173042

Shao

Liu

Yang

Zhang

Luo

Zhao

Appropriateness and comprehensiveness of using ChatGPT for perioperative patient education in thoracic surgery in different language contexts: survey study

Interact J Med Res 2023 08 14 12 e46900

10.2196/46900

37578819

v12i1e46900

PMC10463083

Cheng

Tsai

Bai

Hsu

Yang

Tsai

Yang

Tseng

Hsu

Liang

Comparisons of quality, correctness, and similarity between ChatGPT-generated and human-written abstracts for basic research: cross-sectional study

J Med Internet Res 2023 12 25 25 e51229

10.2196/51229

38145486

v25i1e51229

PMC10760418

Hsu

Hou

Hsieh

Cheng

Examining real-world medication consultations and drug-herb interactions: ChatGPT performance evaluation

JMIR Med Educ 2023 08 21 9 e48433

10.2196/48433

37561097

v9i1e48433

PMC10477918

von Wedel

Schmitt

Thiele

Leuner

Shay

Redaelli

Schaefer

Affiliation bias in peer review of abstracts by a large language model

JAMA 2024 01 16 331 3 252 253

10.1001/jama.2023.24641

38150261

2813511

PMC10753437

Fijačko

Gosak

Štiglic

Picard

John Douma

Can ChatGPT pass the life support exams without entering the American Heart Association course?

Resuscitation 2023 04 185 109732

10.1016/j.resuscitation.2023.109732

36775020

S0300-9572(23)00045-X

Zhu

Mou

Yang

Chen

ChatGPT can pass the AHA exams: open-ended questions outperform multiple-choice format

Resuscitation 2023 07 188 109783

10.1016/j.resuscitation.2023.109783

37349064

S0300-9572(23)00096-5

Chen

Zhu

Mou

Liu

Cheng

Lin

Zhang

Luo

STAGER checklist: Standardized Testing and Assessment Guidelines for Evaluating Generative AI Reliability

arXiv. Preprint posted online on Decemeber 8, 2023 2024

10.48550/arXiv.2312.10074