Dr. KI under the magnifying glass
New studies show strengths and weaknesses of AI-supported tools for digital health advice such as ChatGPT or symptom checker apps
Patients are increasingly turning to digital tools to identify illnesses and receive recommendations for action. Two recent studies by TU Berlin have now examined the quality and effectiveness of such digital health recommendations. The results show both potential and risks. The studies have been published in the journals Scientific Reports and npj Health Systems Journal.
In the first study, a team led by Dr. Marvin Kopka from the Department of Ergonomics at TU Berlin developed a new test method to evaluate the accuracy of health recommendations by ChatGPT and other AI models such as Meta's LLaMa as well as specially developed symptom checker apps that query symptoms and provide recommendations for action based on them. While previous tests were based on idealized textbook cases that rarely occur in reality, the new method relies on real patient cases with which the scientists at TU Berlin tested various digital tools. This allows a more realistic assessment of how precise and helpful the digital tools are in practice. "Our standardized method can be seen as a kind of 'Stiftung Warentest', as it allows us to compare the accuracy of different apps, but also to find their strengths and weaknesses," says study leader Marvin Kopka.
Symptom checker apps significantly more helpful
The results of the newly developed evaluation method show that symptom checker apps are significantly more helpful for laypeople than ChatGPT, especially when it comes to distinguishing between harmless and serious symptoms. While ChatGPT classifies almost every case as an emergency or highly urgent, the specialized apps can make more informed and appropriate recommendations in most cases. What other studies have already shown: ChatGPT can diagnose illnesses well if laboratory values or examination results are available. However, as these are usually missing at home, the diagnosis often remains imprecise and the list of several possible illnesses suggested by the model is of little help to laypeople. Recommendations for action such as "Go to the doctor" or "Call 112" are more decisive - however, ChatGPT performs poorly here, as it classifies almost every case as requiring treatment, even with harmless symptoms.
Laypeople usually recognize medical emergencies reliably
The researchers also found that Laypeople usually recognize medical emergencies reliably and call the emergency services in serious cases, such as a serious head injury with vomiting and dizziness. However, they find it more difficult to correctly assess harmless symptoms. For example, many people tend to seek medical help prematurely for minor complaints such as short-term diarrhea or a minor skin change, even though this is often not necessary. "The fact that more and more people are using ChatGPT for medical advice is detrimental to the healthcare system. AI often motivates users to go to the doctor or emergency room immediately at the slightest symptom. This can lead to massive overload," warns study leader Dr. Marvin Kopka.
Users do not accept digital recommendations uncritically
The second study not only compared people and technology, but also investigated how accurately people incorporate the recommendations of ChatGPT and symptom checker apps into their own decisions. This showed that users do not accept the recommendations uncritically, but compare them with other sources such as Google searches, advice from friends or other apps. "On the other hand, there are also cases in which patients receive too much and sometimes incomprehensible information from digital tools that they are unable to categorize. This creates anxiety and they then seek expert advice in the emergency room or from their GP - even for harmless complaints, as ChatGPT recommends," says Kopka.
Prior to a quantitative study with 600 test subjects, the second study first observed 24 people using ChatGPT and then created a model of how they make decisions with the help of ChatGPT and apps. The evaluation again showed that ChatGPT makes self-care more difficult and increases the number of unnecessary visits to the doctor. In contrast, well-functioning symptom checker apps helped users to opt for self-care in appropriate cases, thus helping to reduce the burden on the healthcare system. "ChatGPT has many useful applications, but it's not suitable for deciding whether I should go to the doctor - it's far too imprecise for that," summarizes Kopka. "We should ask ourselves whether an app helps us to make good decisions, rather than expecting perfection from it. After all, people already make safe and sensible decisions in most cases. In some situations, however, you can benefit from apps."
The studies show that digital tools can be used to support patients in their decision-making. Specially developed symptom checker apps are currently proving to be more helpful than generative AI models such as ChatGPT. Nevertheless, according to the researchers, a critical approach to digital recommendations remains crucial in order to avoid misjudgements and an unnecessary burden on the healthcare system.
Note: This article has been translated using a computer system without human intervention. LUMITOS offers these automatic translations to present a wider range of current news. Since this article has been translated with automatic translation, it is possible that it contains errors in vocabulary, syntax or grammar. The original article in German can be found here.
Original publication
Marvin Kopka, Hendrik Napierala, Martin Privoznik, Desislava Sapunova, Sizhuo Zhang, Markus A. Feufel; "The RepVig framework for designing use-case specific representative vignettes and evaluating triage accuracy of laypeople and symptom assessment applications"; Scientific Reports, Volume 14, 2024-12-23
Marvin Kopka, Sonja Mei Wang, Samira Kunz, Christine Schmid, Markus A. Feufel; "Technology-supported self-triage decision making"; npj Health Systems, Volume 2, 2025-1-25