ChatGPT advises going to the doctor too often for harmless complaints

Too cautious for care: ChatGPT's weaknesses when it comes to health issues

05-May-2026
AI-generated image

Symbol image

artificial intelligence (AI) is also increasingly being used for health issues. Many people use tools such as ChatGPT to classify complaints and assess whether they need medical help immediately, should seek medical advice or can wait and see. With versions specifically positioned for the healthcare sector, such as ChatGPT Health in the USA, it is easy to create the impression of particular professional suitability. However, how reliable ChatGPT recommendations actually are has so far only been investigated to a limited extent.

In a new study from the Department of Ergonomics at the Technical University of Berlin, researchers have therefore analyzed how accurately ChatGPT classifies health complaints in different model versions, how performance has changed over time and whether identical inputs generate consistent recommendations. The result: ChatGPT is currently only suitable to a limited extent for digital initial assessment and independent patient management.

22 model versions, 45 real cases, 9,900 assessments

"The main difference to our previous studies is the longitudinal analysis. Previously, only one or two models were examined. Now we have tested all the models that were available over time and analyzed how they actually changed," says study leader Dr. Marvin Kopka. "This was also important to us because there are always reports that new models achieve almost perfect results in medical licensing examinations or knowledge tests. This quickly leads to the conclusion that they also provide reliable medical recommendations for patients. But according to our study, this is precisely not the case."

For the study "Evaluating the accuracy of ChatGPT model versions for giving care-seeking advice", published in the journal "Communications Medicine", the research team tested 22 ChatGPT model versions using real cases from 45 patients. These included clinical pictures such as "a short-term tendon/ligament strain the day before" or "simple digestive problems/diarrhea for one day without further complaints". Each case was entered ten times per model. This resulted in a total of 9,900 individual assessments. The models each had to decide whether a case should be classified as an emergency, a case for medical clarification or a case for self-care.

Accuracy barely increases

The evaluation shows: Accuracy initially increased significantly with the first model versions. Since the third model generation (gpt-4), however, there have only been minor improvements. The best model tested achieved an accuracy of 74 percent. Although newer models recommended self-care more frequently, overall performance in this area remained limited.

Particular weaknesses for harmless complaints

The models tested were particularly good at recognizing cases requiring treatment. However, most errors occurred in cases where self-care would have been sufficient: 70 percent of all errors were in this group. Not a single one of the 13 self-care cases was correctly solved by all models in all runs.

Only individual models, such as o4, o3 or GPT 5, ever recommended self-care. In all other models tested, medical clarification was recommended across the board. This is problematic because a significant proportion of the complaints are not actually dangerous, go away on their own or can be treated by the patient.

The study thus reveals a structural pattern: almost all models tend to take the precaution of classifying complaints as requiring more treatment than would be medically necessary.

The researchers refer to this pattern as conservative triage behavior. "We were surprised by the clarity of the results ourselves," says Dr. Marvin Kopka. "Because they explicitly show that the questions relevant to patients are not automatically answered better by newer models. Better test or examination results do not necessarily mean greater practical benefits in care."

The practical benefit is crucial

"In our view, the decisive factor is not just whether a model classifies individual cases correctly, but what practical benefit the recommendations actually have in everyday life. If a system advises medical clarification for a large number of complaints as a precautionary measure, this initially seems safe for users - but it no longer offers any real decision-making support if the recommendation is almost always the same," says Dr. Marvin Kopka.

Same input, not always the same recommendation

There is also another problem: the models do not always provide consistent answers. Depending on the model, there were sometimes significant fluctuations with identical inputs. Newer models had fewer cases that were never solved correctly, but at the same time more cases with inconsistent recommendations over several runs. This was particularly evident in GPT 5: In 42 percent of all cases, the recommendations were sometimes correct and sometimes incorrect when the same case was entered multiple times - despite exactly the same input.

The experiment did show that accuracy can be improved if the same question is asked several times and the lowest urgency level is then selected from several answers. In this way, the overall accuracy increased by an average of four percentage points, and the accuracy for self-care cases even increased by 14 percentage points. However, the researchers expressly emphasize that this is not a recommendation for end users, because in the worst case scenario, emergencies could be overlooked.

Relevance for the debate on primary care

The results are also relevant to health policy, says Kopka. There is an intense debate in Germany about a primary care system and forms of digital patient management. The TU study suggests that general language models such as ChatGPT are currently not a suitable tool for this purpose on their own. If a system predominantly advises patients to seek medical clarification in practice, there is hardly any real control effect - unnecessary medical utilization may even increase.

More potential in quality-assured applications

"We therefore currently see the potential of large language models less in use in manufacturers' chat windows than in meaningful integration in quality-assured applications, i.e. in symptom checker apps. There, they could help to prepare information in an understandable way, explain recommendations and guide people better through existing care pathways - provided that medical quality assurance takes place in the background," says Marvin Kopka.

Limitations of the study

The researchers also point out that the focus of this study was on population representativeness. Since real emergencies are rare in everyday life and therefore occur less frequently when using ChatGPT, the data set also contained only a few emergencies and mainly examined decisions for or against seeking medical help. The accuracy of recognizing real emergencies should be investigated in further studies.

Note: This article has been translated using a computer system without human intervention. LUMITOS offers these automatic translations to present a wider range of current news. Since this article has been translated with automatic translation, it is possible that it contains errors in vocabulary, syntax or grammar. The original article in German can be found here.

Original publication

Other news from the department science

Most read news

More news from our other portals

So close that even
molecules turn red...