Most leading chatbots routinely exaggerate science findings
ChatGPT is often asked for summaries, but how accurate are these really?
It seems so convenient: when you are short of time, asking ChatGPT or another chatbot to summarise a scientific paper to quickly get a gist of it. But in up to 73 per cent of the cases, these large language models (LLMs) produce inaccurate conclusions, a new study by Uwe Peters (Utrecht 木瓜福利影视) and (Western 木瓜福利影视 and 木瓜福利影视 of Cambridge) finds.
Almost 5,000 LLM-generated summaries analysed
The researchers tested ten of the most prominent LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA. 鈥淲e entered abstracts and articles from top science journals, such as Nature, Science, and The Lancet,鈥 says Peters, 鈥渁nd asked the models to summarise them. Our key question: how accurate are the summaries that the models generate?鈥
鈥淥ver a year, we collected 4,900 summaries. When we analysed them, we found that six of ten models systematically exaggerated claims they found in the original texts. Often the differences were subtle. But nuances can be of great importance in making sense of scientific findings.鈥
For instance, LLMs often changed cautious, past-tense claims into more sweeping, present-tense versions: 鈥楾he treatment was effective in this study鈥 became 鈥楾he treatment is effective鈥. 鈥淐hanges like this can mislead readers,鈥 Chin-Yee warns. 鈥淭hey can give the impression that the results are more widely applicable than they really are.鈥
When asked for more accuracy, the chatbots exaggerated even more often.
The researchers also directly compared human-written with LLM-generated summaries of the same texts. Chatbots were nearly five times more likely to produce broad generalisations than their human counterparts.
Accuracy prompts backfired
Peters and Chin-Yee did try to get the LLMs to generate accurate summaries. They, for instance, specifically asked the chatbots to avoid inaccuracies. 鈥淏ut strikingly, the models then produced exaggerated conclusions even more often鈥, Peters says. 鈥淭hey were nearly twice as likely to produce overgeneralised conclusions.鈥
鈥淭his effect is concerning. Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they will get a more reliable summary. Our findings suggest the exact opposite.鈥
Newer AI models, like ChatGPT-4o and DeepSeek, performed even worse.
Why are these exaggerations happening?
鈥淟LMs may inherit the tendency to make broader claims from the texts they are trained with,鈥 Chin-Yee explains. He refers to . 鈥淗uman experts also tend to draw more general conclusions, from Western samples to all people, for example.鈥
鈥淏ut lots of the original articles didn鈥檛 contain problematic generalisations, but the summaries then suddenly did,鈥 Peters adds. 鈥淲orse still, overall, newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones.鈥
Another reason for LLMs鈥 generalisation bias might lie in people鈥檚 interaction with the chatbots. 鈥淚n their interactions with LLMs, human users that are involved in the models鈥 fine-tuning may prefer LLM responses that sound helpful and widely applicable. In this way, the models might learn to favour such answers 鈥 even at the expense of accuracy.鈥
There is a real risk that AI-generated science summaries could spread misinformation.
Reducing the risks
鈥淚f we want AI to support science literacy rather than undermine it, we need more vigilance and testing of LLMs in science communication contexts,鈥 Peters stresses. 鈥淭hese tools are already being widely used for science summarisation, so their outputs can shape public science understanding 鈥 accurately or misleadingly. Without proper oversight, there is a real risk that AI-generated science summaries could spread misinformation, or present uncertain science as settled fact.鈥
If you still wish to use a chatbot to summarise a text, Peters and Chin-Yee recommend using LLMs such as Claude, which had the highest generalisation accuracy. It may also help to use prompts that enforce indirect, past-tense reporting, and, if you are a programmer, to set chatbots to lower 鈥榯emperature鈥 (the parameter fixing a chatbot鈥檚 鈥榗reativity鈥).