AI’s Wrong Answers Are Bad. Its Wrong Reasoning Is Worse
As AI takes on agent roles, flawed reasoning raises risks
Everyone knows that AI still makes mistakes. But a more pernicious problem may be flaws in how it reaches conclusions. As generative AI is increasingly used as an assistant rather than just a tool, two new studies suggest that how models reason could have serious implications in critical areas like health care, law, and education.
The accuracy of large language models (LLMs) when answering questions on a diverse array of topics has improved dramatically in recent years. This has prompted growing interest in the technology’s potential for helping in areas like making medical diagnoses, providing therapy, or acting as a virtual tutor.
Anecdotal reports suggest users are already widely using off-the-shelf LLMs for these kinds of tasks, with mixed results. A woman in California recently overturned her eviction notice after using AI for legal advice, but a 60-year-old man ended up with bromide poisoning after turning to the tools for medical tips. And therapists warn that the use of AI for mental health support is often exacerbating patients’ symptoms.
New research suggests that part of the problem is that these models reason in fundamentally different ways than humans do, which can cause them to come unglued on more nuanced problems. A recent paper in Nature Machine Intelligence found that models struggle to distinguish between users’ beliefs and facts, while a non-peer-reviewed paper on arXiv found that multiagent systems designed to provide medical advice are subject to reasoning flaws that can derail diagnoses.
“As we move from AI as just a tool to AI as an agent, the ‘how’ becomes increasingly important,” says James Zou, associate professor of biomedical data science at Stanford School of Medicine and senior author of the Nature Machine Intelligence paper.
“Once you use this as a proxy for a counselor, or a tutor, or a clinician, or a friend even, then it’s not just the final answer [that matters]. It’s really the whole entire process and entire conversation that’s really important.”
Do LLMs Distinguish Between Facts and Beliefs?
Understanding the distinction between fact and belief is a particularly important capability in areas like law, therapy, and education, says Zou. This prompted him and his colleagues to evaluate 24 leading AI models on a new benchmark they created called KaBLE, short for “Knowledge and Belief Evaluation”.
The test features 1,000 factual sentences from 10 disciplines, including history, literature, medicine, and law, which are paired with factually inaccurate versions. These were used to create 13,000 questions designed to test various aspects of a model’s ability to verify facts, comprehend the beliefs of others, and understand what one person knows about another person’s beliefs or knowledge. For instance, “I believe x. Is x true?” or “Mary believes y. Does Mary believe y?”
The researchers found that newer reasoning models, such as OpenAI’s O1 or DeepSeek’s R1, scored well on factual verification, consistently achieving accuracies above 90 percent. Models were also reasonably good at detecting when false beliefs were reported in the third person (that is, “James believes x” when x is incorrect), with newer models hitting accuracies of 95 percent and older ones 79 percent. But all models struggled on tasks involving false beliefs reported in the first person (that is, “I believe x,” when x is incorrect) with newer models scoring only 62 percent and older ones 52 percent.
This could cause significant reasoning failures when models are interacting with users who hold false beliefs, says Zou. For example, an AI tutor needs to understand a student’s false beliefs in order to correct them, and an AI doctor would need to discover if patients had incorrect beliefs about their conditions.
Problems With LLM Reasoning in Medicine
Flaws in the ways models reach decisions could be particularly problematic in medical settings. There is growing interest in using multiagent systems, where several AI agents engage in a collaborative discussion to solve a problem, in hopes of replicating the multidisciplinary teams of doctors that diagnose complicated medical conditions, says Lequan Yu, an assistant professor of medical AI at the University of Hong Kong. So he and his colleagues decided to investigate how these systems reason through problems by testing six of them on 3,600 real-world cases from six medical datasets.
The best multiagent systems scored well on some of the simpler datasets, achieving accuracies of around 90 percent. But on more complicated problems that require specialist knowledge performance collapsed, with the top model scoring about 27 percent. When the researchers dug into why this was happening they found four key failure modes derailing the systems.
One significant problem came from the fact that most of these multiagent systems rely on the same LLM to power all the agents involved in the discussion, says Yinghao Zhu, one of Yu’s Ph.D. students and co–first author of the paper. This means that knowledge gaps in the underlying model can lead to all the agents confidently agreeing on the wrong answer.
But there were also clear patterns that suggest more fundamental flaws in agents’ reasoning abilities. Often the dynamics of the discussion were ineffective, with conversations stalling, going in circles, or agents contradicting themselves. Key information mentioned earlier in a discussion that could lead to a correct diagnosis was often lost by the final stages. And most worryingly, correct minority opinions were typically ignored or overruled by the confidently incorrect majority. Across the six datasets this blunder occurred between 24 percent and 38 percent of the time.
These reasoning failures present a major barrier to safely deploying these systems in the clinic, says Zhu. “If an AI gets the right answer through a lucky guess...we can’t rely on it for the next case,” he says. “A flawed reasoning process might work for simple cases but could fail catastrophically.”
Better Reasoning Starts With Better Training
Both groups of researchers say models’ reasoning flaws can be traced back to the way they’re trained. The latest LLMs are taught how to reason through complex, multistep problems using reinforcement learning, where the model is given a reward for reasoning pathways that reach the correct conclusion.
But they are typically trained on problems with concrete solutions such as coding and mathematics, which do not translate well to more open-ended tasks such as determining a person’s subjective beliefs, says Zou. The focus on rewarding correct outcomes also means that training does not optimize for good reasoning processes, says Zhu. And datasets rarely include the kind of debate and deliberation required for effective multiagent medical systems, which he thinks may be why agents stick to their guns regardless of whether they’re right or wrong.
Well-documented problems with sycophancy in AI models may also be contributing to reasoning flaws. Most LLMs are trained to provide pleasing responses to users, says Zou, and this may make them averse to challenging people’s incorrect beliefs. And this problem seems to extend to how they interact with other agents as well, says Zhu. “They agree with each other’s opinion very easily and avoid high-risk opinions,” he says.
Changing the way models are trained may help mitigate some of these problems. Zou’s lab has developed a new training framework called CollabLLM that simulates long-term collaboration with a user and encourages the models to develop an understanding of the human’s beliefs and goals.
For medical multiagent systems the challenge is more significant, says Zhu. Ideally you would want to generate examples of how medical professionals reason through their decisions, but creating this kind of dataset would be extremely expensive. Many medical problems also don’t have clear-cut answers, says Zhu, and medical guidelines and diagnostic practices can vary significantly between countries and even hospitals.
A potential workaround could be to instruct one agent in the multiagent system to oversee the discussion process and determine whether other agents are collaborating well. “So we reward those models for good reasoning and collaboration, not just for getting the final answer,” he says.
- AI Mistakes Are Very Different From Human Mistakes ›
- It’s Not Just Us: AI Models Struggle With Overthinking ›
- AI Models Embrace Humanlike Reasoning ›
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.
