By Joachim Klement
LONDON, April 1 (Reuters) - Hundreds of billions of dollars are riding on the assumption that artificial intelligence will be reliable enough for high-stakes work. New research suggests it may never be.
The AI tools that power ChatGPT and its rivals - known as large language models, or LLMs - are a genuine productivity-enhancing innovation. But they have serious shortcomings, most notably, their tendency to hallucinate, or make things up.
New research on this widely studied phenomenon indicates that hallucinations occur far more often than most people realise, and the issue gets worse the more information you give the AI model to work with.
A new experiment shows just how serious the problem may be.
JV Roig, an expert in agentic AI at the start-up Kamiwaza AI, tested different LLMs in a controlled environment.
He gave each model input text of 32,000, 128,000 or – when possible – 200,000 words. Then he asked the models to answer questions about these texts. Because the training material was known in advance, he could measure exactly how often the models hallucinated.
At 32,000 words, the best-performing model - China's Zhipu AI's 2513.HK GLM 4.5 - hallucinated just 1.2% of the time. The average was 6.8%. When the input was pushed to 128,000 words, GLM 4.5's error rate rose to 3.2%, while the average jumped to 10%.
That is a meaningful error rate given the relatively limited input. For those models that allowed training texts of 200,000 words, hallucination rates skyrocketed. Some broke down entirely, hallucinating answers in the majority of cases.
Of course, in the real world, LLMs aren’t trained on a relatively limited set of words but effectively on the whole internet, which means hallucinations are likely an even bigger problem in reality than in this experiment.
BROKEN BY DESIGN
But won’t this issue simply be fixed as the technology improves? Current evidence suggests that might not be the case.
In December, a team from Tsinghua University in Beijing published a paper tracing hallucinations to a tiny fraction – typically less than 0.1% of an LLM’s neurons, the basic decision-making units that process information and pass signals forward, much like switches in an electrical circuit.
That’s good news, right?
In an ideal world, one would simply identify the faulty neurons, remove them, and fix the problem.
Alas, the study indicates this is impossible. The problematic neurons are formed during the model's initial training, not at a later stage during fine-tuning.
The authors of the study argue that these hallucinating neurons “originate from the inherent characteristics of the next-token prediction objective. This training paradigm does not distinguish between factually correct and incorrect continuations – it merely rewards fluent text generation."
What all that jargon means is that LLMs are inherently probabilistic, not deterministic. They try to guess the answer that sounds best, not the objectively correct one.
And that’s not a problem to be fixed. It’s fundamental to how these models work.
THE COST OF BEING WRONG
These findings should give investors and executives pause – especially when it comes to throwing lots of money at LLMs and generative AI.
These models are offering “about right” results. In many everyday tasks, like designing a marketing campaign or drafting internal reports, that may be fine.
But in many of the most sophisticated AI software applications, like accounting and law, "about right" is the same as wrong – and could potentially expose the user to significant legal and financial risk.
Indeed, a March investigation by the New York Times found that LLMs asked to fill out U.S. tax forms produced errors serious enough to constitute tax evasion, if not tax fraud, in an IRS audit.
And these were relatively simple personal tax returns. Business filings are far more complex. Any company using LLMs to fully replace professional accounting software could thus be taking on serious risk. The same is likely true for other mission-critical functions like risk management.
One could argue that these findings simply show that AI models will always require human oversight, which is something AI firms would probably agree with. But if too much checking and fixing is required, investors may need to rethink all those projections for massive efficiency gains.
Technology is not standing still, of course. The industry is seeking workarounds to the hallucination problem. But it seems like these workarounds can’t be done within LLMs themselves but instead need a completely new approach like so-called “world models” that are based on some form of “understanding” of the task at hand. And these models are likely years away.
SQUEEZED FROM BOTH SIDES
Ultimately, when it comes to the vast majority of everyday business tasks where total reliability is not required, the available LLMs, including low-cost options like DeepSeek, Meta's META.O Llama, GLM and Alibaba's 9988.HK Qwen, can work just fine.
That’s good news for consumers but not necessarily for firms like Anthropic, OpenAI, and Alphabet GOOGL.O that are behind pricier LLMs. If you can get what you need from a low-cost, or even free model, why pay up for an advanced option if you can’t achieve the level of precision needed for more sophisticated tasks?
While one should never completely discount the possibility of technological breakthroughs, the conclusion from these studies should make AI bulls uncomfortable. The most lucrative applications for LLMs may remain out of reach, even as cheaper AI competition is emerging.
For many AI business models, this is not just a technical problem, but potentially an existential one.
(The views expressed here are those of Joachim Klement, an investment strategist for Panmure Liberum.)
Enjoying this column? Check out Reuters Open Interest (ROI), your essential new source for global financial commentary. Follow ROI on LinkedIn, and X.
And listen to the Morning Bid daily podcast on Apple, Spotify, or the Reuters app. Subscribe to hear Reuters journalists discuss the biggest news in markets and finance seven days a week.