ChatGPT has wowed the world with the depth of its knowledge and the fluency of its responses, but one problem has hobbled its usefulness: It keeps hallucinating.
Yes, large language models (LLMs) hallucinate, a concept popularized by Google AI researchers in 2018. Hallucination in this context refers to mistakes in the generated text that are semantically or syntactically plausible but are in fact incorrect or nonsensical. In short, you can’t trust what the machine is telling you.
That’s why, while OpenAI’s Codex or Github’s Copilot can write code, an experienced programmer still needs to review the output—approving, correcting, or rejecting it before allowing it to slip into a code base where it might wreak havoc.
High school teachers are learning the same. A ChatGPT-written book report or historical essay may be a breeze to read but could easily contain erroneous “facts” that the student was too lazy to root out.
Hallucinations are a serious problem. Bill Gates has mused that ChatGPT or similar large language models could some day provide medical advice to people without access to doctors. But you can’t trust advice from a machine prone to hallucinations.
OpenAI Is Working to Fix ChatGPT’s Hallucinations
Ilya Sutskever, OpenAI’s chief scientist and one of the creators of ChatGPT, says he’s confident that the problem will disappear with time as large language models learn to anchor their responses in reality. OpenAI has pioneered a technique to shape its models’ behaviors using something called reinforcement learning with human feedback (RLHF).
RLHF was developed by OpenAI and Google’s DeepMind team in 2017 as a way to improve reinforcement learning when a task involves complex or poorly defined goals, making it difficult to design a suitable reward function. Having a human periodically check on the reinforcement learning system’s output and give feedback allows reinforcement-learning systems to learn even when the reward function is hidden.
For ChatGPT, data collected during its interactions are used to train a neural network that acts as a “reward predictor,” which reviews ChatGPT’s outputs and predicts a numerical score that represents how well those actions align with the system’s desired behavior—in this case, factual or accurate responses.
Periodically, a human evaluator checks ChatGPT responses and chooses those that best reflect the desired behavior. That feedback is used to adjust the reward-predictor neural network, and the updated reward-predictor neural network is used to adjust the behavior of the AI model. This process is repeated in an iterative loop, resulting in improved behavior. Sutskever believes this process will eventually teach ChatGPT to improve its overall performance.
“I’m quite hopeful that by simply improving this subsequent reinforcement learning from the human feedback step, we can teach it to not hallucinate,” said Sutskever, suggesting that the ChatGPT limitations we see today will dwindle as the model improves.
Hallucinations May Be Inherent to Large Language Models
But Yann LeCun, a pioneer in deep learning and the self-supervised learning used in large language models, believes there is a more fundamental flaw that leads to hallucinations.
“Large language models have no idea of the underlying reality that language describes,” he said, adding that most human knowledge is nonlinguistic. “Those systems generate text that sounds fine, grammatically, semantically, but they don’t really have some sort of objective other than just satisfying statistical consistency with the prompt.”
Humans operate on a lot of knowledge that is never written down, such as customs, beliefs, or practices within a community that are acquired through observation or experience. And a skilled craftsperson may have tacit knowledge of their craft that is never written down.
“Language is built on top of a massive amount of background knowledge that we all have in common, that we call common sense,” LeCun said. He believes that computers need to learn by observation to acquire this kind of nonlinguistic knowledge.
“There is a limit to how smart they can be and how accurate they can be because they have no experience of the real world, which is really the underlying reality of language,” said LeCun. “Most of what we learn has nothing to do with language.”
“We learn how to throw a basketball so it goes through the hoop,” said Geoff Hinton, another pioneer of deep learning. “We don’t learn that using language at all. We learn it from trial and error.”
But Sutskever believes that text already expresses the world. “Our pretrained models already know everything they need to know about the underlying reality,” he said, adding that they also have deep knowledge about the processes that produce language.
While learning may be faster through direct observation by vision, he argued, even abstract ideas can be learned through text, given the volume—billions of words—used to train LLMs like ChatGPT.
Neural networks represent words, sentences, and concepts through a machine-readable format called an embedding. An embedding maps high-dimensional vectors—long strings of numbers that capture their semantic meaning—to a lower-dimensional space, a shorter string of numbers that is easier to analyze or process.
By looking at those strings of numbers, researchers can see how the model relates one concept to another, Sutskever explained. The model, he said, knows that an abstract concept like purple is more similar to blue than to red, and it knows that orange is more similar to red than purple. “It knows all those things just from text,” he said. While the concept of color is much easier to learn from vision, it can still be learned from text alone, just more slowly.
Whether or not inaccurate outputs can be eliminated through reinforcement learning with human feedback remains to be seen. For now, the usefulness of large language models in generating precise outputs remains limited.
“Most of what we learn has nothing to do with language.”
Mathew Lodge, the CEO of Diffblue, a company that uses reinforcement learning to automatically generate unit tests for Java code, said that “reinforcement systems alone are a fraction of the cost to run and can be vastly more accurate than LLMs, to the point that some can work with minimal human review.”
Codex and Copilot, both based on GPT-3, generate possible unit tests that an experienced programmer must review and run before determining which is useful. But Diffblue’s product writes executable unit tests without human intervention.
“If your goal is to automate complex, error-prone tasks at scale with AI—such as writing 10,000 unit tests for a program that no single person understands—then accuracy matters a great deal,” said Lodge. He agrees that LLMs can be great for freewheeling creative interaction, but he cautions that the last decade has taught us that large deep-learning models are highly unpredictable, and making the models larger and more complicated doesn’t fix that. “LLMs are best used when the errors and hallucinations are not high impact,” he said.
Nonetheless, Sutskever said that as generative models improve, “they will have a shocking degree of understanding of the world and many of its subtleties, as seen through the lens of text.”
- Study: Medical Image AIs Need a Good “Hallucination Map” ›
- AI Hallucinates Novel Proteins ›
- GPT-4 Ups The Ante In The AI Arms Race - IEEE Spectrum ›
- When AI’s Large Language Models Shrink - IEEE Spectrum ›
- ‘AI Pause’ Open Letter Stokes Fear and Controversy - IEEE Spectrum ›
- The Stickle-Brick Approach To Big AI - IEEE Spectrum ›
Craig S. Smith, a former reporter and executive for The New York Times, now works as a freelancer with a special interest in artificial intelligence. He is the founder of Eye on A.I., an artificial-intelligence-focused podcast and newsletter.