The long-anticipated update to OpenAI’s family of large language models (LLMs) is finally here. Early demos suggest GPT-4 is substantially more powerful than its predecessor and competitors. But more significantly, this fourth-generation generative pretrained transformer is now also multimodal—able to process visual input as well as text. The company’s secrecy around the model’s technical details has, however, stirred up controversy.
OpenAI announced the release in a blog post on Tuesday, accompanied by a 98-page technical report, although the document omits critical details such as the model’s size, its architecture, or how it was trained. GPT-4 has already been integrated into the company’s wildly popular chatbot service ChatGPT. Such GPT-4 access, though, is currently available only to paying subscribers.
Developers can sign up for access to an API that will allow them to integrate the model into their software, and the company has revealed that partners including Duolingo, Stripe, and Khan Academy are already using the technology in their products. Microsoft also confirmed that its new Bing chatbot has been running on GPT-4 since launching last month.
“It’s pretty close to some weak form of understanding, although it is not understanding the way we know it.”
—Nello Cristianini, University of Bath
In its announcement, OpenAI said the differences between the new model and its predecessor GPT-3.5 are subtle when engaging in general conversation. But on more complex tasks the gulf becomes more evident, with GPT-4 outperforming state-of-the-art models in a wide range of machine-learning benchmarks, including those designed to assess reasoning capabilities. It also performed well in exams designed for humans, scoring near the top of the rankings in the Uniform Bar Exam, college and graduate-school aptitude tests (including the SAT, GRE, and LSAT), as well as a host of professional exams and high school advanced-placement college tests.
GPT-4’s significant jump in reasoning capability, compared to previous generations of the LLM, is the most impressive thing about it, says Jim Fan, an AI research scientist at Nvidia. “For the first time in history, benchmarks for AI will be the same as benchmarks for humans,” he adds.
But the other way in which GPT-4 differs significantly from previous iterations, and the LLMs released by competitors like Google and Meta, is the fact that it is able to process images as well as text. These multimodal capabilities have been studied in academia for many years, says Fan, but making them accessible via a commercial API is a significant step, he adds.
These aren’t the first multimodal AI models people have been able to play around with. OpenAI’s DALL-E model and the open-source Stable Diffusion model are both able to convert text into images. But GPT-4 works the other way round, accepting images as input and then answering questions about them or using them as a starting point to generate new ideas.
In demos, OpenAI has shown how the model can explain why image-based jokes are funny, generate recipe ideas from photos of a fridge’s contents, or even code a working website based on nothing more than a rudimentary sketch. These capabilities are not yet accessible to developers or the general public, but the company has already partnered with the app Be My Eyes to use GPT-4 to describe what’s going on in photos for visually impaired people.
Beyond opening up a range of new practical-use cases for LLMs, this multimodality could be an important step toward more powerful and more generally capable AI, says Nello Cristianini, a professor of artificial intelligence at the University of Bath, in England, and author of The Shortcut: Why Intelligent Machines Do Not Think Like Us.
For a start, the way these LLMs are trained can benefit significantly from using multiple modalities, he says. One of the main innovations that has made them possible is the idea of self-supervised learning, which did away with the need for humans to painstakingly label training data, and instead allowed AI to essentially teach itself by ingesting vast quantities of text.
When it comes to language models, this is done by getting the AI to guess the next word in a sentence. This doesn’t require any human input, because the model can work out if it was right or wrong from the data itself. Do this over and over on Web-scale data sets and the AI can develop a statistical model of language so sophisticated it develops emergent capabilities, says Cristianini.
“From a scientist’s point of view, it’s very disappointing. There’s this whole thing about standing on the shoulders of giants, but if we don’t know what kind of shoulders we are standing on, then it’s difficult to build on it.”
—Anthony Cohn, University of Leeds
Making these models multimodal can turbocharge the learning process, because it allows one data source to act as a supervising signal on the other. This not only allows the model to learn from more varied forms of data, but it can also help “ground” the abstract knowledge the models have learned from text in other media such as images. While GPT-4 is only a baby step in that direction, Cristianini says that as more modalities are added AI could start to develop more sophisticated models of reality. “It’s pretty close to some weak form of understanding, although it is not understanding the way we know it,” he says.
Multimodality could also be a crucial ingredient when it comes to solving a host of scientific problems that have so far proven intractable, says Nathan Benaich, founder of Air Street Capital and coauthor of the influential State of AI Report. “Many grand challenges in science—whether in physics or biology—will require the fusing of as many data modalities as we can get our hands on,” he says.
But expanding to more modalities using these approaches may be harder than it seems, says Anthony Cohn, professor of automated reasoning at the University of Leeds, in England. LLMs require reams of examples to train on, and while the Internet provides an almost limitless supply of text and images, the same is not true of other kinds of data. “That’s always been one of the big criticisms of this kind of technology. It just requires a quite an insane amount of training data,” he says.
It’s also important not to overstate the current model’s capabilities, says Cohn. In its announcement, OpenAI admitted that GPT-4 suffers from similar problems to its predecessors, in particular its tendency to “hallucinate,” which refers to it confidently stating as fact things that are actually false. And while the company has put in blocks to prevent it from providing harmful advice or dangerous information, it certainly still happens. That said, OpenAI claims GPT-4 misleads and misrepresents 82 percent less frequently than GPT-3.5.
Given this, Cohn says he was pleased to see OpenAI framing the model as a tool that needs careful supervision. More problematic, however, is the company’s decision not to release key technical details about the model. That not only makes it harder for others to debug and test the system, but it also means the rest of the AI community can’t build on OpenAI’s work. “From a scientist’s point of view, it’s very disappointing,” Cohn says. “There’s this whole thing about standing on the shoulders of giants, but if we don’t know what kind of shoulders we are standing on then it’s difficult to build on it.”
Ultimately though, the company’s reticence is understandable, says Cristianini. The release of ChatGPT has set off an LLM arms race among the big tech companies, and giving up how it built GPT-4 could be a boon to its rivals. And at the same time, even if OpenAI released the details, very few research groups have the expertise or resources to build models of this size or complexity. “It’s not like we can replicate even if the paper was public,” he says. “But the competitors can, so that’s the problem.”
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.