Rapid improvements in the capabilities of large language models (LLMs) have allowed them to tackle a wide range of tasks, but there are still many problems they can’t address. New research suggests that letting LLMs outsource jobs to smaller, specialized AIs could significantly broaden their scope.
Today’s leading LLMs are capable of some impressive feats, including acing the Uniform Bar Exam and coding video games. But their capabilities are still primarily linguistic in nature. Efforts are underway to get these models to work with more varied types of data and solve a wider range of problems. For example, OpenAI’s largest GPT-4 model is multimodal and can now analyze pictures as well as text, after been trained on both language and image data.
Rather than trying to create an all-singing, all-dancing model that can solve all kinds of problems, researchers at Rutgers University, in New Jersey, think we should piggyback on the wealth of specialized AI systems already optimized for solving narrower problems. Their new approach allows a human to describe a task in natural language, which an LLM then analyzes before piecing together several specialist AIs to provide a solution. This Stickle Brick–like approach to AI could help incorporate all of the breakthroughs made in AI over the decades into a single generalist system, says Yongfeng Zhang, an assistant professor at Rutgers who led the project.
“These LLMs have some basic ability to manipulate different tools and models to solve some basic tasks,” he says. “With more and more modules, tools, and domain-expert models being added, these models will naturally expand their ability to solve more different tasks.”
Zhang and his colleagues have created a software platform called OpenAGI that links together various pretrained LLMs and other domain-specific AI models. In a preprint posted to arXiv, they describe experiments with three LLMs—OpenAI’s GPT-3.5, Meta’s LLaMA, and Google’s FLAN-T5—as well as a host of smaller models specialized in tasks like sentiment analysis, translation, image classification, image deblurring, image captioning, and text-to-image generation.
The user provides the LLM with a natural-language description of a task and the relevant data set. An example might be “Given a blurry grayscale image, how can we answer a written question about it?” The LLM analyzes the task and works out a step-by-step plan for how to solve it, using natural-language descriptions of AI models to work out which to piece together and in what order. In this example, that might involve using a deblurring model to improve image quality, then another model that can colorize the photo, and finally a model that can answer questions about pictures.
The power of the approach, says Zhang, is that it piggybacks on the vast expressive power of human language, which can be used to explain almost any problem or model capability. “To really develop general AI systems, human beings should develop some kind of technical approach to unify the different challenges in one data format,” says Zhang. “Language naturally serves as such a medium to describe many different tasks.”
The Rutgers team isn’t the only one exploring this approach. Last month, researchers from Microsoft and Zhejiang University, in China, unveiled a system called HuggingGPT, which connects OpenAI’s ChatGPT service to Hugging Face’s repository of AI models. The user provides a natural-language explanation of the task they want completed and ChatGPT will devise a plan, select and run the models required to complete it, and then compile the results into a natural-language response for the user.
One significant difference between the approaches, says Zhang, is that HuggingGPT relies on a model that is accessible only through a company API. The Rutgers team’s approach is LLM-agnostic and open source. One advantage of this aspect is that it makes it possible to train the LLM to be better at the task-planning challenge, either using human-devised examples or using feedback on its performance to retrain the model.
In tests, the group showed that the 17-billion-parameter GPT-3.5, which is accessible only through a company API, achieved the best results when the model was given no prompts about how to solve the problem or just a handful of examples. But when the much smaller FLAN-T5, which has just 770 million parameters, was retrained using performance feedback, it did significantly better than GPT-3.5 under the no-prompt scenario.
Both OpenAGI and HuggingGPT are part of a recent explosion in efforts to link LLMs up to other AI models and digital tools, often using APIs. Other prominent examples include Microsoft’s TaskMatrix.AI, Meta’s Toolformer, and the Allen Institute’s VISPROG. Whether offloading tasks to other AI models or simpler software, the idea is much the same, says Mahesh Sathiamoorthy, a software engineer at Google Brain. He thinks the approach is likely to be a more promising avenue for boosting the capability of future AIs than multimodal training approaches.
“It is possible that one model that does everything will be better in terms of quality, but it is going to be impractical to train and serve such a model,” he writes in an email to IEEE Spectrum. “We already have a lot of excellent domain-specific models, and domain-specific knowledge stores (e.g. Google). So it is easier to make use of them.”
However, David Schlangen, a professor of computational linguistics at the University of Potsdam, in Germany, takes issue with the name the Rutgers team have given their model. AGI stands for “artificial general intelligence,” which refers to a hypothetical AI system that mimics the kind of general intelligence seen in humans. While these new models are interesting experiments in pushing LLMs beyond simply dealing with text, Schlangen says they still provide no solution to key flaws like the tendency to make up facts. “The framing as having anything to do with ‘artificial general intelligence’ is at best misleading,” he says.
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.