Building ever larger language models has led to groundbreaking jumps in performance. But it’s also pushing state-of-the-art AI beyond the reach of all but the most well-resourced AI labs. That makes efforts to shrink models down to more manageable sizes more important than ever, say researchers.
In 2020, researchers at OpenAI proposed AI scaling laws that suggested increasing model size led to reliable and predictable improvements in capability. But this trend is quickly putting the cutting edge of AI research out of reach for all but a handful of private labs. While the company has remained tight-lipped on the matter, there is speculation that its latest GPT-4 large language model (LLM) has as many as a trillion parameters, far more than most companies or research groups have the computing resources to train or run. As a result, the only way most people can access the most powerful models is through the APIs of industry leaders.
“We won’t be able to make models bigger forever. There comes a point where even with hardware improvements, given the pace that we’re increasing the model size, we just can’t.”
—Dylan Patel, SemiAnalysis
That’s a problem, says Dylan Patel, chief analyst at the consultancy SemiAnalysis, because it makes it more or less impossible for others to reproduce these models. That means external researchers aren’t able to probe these models for potential safety concerns and that companies looking to deploy LLMs are “tied to the hip” of OpenAI’s data set and model design choices.
There are more practical concerns too. The pace of innovation in the GPU chips that are used to run AI is lagging behind model size, meaning that pretty soon we could face a “brick wall” beyond which scaling cannot plausibly go. “We won’t be able to make models bigger forever,” he says. “There comes a point where even with hardware improvements, given the pace that we’re increasing the model size, we just can’t.”
How large do large language models need to be?
Efforts to push back against the logic of scaling are underway, though. Last year, researchers at DeepMind showed that training smaller models on far more data could significantly boost performance. DeepMind’s 70-billion-parameter Chinchilla model outperformed the 175-billion-parameter GPT-3 by training on nearly five times as much data. This February, Meta used the same approach to train much smaller models that could still go toe-to-toe with the biggest LLMs. Its resulting LLaMa model came in a variety of sizes between 7 and 65 billion parameters, with the 13-billion-parameter version outperforming GPT-3 on most benchmarks.
The company’s stated goal was to make such LLMs more accessible, and so Meta offered the trained model to any researchers who asked for it. This experiment in accessibility quickly got out of control, though, after the model was leaked online. And earlier this month, researchers at Stanford pushed things further: They took the 7-billion-parameter version of LLaMa and fine-tuned it on 52,000 query responses from GPT-3.5, the model that originally powered ChatGPTand (as of press time) still powers OpenAI’s free version. The resulting model, called Alpaca, was able to replicate much of the behavior of the OpenAI model, according to the researchers, who released their data and training recipe so others could replicate it.
“Increasingly, we were finding that there was a gap in the qualitative behavior of models available to the research community and the closed-source models being served by leading LLM providers,” says Tatsunori Hashimoto, an assistant professor at Stanford who led the research. “Our view was that having a capable and accessible model was important to have the academic community engage in analyzing and solving the many deficiencies of instruction following LLMs.”
Since then, hackers and hobbyists have run with the idea, using the LLaMa weights and the Alpaca training scheme to run their own LLMs on PCs, phones, and even a Raspberry Pi microcontroller. Hashimoto says it’s great to see more people engaging with LLMs, and he’s been surprised at the efficiency people have squeezed out of these models. But he stresses that Alpaca is still very much a research model not suitable for widespread use, and that broad accessibility to LLMs also carries risks.
“If we can take advantage of the knowledge already frozen in these models, we should.”
—Jim Fan, Nvidia
Patel says there are question marks around the way the Stanford researchers evaluated their model, and it’s not clear the performance is as good as that of larger models. But there are plenty of other approaches for boosting efficiency that are also making progress. One promising technique is the “mixture of experts,” he says, which involves training multiple smaller sub-models specialized for specific tasks rather than using a single large model to solve all of them. The MoE approach makes a lot of sense, says Patel. Our brains follow a similar pattern, with different regions specialized for different tasks.
Nvidia recently used the approach to build a vision-language model called Prismer, designed to answer questions about images or provide captions. The company showed that the model could match the performance of models trained on 10 to 20 times as much data. “There are tons of high-quality pretrained models for various tasks like depth estimation, object segmentation, and 3D understanding,” says Jim Fan, AI research scientist at Nvidia. “If we can take advantage of the knowledge already frozen in these models, we should.”
Turning LLM sparsity into opportunity
Another attractive approach to boosting model efficiency is to exploit a property known as sparsity, says Patel. A surprisingly large number of weights in LLMs are set to zero, and performing operations on these values is a waste of computation. Finding ways to remove these zeros could help shrink the size of models and reduce computational costs, says Patel.
Sparsity is one of the most promising future directions for compressing models, says Sara Hooker, who leads the research lab Cohere For AI, but current hardware is not well-suited to exploit it. Patterns of sparsity typically don’t have any obvious structure, but today’s GPUs are specialized for processing data in well-defined matrices. This means that even when a weight is zero it still needs to be represented in the matrix, which takes up memory and adds computational overhead. While enforcing structured patterns of sparsity is a partial workaround, the chips can’t take full advantage, and further hardware innovation is probably needed, Hooker says. “The interesting challenge is, how do you represent the absence of something without actually representing it?” she says.
Many of the techniques that are effective at compressing smaller AI models also don’t appear to translate well to LLMs, says Hooker. One popular approach is known as quantization, which reduces data requirements by representing weights using fewer bits—for instance, using 8-bit floating-point numbers rather than 32-bit. Another is knowledge distillation, in which a large teacher model is used to train a smaller one. So far, though, these techniques have had little success when applied to models above 6 billion parameters, says Hooker.
The fight against AI scaling laws also faces more prosaic challenges, says Patel. Part of the reason why they’ve proved so enduring is that it’s often easier to throw computing power at a well-understood model architecture than fine-tune new techniques. “If I have 1,000 GPUs for three months, what’s the best model I can make?” he says. “A lot of times, the answer is, unfortunately, that you really can’t get these new architectures to run efficiently.”
That’s not to say that efforts to shrink larger models are a waste of time, says Patel. However, he adds, scaling is likely to continue to be important for setting new states of the art. “The max size is going to continue to grow, and the quality at small sizes is going to continue to grow,” he says. “I think there’s two divergent paths, and you’re kind of following both.”
- Hallucinations Could Blunt ChatGPT’s Success ›
- Coding Made AI—Now, How Will AI Unmake Coding? ›
- The Stickle-Brick Approach To Big AI - IEEE Spectrum ›
Edd Gent is a freelance science and technology writer based in Bengaluru, India. His writing focuses on emerging technologies across computing, engineering, energy and bioscience. He's on Twitter at @EddytheGent and email at edd dot gent at outlook dot com. His PGP fingerprint is ABB8 6BB3 3E69 C4A7 EC91 611B 5C12 193D 5DFC C01B. His public key is here. DM for Signal info.