AI Language Models Are Struggling to “Get” Math

Should this be telling us something?

4 min read
The words Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have? in blue and distorted
iStockphoto/IEEE Spectrum

If computers are good at anything, they are good at math. So it may come as a surprise that after much struggling, top machine-learning researchers have recently made breakthroughs in teaching computers math.

Over the past year, researchers from the University of California, Berkeley, OpenAI, and Google have made progress in teaching basic math concepts to natural language generation models—algorithms such as GPT-2/3 and GPT-Neo. However, until recently, language models regularly failed to solve even simple word problems, such as “Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have?”

“When we say computers are very good at math, they’re very good at things that are quite specific,” says Guy Gur-Ari, a machine-learning expert at Google. Computers are good at arithmetic—plugging numbers in and calculating is child’s play. But outside of formal structures, computers struggle.

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something.”
—Ethan Dyer, Google

Solving word problems, or “quantitative reasoning,” is deceptively tricky because it requires a robustness and rigor that many other problems don’t. If any step during the process goes wrong, the answer will be wrong. “When multiplying really large numbers together…they’ll forget to carry somewhere and be off by one,” says Vineet Kosaraju, a machine-learning expert at OpenAI. Other mistakes made by language models are less human, such as misinterpreting 10 as a 1 and a 0, not 10.

“We work on math because we find it independently very interesting,” says Karl Cobbe, a machine-learning expert at OpenAI. But as Gur-Ari puts it, if it’s good at math, “it’s probably also good at solving many other useful problems.”

As machine-learning models are trained on larger samples of data, they tend to grow more robust and make fewer mistakes. But scaling up seems to go only so far with quantitative reasoning; researchers realized that the mistakes language models make seemed to require a more targeted approach.

Last year, two different teams of researchers, at UC Berkeley and OpenAI, released two data sets, MATH and GSM8K, respectively, which contain thousands of math problems across geometry, algebra, precalculus, and more. “We basically wanted to see if it was a problem with data sets,” says Steven Basart, a researcher at the Center for AI Safety who worked on MATH. Language models were known to be bad at word problems—but how bad were they, and could they be fixed by introducing better formatted, bigger data sets? The MATH group found just how challenging quantitative reasoning is for top-of-the-line language models, which scored less than 7 percent. (A human grad student scored 40 percent, while a math olympiad champ scored 90 percent.)

Models attacking GSM8K problems, which had easier grade-school-level problems, reached about 20 percent accuracy. The OpenAI researchers used two main techniques: fine-tuning and verification. In fine-tuning, researchers take a pretrained language model that includes irrelevant information (Wikipedia articles on zambonis, the dictionary entry for “gusto,” and the like) and then show the model, Clockwork Orange–style, only the relevant information (math problems). Verification, on the other hand, is more of a review session. “The model gets to see a lot of examples of its own mistakes, which is really valuable,” Cobbe says.

At the time, OpenAI predicted a model would need to be trained on 100 times more data to reach 80 percent accuracy on GSM8K. But in June, Google’s Minerva announced 78 percent accuracy with minimal scaling upwards. “It’s ahead of any of the trends that we were expecting,” Cobbe says. Basart agrees. “That’s shocking. I thought it would take longer,” he says.

Minerva uses Google’s own language model, Pathways Language Model (PaLM), which is fine-tuned on scientific papers from the arXiv online preprint server and other sources with formatted math. Two other strategies helped Minerva. In “chain-of-thought prompting,” Minerva was required to break down larger problems into more palatable chunks. The model also used majority voting—instead of being asked for one answer, it was asked to solve the problem 100 times. Of those answers, Minerva picked the most common answer.

The gains from these new strategies were enormous. Minerva shot up to 50 percent accuracy on MATH and nearly 80 percent accuracy on GSM8K, as well as the MMLU, a more general set of STEM questions that included chemistry and biology. When Minerva was asked to redo a random sample of slightly tweaked questions, it performed just as well, suggesting that its capabilities were not from mere memorization.

What Minerva knows—or doesn’t know—about math is fuzzier. Unlike proof assistants, which come with built-in structure, Minerva and other language models have no formal structure. They can have strange, messy reasoning and still arrive at the right answer. As numbers grow larger, the language models’ accuracy falters, something that would never happen on a TI-84.

“Just how smart is it—or isn’t it?” asks Cobbe. Though models like Minerva might arrive at the same answer as a human, the actual process they’re following could be wildly different. On the other hand, chain-of-thought prompting is familiar to any human student who’s been asked to “show your work.”

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something,” says Ethan Dyer, a machine-learning expert at Google. But humans give inconsistent answers, make errors, and fail to apply core concepts, too. The borders, at this frontier of machine learning, are blurred.

Update 14 Oct. 2022: A previous version of this story extraneously alluded to the DALL-E/DALL-E 2 art-generation AI in the context of large language generation models being taught to handle math word problems. Of course, neither DALL-E nor DALL-E 2 is a large language generation model. (And it was not studied in the math word problem research.) So to avoid confusion, references to it were cut.

This article appears in the December 2022 print issue as “Machine Learning Rethinks Scientific Thinking.”

The Conversation (2)
Ashok Deobhakta
Ashok Deobhakta11 Jan, 2023


R Watkins
R Watkins27 Oct, 2022

Computers are not good at math. They are good at arithmetic. And neither arithmetic word problems, nor the actual concepts of mathematics, are part of "natural language" Has it escaped notice that these are difficult to teach otherwise literate schoolchildren? Forget incorporating these into natural language models, instead build language models specifically for them, run them in parallel, and defer to the one which best understands the problem.