AI Language Models Are Struggling to “Get” Math

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

If computers are good at anything, they are good at math. So it may come as a surprise that after much struggling, top machine-learning researchers have recently made breakthroughs in teaching computers math.

Over the past year, researchers from the University of California, Berkeley, OpenAI, and Google have made progress in teaching basic math concepts to natural language generation models—algorithms such as GPT-2/3 and GPT-Neo. However, until recently, language models regularly failed to solve even simple word problems, such as “Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have?”

“When we say computers are very good at math, they’re very good at things that are quite specific,” says Guy Gur-Ari, a machine-learning expert at Google. Computers are good at arithmetic—plugging numbers in and calculating is child’s play. But outside of formal structures, computers struggle.

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something.”
—Ethan Dyer, Google

Solving word problems, or “quantitative reasoning,” is deceptively tricky because it requires a robustness and rigor that many other problems don’t. If any step during the process goes wrong, the answer will be wrong. “When multiplying really large numbers together…they’ll forget to carry somewhere and be off by one,” says Vineet Kosaraju, a machine-learning expert at OpenAI. Other mistakes made by language models are less human, such as misinterpreting 10 as a 1 and a 0, not 10.

“We work on math because we find it independently very interesting,” says Karl Cobbe, a machine-learning expert at OpenAI. But as Gur-Ari puts it, if it’s good at math, “it’s probably also good at solving many other useful problems.”

As machine-learning models are trained on larger samples of data, they tend to grow more robust and make fewer mistakes. But scaling up seems to go only so far with quantitative reasoning; researchers realized that the mistakes language models make seemed to require a more targeted approach.

Last year, two different teams of researchers, at UC Berkeley and OpenAI, released two data sets, MATH and GSM8K, respectively, which contain thousands of math problems across geometry, algebra, precalculus, and more. “We basically wanted to see if it was a problem with data sets,” says Steven Basart, a researcher at the Center for AI Safety who worked on MATH. Language models were known to be bad at word problems—but how bad were they, and could they be fixed by introducing better formatted, bigger data sets? The MATH group found just how challenging quantitative reasoning is for top-of-the-line language models, which scored less than 7 percent. (A human grad student scored 40 percent, while a math olympiad champ scored 90 percent.)

Models attacking GSM8K problems, which had easier grade-school-level problems, reached about 20 percent accuracy. The OpenAI researchers used two main techniques: fine-tuning and verification. In fine-tuning, researchers take a pretrained language model that includes irrelevant information (Wikipedia articles on zambonis, the dictionary entry for “gusto,” and the like) and then show the model, Clockwork Orange–style, only the relevant information (math problems). Verification, on the other hand, is more of a review session. “The model gets to see a lot of examples of its own mistakes, which is really valuable,” Cobbe says.

At the time, OpenAI predicted a model would need to be trained on 100 times more data to reach 80 percent accuracy on GSM8K. But in June, Google’s Minerva announced 78 percent accuracy with minimal scaling upwards. “It’s ahead of any of the trends that we were expecting,” Cobbe says. Basart agrees. “That’s shocking. I thought it would take longer,” he says.

Minerva uses Google’s own language model, Pathways Language Model (PaLM), which is fine-tuned on scientific papers from the arXiv online preprint server and other sources with formatted math. Two other strategies helped Minerva. In “chain-of-thought prompting,” Minerva was required to break down larger problems into more palatable chunks. The model also used majority voting—instead of being asked for one answer, it was asked to solve the problem 100 times. Of those answers, Minerva picked the most common answer.

The gains from these new strategies were enormous. Minerva shot up to 50 percent accuracy on MATH and nearly 80 percent accuracy on GSM8K, as well as the MMLU, a more general set of STEM questions that included chemistry and biology. When Minerva was asked to redo a random sample of slightly tweaked questions, it performed just as well, suggesting that its capabilities were not from mere memorization.

What Minerva knows—or doesn’t know—about math is fuzzier. Unlike proof assistants, which come with built-in structure, Minerva and other language models have no formal structure. They can have strange, messy reasoning and still arrive at the right answer. As numbers grow larger, the language models’ accuracy falters, something that would never happen on a TI-84.

“Just how smart is it—or isn’t it?” asks Cobbe. Though models like Minerva might arrive at the same answer as a human, the actual process they’re following could be wildly different. On the other hand, chain-of-thought prompting is familiar to any human student who’s been asked to “show your work.”

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something,” says Ethan Dyer, a machine-learning expert at Google. But humans give inconsistent answers, make errors, and fail to apply core concepts, too. The borders, at this frontier of machine learning, are blurred.

Update 14 Oct. 2022: A previous version of this story extraneously alluded to the DALL-E/DALL-E 2 art-generation AI in the context of large language generation models being taught to handle math word problems. Of course, neither DALL-E nor DALL-E 2 is a large language generation model. (And it was not studied in the math word problem research.) So to avoid confusion, references to it were cut.

This article appears in the December 2022 print issue as “Machine Learning Rethinks Scientific Thinking.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

AI Language Models Are Struggling to “Get” Math

Should this be telling us something?

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Why Are Large AI Models Being Red Teamed?

The Battle for Better, Broader, More Inclusive AI

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

AI Language Models Are Struggling to “Get” Math

Should this be telling us something?

This IEEE Society’s Secret to Boosting Student Membership

Why Haven’t Hoverbikes Taken Off?

Ukraine Is Riddled With Land Mines. Drones and AI Can Help

Related Stories

Grokking X.ai’s Grok—Real Advance or Just Real Troll?

Why Are Large AI Models Being Red Teamed?

The Battle for Better, Broader, More Inclusive AI