The words Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have? in blue and distorted
iStockphoto/IEEE Spectrum

If computers are good at anything, they are good at math. So it may come as a surprise that after much struggling, top machine-learning researchers have recently made breakthroughs in teaching computers math.

Over the past year, researchers from the University of California, Berkeley, OpenAI, and Google have made progress in teaching basic math concepts to natural language generation models—algorithms such as GPT-2/3 and GPT-Neo. However, until recently, language models regularly failed to solve even simple word problems, such as “Alice has five more balls than Bob, who has two balls after he gives four to Charlie. How many balls does Alice have?”

“When we say computers are very good at math, they’re very good at things that are quite specific,” says Guy Gur-Ari, a machine-learning expert at Google. Computers are good at arithmetic—plugging numbers in and calculating is child’s play. But outside of formal structures, computers struggle.

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something.”
—Ethan Dyer, Google

Solving word problems, or “quantitative reasoning,” is deceptively tricky because it requires a robustness and rigor that many other problems don’t. If any step during the process goes wrong, the answer will be wrong. “When multiplying really large numbers together…they’ll forget to carry somewhere and be off by one,” says Vineet Kosaraju, a machine-learning expert at OpenAI. Other mistakes made by language models are less human, such as misinterpreting 10 as a 1 and a 0, not 10.

“We work on math because we find it independently very interesting,” says Karl Cobbe, a machine-learning expert at OpenAI. But as Gur-Ari puts it, if it’s good at math, “it’s probably also good at solving many other useful problems.”

As machine-learning models are trained on larger samples of data, they tend to grow more robust and make fewer mistakes. But scaling up seems to go only so far with quantitative reasoning; researchers realized that the mistakes language models make seemed to require a more targeted approach.

Last year, two different teams of researchers, at UC Berkeley and OpenAI, released two data sets, MATH and GSM8K, respectively, which contain thousands of math problems across geometry, algebra, precalculus, and more. “We basically wanted to see if it was a problem with data sets,” says Steven Basart, a researcher at the Center for AI Safety who worked on MATH. Language models were known to be bad at word problems—but how bad were they, and could they be fixed by introducing better formatted, bigger data sets? The MATH group found just how challenging quantitative reasoning is for top-of-the-line language models, which scored less than 7 percent. (A human grad student scored 40 percent, while a math olympiad champ scored 90 percent.)

Models attacking GSM8K problems, which had easier grade-school-level problems, reached about 20 percent accuracy. The OpenAI researchers used two main techniques: fine-tuning and verification. In fine-tuning, researchers take a pretrained language model that includes irrelevant information (Wikipedia articles on zambonis, the dictionary entry for “gusto,” and the like) and then show the model, Clockwork Orange–style, only the relevant information (math problems). Verification, on the other hand, is more of a review session. “The model gets to see a lot of examples of its own mistakes, which is really valuable,” Cobbe says.

At the time, OpenAI predicted a model would need to be trained on 100 times more data to reach 80 percent accuracy on GSM8K. But in June, Google’s Minerva announced 78 percent accuracy with minimal scaling upwards. “It’s ahead of any of the trends that we were expecting,” Cobbe says. Basart agrees. “That’s shocking. I thought it would take longer,” he says.

Minerva uses Google’s own language model, Pathways Language Model (PaLM), which is fine-tuned on scientific papers from the arXiv online preprint server and other sources with formatted math. Two other strategies helped Minerva. In “chain-of-thought prompting,” Minerva was required to break down larger problems into more palatable chunks. The model also used majority voting—instead of being asked for one answer, it was asked to solve the problem 100 times. Of those answers, Minerva picked the most common answer.

The gains from these new strategies were enormous. Minerva shot up to 50 percent accuracy on MATH and nearly 80 percent accuracy on GSM8K, as well as the MMLU, a more general set of STEM questions that included chemistry and biology. When Minerva was asked to redo a random sample of slightly tweaked questions, it performed just as well, suggesting that its capabilities were not from mere memorization.

What Minerva knows—or doesn’t know—about math is fuzzier. Unlike proof assistants, which come with built-in structure, Minerva and other language models have no formal structure. They can have strange, messy reasoning and still arrive at the right answer. As numbers grow larger, the language models’ accuracy falters, something that would never happen on a TI-84.

“Just how smart is it—or isn’t it?” asks Cobbe. Though models like Minerva might arrive at the same answer as a human, the actual process they’re following could be wildly different. On the other hand, chain-of-thought prompting is familiar to any human student who’s been asked to “show your work.”

“I think there’s this notion that humans doing math have some rigid reasoning system—that there’s a sharp distinction between knowing something and not knowing something,” says Ethan Dyer, a machine-learning expert at Google. But humans give inconsistent answers, make errors, and fail to apply core concepts, too. The borders, at this frontier of machine learning, are blurred.


Update 14 Oct. 2022: A previous version of this story extraneously alluded to the DALL-E/DALL-E 2 art-generation AI in the context of large language generation models being taught to handle math word problems. Of course, neither DALL-E nor DALL-E 2 is a large language generation model. (And it was not studied in the math word problem research.) So to avoid confusion, references to it were cut.

The Conversation (1)
R Watkins27 Oct, 2022
M

Computers are not good at math. They are good at arithmetic. And neither arithmetic word problems, nor the actual concepts of mathematics, are part of "natural language" Has it escaped notice that these are difficult to teach otherwise literate schoolchildren? Forget incorporating these into natural language models, instead build language models specifically for them, run them in parallel, and defer to the one which best understands the problem.

Will AI Steal Submarines’ Stealth?

Better detection will make the oceans transparent—and perhaps doom mutually assured destruction

11 min read
A photo of a submarine in the water under a partly cloudy sky.

The Virginia-class fast attack submarine USS Virginia cruises through the Mediterranean in 2010. Back then, it could effectively disappear just by diving.

U.S. Navy

Submarines are valued primarily for their ability to hide. The assurance that submarines would likely survive the first missile strike in a nuclear war and thus be able to respond by launching missiles in a second strike is key to the strategy of deterrence known as mutually assured destruction. Any new technology that might render the oceans effectively transparent, making it trivial to spot lurking submarines, could thus undermine the peace of the world. For nearly a century, naval engineers have striven to develop ever-faster, ever-quieter submarines. But they have worked just as hard at advancing a wide array of radar, sonar, and other technologies designed to detect, target, and eliminate enemy submarines.

The balance seemed to turn with the emergence of nuclear-powered submarines in the early 1960s. In a 2015 study for the Center for Strategic and Budgetary Assessment, Bryan Clark, a naval specialist now at the Hudson Institute, noted that the ability of these boats to remain submerged for long periods of time made them “nearly impossible to find with radar and active sonar.” But even these stealthy submarines produce subtle, very-low-frequency noises that can be picked up from far away by networks of acoustic hydrophone arrays mounted to the seafloor.

Keep Reading ↓Show less
{"imageShortcodeIds":["30133857"]}