OpenAI Demos a Control Method for Superintelligent AI

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceAIBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, videos, and infographics inform our readers about developments in technology, engineering, and science.
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2025 IEEE — All rights reserved. A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

One day, the theory goes, we humans will create AI systems that outmatch us intellectually. That could be great if they solve problems that we’ve been thus far unable to crack (think cancer or climate change), or really bad if they begin to act in ways that are not in humanity’s best interests, and we’re not smart enough to stop them.

So earlier this year, OpenAI launched its superalignment program, an ambitious attempt to find technical means to control a superintelligent AI system, or “align” it with human goals. OpenAI is devoting 20 percent of its compute to this effort, and hopes to have solutions by 2027.

The biggest challenge for this project: “This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to,” says Collin Burns, a member of OpenAI’s superalignment team. “This makes it very tricky to study—but I think we also have no choice.”

The first preprint paper to come out from the superalignment team showcases one way the researchers tried to get around that constraint. They used an analogy: Instead of seeing whether a human could adequately supervise a superintelligent AI, they tested a weak AI model’s ability to supervise a strong one. In this case, GPT-2 was tasked with supervising the vastly more powerful GPT-4. Just how much more powerful is GPT-4? While GPT-2 has 1.5 billion parameters, GPT-4 is rumored to have 1.76 trillion parameters (OpenAI has never released the figures for the more powerful model).

It’s an interesting approach, says Jacob Hilton of the Alignment Research Center; he was not involved with the current research, but is a former OpenAI employee. “It has been a long-standing challenge to develop good empirical testbeds for the problem of aligning the behavior of superhuman AI systems,” he tells IEEE Spectrum. “This paper makes a promising step in that direction and I am excited to see where it leads.”

“This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to.” —Collin Burns, OpenAI

The OpenAI team gave the GPT pair three types of tasks: chess puzzles, a set of natural language processing (NLP) benchmarks such as commonsense reasoning, and questions based on a dataset of ChatGPT responses, where the task was predicting which of multiple responses would be preferred by human users. In each case, GPT-2 was trained specifically on these tasks—but since it’s not a very large or capable model, it didn’t perform particularly well on them. Then its training was transferred over to a version of GPT-4 with only basic training and no fine-tuning for these specific tasks. But remember: GPT-4 with only basic training is still a much more capable model than GPT-2.

The researchers wondered whether GPT-4 would make the same mistakes as its supervisor, GPT-2, which had essentially given it instructions for how to do the tasks. Remarkably, the stronger model consistently outperformed its weak supervisor. The strong model did particularly well on the NLP tasks, achieving a level of accuracy comparable to GPT-3.5. Its results were less impressive with the other two tasks, but they were “signs of life” to encourage the group to keep trying with these tasks, says Leopold Aschenbrenner, another researcher on the superalignment team.

The researchers call this phenomenon weak-to-strong generalization; they say it shows that the strong model had implicit knowledge of how to perform the tasks, and could find that knowledge within itself even when given shoddy instructions.

In this first experiment, the approach worked best with the NLP tasks because they’re fairly simple tasks with clear right and wrong answers, the team says. It did worst with the tasks from the ChatGPT database, in which it was asked to determine which responses humans would prefer, because the answers were less clear cut. “Some were subtly better, some were subtly worse,” says Aschenbrenner.

Could this alignment technique scale to superintelligent AI?

Burns gives an example of how a similar situation might play out in a future with superintelligent AI. “If you ask it to code something, and it generates a million lines of extremely complicated code interacting in totally new ways that are qualitatively different from how humans program, you might not be able to tell: Is this doing what we ask it to do?” Humans might also give it a corollary instruction, such as: Don’t cause catastrophic harm in the course of your coding work. If the model has benefitted from weak-to-strong generalization, it might understand what it means to cause catastrophic harm and see—better than its human supervisors can—whether its work is straying into dangerous territory.

“We can only supervise simple examples that we can understand,” Burns says. “We need [the model] to generalize to much harder examples that superhuman models themselves understand. We need to elicit that understanding of: ‘is it safe or not, does following instructions count,’ which we can’t directly supervise.”

Some might argue that these results are actually a bad sign for superalignment, because the stronger model deliberately ignored the (erroneous) instructions given to it and pursued its own agenda of getting the right answers. But Burns says that humanity doesn’t want a superintelligent AI that follows incorrect instructions. What’s more, he says, “in practice many of the errors of the weak supervisor will be more of the form: ‘this problem is way too hard for me, and I don’t have a strong opinion either way.’” In that case, he says, we’ll want a superintelligence that can figure out the right answers for us.

To encourage other researchers to chip away at such problems, OpenAI announced today that it’s offering US $10 million in grants for work on a wide variety of alignment approaches. “Historically, alignment has been more theoretical,” says Pavel Izmailov, another member of the superalignment team. “I think this is work that’s available to academics, grad students, and the machine learning community.” Some of the grants are tailored for grad students and offer both a $75,000 stipend and a $75,000 compute budget.

Burns adds: “We’re very excited about this, because I think for the first time we really have a setting where we can study this problem of aligning future superhuman models.” It may be a future problem, he says, but they can “make iterative empirical progress today.”

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

OpenAI Demos a Control Method for Superintelligent AI

The researchers asked GPT-2 to command the much more powerful GPT-4

Could this alignment technique scale to superintelligent AI?

Tiny MEMS Clock Rivals Atomic Precision

Metalenses Improve Microscopic 3D Printing Precision

U.S. Plans $80B Nuclear Reactor Expansion

Related Stories

Why GPT-5 Fell Short of Expectations

OpenAI’s Return to Open Weights Surprises Developers

OpenAI Builds AI to Critique AI

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and post comments — all free! For full access and benefits, subscribe to Spectrum.

OpenAI Demos a Control Method for Superintelligent AI

The researchers asked GPT-2 to command the much more powerful GPT-4

Could this alignment technique scale to superintelligent AI?

Tiny MEMS Clock Rivals Atomic Precision

Metalenses Improve Microscopic 3D Printing Precision

U.S. Plans $80B Nuclear Reactor Expansion

Related Stories

Why GPT-5 Fell Short of Expectations

OpenAI’s Return to Open Weights Surprises Developers

OpenAI Builds AI to Critique AI