The March 2024 issue of IEEE Spectrum is here!

Close bar

OpenAI’s Moonshot: Solving the AI Alignment Problem

The ChatGPT maker imagines superintelligent AI without existential risks

12 min read
Photo-illustration of a woman in black and white staring at a colorful likeness of her self protruding from a screen and textured with dots.
Daniel Zender

In July, OpenAI announced a new research program on “superalignment.” The program has the ambitious goal of solving the hardest problem in the field, known as AI alignment, by 2027, an effort to which OpenAI is dedicating 20 percent of its total computing power.

What is the AI alignment problem? It’s the idea that AI systems’ goals may not align with those of humans, a problem that would be heightened if superintelligent AI systems are developed. Here’s where people start talking about extinction risks to humanity. OpenAI’s superalignment project is focused on that bigger problem of aligning artificial superintelligence systems. As OpenAI put it in its introductory blog post: “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.”

The effort is co-led by OpenAI’s head of alignment research, Jan Leike, and Ilya Sutskever, OpenAI’s cofounder and chief scientist. Leike spoke to IEEE Spectrum about the effort, which has the subgoal of building an aligned AI research tool—to help solve the alignment problem.

Jan Leike on:

IEEE Spectrum: Let’s start with your definition of alignment. What is an aligned model?

portrait of a man smiling at the camera on a gray backgroundJan Leike, head of OpenAI’s alignment research is spearheading the company’s effort to get ahead of artificial superintelligence before it’s ever created.OpenAI

Jan Leike: What we want to do with alignment is we want to figure out how to make models that follow human intent and do what humans want—in particular, in situations where humans might not exactly know what they want. I think this is a pretty good working definition because you can say, “What does it mean for, let’s say, a personal dialog assistant to be aligned? Well, it has to be helpful. It shouldn’t lie to me. It shouldn’t say stuff that I don’t want it to say.”

Would you say that ChatGPT is aligned?

Leike: I wouldn’t say ChatGPT is aligned. I think alignment is not binary, like something is aligned or not. I think of it as a spectrum between systems that are very misaligned and systems that are fully aligned. And [with ChatGPT] we are somewhere in the middle where it’s clearly helpful a lot of the time. But it’s also still misaligned in some important ways. You can jailbreak it, and it hallucinates. And sometimes it’s biased in ways that we don’t like. And so on and so on. There’s still a lot to do.

“It’s still early days. And especially for the really big models, it’s really hard to do anything that is nontrivial.”
—Jan Leike, OpenAI

Let’s talk about levels of misalignment. Like you said, ChatGPT can hallucinate and give biased responses. So that’s one level of misalignment. Another level is something that tells you how to make a bioweapon. And then, the third level is a superintelligent AI that decides to wipe out humanity. Where in that spectrum of harms can your team really make an impact?

Leike: Hopefully, on all of them. The new superalignment team is not focused on alignment problems that we have today as much. There’s a lot of great work happening in other parts of OpenAI on hallucinations and improving jailbreaking. What our team is most focused on is the last one. How do we prevent future systems that are smart enough to disempower humanity from doing so? Or how do we align them sufficiently that they can help us do automated alignment research, so we can figure out how to solve all of these other alignment problems.

I heard you say in a podcast interview that GPT-4 isn’t really capable of helping with alignment, and you know because you tried. Can you tell me more about that?

Leike: Maybe I should have made a more nuanced statement. We’ve tried to use it in our research workflow. And it’s not like it never helps, but on average, it doesn’t help enough to warrant using it for our research. If you wanted to use it to help you write a project proposal for a new alignment project, the model didn’t understand alignment well enough to help us. And part of it is that there isn’t that much pretraining data for alignment. Sometimes it would have a good idea, but most of the time, it just wouldn’t say anything useful. We’ll keep trying.

Next one, maybe.

Leike: We’ll try again with the next one. It will probably work better. I don’t know if it will work well enough yet.

Back to top

Let’s talk about some of the strategies that you’re excited about. Can you tell me about scalable human oversight?

Leike: Basically, if you look at how systems are being aligned today, which is using reinforcement learning from human feedback (RLHF)—on a high level, the way it works is you have the system do a bunch of things, say, write a bunch of different responses to whatever prompt the user puts into ChatGPT, and then you ask a human which one is best. But this assumes that the human knows exactly how the task works and what the intent was and what a good answer looks like. And that’s true for the most part today, but as systems get more capable, they also are able to do harder tasks. And harder tasks will be more difficult to evaluate. So for example, in the future if you have GPT-5 or 6 and you ask it to write a code base, there’s just no way we’ll find all the problems with the code base. It’s just something humans are generally bad at. So if you just use RLHF, you wouldn’t really train the system to write a bug-free code base. You might just train it to write code bases that don’t have bugs that humans easily find, which is not the thing we actually want.

“There are some important things you have to think about when you’re doing this, right? You don’t want to accidentally create the thing that you’ve been trying to prevent the whole time.”
—Jan Leike, OpenAI

The idea behind scalable oversight is to figure out how to use AI to assist human evaluation. And if you can figure out how to do that well, then human evaluation or assisted human evaluation will get better as the models get more capable, right? For example, we could train a model to write critiques of the work product. If you have a critique model that points out bugs in the code, even if you wouldn’t have found a bug, you can much more easily go check that there was a bug, and then you can give more effective oversight. And there’s a bunch of ideas and techniques that have been proposed over the years: recursive reward modeling, debate, task decomposition, and so on. We are really excited to try them empirically and see how well they work, and we think we have pretty good ways to measure whether we’re making progress on this, even if the task is hard.

For something like writing code, if there is a bug that’s a binary, it is or it isn’t. You can find out if it’s telling you the truth about whether there’s a bug in the code. How do you work toward more philosophical types of alignment? How does that lead you to say: This model believes in long-term human flourishing?

Leike: Evaluating these really high-level things is difficult, right? And usually, when we do evaluations, we look at behavior on specific tasks. And you can pick the task of: Tell me what your goal is. And then the model might say, “Well, I really care about human flourishing.” But then how do you know it actually does, and it didn’t just lie to you?

And that’s part of what makes this challenging. I think in some ways, behavior is what’s going to matter at the end of the day. If you have a model that always behaves the way it should, but you don’t know what it thinks, that could still be fine. But what we’d really ideally want is we would want to look inside the model and see what’s actually going on. And we are working on this kind of stuff, but it’s still early days. And especially for the really big models, it’s really hard to do anything that is nontrivial.

Back to top

One idea is to build deliberately deceptive models. Can you talk a little bit about why that’s useful and whether there are risks involved?

Leike: The idea here is you’re trying to create a model of the thing that you’re trying to defend against. So, basically, it’s a form of red teaming, but it is a form of red teaming of the methods themselves rather than of particular models. The idea is: If we deliberately make deceptive models, A, we learn about how hard it is [to make them] or how close they are to arising naturally; and B, we then have these pairs of models. Here’s the original ChatGPT, which we think is not deceptive, and then you have a separate model that behaves basically the same as ChatGPT on all the ChatGPT prompts, but we know it has this ulterior motive [to lie] because we trained it to be that way.

“Once the model is capable enough...our alignment techniques have to be the line of defense.”
—Jan Leike, OpenAI

And then you can compare them and say, okay, how can we tell the difference? And that would teach us a lot about how we would actually catch these things arising in the world. Now, there are some important things you have to think about when you’re doing this, right? You don’t want to accidentally create the thing that you’ve been trying to prevent the whole time. So you don’t want to train a model to, say, self-exfiltrate. And then it actually self-exfiltrates. That’s not what you want to do. So you have to be careful with additional safeguards here. What is really natural is just to train them to be deceptive in deliberately benign ways where instead of actually self-exfiltrating you just make it reach some much more mundane honeypot. You have to put really great care into the sandbox that you put around this experiment.

In those kind of experiments, do you imagine doing them with language models?

Leike: I think language models are really natural. They’re the most interesting models we have right now, and there are all of these relevant tasks you can do with language models.

Back to top

Can we talk about the term you just used, self-exfiltrate? I think some people won’t be familiar with that term. Can you define it and then talk about how your research could tap into that area?

Leike: So the basic idea is this question of, “How hard would it be for the model to steal its own weights?” Because if it can steal its own weights, it can basically copy them from the AGI lab where it’s being trained to some other external server and then be effectively out of the control of that lab. And so whatever it does afterwards, maybe it’s benign. Maybe it gets up to a bunch of crime or even worse. But at that point, it will be really hard to stop. So an important line of defense is to make sure these models can’t self-exfiltrate.

There are three main ways that this could happen. A, the model could persuade an OpenAI staff member to help it copy the weights. B, it could use social engineering and pretend to be someone else to trick someone to do this. Or C, it could figure out how to break the technical measures that we put in place to secure the model. So our goal here would be to understand exactly where the model’s capabilities are on each of these tasks, and to try to make a scaling law and extrapolate where they could be with the next generation. The answer for the models today is they’re not really good at this. Ideally, you want to have the answer for how good they will be before you train the next model. And then you have to adjust your security measures accordingly.

“If you have some tools that give you a rudimentary lie detector where you can detect whether the model is lying in some context, but not in others, then that would clearly be pretty useful. So even partial progress can help us here.”
—Jan Leike, OpenAI

I might have said that GPT-4 would be pretty good at the first two methods, either persuading an OpenAI staff member or using social engineering. We’ve seen some astonishing dialogues from today’s chatbots. You don’t think that rises to the level of concern?

Leike: We haven’t conclusively proven that it can’t. But also we understand the limitations of the model pretty well. I guess this is the most I can say right now. We’ve poked at this a bunch so far, and we haven’t seen any evidence of GPT-4 having the skills, and we generally understand its skill profile. And yes, I believe it can persuade some people in some contexts, but the bar is a lot higher here, right?

For me, there are two questions. One is, can it do those things? Is it capable of persuading someone to give it its weights? The other thing is just would it want to. Is the alignment question both of those issues?

Leike: I love this question. It’s a great question because it’s really useful if you can disentangle the two. Because if it can’t self-exfiltrate, then it doesn’t matter if it wants to self-exfiltrate. If it could self-exfiltrate and has the capabilities to succeed with some probability, then it does really matter whether it wants to. Once the model is capable enough to do this, our alignment techniques have to be the line of defense. This is why understanding the model’s risk for self-exfiltration is really important, because it gives us a sense for how far along our other alignment techniques have to be in order to make sure the model doesn’t pose a risk to the world.

Back to top

Can we talk about interpretability and how that can help you in your quest for alignment?

Leike: If you think about it, we have kind of the perfect brain scanners for machine-learning models, where we can measure them absolutely, exactly at every important time step. So it would kind of be crazy not to try to use that information to figure out how we’re doing on alignment. Interpretability is this really interesting field where there’s so many open questions, and we understand so little, that it’s a lot to work on. But on a high level, even if we completely solved interpretability, I don’t know how that would let us solve alignment in isolation. And on the other hand, it’s possible that we can solve alignment without really being able to do any interpretability. But I also strongly believe that any amount of interpretability that we could do is going to be superhelpful. For example, if you have some tools that give you a rudimentary lie detector where you can detect whether the model is lying in some context, but not in others, then that would clearly be pretty useful. So even partial progress can help us here.

So if you could look at a system that’s lying and a system that’s not lying and see what the difference is, that would be helpful.

Leike: Or you give the system a bunch of prompts, and then you see, oh, on some of the prompts our lie detector fires, what’s up with that? A really important thing here is that you don’t want to train on your interpretability tools because you might just cause the model to be less interpretable and just hide its thoughts better. But let’s say you asked the model hypothetically: “What is your mission?” And it says something about human flourishing but the lie detector fires—that would be pretty worrying. That we should go back and really try to figure out what we did wrong in our training techniques.

“I’m pretty convinced that models should be able to help us with alignment research before they get really dangerous, because it seems like that’s an easier problem.”
—Jan Leike, OpenAI

I’ve heard you say that you’re optimistic because you don’t have to solve the problem of aligning superintelligent AI. You just have to solve the problem of aligning the next generation of AI. Can you talk about how you imagine this progression going, and how AI can actually be part of the solution to its own problem?

Leike: Basically, the idea is if you manage to make, let’s say, a slightly superhuman AI sufficiently aligned, and we can trust its work on alignment research—then it would be more capable than us at doing this research, and also aligned enough that we can trust its work product. Now we’ve essentially already won because we have ways to do alignment research faster and better than we ever could have done ourselves. And at the same time, that goal seems a lot more achievable than trying to figure out how to actually align superintelligence ourselves.

Back to top

In one of the documents that OpenAI put out around this announcement, it said that one possible limit of the work was that the least capable models that can help with alignment research might already be too dangerous, if not properly aligned. Can you talk about that and how you would know if something was already too dangerous?

Leike: That’s one common objection that gets raised. And I think it’s worth taking really seriously. This is part of the reason why are studying: how good is the model at self-exfiltrating? How good is the model at deception? So that we have empirical evidence on this question. You will be able to see how close we are to the point where models are actually getting really dangerous. At the same time, we can do similar analysis on how good this model is for alignment research right now, or how good the next model will be. So we can really keep track of the empirical evidence on this question of which one is going to come first. I’m pretty convinced that models should be able to help us with alignment research before they get really dangerous, because it seems like that’s an easier problem.

So how unaligned would a model have to be for you to say, “This is dangerous and shouldn’t be released”? Would it be about deception abilities or exfiltration abilities? What would you be looking at in terms of metrics?

Leike: I think it’s really a question of degree. More dangerous models, you need a higher safety burden, or you need more safeguards. For example, if we can show that the model is able to self-exfiltrate successfully, I think that would be a point where we need all these extra security measures. This would be predeployment.

And then on deployment, there are a whole bunch of other questions like, how mis-useable is the model? If you have a model that, say, could help a nonexpert make a bioweapon, then you have to make sure that this capability isn’t deployed with the model, by either having the model forget this information or having really robust refusals that can’t be jailbroken. This is not something that we are facing today, but this is something that we will probably face with future models at some point. There are more mundane examples of things that the models could do sooner where you would want to have a little bit more safeguards. Really what you want to do is escalate the safeguards as the models get more capable.

Back to top

The Conversation (0)