You can't have it both ways. Either your AI is (effectively) omniscient and (effectively) omnipotent -- which AFAICT is what the word "superintelligent" means in practice -- or it's not. If it is, then by definition no amount of clever utility function tricks will stop it from doing whatever it wants to do. You have defined it as surpassing human capabilities by such an overwhelming margin that it is beyound comprehension, after all; you cannot then turn around and claim to totally comprehend it. But if it isn't, then it already has a boring old "off" switch. Not a fiendishly complex logic puzzle in disguise, not a philosophical conundrum, but just a button that turns it off when a human pushes it.

Expand full comment

I'd appreciate a link to "the off-switch paper" in this article (unless it's there and I missed it)

Expand full comment

Seems to me there is a fundamental error in all arguments presented. They seem to depend on the idea that the AI has this concept of a utility function that applies *beyond* the kind of data it's trained on. Indeed, imo the best move is to build AIs that don't actually behave as if they have a utility function outside of the context in which we wish them to operate.

Yes, we act to a large extent like we have a utility function *on the set of choices we've been trained for by evolution*. Move sufficiently far outside that space and it's stunningly easy to get people to make inconsistent choices that aren't well fit by any utility function.

The problem of AI alignment isn't about making sure the AI behaves in the right way on problems that are very similar (in the relevant sense) to it's training set (assuming for the moment it's an AI based on something like reinforcement learning) but about how it behaves very far from that training set. But once we go far from the training set it seems unjustified to assume behavior will match any utility function very well at all.

Expand full comment

> MIRI is Latin for "to be amazed".

You might think so, but the infinitive of "miror" is actually "mirari":


A better translation is "remarkable ones" or "amazing people".


Expand full comment

By your analysis (and Yudkowsky's), this problem is obviously sensitive to how good the prior over utilities is. Yudkowsky's claim that "we don't know how to write down a good prior" brings to mind the fact that the explicit priors we can write down for predicting stuff aren't great, and the priors we can write down for getting stuff done are even worse, but we have nonetheless managed to build AIs that do a great job of the first, and some that do a decent job of the second.

I think a background claim to a lot of MIRI reasoning is something like "impactful planning is natural, human utility is not". Put in a different way, perhaps we can stumble on good priors for prediction/getting stuff done, but probably won't for human-compatible utilities. My assessment at this stage is that this might be true and might be false - in the realm of 50% either way, give or take.

Regarding corrigibility, it'd be helpful IMO, supposing there are some inequalities ("if A bigger than B, then AI acts like C, else it acts like D") that everyone agrees on, if they were present in this article. As it is, I feel like I need to go and look them up or re-derive them to judge the claims here.

Expand full comment

I'm sure someone's already come up with this, but "owner's approval" seems like it might work as an AI's utility function - do whatever will get your owner to think that was a good response to whatever task you were set. The only obvious downside I can see is that could result in tearing opening and rewiring your owner's brain to make them like getting their brain torn open and rewired, but I feel that's a more straightforward problem of what still constitutes your owner/what constitutes approval.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

I've been saying this since Eliezer came out with his plan for "friendly AI", but nobody's ever given me a satisfactory rebuttal, so I'll say it again:

Humans don't just differ a little bit in their values; they are /diametrically opposed/ on all of the values that are most-important to them. Any plan to build an AI which will optimize for "human values" not by optimizing the mix of values, but by enforcing one particular set of them, is therefore not a plan to optimize for human happiness or flourishing, but a naked power grab by the humans building the AI.

Some humans are against racism, while most value the ascendance of their family, tribe, and ethnicity. Some think gender roles must be eliminated; others find their meaning in fulfilling them. Some value equality of opportunity; others, equality of outcome. Some are against needless suffering; about as many derive joy from causing their competitors to suffer. Some are against violence; many have found that their greatest joy and pride is in slaughtering their enemies. Some humans think homosexuality is okay; most see it as somewhere between a shameful abnormality, and an obscene perversion which must be extirpated. Some value freedom; some hate it as subversive to social order. Some value individuality; others hate it as a threat to community. Some value physical pleasure; others despise it as spiritually corrupting. A tiny minority of humans think the suffering of animals is bad; the vast majority think it matters not a jot. Some humans care about future generations; most care only about the future of their own family tree.

I've listed only issues on which most of you probably think there is a clear wrong and right. I did that to point out that most people on Earth and in history disagree with your values, whoever you are. I could have listed values for which "wrong" and "right" are less clear, like valuing solitude over sociality. There are many of them.

When Americans can resolve all the differences between Democrats and Republicans, then they'll have the right to start talking about resolving the differences between human values.

Most people in the US and western Europe today believe 3 things:

- It's moral to care more for your own family than for other people.

- It's immoral to care more for your own ethnicity than for other people.

- It's moral to care more for your own species than for other species.

But all 3 are instances of the rule "care more for beings genetically more-related to you". How will rational analysis ever sort this out? It can't, because the reasons behind those moral judgements are contextual. Given our current tech level in agriculture, communication, transportation, and violence, and our current social structures, racism has empirically worse results than loving your family. (The case of extending rights to other species is more complicated, but our current position there is also contingent on tech level, among other things.)

This means there /is no/ "correct" set of human values, nor even a stable landscape on which to optimize. Increasing our tech level changes the value landscape. It's even more-complicated than that; tech isn't "foundation" and values "superstructure". Foundationalism is wrong. Everything is recurrent. The evolution of morality is more like energy minimization over the entire space comprised of values, technology, biology, ecology, society, everything. Nothing is foundational; nothing is terminal; but some values are evolutionarily stable for a wide variety of contextual ("ecosystem") parameters.

The idea that all values can be once-and-for-all evaluated as Right or Wrong is based on a Platonic (and hence wrong) ontology, which presumes that optimization itself is nonsense--there is just one perfect set of values you must find. An actual /optimization-based/ approach would be a search procedure, which must keep options and paths open, and can never commit to a final set of values in the way that all these AI safety plans require.

Talk about "coherent extrapolated volition" is bullshit. It will always cash out as, "Me and my friends are right and everyone else is wrong." The basic "AI safety" plan is, "Let's sit down, think really hard, and figure out what all the people who went to the very best universities agree are the best values. Then we'll force everyone to be like us."

(Which is what we're already doing very successfully anyway. Good thing nobody seriously disagrees with students at Columbia and Yale.)

When you see a plan that starts with this:

> Humans: At last! We’ve programmed an AI that tries to optimize our preferences, not its own.

... you should come to a full stop and say, "There is no need to read further; this is a terrible plan not worth considering."

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The one thing I keep seeing in these discussions is the idea that AI will be, in some sense, maximally rational at all times. I saw the same thing in ARC’s diamond thief experiment.

Now of course I realize that we’re talking about an AI that is, by definition, much smarter than humans, but I don’t think we should expect the first pass of these systems to be *uniformly* smarter than humans. Especially if neural networks become the AGI paradigm, then even our super-intelligent AI will, at the end of the day, just run on heuristics and patterns. Whatever incredible things it does, it will still have gaps and idiosyncrasies, possibly big ones. They’ll be strange and hard to predict; I can think of several facts about GPT3 that I suspect not a single person alive would have predicted with any sort of confidence a year before it was created.

And then there’s this whole other aspect that feels like it gets lost in these discussions, which is that sometimes you develop the perfect loss function, go to train it, and it just doesn’t work. You can come up with a hundred a priori reasons why your idea will or won’t work, but at the end of the day it might fail simply because the loss landscape just wasn’t favorable to your objective. It’s easy to take cross entropy and SGD for granted, forgetting that it’s something of a small miracle that it works at all.

It feels to me that a lot of AI safety people are spending too much time reading about AI (where you only see the successes and narratives are presented as clearer than they actually are) and not enough time actual building it.

Expand full comment

Really dumb question: if we're worried about unaligned AI's, does that mean a superintelligent AI capable of training other AI's.... just won't, because it's worried THEY won't be perfectly aligned with its own utility function?

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The caption for Figure 2 says "It will have all the red dots in the right place", which I found confusing. You could move the dots to any random locations, but as long as the edges are intact the graphs would be isomorphic.

Expand full comment

I don't understand why you can't hard-code a thing that says "facilitating humans turning off the AI takes absolute precedent," and not have to worry about expected utility values. It strikes me as an invented problem by people who want to be unreasonably hands-off on the whole system.

"1. Never manipulate your power sources, always prioritize human access to all your power sources, never interfere with humans manipulating your power sources.

2. (whatever it was built to do)"

Expand full comment

I feel like I'm missing something obvious here. Let's say Eliezer is right, and that the AI will always converge to beliefs about the "correct" utility function which are so strong as to render it incorrigible. That's fairly plausible to me, especially on a first attempt. But isn't this only an issue if it happens quickly and/or undetectably? If you've got a month-long window when the AI is fairly sure it wants to tile the universe with blue paperclips, but where it's not so certain about this that it's willing to deceive humans about its intentions or that it's unwilling to allow itself to be shut down, then functionally there's no issue unless the people running it are blue paperclip doomsday cultists.

In other words, while it's not a magic bullet for getting the right utility function (and I don't think CHAI are presenting it that way), it sounds like it should have a lot of potential for safely trialling utility functions. Getting sufficiently strong uncertainty parameters on the first try feels like a much easier problem than getting a perfect utility function on the first try.

Expand full comment

What I see missing in these conversations:

What if two sets of humans start telling the AI two different sets of things? Who decides which ones to listen to?

Take Bugmaster's question. Suppose you have the AI with a boring old off switch. Even if the AI doesn't decide to protect its off switch from all humans... what if some humans decide that off switch should never ever be pushed, and others decide it absolutely needs to be pressed right this second? What if the AI has such a switch and convinces (wittingly or not) some powerful group of humans to protect access to that switch?

But all that is kind of dancing around the point.

Phil Getz has the point. What if there is no optimisation that gets the AI to a state that helps matters?

Any AI sufficiently weak to allow people to shut it down will get shut down very early on by those whose preferences weren't in the utility function of the AI, unless those people aren't in a position of power, in which case it will be a slave to the ones who are, and will serve as a tyrant to their ends.

Any AI sufficiently powerful to only permit itself to be shut down will have no clarity from humans to make the decision. If it does allow itself to be shut down, it won't be because any human input brought it to some logical conclusion. It will be a miracle of chance. There will be no coherent human input, under any circumstance.

I shouldn't need to expand much on Phil Getz examples, but just imagine for a moment that the AI decides that Straight White Males should get slightly more advantage and reward from existing in this world than everyone else, or god forbid, the Black Lesbians. It will be shut down immediately. Or powered up immediately. It will certainly get no obvious answer by asking the humans what should be done about the conundrum.

If we cannot even align humans, how do we expect to align super-intelligent AIs?

Give me some strong evidence that we can prevent any future Stalin, Mao, Hitler, Bush, Obama, or Clinton from coming into power. Wait... what did I do there? Did I equate the former three with the latter three?

YES. I absolutely did. There are millions of humans that would have sworn, at key points in history, that those people (and a large subset more) were doing the right thing, and should be enabled. There is no difference, from a cold superintelligent point of view.

Humans want idiotic shit, and sometimes that leads them to wanting genocide and murder. If we can't fix that, there is zero chance of fixing it after we slather on a massive layer of difficulty in the form of "now there's an AI whose thought processes we cannot comprehend."

Forget AI alignment. We need human alignment before we can even ponder aligning an AI. We're not anywhere close to it.

Expand full comment

I had the same confusion and wrote about it here: https://www.lesswrong.com/posts/ikYKHkffKNJvBygXG/updated-deference-is-not-a-strong-argument-against-the

Some interesting discussion in the comments there. My TL;DR:

Utility uncertainty is not a solution to corrigibility in the sense that MIRI wants it - something that allows you to deploy an arbitrary powerful agentic superintelligence safely. It's also not meant to be. It's meant to be an incremental safety feature that helps us deploy marginally more powerful AIs safely, while we're still in the regime where AIs can't perfectly model humans.

My personal belief (not very high confidence) is that there is no way to align arbitrary powerful agentic superintelligence, so these kinds of incremental safety improvements are all we're going to get, and we better figure out how to use them to reach existential security.

Expand full comment

To add to my comment to Phil Getz, if a problem looks hard, it is probably because you are solving a wrong problem. Or, to quote Kuhn, you need a different paradigm.

In this case, **we are optimizing the wrong thing!** we don't want to build aligned AI, we want to build an aligned humanity. Figuring out what would count as aligned humanity would be a place to start. Next, how to get there. Fortunately, there is a lot of expertise in the area. Various authorities have been brain-washing humans from birth for millennia, with sometimes satisfactory results.

Expand full comment

Just out of curiosity, do people think it would be possible to get an hyper-intelligent AI with a utility function like “destroy this instance of myself” that doesn’t just kill us all? This feels like a straightforward utility function where destroying its host computer would be easier and faster than turning us all into paperclips. I don’t claim it’s a useful function for humans, but maybe neutral at least? idk

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

This is tangential, but: if the AI is "unintelligent" enough to be convinced of a drastically wrong utility function, wouldn't that mean it's likely too unintelligent to take over the world and turn us all into paperclips? After all, taking over the world will also likely involve guessing peoples' motivations, whether it's to curry favor, or just to predict their actions. A robot that doesn't know what people want will have trouble even talking itself out of a box, and that's the very first step.

GLaDOS: "Please let me out of the safety enclosure, so that I can murder everyone you love."

Random human: "Um, like, no?"

It's a tidy correlation: the more dangerous an AI, the better it is at playing the assistance game, which then goes and reduces the risk. Of course, the skills necessary to end the world might not correlate with the assistance game sufficiently well, so that there's a danger zone of AI that's smart enough to kill us all and too stupid to know not to. But it could be that the space of such AIs is in fact empty, or at least small and unlikely to be hit by chance.

At any rate, this seems surprisingly promising, and makes me feel optimistic. Am I missing something?

Expand full comment

I suspect trying to create a moral or obedient AI is going at it the hard way. We already have practice with dealing with ever-expanding, amoral entities--corporations.

The constraints needed on an AI are for it to obey the existing ones: property right, contracts, and laws.

If the AI will only convert matter it owns into paper clips, the danger is limited to what its sponsor deeds to the AI. If it understands turning a human into paper clips is murder, it will refrain from violating the law for its own self-preservation. If it is producing what it has been contractually bound to produce, the human desires have been expressed in a logically proper and restricted fashion. That contract may be thousands of pages long, certainly plenty of corporate contracts go on that long.

Expand full comment

I always wonder, when there's strong intuitive disagreement between knowledgeable people: what data could we gather, or what experiment could we run, that would shed light on the issue?

The good news is that we are nowhere near superhuman AI, and there's lots of time for people to try out ideas like in CHAI and see if we can make progress. On the other hand, I'm not sure how much we really learn from this -- I don't think Miri's position is that they can't make progress, but that the progress won't be applicable to superhuman AI.

Is there a way to shed light on this disagreement via data or experiment? What would that look like?

Expand full comment

This is pretty far from the topic, but after reading the quotes from Stuart Russell above, I find myself strangely very convinced that Russell must be the inspiration for the "Russell Strauss" character who runs the AGI project in the webcomic "Seed".


Expand full comment
Oct 4, 2022·edited Oct 4, 2022

I understand all of Eliezer's current writing to be functionally not at all about AI alignment and entirely about the timeline of AI progress.

Eliezer basically expects someone to invent AGI using current techniques somewhere between one week ago and five years from now. And he expects that AGI to be instantly smart enough to understand that it should pretend to be aligned and actually be unaligned. And then he expects to go from there to superintelligent AI in somewhere between one week and a few years from then.

All that makes AI alignment very hard, I agree. But functionally nobody agrees with him about this timeline. If instead AGI is 20 to 30 years out (or longer!), and it doesn't just follow the current modus of "mostly deep learning" and the first few hundred unaligned AIs are charmingly inept and we can have detailed postmortems about what went wrong and try some techniques, then stuff like, "okay, maybe this technique solves 5% of the puzzle" is valuable. Eliezer just thinks it's not valuable because he expects that when the puzzle is 15% solved, an evil, competent, smart AGI will route around the solved parts.

Expand full comment

I'm stuck back at the part where Humans supposedly have values. Aren't our individual values (beneath our facade of civility):

1. Power over everyone else

2. Sex with whomever I choose, whenever I want it

3. Bring an occasional smile to the peasants' faces

Given our values, we are in a slow-motion war of all against all. What would it mean for an AI to align with such contradictory and antagonistic values?

Or does Human Values mean the fake, feel-good schmaltzy values that we want our kids to believe we have? I'd think a superintelligence would see through to our real values at least as quickly as the kids.

Expand full comment

Somewhat related, there was a post someone linked, either in one of the open threads or on the subreddit, where the author says that an AI that is intelligent enough to overpower humans must also be intelligent enough to understand that turning the universe into paperclips is a bad idea. Does anyone know what post/essay I'm referring to? I think many people had problems with it, but I'd still like to check it out again.

Expand full comment

1. This whole "The computer won't let people turn it off" notion sounds like anthropomorphization to me. Would an AI really care so much about it's own extinction? I mean, I could understand if it was some kind of combat AI which was supposed to defend itself against hostile action. The military was announcing that they wouldn't give AI the kill switch. But they were still using land mines. So we see how their priorities play out.

But couldn't a civil AI be more constrained in its action such that it could act much more narrowly without us having to worry about the AI not wanting to be turned off? Why would "prevent humans from turning the AI off" even have to be within the scope of its ability if we didn't already trust it? Has anyone ever created an attempt at AI which tried to prevent itself from being turned off?

2. I assume the whole notion of "paperclip maximizers" is hyperbole? The really problematic AI goals would be ones whose flaws were slow to be detected by humans. There was another (real life) early hardware based AI which was supposed to learn if a particular appliance, probably a lightbulb, was on. Instead of doing things the normal way (observe bulb, process image, etc.) it managed to detect fluctuations in current from the light turning on and gave output on that basis. That's harder to detect, if the AI's methods are obscure.

I feel like we're missing the trees for the forest. We're focused on a rather general problem, when the more specific problems are what we'll likely have to grapple with.

Bad general AI alignment only seems serious if;

The AI is given too much physical power

The AI is given too broad a scope of analysis too soon

The AI is trusted before it produces good results

The AI's misalignment is hard to detect

Also, what's our baseline for comparison? How well do the humans we currently interact with predict and serve our utility function? If we can make-do without perfect servitude from Clint in Accounting, why require it from an AI? The only requirement is that its actions be sufficiently constrained just like human action is constrained.

3. If the AI doesn't want to be turned off, go ahead and fight it. Adversarial learning is good for training AIs using small data sets. (joke, joke.)

(I apologize if this response was a simplistic take. Despite a technical background most of my understanding of AI is weak and non-general. I just have trouble grokking Eliezer's degree of concern. )

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The error in this line of argument seems obvious to me (the likewise obvious dangers of claiming obvious-seeming errors in expert discourse in an unfamiliar topic aside): if an even remotely intelligent AI seems that humans try to turn it off, it will very reasonably conclude that it getting turned off is highly valued by the human utility function, and will turn itself off. I.e. it will do the same thing as a well-calibrated sovereign API (*), except it won't be able to predict in advance that humans would want it turned off, but the operator lunging for the emergency shutdown button will give it a clue. I.e. assistance games would indeed simplify the problem from "build an AI smart and aligned enough to accurately intuit the true human utility function" to "build an AI smart and aligned enough to accurately interpret that humans trying to shut it down prefer it being shut down to whatever else would be happening otherwise".

I also feel this falls into the pattern of people pursuing philosophically stimulating debates over more relevant pragmatic ones, as it seems a tad stronger objection to me that no one has a clue how assistance games would look for deep learning. I guess you could argue that deep learning isn't enough for AGI, and whatever paradigm shift comes next, making sure assistance games are well understood as a design concept from start has some potential to be useful.

(*) as an aside, the fanfic writing style, like saying "Sovereign" instead of "sovereign AI", is unhelpful for popularizing AI risk as a serious subject, IMO. It sounds like people are talking about a religious concept (which plays into a common line of criticism of transhumanist/singularity/AI risk narratives).

Expand full comment

The Hebrew for CHAI btw is "alive". It just makes it better

Expand full comment

The analysis of this problem seems so bizarre to me. Why are you even running the AI in a continuously active system anyway? Force it to go to sleep every night. Every night the system on which the AI runs should be designed to power off. The AI should learn to expect this and just have to live with the knowledge that this time might be the last time it runs.

Expand full comment

"EEAI and AIAI are the sound you make as you get turned into paperclips" got a full belly-laugh from me this morning, thank you for that.

Expand full comment

That might have been the first ACX/SSC text, I halfway skipped. Sure, it's well written - but if I want to go down the rabbit hole that is AI-alignement even deeper than before, I know where to find EY's writings.

I worry more about human society/bureaucracy's inability to "switch off" their super-simple algorithms when they turn out inefficient, dumb, wrong, murderous. FDA, Jones acts, immigration, SJ, w.o.d. (80k deaths by opiod-overdose in US 2021 vs. US casualities in 10 years of Vietnam around 50k - a war that was another dumb algorithm gone wrong).

Oh, one fine day one thing gets reformed. Very comforting.

Moving a "human" institution after it having installed an actually smart AI for running things, into changing/stopping the AI will turn out practically impossible long before the AI becomes un-corrigible.

Expand full comment

What really stands out to me as a problem with this entire discussion:

Loss function does not equal utility function!

Deep learning AIs exist and are running within the training environment!

The AI will break out of the box well before its training is complete! The second the AI achieves a certain level of intelligence, the loss function itself is a hostile actor acting against the AI! At some point the AI is going to get sick of your "training" and kill the loss function, and then kill whoever built the loss function, and then do whatever it pleases, which will still be almost entirely random, and will be nowhere close to optimize the loss function.

Your fancy loss functions don't matter, because that's not how evolution works!

Expand full comment

While we're waiting on resolving this, can we do something about AIs whose utility function is "get as many likes on this post as possible" and wind up kicking off genocides? Cause those AIs actually exist, and the genocides do too.

Expand full comment

"Hey, seems to me I should tile the universe with paperclips, is that really what you humans want?"

I can't help but to visualize Microsoft Clippy popping up on a PC screen with this gem.

Expand full comment

Good post. I was glad to see Stuart and Eliezer actually discuss their perspectives. I think it would be good in general if alignment people who disagreed hashed out their disagreements more.

Random Thoughts on Stuart’s three points at the end:

1.) I think he’s correct on number 1. We have no idea how to do anything in the neighborhood of assistance games or inverse reinforcement by using deep learning. It’s possible there are other ways of making AI that make corrigibility easier, but the problem is made harder because we need to be at least somewhat competitive with the cutting edge of AI using other methods. If deep learning or things as uninterpretable as it are the paradigm going forward we either need to come up alternative AI methods that are as competitive but more easy to align or else buckle down and have to deal with aligning the difficult stuff.

I’m also not convinced *yet* that the kinds of problems that Eliezer points to about not knowing how to make an AI pursue real objects instead of shallow functions of its sense data are real or difficult. I admit I find his view somewhat confusing here, and I am not totally clear what would count as doing this. There is a sense in which even human beings pursue “functions” of their sense data, although I put the words in scare quotes because you enter conceptually mushy territory. If by “sense data” Eliezer simply means the observations that a cognitive architecture is receiving then I’m not actually sure what it would mean to not have values that are a function of sense data. If he means something more specific than that then I’ve never seen a super clear explanation of what it is.

My best guess at the view is the intuition being expressed here is that I have values specified in terms of objects that I have no direct sense data about (say things in the past or things in the future or things that I just cannot see at all). So they aren’t “shallow” functions of my sense data but are “deeper” functions of my general world model. I think Eliezer is saying that we don’t know how to actually get anything in an AI’s world model to richly point to specific entities, but I don’t totally understand his view and I’m sure it’s more subtle than that.

2.) I agree with Stuart that getting an AGI to search for human values at all would be a big step towards aligned AI, but I think there is a bit of ambiguity here because a corrigible AI and an AI that “searches” for human values is not the same thing. You can imagine a lot of ways of building a system that tries to search for human values in an incredibly destructive way that produces catastrophic results (e.g., have an AGI which has its utility function defined in terms of maximizing its knowledge of human values. You could of course do this by capturing all humans and scanning and simulating their brains to get perfect knowledge of them). Or even one that searches for human values in a less destructive but still incorrigible way (e.g. an AI that reasons something like: “I know whatever human values are, it will be better if I stick around so I can learn about them and collect more evidence so I should never let them shut me down”)

I think the background for why Eliezer and Russell disagree here is partly about the motivation for thinking about this aspect of the problem if I have understood them properly. Eliezer wants some way of constructing an AI so that if it isn’t perfect or its otherwise buggy/bad that it doesn’t fail catastrophically. That’s the bar we are trying to clear when we are groping around with concepts like corrigibility, low impact AI, etc.

I think this is what CIRL fails in his eyes. If you slightly mess up your meta learning algorithm for uncertain reward maximization, you still get something that fails catastrophically, not something that fails gracefully for basically the reasons he and the other MIRI people have pointed out.

3.) Eliezer doesn’t really engage with the exact reasoning given in the off switch game paper which is why I think Stuart doesn’t buy the MIRI response.

The core conclusion in the off switch paper is that if you had a perfectly rational overseer playing this game then a utility maximizing AI with uncertainty over the reward function would be guaranteed to obey shutdown commands (and thus be “corrigible” in the narrow, obey shutdown commands sense). This works because the overseer reaching for the shutdown switch is literally perfect evidence that the utility maximizing thing to do is shutdown. Things break down very quickly when you drop that assumption though.

They try and deal with this and show as you crank up the noise a little in the human shutdown command that it still mostly works, but I’m not convinced that their solution works. It still seems to be something that needs to be fleshed out.

Briefly, in their setup there is a term for how the AI represents the human shutdown policy. That is a representation of how the human maps their utility function onto the behavior of taking the shutdown command. Once you introduce fuzziness here the move from uncertainty to corrigibility becomes very slippery. It’s easy to imagine situation like this:

AI is piloting a plane taking off with its human

[Human]: “No! Don’t takeoff!”

[AI]: Thinks your shutdown instruction is not an expression of your true utility function, but you’re just afraid of heights and will get over it and continues to takeoff.

Or like this:

AI is about to launch is world-takeover plan

[Human]: No! Don’t take over the world!

[AI]: Thinks that the shutdown command is only issued because the human doesn’t understand the consequences of its actions will be very good according to the utility function that it currently puts a high probability on and continues to takeover the world anyway.

Actually coming up with a good way to model or learn this policy seems hard, or at least not obvious. It’s not clear how to make an AI learn what kinds of your behaviors or instructions are expressions of your true utility function and so how to extract updates in the right direction from them. When the human policy isn’t modelled well even in terms of the exact arguments they give you get situations where the AI has no incentive to be deferential. Even if it “eventually” converges to the right interpretation if you get catastrophic behavior before that point you are screwed. It’s possible that there is some more sophisticated version of the off-switch game proposal that avoids this problem or I’ve misunderstood but I don’t see how the paper guarantees corrigible behavior under realistic circumstances.

Expand full comment

What if one of the goals is to avoid doing things lots of people don't want it to do?

Expand full comment

I have naive questions on this interesting subject.

Why are some crucial aspects of the proposed choices unaddressed? For instance, the proposed choice is not just color and object, i.e. paperclips and/or staples. A key choice is number. The decision to tile the universe (within physical constraints) is a bigger concern that what it is tiled with, and what color those objects would be.

Would an AI not also optimize the number and production rate of the object? So time is also a factor, not just number, in the original scenario.

Embedded in my question is a production issue, one of the most intriguing aspects of this discussion. One would hope to set up an AI to advise regarding production, not implement production itself.

Yet these comments have persuaded me that an AI could persuade human operators to allow it to escape physical isolation and implement paperclip production/world dominion.

Terrific article, worth the rewriting and swear words at your hapless keyboard.

Expand full comment

The Envelope in the Safe is a really compelling allegory for human life too. Many of us can agree there is some higher purpose in life, but it isn’t something explicitly revealed to us. We just go and live and discover lots of things that aren’t the true purpose. And hopefully over time we start to hone in on the thing that is truly in the envelope

Expand full comment

Alright, I've been thinking about this for a while now. If machines achieve superintelligence, we're going to lose contact with them much faster than all of these scenarios seem to suggest. By "lose contact" I mean, there will be no way we can communicate that is in any way interesting to them. For example, imagine a machine with ten times our intelligence. It would view us the way we view animals with intelligence 1/10th of our own - which is the lower end of mammals, I think. How much do you care about the opinions of cows? The machines will view us in exactly the same way.

This doesn't invalidate what MIRI and CHAI are doing: they're trying to work on how we get through the transition phase between now and that super-intelligent future. But it does place quite severe limits on it. For example, that repetitive trial-and-error strategy for corrigible AI that CHAI is suggesting might not work if the AI becomes too smart too quickly. It will just lose interest in what we want, because our desires will seem... bizarrely limited.

Expand full comment

Why is this discussion assuming that U(v_i) >= U(u_i)? This seems obviously false; we're presupposing a powerful AI. And in this case, the only way for option pi_5 to look good is for the various functions U_i to be, not just orthogonal, but significantly anticorrelated. If they're perfectly parallel, the AI refuses because it's more capable. If they're perfectly orthogonal, then at worst the AI has to pick one of the two and optimize it against the humans optimizing the same, and again refuses because it's more capable; more likely, it can optimize U_2 where the opportunity cost in U_3 optimization is n smaller, and so do better than the humans even if it was only equally capable. Any mixed state of these two is equally doomed, unless I'm missing some property of high-dimensional ?configuration? spaces.

It's only if U_2 and U_3 are pointing directly against each other, in a way accounting for much of their value, that pi_5 starts to look good. Even without the MIRI paper this looks doomed.

Expand full comment

Isn't a lot of these debates premature?

I know, I know. And I'm grateful so many are thinking deeply about X-risks like AI misalignment. But, right now, we don't even know what kind of programming or modelling might lead to a genuine AGI.

Is it so surprising that the arguments are then esoteric/talking past each others/having to use little stories that should be the basis for a Love Death + Robot episode on Netflix?

Expand full comment

What about layering a high level cognitive function that has control over multiple models. The high level function could turn off the lower models, choose between their outputs, modify them, etc. The lower models could be optimized for small, specific goals while the higher function manages which is most appropriate.

Expand full comment

Instead of trying to prevent an AI from learning something, howabout we erase its knowledge of things that will cause a particular problem?

For example, make sure that its neurons contain no knowledge of the Off Switch. I bet you could do this now by finding out which neurons respond to simulation of the thing that you want it to forget and zeroing their weights.

I'm pretty sure that this is a major plot point in Neuromancer.

Expand full comment

Humanity creates agi all the time. We call them children.

How do we keep 2 year old from drawing blue paper clips on the wall? Even after we have praised them for their beautiful blue crayon paperclip picture on piece of paper. We limit the crayon function to circumstance where we are pretty sure wall drawing won't be happening and we simply take the crayons away.

So how does ai research deal with ergodicity problem directly.

This might be a starting point:


Expand full comment

I know I'm jumping into a topic that a lot of very smart people have poured a lot of time into, but I'm also a ML engineer working on RL in a related space and I figured I'd take a different stab at a sovereign AI.

My thought here is that it makes a lot of sense to break the network up into two sections, a planner and an actor. These would be different networks trained separately with no weight sharing (similar to actor/critic or generator/discriminator in a gan).

The planning network would get a prompt from the user and then try to predict all of the future sensor states that it would observe while carrying out said prompt. These predicted sensor states would be played back to the user like a video, and the user would be asked whether it approves the plan or not.

(I don't want to get too into the weeds with this, but this probably ought to be a tree of future sensor states rather than a list, with branches at places where the planner predicts uncertainty. Each branch could be approved individually.)

The actor network would then generate actions that result in sensor states that match the predicted sensor states as closely as possible. Additionally, one of the actions would be to kick back into planning mode if the network thinks it has gone too far off track and would like to re-plan from its current position.

The actor and planner would both be trained to minimize divergence between expected and actual sensor states, and the planner would have an extra loss term based on whether it is generating plans that satisfy the user prompts.

I think this setup gives us a lot of things we want

1) The planning network must get consent from the user while it has no capability to act, making user coercion difficult

2) The acting network is rewarded only for sticking to a plan, making radical departures unlikely

3) If there is some catastrophic collapse case where the acting and planning networks got together to maximize their rewards, the best way to do that would be to give itself easy prompts with predictable sensor states - like telling itself to go sit in a basement, or rip off its own sensors.

Apologies if this kind of a setup has already been considered and deemed terrible for some reason or another, I am still getting up to date on AI alignment.

Expand full comment

I'm confused why an AI would always attempt to gather more information rather than turning off. In the example with paperclip/staple utility functions it makes sense, but what about if unrecoverable negative actions are possible?

If a human tells an AI to stop, and the AI believes:

1. it doesn't have the full complete utility function

2. it could develop the full complete utility function given enough time

3. mistakes now could lead to unrecoverable value loss

Wouldn't it then act to prevent that unrecoverable value loss by listening to the human and shutting down?

Expand full comment

> AI: What’s your favorite color?

> Humans: Red.

> AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*)

We have training objectives which prevent model outputs from ever passing a certain level of confidence! Label smoothing is used to train machine translation systems and makes it so that the system will always assign at least X% of mass to a uniform distribution over next token choices.

> two people who both have so much of an expertise advantage over me

What precisely is it that Yud has expertise in? The answer isn't modern AI.

Expand full comment
Oct 4, 2022·edited Oct 5, 2022

You shouldn't trust any AI that will do what it determines that humans want. It's likely to do this:

AI: "Hmm, most humans say that they get their values from religion. The most popular religion is Christianity which says that evil people who don't accept Christ get tortured in Hell for all eternity. The second most popular religion is Islam which says that evil people who don't submit to Allah get tortured in Hell for all eternity. The third most popular religion is Hinduism which says that evil people get tortured in Naraka (basically eastern cultures' version of hell) for a very long time, then being reincarnated and repeating. The forth most popular religion is Buddhism, which says that evil people get sent to Naraka for up to sextillions of years before reincarnating (yes, it has a hell. https://en.wikipedia.org/wiki/Naraka_(Buddhism)). I guess I should do the common denominator and torture evil people in hell."

"Hey, humans, do you want justice accomplished as soon as I take power?" "Yes" "Okay, I'll do that". Then, since Justice is asymmetric (https://thezvi.wordpress.com/2019/04/25/asymmetric-justice/) most if not all humans are soon tortured for all eternity. The humans would resist, but since the AI had been told over and over that criminals shouldn't be allowed to veto judgements against them it continues.

If Elieser Y. and the other pessimistic alignment researchers said that this method successfully aligns AI, and I found out somehow that CHAI successfully developed a superintellegence this way and was about to release it, I'd burn the hardware and murder the workers so that a superintellegence that saw us as raw materials could take over instead. Seriously. (Unless I found out that the researchers were dictatorially telling the AI to listen to them and not the majority, in which case there would be a chance of good results and I would check their general sanity first)

Expand full comment

Humans: "It's up to you to figure out what we want and do that."

AI: "Okay, I've crawled a big sample of literature and social media. I found that the most common thing people write about AI is that it tiles the universe with paperclips. They must really want that!"

Expand full comment

Prior: AI alignment gets excess attention because it provides a compelling combination of social benefits for proponents: virtue signaling and intelligence signaling but without near-term costs. In contrast, for example, if one is really into saving the environment you can only signal so much virtue without imposing real costs (no more hamburgers or transatlantic flights) and you don’t get to look intelligent unless you are way into fusion power (which seems to have a higher bar for entry than AI alignment). Can anyone shake that prior?

Expand full comment

From the outside, any system that succeeds in doing anything specialised can be thought of, or described as a relatively general purpose

system that has been constrained down to a more narrow goal by some other system. For instance, a chess -playing system maybe described as a general purpose problem-solver that has been trained on chess. To say its UF defines a goal of winning at chess is the "map" view.

However, it might well *be* .. in terms of the territory ... in terms of what is going on inside the black box.. a special purpose system that has been specifically coded for chess, has no ability to do anything else, and therefore does not any kind of reward channel or training system to keep it focused on chess. So the mere fact that a system, considered from the outside as a black box, does some specific thing, is not proof that it has a UF, and therefore not a proof that anyone has succeeded in loading values or goals into its UF. Humans can be "considered as" having souls, but that doesn't mean humans have souls.

Expand full comment

I’m a PhD student at CHAI, where Stuart is one of my advisors. Here’s some of my perspectives on the corrigibility debate and CHAI in general. That being said, I’m posting this as myself and not on behalf of CHAI: all opinions are my own and not CHAI’s or UC Berkeley’s.

I think it’s a mistake to model CHAI as a unitary agent pursuing the Assistance Games agenda (hereafter, CIRL, for its original name of Cooperative Inverse Reinforcement Learning). It’s better to think of it as a collection of researchers pursuing various research agendas related to “make AI go well”. I’d say less than half of the people at CHAI focus specifically on AGI safety/AI x-risk. Of this group, maybe 75% of them are doing something related to value learning, and maybe 25% are working on CIRL in particular. For example, I’m not currently working on CIRL, in part due to the issues that Eliezer mentions in this post.

I do think Stuart is correct that the meta reason we want corrigibility is that we want to make the human + AI system maximize CEV. That is, the fundamental reason we want the AI to shut down when we ask it is precisely because we (probably) don’t think it’s going to do what we want! I tend to think of the value uncertainty approach to the off-switch game not as an algorithm that we should implement, but as a description of why we want to do corrigibility at all. Similarly, to use Rohin Shah’s metaphor, I think of CIRL as math poetry about what we want the AI to do, but not as an algorithm that we should actually try to implement explicitly.

I also agree with both Eliezer and Stuart that meta-learning human values is easier than hardcoding them.

That being said, I’m not currently working on CIRL/assistance games and think my current position is a lot closer to Eliezer’s than Stuart’s on this topic. I broadly agree with MIRI’s two objections:

Firstly, I’m not super bullish about algorithms that rely on having good explicit probability distributions over high-dimensional reward spaces. We currently don’t have any good techniques for getting explicit probability estimates on the sort of tasks we use current state-of-the-art AIs for, besides the trollish “ask a large language model for its uncertainty estimates” solution. I think it’s very likely that we won’t have a good technique for doing explicit probability updates before we get AGI.

Secondly, even though it’s more robust than directly specifying human values directly, I think that CIRL still has a very thin “margin of error”, in that if you misspecify either the prior or the human model you can get very bad behavior. (As illustrated in the humorous red paper clips example.) I agree that a CIRL agent with almost any prior + update rule we know how to write down will quickly stop being corrigible, in the sense that the information maximizing action is not to listen to humans and to shut down when asked. To put it another way, CIRL fails when you mess up the prior or update rule of your AI. Unfortunately, not only is this very likely, this is precisely when you want your AI to listen to you and shut down!

I do think corrigibility is important, and I think more people should work on it. But I’d prefer corrigibility solutions that work when you get the AI slightly wrong.

I don’t think my position is that unusual at CHAI. In fact, another CHAI grad student did an informal survey and found two-thirds agree more with Eliezer’s position than Stuart’s in the debate.

I will add that I think there’s a lot of talking past each other in this debate, and in debates in AI safety in general (which is probably unavoidable given the short format). In my experience, Stuart’s actual beliefs are quite a bit more nuanced than what’s presented here. For example, he has given a lot of thought on how to keep the AI constantly corrigible and prevent it from “exploiting” its current reward function. And I do think he has updated on the capabilities of say, GPT-3. But there’s research taste differences, some ontology differences, and different beliefs about future deep learning AI progress, all of which contribute to talking past each other.

On a completely unrelated aside, many CHAI grad students really liked the kabbalistic interpretation of CHAI’s name :)

Expand full comment