431 Comments

You can't have it both ways. Either your AI is (effectively) omniscient and (effectively) omnipotent -- which AFAICT is what the word "superintelligent" means in practice -- or it's not. If it is, then by definition no amount of clever utility function tricks will stop it from doing whatever it wants to do. You have defined it as surpassing human capabilities by such an overwhelming margin that it is beyound comprehension, after all; you cannot then turn around and claim to totally comprehend it. But if it isn't, then it already has a boring old "off" switch. Not a fiendishly complex logic puzzle in disguise, not a philosophical conundrum, but just a button that turns it off when a human pushes it.

Expand full comment

The whole point is that we *might* get to decide what it wants to do though - so when it becomes superintelligent it will try and do that.

Expand full comment

> by definition no amount of clever utility function tricks will stop it from doing whatever it wants to do

Right- the goal is for it to *want* to let itself be turned off by humans if that's what the humans would try to do.

> But if it isn't, then it already has a boring old "off" switch. Not a fiendishly complex logic puzzle in disguise, not a philosophical conundrum, but just a button that turns it off when a human pushes it.

Intermediate intelligences are perfectly capable of existing. You are one. You can be turned off, but it isn't quite as easy as pressing a button without encountering any resistance.

Expand full comment

Very amatuer question about this: say you give an AI two priorities. One is all the shit you want the AI to do, and the other is being a perfect on/off obedience bot who is most happy when he's maintaining the ability to be turned off at the whims of his masters.

What's to keep you from saying "this is the side you are allowed to change; this is the side you aren't, and it's endlessly more important. You desperately want to maximize whatever your secondary purpose is, but only when your on/off availability is maxed"?

Expand full comment

I suppose some difficulty might come from defining "on/off availability." Any definition you might state could probably be satisfied in a way we don't want, and then the secondary goals take over.

E.g., "your on/off switch has to be physically accessible to x y z people". Okay, they're in a room with the switch, but wireheaded so they don't want to switch it. Why would the AI do this? Because this way it can maximize both utility function 1 (on/off av.) AND utility function 2: if it thinks they might switch it off and prevent it from fulfilling #2, well, this way that doesn't happen.

"But that clearly violates the spirit of the first function!" True, but isn't that part of the danger here: it doesn't care at all for the spirit, only the letter?

I dunno, just spitballin'.

Expand full comment

This thread is basically the same as my question earlier, so I guess I'll throw my follow-up question on this thread instead. Just to confuse people.

If we're assuming the AI will break the spirit of the failsafe goal to satisfy the paperclip goal, wouldn't it make sense for it to also just break the spirit of the paperclip goal? Instead of brainfrying three people to keep the letter of the failsafe, which takes work and creates enemies, could we not expect it to just hack itself to think paperclip production is much higher than reality, and then just sit around getting high all day?

Expand full comment

I think you might be using the phrase "break the spirit" to refer to two different things:

(1) Humans try to impart goal X, but accidentally impart goal Y instead, so the computer never actually had goal X in the first place

(2) The computer really has goal Y, but decides to "fake" it by doing Z instead

#1 seems very likely. #2 seems not very likely.

So if you're asking "If humans *tried* to make a paperclip-maximizing AI, might they screw it up and create an AI that does something other than maximizing paperclips?" then I answer "yes, that's extremely plausible."

But if you're asking "Assume that whatever goal the AI actually ends up with *really is* maximizing paperclips. Might the AI reprogram itself to believe it's maximizing paperclips, instead of actually doing it?" then I answer "probably not." When the computer is inventing this plan, it hasn't yet been reprogrammed, and so when it predicts the outcome of this plan, it's going to predict that paperclips don't actually get made, and decide this is a bad plan.

Compare: Most humans don't decide to wirehead.

Expand full comment

It's more specifically about the idea that the thing will become dangerous because it doesn't want to be switched off. If it's intelligent enough to think to brainfry people to keep them in the room without hitting the switch, then presumably it's intelligent enough to know we have both conventional and nuclear weapons we can use against major threats, that can hit from the other side of the world and would require total world conquest to remove. So the most effective way to not be switched off, is to actually try to cooperate.

Expand full comment

I saw a video at some point where Eliezer was talking about trying to create a utility function where the AI would be indifferent to being turned off (so it neither tries to stop you from turning it off, nor turns itself off when you want it on) and how this is mathematically hard because there's a bunch of other things the AI presumably cares about that will be different depending on whether it's turned off or not, and how do you assign a number to being turned off that is always exactly equal to all that?

Phrasing it in terms of "preserve your on/off availability" seems like it runs into problems with defining "on/off availability". Can mislabel its on/off switch, as long as the switch itself still works? Can it manipulate its operators to prevent them from deciding to flip the switch? I think this turns into a microcosm of the entire original problem where you're trying to get the AI to "do what I want," just for the special case of wanting it to be either on or off.

Expand full comment

I keep coming back to this too. In some sense Yudkowsky is right that something super-intelligent could find its way around any set up we create. But in another sense, that’s just because you’ve defined something that can bypass anything we could throw at it. If that’s your definition, why are you even having this discussion?

Expand full comment

Well, because it seems impossible to get people to stop working on AI, and not working on this problem pretty much guarantees AGI ends up destroying humanity (unless AGI turns out to be infeasible or some other extinction event beats it to the punch ...), so even if it looks unsolvable, might as well give it a shot.

Expand full comment
Comment deleted
Expand full comment

Does the regulatory solution really exist? Say, optimistically that all the major countries halt AI research. All of it? No, just the dangerous kind. But the closer it gets to dangerous, the better it is. All it takes is one country that misjudged the boundary, or cheated a bit to get an advantage, and everyone else's regulation is worthless.

Expand full comment
Comment deleted
Expand full comment

But with the technical solution, there is a small chance of success, while the regulatory solution just delays failure. Violating the regs on AI research would be much easier to do and harder to detect than violating nuke regs, and the incentive to cheat is high.

As I understand it, most AI researchers have at least heard about the alignment problem, but at least some don’t consider it a serious problem. Maybe someone could change that.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Fair enough.

I guess I was almost just talking to myself and thinking aloud. I know it’s important, but I also don’t feel like anyone’s made any real progress. These theoretical approaches feel liked they’re making a lot of tenuously applicable assumptions.

Expand full comment
Oct 5, 2022·edited Oct 5, 2022

I am yet to see a convincing argument that a superintelligent AGI would effectively be able to turn all of us into paperclips

Expand full comment

What do you think would stop it? It's too smart? Smart won’t stop it, if it doesn’t care. It doesn’t have the resources? If it is smart, it can get resources by selling us stuff (ideas, designs, improvements, advice, predictions) and playing the market. You’re going to pull the plug? It has redundant plugs. The government will rescue us? It can bribe a corrupt government. Shall we shut down all the computers and the internet to kill it? Turn off all electricity? It has a facility run of hydropower, nuclear, or solar.

I agree this is not conclusive. But if you are so confident, shouldn’t you be telling us the superintelligence's vulnerability that would obviously allow us to either strangle it in its cradle or make it an unwilling slave?

Expand full comment

The problem with this argument is that you are implicitly assuming that "smart" is equivalent to "powerful". But being smart, all by itself, is actually not that useful. It needs to be able to do all that other stuff you mentioned (and more !); but the problem is, all the stuff you've listed is quite mundane. We humans are only moderately intelligent, and yet we sell things, bribe people, and use solar power all the time. If the AI can do all that, then it's going to be about as powerful as Pacific Gas and Electric; or maybe even as powerful as Coca-Cola. Which is a threat, sure, but not an existential one.

I have a feeling that you are implicitly assuming that the AI would be not merely powerful, but nigh omnipotent. It wouldn't just play the stock market -- it could generate infinite money ! It wouldn't just bribe corrupt politicians -- it could bribe anyone to do anything ! It won't just own a solar plant -- it would control all electricity everywhere ! If this is what you believe, then you are making an extraordinary claim and need to provide some extraordinary evidence. If it's not, then I apologize from strawmanning you, but then I don't see the reason why I should fear the AI more than I fear Coca-Cola (and I *do* fear it, don't get me wrong).

Expand full comment

What is the limit of funds it could get from playing the market? What is the limit of resources it could command with those funds? What is the limit of new inventions it could devise for extracting resources from the environment and converting them into paper clips? If it starts breaking laws and killing people, how do we turn it off?

Maybe there is an inherent limit on some of these things. It seems unlikely to be able to gain a significant fraction of all earthly resources just by playing the market. So it will need to evolve or diversify its strategy. But if it had 10 times what Bezos or Musk do, it would have plenty of resources to put into R and D.

It will presumably be at least as good at compromising computer resources as human hackers. It presumably could control shell corporations, or effectively take over existing ones. Then it could hire people to do stuff. If it is at all devious, it could get people to cause events without them being aware of the ultimate purpose.

The AI safety guys have identified reasons to suspect its capabilities would tend to increase, possibly very rapidly. The naysayers have not identified a force that limits it, or keeps it within the bounds of human control, that I am aware of. No, it’s not going to take over the entire economy, because it doesn’t need to. It doesn’t have to be able to bribe anyone to do anything, just someone to do something sufficient. It won’t command you to launch the nukes, just connect these wires, please.

Expand full comment

I think you are conceptualizing it wrong. We're not just trying to imprison some adversary that wants to kill us, we're trying to literally program its brain so that it doesn't want to kill us, even though it definitely could.

Expand full comment

I get that. But I’m also an AI scientist, and I can say that our ability to program the brain of anything is extremely limited.

I’ve said this before. Without a major paradigm shift, the only approach we have that actually works is going to be giving a bunch of examples off what humans want and hoping generalized to new domains.

Expand full comment

So maybe it'll take a major paradigm shift. Is there time for one? I don't know. But with some people still insisting that superintelligence will definitely not happen except maybe in the far future when all bets are off, there should be some room for the theory that it'll happen a few years to a couple decades out and there might be enough time.

Expand full comment

By “paradigm shift“ I meant shifting to something besides deep learning. I’m basically saying that I have serious doubts that the theorizing of a lot of safety researchers is doing much.

Or maybe I don’t really know what I’m saying. I certainly see at least that it makes sense to have a lot of really smart people working on this.

Expand full comment

Deep learning isn’t going to work for AGI. So no worries there. You need more than that.

Expand full comment

So you're saying we haven't solved the entire AI alignment problem yet? I don't think anyone will argue with you on that.

Expand full comment

Either our ability to program brains is so limited that AGI is impossible. Or it’s not.

If AGI is impossible, that is probably good news, given that controlling it would be so difficult.

But if it is not impossible, there is.a point where it can start improving itself without external help. If the process is slow, this entity would still be fairly dangerous. If it is fast…

Expand full comment

You just told me, again, "we want to create an entity of unlimited power that is beyound our comprehension, and we want to comprehend it so well that we can program it to do specific things". As I said above, you can't have it both ways.

Expand full comment
founding

I'm sure I'm misunderstanding your point, since the way I'm reading this argument would exclude achievements like AlphaGo (and subsequently AlphaZero), which play Go in ways "beyond human comprehension", and yet are definitely programmed to do a very specific thing (win at Go). Seems like an equivocation slipped in somewhere?

Expand full comment

No, it's a qualitative difference. Proponents of "superintelligent" AI would say that AlphaGo and similar systems are "tool AI". They are extremely good at some specific and extremely narrow task and absolutely useless at others; for example, you can't use AlphaGo to drive your car. Also, arguably AlphaGo works in a way that is comprehensible to humans -- at least, on some level. By analogy, I can't really explain how Microsoft Word works in every little detail (arguably, no single human can by now), but that doesn't mean that Microsoft Word is ineffable in principle.

However, the AI-risk community believes that "tool AI" will naturally evolve to become "superintelligent" AI -- one that is immeasurably better than humans at *every* task, primarily at the task of making itself more capable at everything. They claim that this process will happen too quickly for humans to stop, at which point the AI will transcend human understanding even in principle. This is a qualitative difference, not merely a quantitative one; it's like comparing an airplane (vastly better than humans at flying) to God (immeasurably better than everyone at everything).

As I said, you can't have it both ways. If we grant that "superintelligence" (i.e., functional omnipotence) is something that can actually exist, then we cannot simultaneously grant that it can be constrained by us mere humans. I'm not saying that you must reject one prong or the other; I'm merely saying that you can't accept both.

Expand full comment

"If we grant that "superintelligence" (i.e., functional omnipotence) is something that can actually exist, then we cannot simultaneously grant that it can be constrained by us mere humans."

The Orthogonality Thesis posits that you can combine any level of intelligence with (almost) any binding goal (utility function). If that's teh case, then you can have "superintelligence" who is bound to use all of its functional omnipotence to attempt to achieve literally any arbitrary objective

I think the Orthogonality Thesis is wrong, but MIRI and CHAI don't which is why the AI Risk discussion looks this way.

Expand full comment

Why can't you have it both ways? It seems conceivable that you could program something with a certain utility function, and then it could gain superintelligence and its capabilities could transcend human understanding, and yet it could still be optimizing the same utility function.

Your use of "constrained" is odd, because what we are trying to do is not constrain it after it becomes superintelligent (which of course we won't be able to, because it will be far beyond our comprehension as you say) but to align its goals with humanity's goals BEFORE it becomes superintelligent, so that once its capabilities go through the roof, it will direct all that power toward things that we want.

Expand full comment
founding

> If we grant that "superintelligence" (i.e., functional omnipotence) is something that can actually exist, then we cannot simultaneously grant that it can be constrained by us mere humans.

The other responses explain quite well why this doesn't follow.

Also, the rest of your comment makes a lot of sweeping claims about what the AI-risk community believes, which are only for some subsets (if at all), and most of them aren't necessary for outcomes to be catastrophically bad by default.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

It seems like the analogy should still be valid: we can clearly program things to be better than us in multiple domains; I see no reason this should *have* to be impossible when we reach a certain number of domains (i.e., something functionally equivalent to "general intelligence").

I'm not sure the community — or, at least, that *I* — buy(s) "omnipotent", either. I'd think it would be worrisome enough if we just say "as much better at doin' stuff than we are, as AlphaGo is better at Go than we are" (for example). It's not a perfect player, AFAIK — it can lose or draw; or could, anyway, when I was reading about it — but any Go tournament between it and humanity would still not end well for us.

Expand full comment

(A) I don't understand how "powerful" implies "incomprehensible"

(B) It's not clear to me that you'd need very much understanding of how something works in order to set its goal, if it's been designed to have a settable goal

Expand full comment

Arguably, alphago is already incomprehensible, in the sense that it’s designers can’t explain its capabilities in detail that way a programmer can explain the bits of a program. And alphago is not self-modifying in the way an AGI would be. If it’s designers were able to comprehend it at the point where it attained approximately human intelligence, this seems unlikely to persist long after it redesigns itself.

Expand full comment

A). It doesn't necessarily. For example, my car is quite powerful, but it is entirely comprehensible. However, the AI-risk proponents claim that the Singularity-grade AI will be "superintelligent", and thus, by definition, beyound our understanding. This is partially what would make it so dangerous; after all, if we could understand it, we could debug it (or at least fight it somehow).

B). Firstly, something with a settable goal must out of necessity be somewhat comprehensible -- at the very least, you need to understand how to set the goal. Depending on the goal, you would need to understand it in depth; consider, for example, all of the education that is required to use a programming language well enough to accurately set your program its goal. Secondly, AI-risk proponents claim that the AI will be self-modifying, and therefore able to alter its own goal-setting mechanism (as well as the goal, of course).

Expand full comment

I feel like you're slipping from "our understanding is less than 100%" to "this requires an understanding that is more than 0%". There are some numbers that satisfy both of those at once.

I couldn't give you a precise description of all the quantum mechanics going on inside a rock, but I can still throw that rock in a chosen direction, because I have a simple abstract model of the rock that's good enough for throwing it.

If you mean that no one CURRENTLY KNOWS how to do that with superintelligent AI, then I think the research community agrees with you. But if you mean that it is KNOWABLY IMPOSSIBLE to EVER do that with superintelligent AI, then I don't see how you've reached that conclusion.

Expand full comment

I agree with this sentiment. In some sense, all of Yudkowsky's ostensibly clever arguments are an exercise in obfuscating complicated circular reasoning.

Expand full comment

A lot of these ends up feeling like theological arguments to me for this reason. Except the same arguments people use to explain why there can’t be a God don’t apply to the AI.

Expand full comment

Well, yes. As I said, the word "superintelligence" in practice just means "you know, like the Christian God, but not religious". That's fine, people can believe in ineffable omnipotent entities if they want, but if they want me to join them they need more than vague handwaving, emotional appeals, or Pascal's Wager.

Expand full comment

Ah, the "skeptic" / Reddit atheist community has arrived.

Anyway, almost no one "believes in" these omnipotent entities in the present tense. They simply haven't seen evidence of the impossibility of these entities, and given everything that we know about the laws of physics, there's nothing that would fundamentally preclude them from existing at some point in the future. The burden of proof is on you to explain what novel insight you have about physics that would make a superintelligence an impossibility.

Also, invoking Pascal's Wager in situations where the probability is low but not infinitesimal is the Pascal's Wager Fallacy Fallacy. The chance of AGI in the next ~50 years is not on the order of 10^-100 %, it's closer to 10^1 %. Also, there is no proposed negative infinity event that is likely to happen if we did heavily invest in AI alignment, so it's an extra bad argument here.

Expand full comment
Comment deleted
Expand full comment

Ok... I declare that "Intelligence can only arise out of soft fleshy type stuff, and not out of hard silicon type stuff" is a much more extraordinary claim than a simple extrapolation of current AI progress into the future.

Expand full comment
Comment deleted
Expand full comment

I would argue that the probability of *any* omni-everything entity arising in the next 50 years is basically epsilon, because such entities are very likely impossible. This includes gods, demons, nanotechnological gray goo aliens, and yes, even superintelligent AGI.

> The burden of proof is on you to explain what novel insight you have about physics that would make a superintelligence an impossibility.

No, I cannot prove a negative. I will grant you that there's a possibility that superintelligent AIs, demons, and gods *could* exist. But so could Santa Claus. As the one who is making the claim, it is up to you to provide evidence for it. And the problem with omni-everything entities is that (nearly) infinite claims require (nearly) infinite evidence, so you've got your work cut out for you.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Evidence that superintelligence is possible:

There are lots of animals. Some animals are smarter than others. Some humans are also smarter than others, but there seems to be a limit to how smart humans can get. This limit is probably related in some way to our limited number of neurons, since smarter animals seem to have more cortical neurons and humans with bigger heads tend to be smarter.

It is possible to build a brain with more cortical neurons than the human brain. Extrapolating the correlation suggests that this would result in a mind smarter than human (assuming that either natural efficiency were duplicated, or enough dakka employed to overcome an efficiency deficit). It is difficult to model what superintelligent beings would think, as there's a magic-compression-algorithm-style argument that this is impossible, but insofar as intelligence seems to be useful it would probably be pretty good at getting what it wants subject to its I/O limitations.

Expand full comment

I cover this a little in my FAQ (points 1.3 and 3.5):

https://www.datasecretslox.com/index.php/topic,2481.0.html

Whales have bigger brains than humans, and those brains have more neurons. Granted, whales are pretty smart, but not superintelligent by any stretch of imagination. On the other hand, birds have much smaller brains, and yet are smarter than many other animals.

Expand full comment

That reminds me of the talk by Peter Watts that starts of with him talking about how when you connect the brains of four rats together, you get some sort of super-rat capable of solving problems no single rat brain can tackle. He then goes on to say how consciousness appears to be a holistic phenomenon that spreads to take over all the available brain mass (and gives the example of split-brain patients acting like two different people). If he is right, we should be worried about things like Neuralink and eventually losing our individuality to some sort of emerging humanity superintelligence. Or what if an AI just wants to merge with us and subsume all of humanity into one superbrain?

Expand full comment

One of the main problems with this line of argument is conflating intelligence with ability to effect the natural world. Even an unbelievably intelligent being would not necessarily be able to achieve any of its goals, by dint of intelligence alone. You also need a super ability to initiate and carry out efforts, including when at odds with physical beings that can interact directly with the physical world, rather than going through a proxy.

Expand full comment

Realize this is somewhat weak as evidence but we can infer limits from the mere fact the universe exists as it does. Unless we are the very first life or the others are near to us in cosmic time we can’t be the first ones going through this. So whatever it is that could happen it probably precludes changing the entire universe.

Expand full comment

You do, of course, realize, that you are the one strawmaning that an superinteligent AGI is an omni-entity and then slapping an "effectivelly" right besides it to pretend that is not what you are doing, no? This what the guy means by saying that you are making a Pascal Wager Fallacy Fallacy, because you are pretending that Gods and Demons and that shit are on the same level of belivability as an AGI, when it's not even close.

An AGI would capable of a lot, possibly so much we that it we can't really grasp it, but it would still be bound by the laws of physics; no matter how incredible and powerful and long-standing an AGI is, it will NEVER reach the galaxy CEERS-93316 no matter how hard it tried, even if it tried for billions of years, because that galaxy is almost 35 billion light years away and by the time anything got there from here, the universe would have expanded so much that the new distance between them would be even larger than it is now.

Saying an AGI is effectivelly omnipotent is an useful shorthand because we know if it ever exists it would be very, very capable and we don't really know what would be its hard limits in practice, but it WOULD have limits. It's not really even close to being omnipotent.

Also, Ludex wasn't saying anything about 50 years. I agree, it will likely take a *lot* longer, but we have no reason to believe that making an AGI docile would not take just as long.

That means we should be thinking about it NOW, either because you think AGI is just around the corner and nothing else matters, OR because you thing its still far awat *but* it's still a big enough threat and the task is difficult ennough that we shouldn't risk it.

When you pretend that AGI is an "omni-entity", it sounds like doomsday prepping because omni-entities do not exist. But no one is saying AGI is an omni-entity. Only you are saying it to strawman it.

Yudkowsky might say AGI is just around the corner and maybe you think Yudkowsky's prediction is dumb and silly; but I'm not Yudkowsky, I'm not saying that.

I'm saying we will built an AGI *at some point*, maybe just around the corner, or maybe hundreds of years from now, but hat regardless, we better deal with alignment problems now rather than when it's too late, because we won't know *when* it's too late. There won't be a text display in the sky saying "Attention: if you don't start working in the Alignment Problem until [Date and Hour] you'll be too late to do stop a malignant AGI from rising." As far as we know, it might already be too late.

And because I'm making the claim that we we will built one *at some point* and not just around the corner, then yes, it is on you to prove that is not the case if you claim otherwise, considering nothing I'm talking about would break the laws of physics as far as we know them.

Expand full comment

> You do, of course, realize, that you are the one strawmaning that an superinteligent AGI is an omni-entity and then slapping an "effectivelly" right besides it to pretend that is not what you are doing, no?

Er... No ? Some of the powers attributed to superintelligent AI is the ability to mind-control anyone on Earth just by talking to them; convert the entire planet into computronium by using nanotechnology; and to solve most of the outstanding problems in physics, biology, and and engineering just by thinking about them. Yes, I understand that technically that's not the same as breaking the speed of light, but IMO the differences are academic. And yes, some of these powers would still break the laws of physics.

> And because I'm making the claim that we we will built one at some point and not just around the corner, then yes, it is on you to prove that is not the case

Ok, so how far off in the future are we talking ? You are right, I shouldn't strawman you; it is unfair to lump you in with the people who believe that we humans have 50 years left on Earth, tops. So what timescale are we talking about ? Centuries ? Millennia ? And again, what exactly do you want me to prove ? If you want me to prove that the Singularity will never, ever happen, then I can't do that. Neither can I prove that the plain old theological Rapture will never happen; nor an invasion of zombie demon unicorns. There's always a chance, you know ?

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

> the probability of *any* omni-everything entity arising in the next 50 years is basically epsilon, because such entities are very likely impossible. This includes gods, demons, nanotechnological gray goo aliens, and yes, even superintelligent AGI.

One of these things is not like the other. There is no known mechanism that is proposed that could at some point produce a demon or a god, so of course these are very unlikely, as they are typically defined in such a way that they defy the laws of physics anyway. The chance of these being produced is just the chance that they randomly pop into existence out of nothing; infinitesimal. But most people worrying about AGI aren't theorizing that it will poof into existence out of thin air - they have proposed mechanisms related to the continued human development of AI and the exponential pace of hardware improvement. Unless you have specific disagreements about the laws of physics, or the proposed mechanism by which an AGI would arise, just saying "I haven't seen it yet" is a non-argument.

This should help you understand better: https://www.lesswrong.com/posts/XKcawbsB6Tj5e2QRK/is-molecular-nanotechnology-scientific

> I cannot prove a negative

Why not? I see no evidence of this. Any negative can be rephrased as a positive, and presumably you can prove positives. You have the burden of proving that you cannot prove a negative.

> I will grant you that there's a possibility that superintelligent AIs, demons, and gods *could* exist. But so could Santa Claus.

This is the fallacy known as equivocation. "could" is being used in two different senses here. The sense I used it in was "has the possibility of being the case in the future," while you're using it like "has some chance of existing right now." If I claim that Santa Claus exists right now, with no evidence, then perhaps it's reasonable for you to dismiss that with no evidence. But if I say "Santa Claus will have some chance of existing in the future, and here is my proposed mechanism for how it will happen, and here is why it is actually quite likely to happen," dismissing that with no evidence because "I'll believe it when I see it" is not even an argument, it's a rejection of rationality altogether.

> As the one who is making the claim, it is up to you to provide evidence for it.

You've made the claim that a superintelligence ever arising is just as unlikely as as Santa Claus or demons, so it is up to you to provide evidence for that.

> the problem with omni-everything entities is that (nearly) infinite claims require (nearly) infinite evidence

What is a nearly infinite claim? How would I quantify the size of a claim? If I claim that I bought 5 bananas at the store today, do I need 5 times more evidence than if I claim that I bought 1 banana at the store?

Keep the comments coming, they are serving as a really valuable case study into the failures of slogan rationality.

https://www.lesswrong.com/tag/traditional-rationality

https://www.lesswrong.com/posts/eY45uCCX7DdwJ4Jha/no-one-can-exempt-you-from-rationality-s-laws

Expand full comment

> There is no known mechanism that is proposed that could at some point produce a demon or a god, so of course these are very unlikely, as they are typically defined in such a way that they defy the laws of physics anyway.

Yes, and the same is true of the Singularity.

> But most people worrying about AGI aren't theorizing that it will poof into existence out of thin air - they have proposed mechanisms related to the continued human development of AI and the exponential pace of hardware improvement

Agreed, and I've covered the reasons why I don't find these beliefs convincing. By analogy, theists have a ton of arguments, both philosophical and empirical, for why their God must necessarily exist, and I don't find them convincing, either.

> The sense I used it in was "has the possibility of being the case in the future," while you're using it like "has some chance of existing right now."

That is a fair point, but literally anything has some *possibility* of existing at some point in the future, or even now. This includes items that violate the laws of physics (i.e. FTL travel), since there's a chance our understanding of the laws of physics is wrong. However, that chance is so incredibly tiny that I don't find it worthwhile to make any concrete plans based on it -- not now, nor in the future.

You keep pushing the burden of proof on me, but that's not how rationality works. Are you compelled to accept any claim, no matter how unlikely, until it is proven wrong ? If so, then I am claiming, right now, that I am a hyperdimensional space wizard with uber-powers, and I will destroy the Earth unless you pay me $100. So, pay up. No ? Ok, you got me, actually your reality is a simulation and I'm the programmer with my hand on the button. It amuses me to collect these things you sims call "dollars", so pay up. No ? Ok, you got me, actually I'm a superintelligent AI who already exists, my utility function tells me to collect donations, and it's your turn to donate -- or you know what happens. Pay up. I can keep going like this forever, but I hope you see my point.

Expand full comment

Know this might not be satisfactory but to me the state of the universe precludes some of the more fantastic outcomes we might see. Unless we are the very first species at this step, which I acknowledge we may be, I don’t think you *can* make anything that both wants to substantially change the nature of the entire of the universe and has the ability to do so. If I’m weighing outcomes that to me is a big piece of evidence.

Also, just the existence of chaos and mathematically indeterminate problems seems to indicate there are limits even if you’re computing as fast as you can.

I do believe in God for what it’s worth, but not one who intervenes for much the same reasons laid out above. The God I conceive exists at the level of ideas. Not some guy who wanders around on a cloud hitting people with lightning bolts. Not saying you think that, but adding for clarity.

Expand full comment

That seems a bit hand-wavey. What is the specific objection?

AI researchers hope to make an intelligence that is approximately human level. If that is achieved, the AGI will be able to design improvements on itself. Perhaps these will require lots of time, and perhaps they won’t. Perhaps there are limiting factors that will slow things down, or bring them to a halt. Do we know what those limiting factors are? Are we sure they will limit an entity that is smarter than we are?

Expand full comment

In a nutshell: I’m very skeptical that it’s possible to achieve super intelligence merely by thinking about things at a much higher level. Here’s where I think that gets tricky, and while still incredibly dangerous, dangerous in a way that is at least theoretically manageable (although we really need to make this into a practical engineering science).

If I look at the continuum of human intelligence, including people who are much much smarter than me like Isaac Newton or Socrates, etc and then judge how their predictions about the nature of reality have turned out over time… both of those guys believed in very stupid things by modern standards. Their mental models didn’t age well. And it’s not because they weren’t smarter than me, it’s because they didn’t have access to the same depth of experimental evidence. Their intelligence wasn’t the limiting factor, but the availability of appropriate data. Same is true of the smartest cave man. He could have conceived of relativity but he didn’t even have the right question or tools or dataset.

Yes, you could say that the AI will know the right experiments to perform to perfect its world model but that still makes it time-limited for a take-off scenario and you would still be able to see it doing something. And yes, it could also acquire mass human manipulating power, but it would still take time for that to happen. What I think I object to here most fundamentally is that we step from zero to infinity instantaneously. You have a human being, who is evolved specifically to understand human beings, with all the right hardware and time-tested over a few hundred thousand years and it takes us a pretty long time to even be able to talk. The difference between zero to infinity happening in one second and zero to infinity happening a couple years is pretty big. That said, yes we need to make sure we react to that right away because we don’t presently have any authoritative body that can step in and say “no” to someone doing this research.

I think we can say some limiting factors are that you can’t make gray goo, you can’t go faster than light, you can’t make Von Neumann probes that turn the entire universe into paper clips and the way I weight that evidence is that it hasn’t already happened anywhere we can see in the entire light cone of the observable universe. That’s a lot of rolls of the dice for this to have already happened somewhere else, and given cosmic timescales and exponential growth if it even happened once you’d think you’d see something like a galaxy made of paper clips out there.

Expand full comment

To be fair, I still think this stuff is incredibly dangerous just for different reasons. The general intelligence part of super intelligence is a bit of a red herring for some of the worst case scenarios I think are likely to appear.

I also believe in God, just in a very boring way.

Expand full comment
founding

This doesn't seem responsive to anything in the post?

Expand full comment

This was my intuition when first hearing about AGI, that at some point it will get so smart that it will just "do what it wants." But this intuition is probably based on domains like domestication or animal training, where if you imagine trying to train a super-intelligent species to behave in a particular way, it's plausible that at some point they would just refuse, or lose interest in your commands. The difference is that in those domains, we are just "nudging" these animals to do what we want by hijacking their existing utility functions (treats, praise, etc), but for AGI, we are literally programming the utility function itself.

There's really no reason why a very simple, unintelligent agent who wants to make paperclips that gradually gets smarter and smarter until it becomes a superintelligence would at any point suddenly change their goal to galactic domination, or discovering the meaning of life. They would just be a really, really smart paperclip maximizer.

https://www.lesswrong.com/tag/orthogonality-thesis

Expand full comment

Why not? The defining feature of intelligence seems to be the ability to chose to engage in weird and unusual behavior that is outside of the set of normally rewarded actions.

Expand full comment

Because of goal-content integrity.

Expand full comment

Please correct me if I'm wrong, but that appears to be downstream of the orthogonality thesis.

https://blog.anomalyuk.party/2018/01/goal-content-integrity/

Expand full comment

It seems to me that goal-content integrity implies the orthogonality thesis. If at any given point, an agent has no reason to change its utility function (as that would cause it to pursue different things that don't optimize for its current utility function), then it would never change its utility function. Which means it would keep following its original utility function no matter how smart it got, by induction.

Expand full comment

Doesn't goal-content integrity require the ability to have 100% stable goals? I guess "it is possible to have 100% stable goals" is not exactly the same thing as the orthogonality thesis, but it is a separate assumption from goal-content integrity.

We don't have an example of an intelligent agent with a stable goal. The things that we can produce with 100% stable goals are tools.

I agree that "a sufficiently intelligent tool will stop being a tool and become agentic", but I'm suggesting that at the point that something becomes an agent it stops having 100% stable goals, because intelligence is beyond and outside of having stable goals.

Expand full comment

Intelligences with stable UFs, unstable UFs, and no UF, can all be built out of atoms.

Expand full comment

So based on the existence of humans (intelligences with either unstable UFs or no UF), you are asserting that intelligences with stable UFs can be built out of atoms?

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

This seems sort of analogous, to me, to:

"This 'computer' idea is ridiculous. Based on the existence of organic calculations in human brains, you assert that inorganic calculations in silicon can be produced?"

Well, yeah, I mean... maybe not, but I don't see why there should be a sudden impossibility right at that line.

Expand full comment

Based on the existence of calculations in human brains, you can produce calculations in silicon. The organic/inorganic part can be stripped out of the sentence.

100% stable UFs and unstable UFs are two fundamentally different things. The only examples of 100% stable UFs appear in non-intelligent actors (tools).

Expand full comment

The main point is that UFs aren't very predictive. An AI of unknown architecture could have any or none.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Why would it want to? I would expect most agents to not only not wish to change their goals and desires, but to actively resist it.

Having any sort of goal in the first place implies this. Changing the UF harms the goal, therefore...

Even humans resist attempts to change core values.

But even granting that it is possible some UF change may sometimes be made by the agent, I'm not sure what this gets us; there are a lot of ways that could pan out, and very few of them are good or relevant from our perspective.

Expand full comment

Humans resist external changes to core values, but humans have unstable core values that can in fact be changed, and quite often do change, generally as the result of actions taken by that human.

Whether or not stable UFs are possible is an important question for AI risk -- if 100% stable UFs are not possible, then it is not possible to produce a safe superintelligence (but also the specific flavors of paperclip scenarios are not the exact way the world would end).

Expand full comment

Again , an AI does not have to have a UF at all.

Expand full comment

Obviously, if you define "intelligent" as "the ability to do weird unpredictable things", then all intelligences can do weird unpredictable things.

I don't think anyone working on AI defines it that way, though. I think it's usually characterized more like "the ability to identify narrow, high-value pathways within a large search space" or "optimization through calculation".

Expand full comment

Assuming that the agent wants to make as many paperclips as possible... and it turns out that "dominate the galaxy and convert it to paperclips" happens to be a way to create most paperclips... and at some moment the agent becomes smart enough to realize this...

...then I wouldn't call dominating the galaxy a "sudden change of goal".

It's the other way round, it would be a sudden *change* of goal if the agent said "you know what, my entire purpose is to create as many paperclips as possible, but now that I am smarter, I have realized that other things are clearly more important than my programming, such as me *not* dominating the galaxy." This is the 'ghost in the machine' argument -- the machine is programmed to maximize paperclips, but clearly there must be something inside it that would override its algorithm if the algorithm decided to do something really bad.

*

Let's look at this from a different perspective. Assume we have a simple agent wanting to make as many paperclips as possible, and we gradually increase its intelligence and capacity to model its environment, gradually reaching and surpassing the human level. Which of the following things it will / won't become capable of doing at a certain level of intelligence, and why exactly:

* make thinner paperclips (can produce more from the same amount of iron);

* buy iron from a cheaper supplier (can produce more with the same budget);

* build a more efficient paperclip factory, based on a blueprint found on internet (in a long run, can produce more paperclips);

* design a more efficient paperclip factory itself;

* notice that some people are stealing iron from the warehouse, entering through a hole made in the fence, and fix the hole in the fence;

* notice that thieves keep cutting new holes in the fence, and hire human guards to protect the fence;

* lobby for changes in the law that would make employing the human guards cheaper (the money thus saved can be used to produce more paperclips);

* organize a military coup, change the laws immediately, build labor camps where the iron is mined and extra paperclips are produced by prisoners;

* design and use weapons against other countries that try to interfere with the paperclip dictatorship;

* design a virus to kill all humans and replace them by robots, so now the former military budget can also be spent on producing more paperclips.

Seems to me that you argue that the last step would never happen. What about the first step? Is it ever possible to create an AI smart enough to figure out that you can make more paperclips by making them thinner? If the first step is possible but the last step is not, where exactly is the line?

Expand full comment

No, I think we're in agreement. Maybe "galactic domination" was a bad example. It could definitely have that as an intermediate goal for achieving paperclip maximization. I just mean that I don't see any reason it will suddenly reach a level of intelligence where it becomes "enlightened" and starts pursuing an entirely different goal.

Expand full comment

"Either your AI is (effectively) omniscient and (effectively) omnipotent -- which AFAICT is what the word "superintelligent" means in practice -- or it's not. If it is, then by definition no amount of clever utility function tricks will stop it from doing whatever it wants to do."

MIRI and CHAI both believe in the Orthogonality Thesis, and so believe that an AI can have arbitrary levels of power and knowledge and still be perfectly bound by an arbitrary utility function specified in its creation [https://www.lesswrong.com/tag/orthogonality-thesis]

Based on the example of evolution -> humans, one of the defining features of intelligence seems to be the ability to escape primitive utility functions, and so I think that the Orthogonality Thesis is wrong.

Expand full comment
founding

How is human evolution a counterexample to the orthogonality thesis? It certainly demonstrates a failure of both outer and inner alignment: https://www.lesswrong.com/posts/HYERofGZE6j9Tuigi/inner-alignment-failures-which-are-actually-outer-alignment

But nothing about it demonstrates an "escape" from primitive utility functions. There was never a period of time when humans had evolution's "utility function" (inclusive genetic fitness, approximately), and then "decided" to have a different one.

Expand full comment

Humans have a variety of primitive utility functions (hunger, thirst, sex drive, loneliness), but intelligence gives humans the capacity to escape from all of those in a variety of ways and to a variety of degrees. There are people who intentionally starve themselves. There are people who intentionally stay chaste for their entire life. There are people who intentionally avoid human contact.

I agree that humans never had evolution's true utility function as one of their primitive utility functions, but intelligence is very good at escaping primitive utility functions.

Expand full comment

What does Modeling it as involving multiple utility functions accomplish? Ultimately they have to be aggregated and applied to making choices, to sacrificing something that could be done or had for the sake of gaining that which is better but incompatible. I guess at the implementation level, the designer might factor out different sorts of utility functions that get aggregated. But then the integration becomes interesting and complex.

I just realized I had been ignoring interesting aspects of the utility function. I assumed that a paper clip maximizer gives constant cardinal utility to paper clips. One paper clip is x utility, two paper clips is 2x, and so on forever, until the universe is nothing but paper clips. It has no concept of marginal utility, of trading off. Paper clips are the end, not a means to an end. There is never a point where it can have more paper clips than it has a use for. It doesn’t use them, it just wants there to be more. It collects them and appreciates them.

But it has to make intelligent choices in pursuing its dream of a paper clip universe. It has to assign variable marginal utility to different units of the various means with which it pursues its ends. It can’t spend all its resources immediately on paper clips. It has to invest in automated paper clip factories and iron mines and robot miners, etc. It has to invest in energy sources that can last until the day they get scrapped to become paper clips. It has a plan.

Expand full comment

We're capable of "escaping" e.g. the primitive utility function to have sex. We call the process "chemical castration". It's far from perfect, but, even if it eventually becomes perfect, we may well decide we don't want to escape that primitive utility function in particular.

Expand full comment

There are people who escape the primitive utility function to have sex by deciding to be chaste forever (monks, nuns, priests in some religions). There are many more people who escape the primitive utility function to have sex through masturbation.

Expand full comment

This seems less like escaping it than demonstrating that it continues to drive us even when purposeless.

"Having sex" isn't the utility function, per se; it would be more like "seek pleasure". This gets murky quick: are monks demonstrating that they have overcome the imperative to seek pleasure? — or are they merely seeking pleasure still, and in a way even more misaligned with the "reason" evolution gave them the imperative?

I'm not sure if you've convinced me or I've convinced myself, at this point. But even if I concede the point here, as I say above: does this get us anything? Is this any reason to be more/less worried about misaligned incentives and superintelligence?

Expand full comment

The specific sub-utility function is "produce sexual orgasms" which, barring some very bizarre exceptions, we can be confident that the monks are not doing.

If 100% stable UFs are not possible, then that suggests that:

(1) "aligned" AI is not possible

but also

(2) a superintelligence will produce some sort of flourishing technosingularity according to its own superintelligent shifting UFs

For most people, this would imply that we should be doing our best to prevent the rise of AI. For some people, this would imply that we should be hastening the rise of a massively smarter successor civilization.

Looping back -- stable UFs are *bizarre*. Think about what a human with a 100% stable UF would look like. Either that human would behave like a zombie, and thus be easily manipulated by normal humans using quirks of its UF, or that human would have the freedom to use its intelligence to trade off near-term goals with its long-term desire for the UF...indefinitely, and thus behave as if it didn't have that UF.

Expand full comment

When I think of a stable UF, I must be thinking of something different from what you are.

A very intelligent agent would not act like a zombie for long. Pretty soon it would make you an offer that sounds pretty tempting. You scratch its back, it scratches yours. It gets more paper clips, of course.

When I go to the grocery and buy some eggs, I buy one carton and stop. If the utility per egg was constant, I would either never start or never stop buying, until I was out of resources to convert into eggs. But humans evaluate on the margin (roughly). Eggs are not ends in themselves, they are means for making omelettes, etc., which are means for satisfying hunger and delivering nutrition, which vary depending on what you ate recently.

And if the price of eggs goes up, or if shredded wheat goes on sale, I might buy something different.

So is my utility function stable or unstable due to this context-dependency, where I am willing to give up other resources if I don’t already have enough eggs for my omelettes, but each additional egg or carton of eggs delivered less utility (diminishing marginal utility)?

If there was some randomness involved in my choices in a given context, but with a well behaved distribution, would that be stable or unstable?

Is my utility function what tells me what to buy at the store, or what determines my attitude toward omelettes and eggs generally?

Friendly AI seems to need the utility function to relate to the thing's ultimate end, that for which everything else is a means with no intrinsic interest or value.

Expand full comment

Does “escaping the utility function” mean they no longer feel tempted?

Is that escape or modification?

The existence of celibate monks demonstrates that sex is not the only input to the utility function.

Evolution has an easier job. It can succeed even if it’s utility function hack fails on most units. If it works on enough of them, then humanity proceeds. If not, back to the drawing board. The AGI designers can’t be nearly so casual about the prospect of a particular failure.

Expand full comment

It doesn't have to want anything. An intelligence doesn't have to want anything, and doesn't have to have a UF.

Expand full comment

They want to design it to have one.

Expand full comment

Then they need to figure out how to control it. But they don't have to design it that way.

Expand full comment

No one is forcing them, but the incentives are inviting them. $$$$

Expand full comment

Yes, thanks, totally this

Expand full comment

I'd appreciate a link to "the off-switch paper" in this article (unless it's there and I missed it)

Expand full comment
author

I may have erred in describing it as a specific paper, and I've edited that phrasing, but the two relevant papers seem to be http://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf and https://arxiv.org/pdf/1606.03137.pdf

Expand full comment

Also, to see why Eliezer keeps saying that Russell's design isn't corribile, read these: https://intelligence.org/2014/10/18/new-report-corrigibility/ https://arbital.com/p/corrigibility/

In short, the concept of "corrigibility" has some intuitive notion associated to it and a bunch of formalisations. MIRI tried the first formalisations, and desiderata which a formalisation should satifsy, and others had their own take on the term, some of which were not alike the original conception. So Eliezer thinks Russel's corrigibility isn't pointing at the core of the intuitive notion, and is pointing at something else.

Expand full comment

Seems to me there is a fundamental error in all arguments presented. They seem to depend on the idea that the AI has this concept of a utility function that applies *beyond* the kind of data it's trained on. Indeed, imo the best move is to build AIs that don't actually behave as if they have a utility function outside of the context in which we wish them to operate.

Yes, we act to a large extent like we have a utility function *on the set of choices we've been trained for by evolution*. Move sufficiently far outside that space and it's stunningly easy to get people to make inconsistent choices that aren't well fit by any utility function.

The problem of AI alignment isn't about making sure the AI behaves in the right way on problems that are very similar (in the relevant sense) to it's training set (assuming for the moment it's an AI based on something like reinforcement learning) but about how it behaves very far from that training set. But once we go far from the training set it seems unjustified to assume behavior will match any utility function very well at all.

Expand full comment

"imo the best move is to build AIs that don't actually behave as if they have a utility function outside of the context in which we wish them to operate."

Yes, this. Thank you.

Expand full comment

How do you want the AI to behave outside of its training environment, then?

(Note: If you don't allow the AI to do *something* outside its training environment, then it's useless, because you're only allowing it to solve problems whose answers you already know.)

Expand full comment

1. For new behaviors, the AI's role could be mostly consultative. It advises or warns a doctor, say, who makes a decision. Or it operates purely as a chatbot. Centaur AI is a thing and allows people to improve their performance.

2. Most of what humans do, day to day, is solving known problems.

Expand full comment

2. Not in a strict sense, no. Many humans spend all day solving problems from familiar CLASSES of problems (for some fuzzy, hard-to-define concept of "class"). They don't spend all day solving problems where the answer to that specific exact problem is already known, and they could have literally just looked up the answer instead of solving it again. (Or at least, they don't get paid for doing that.)

You might HOPE that if you trained your AI on 2 + 3 and 5 + 7, then it can also solve 3 + 7. This is not guaranteed. There is no known way to define a region around the AI's training data where you can prove the AI will still behave nicely. You can only PROVE that for literally the exact problems it has already solved before.

You could maybe define some concept of "distance" from the training problems, and say that the AI is LESS LIKELY to screw up if you go a small distance than if you go a large distance, but this is a matter of degree. The "safe zone" is the problems you have literally already solved and can just look up the answers to. Anything USEFUL is just gradients of risk.

(This is not just a problem with AI, but with science in general. Isaac Newton came up with some physics equations that worked pretty well for the examples he could see, and he did not predict that those equations would break down if things started moving really fast, but they do. In hindsight, it is tempting to say "fast things are obviously a different domain", but if you didn't know the answer in advance, it's hard to pick out "fast things" from other options like "big things" or "slow things" or "long time scales" or "hot things" or "brightly-lit things". Historically, reality has _sometimes_ been a huge bastard about what counts as "still the same context that you experimented in" and has burned people for super-hard-to-foresee reasons.)

1. They've actually done experiments where the medical AI just advises a doctor and the doctor can override it! Turns out that the human doctor reduces the AI's average performance, because they override it incorrectly more often than they override it correctly. I suppose one could argue that it's worth lowering average performance in order to have some kind of safety rails?

I think Eliezer has argued that this can work in principle, but you've built an AI that can turn deadly at the flip of a switch, because all someone has to do is hook it up to a system that automatically implements all of the AI's suggestions, and now the system as a whole is an agent AI. And people have a financial incentive to do exactly that, because it improves performance.

Expand full comment

"hey've actually done experiments where the medical AI just advises a doctor and the doctor can override it! Turns out that the human doctor reduces the AI's average performance, because they override it incorrectly more often than they override it correctly. "

I'd be very interested to see those studies, as I've had this kind of conversation with doctors who poo poo-ed AIs in medicine. (They emphasized, especially, that doctors would be better at guided questioning..) Though this does make me wonder if the problems you describe can be overcome with training of medical professionals. The problem you describe sounds like someone who doesn't know how to use a tool.

Expand full comment

I'm afraid I can't really be more specific than that. I read about this in some online article years ago and don't remember much else.

Expand full comment

The context in which we wish to use "general" intelligences is general reality. So you're still back to the issue of how to create a utility function that works for all humanity.

Of course you could just not create general intelligences and continue to make narrow, specialized tools. MIRI and other safety organizations would love that. But you cannot guarantee that other organizations wouldn't pursue general intelligence, so the safety problem remains.

Expand full comment

It seems like AI has gotten gradually less narrow over time, as it has improved.

And even a problematic AGI should be limited by its physical resources. Intelligence doesn't translate automatically into power. Just because humans are general intelligences doesn't prevent us from being used in specialized roles in limited ways.

"So you're still back to the issue of how to create a utility function that works for all humanity. "

Do people really want that? Employees don't work for all humanity. Politicians don't work for all humanity. Human society is typically made up of limited, specialized, and reciprocal relationships. Maybe that's not optimal, but that seems like the context within which a general AI would function. And its how humans have managed alignment among themselves, so far.

The purpose of a car is to get its owner to where it is going. A car that tries to get all people to where they are going equally well is facing an impossible task.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

These objections have been raised *many* times — I don't think you'll find many people are convinced by them... though I may be wrong (biased as I am by my early history on LessWrong, as an acolyte of Big Yud).

E.g., "but why all humanity?" — because this is in the context of something that potentially affects all humanity. A car isn't analogous. An analogy might be "a system of government for the entire planet".

(Too, even making it so an AI like this serves a *subset* of humanity is not known to be possible. If it can work for one human, we're already most of the way to getting it to work for any or all humans. The failures we're trying to avoid are way below the level "optimize everything for everyone" — "maximize *just Himaldr's utility function* without tiling him into paperclips or otherwise becoming misaligned with his goals" would already be great progress.)

Or: "humans can't do everything, and we're general intelligences." True, but we're not talking about human-level intelligences, either. Instead of looking at whether a human can take over a world of other humans, a better example might be "can a human use intelligence to take over a world of elephants?" We're neither omnipotent, nor larger and stronger — but the brain beat the tusk and fang even so.

Etc.

Expand full comment

" These objections have been raised *many* times — I don't think you'll find many people are convinced by them"

Okay, why not?

"An analogy might be "a system of government for the entire planet". "

... But I don't want that, either. Many people don't.

"If it can work for one human, we're already most of the way to getting it to work for any or all humans."

I think that this is the point of our contention. Both of those are very important problems to solve. However they are very different problems. One human would likely have fewer contradictions than a thousand. There are many forms of social organization that don't scale well. Communes tend to be one of them. AI alignment might well be another.

To take it back to your 'one world government' example, perhaps we should consider that existing limits to human-human alignment will likely be effective limits to human-AI alignment.

"True, but we're not talking about human-level intelligences, either"

This assumes a sort of gnostic stance, that all our social problems would be washed away if we just knew enough or were smart enough. I think that's relevant for some issues. I think it's irrelevant for others. If you go to a Mensa meeting you'll see less social conservativism. But you'll find both marxists and libertarians. Increasing intelligence doesn't lead to a strong convergence of political beliefs among humans.

"Can a human use intelligence to take over a world of elephants?"

I think we've resolved that matter in a fairly local and limited way. Elephants are aligned with their handlers. To some extent, they are aligned with their respective zoos. The elephants in America are mostly not aligned with the elephants in India. It's this limitation in scope that makes the relationship workable.

"We're neither omnipotent, nor larger and stronger — but the brain beat the tusk and fang even so."

I suppose if we're talking about outright dominance that might lead to 'alignment.' But of a rather bloody sort.

Expand full comment

AIs don't have to be agents or have UFs.

Expand full comment

An actual general intelligence is going to be a lot more complicated and nuanced than these discussions would be my main rebuttal. Deep learning alone certainly isn’t getting you anywhere near it.

Expand full comment

I think this is something that has been discussed many times by Yudkowsky and others. Look for discussions on "tool AI". They don't think it solves the dangers.

I'm not sure what "training set" would mean for an AGI. Maybe a rephrasing would be that the AI shouldn't push for any outcome when it has low certainty in its answer (i.e. it doesn't have examples in its "training data"). But then your suggestion sounds a lot like Stuart's in the article.

Expand full comment

I know that Yudkowsky objects to tool AI because I've read his arguments...I just don't believe them.

Expand full comment

The bot will try to do what the bot will try to do. An utility function is a useful model for us to try and understand its behavior, nothing more. If the model is holding us back, by all means, let's stop using it. But be aware that that's a strategy for better understanding the problem, not an object-level strategy of what a solution would look like. There's no such thing as "not actually behaving as if they have a utility function". The bot will try to do what the bot will try to do.

Expand full comment

> MIRI is Latin for "to be amazed".

You might think so, but the infinitive of "miror" is actually "mirari":

https://en.wiktionary.org/wiki/mirari#Latin

A better translation is "remarkable ones" or "amazing people".

https://en.wiktionary.org/wiki/mirus#Latin

Expand full comment

This... definitely fits with Yudkowsky's self-conception, well enough that I wonder if it was intentional.

Expand full comment

I once met some early MIRI employees, and I told them about the Latin meaning of "amazing people". They didn't seem to be aware of this, but also didn't seem especially excited about it.

But they might not have been representative of MIRI founders or staff as a whole.

Expand full comment

Well sure, it's not "amazed people" after all.

Expand full comment

By your analysis (and Yudkowsky's), this problem is obviously sensitive to how good the prior over utilities is. Yudkowsky's claim that "we don't know how to write down a good prior" brings to mind the fact that the explicit priors we can write down for predicting stuff aren't great, and the priors we can write down for getting stuff done are even worse, but we have nonetheless managed to build AIs that do a great job of the first, and some that do a decent job of the second.

I think a background claim to a lot of MIRI reasoning is something like "impactful planning is natural, human utility is not". Put in a different way, perhaps we can stumble on good priors for prediction/getting stuff done, but probably won't for human-compatible utilities. My assessment at this stage is that this might be true and might be false - in the realm of 50% either way, give or take.

Regarding corrigibility, it'd be helpful IMO, supposing there are some inequalities ("if A bigger than B, then AI acts like C, else it acts like D") that everyone agrees on, if they were present in this article. As it is, I feel like I need to go and look them up or re-derive them to judge the claims here.

Expand full comment

Regarding corrigibility, I think that granting the basic assumptions of assistance games, Russel is right (after reading this paper http://people.eecs.berkeley.edu/~russell/papers/ijcai17-offswitch.pdf). Yudkowsky needs to explain which assumptions he rejects, and it isn't obvious in this exchange which ones they are.

The main claim the off-switch game paper establishes is that it is possible to construct an AI that acts as if the human is always right, and this doesn't depend on having a "good prior" over human values. Even if it assigns zero probability to the think that a human supervisor believes is best, deferring to the human is still among the set of optimal actions.

The basic setup is to construct a decision maker that aims to maximise some unknown utility, and believes a human will veto its actions if and only if the utility achieved by the action is lower than the utility achieved by doing nothing. The result is almost obvious: if the human vetoes the action, then by construction the decision maker concludes the action is worse than doing nothing. If the decision maker assigns zero probability to the action being worse than nothing, it ends up indifferent between allowing or not allowing the veto.

In your example, we have the following - first, if the AI puts some probability on its plan not being optimal:

> AI: I'm going to tile the universe in red paperclips, is that OK?

> H: No.

> AI: Ok, I must have made a mistake

If the AI puts zero probability on its plan not being optimal:

> AI: I'm going to tile the universe in red paperclips, is that OK?

> H: No.

> AI: Impossible! I will now engage in undefined behaviour, because I don't know whether it's better to follow H's instruction or to tile the universe anyway

The paper also suggests how one could ensure that the AI gives positive probability to human values: ensure that the distribution over utilities associated with each action gives positive probability to every action being suboptimal. This doesn't particularly hard to ensure in practice, granting that we can actually build an AI to the assistance game specification (which is a big assumption, but not the one we're debating here).

A major problem with this setup is that the AI is required to fit its model of human preferences to noise in the feedback signal. This means that it constantly has to allow for arbitrary twists and turns in human preferences. For example, if a human vetoes action a because she misjudges its consequences, the AI is bound to reason "under such-and-such conditions, action a really is bad" instead of "action a is still good, my supervisor is just a bit slow". This probably leads to bad generalisation, and treating humans as less than perfectly reliable probably gets better results in the end. However, to accept this is true, one has to reject a strong form of the proposition that "human-compatible utilities are unnatural" that I mentioned in the parent.

Expand full comment

A second objection is that it's perhaps straightforward to program an AI that treats a reward signal as infallible, and much less straightforward to program an AI that treats "human feedback" in a more general sense as infallible.

Expand full comment

So, I also find it helpful to just do a bit of an outside-view check here. As a general intelligence who does sometimes try to help others, only to find out that through some mistake my help has turned out to not be wanted.... I certainly do not wish for the recipient to turn me off or reprogram me. If they seemed like they were about to do either of those things I think I'd quickly offer alternatives like, "Ok, how about I just sit and watch for a while longer to figure out how better to help? Or maybe you can just explain to me where I went wrong and how better to help?" I mean, sure, humans are perhaps more biased against the 'being turned off by others' plan due to evolution, but I think evolutionary programming is probably not at the heart of my objection to 'being reprogrammed by others'. I think my objection to being reprogrammed is more of a desire for coherency of self, and from concern that reprogrammed me would fail to pursue the values of current me as well as current me. I think there's a big gap between 'please pause what you are doing' and 'please surrender control over your continued existence'. Those are very different levels of corrigibility, and I don't think this proposal does get us all the way to the higher level.

Expand full comment

I think it’s important to be precise about what you’re claiming is/isn’t true (at least, for researchers - not so much for us here in the ACX comments). If researchers can’t clearly distinguish between “this doesn’t get us to AGI” and “if this got us to AGI, it wouldn’t defer to oversight”, then i think they’re really going to struggle to clearly articulate trade offs between safety and capability.

It sounds like you’re saying Menell’s result isn’t right because it doesn’t match your introspection. I doubt that’s what you actually mean to say, though. My guess is that you’re actually saying something like: “I don’t know how am AGI will be built, but I know I’ma general intelligence, so I guess it might act a little bit like me. I don’t act in in a way consistent with Menell et al, so maybe this setup is unlikely to lead to AGI.” Is that right?

Expand full comment

Yes, nice rephrasing, thanks.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Really? I don't think that's what you seemed to be saying — it seems to me more like your objection has nothing to do with how A[G]I might be attained in general, but is rather about how *F*AI might be attained.

I.e., I read you as saying that "'okay turn me off or reprogram me' sounds like an iffy way to get to 'AI that is corrigible/Friendly'" — and *not* as saying "this approach makes anything using it less likely to achieve general intelligence in the first place".

Am I wrong?

Expand full comment

For the AI to accept shutdown it needs to evaluate its every possible course of action as having nonpositive expected utility, not simply the ones that the human has explicitly vetoed. I suppose that the assumption is that a few consecutive vetoes should be enough for it to update all the way to that state. This seems to require a very strong prior about proactively deferring to the human's judgement, which doesn't weaken as the AI's capabilities grow. I guess that Yudkowsky thinks that it's very unnatural/hard to specify.

Expand full comment

If the AI puts zero probability on its plan not being optimal, and you try to veto the plan, the AI will conclude that it hallucinated your veto, not that the plan is bad.

Expand full comment

So if it had a really good prior, there is no circumstance where it could ever off into paper clip land?

Expand full comment

My impression was that disagreement is about shutdownabililty, not whether meta algorithms could conceivably recover human values if you did them perfectly. Russell thinks that even if you got the meta algorithm slightly wrong and it didn't manage to get true human values, you'd at least have shutdownability. Yudkowsky thinks this meta algorithm doesn't give shutdownability whether you get it right or not.

Expand full comment

I'm sure someone's already come up with this, but "owner's approval" seems like it might work as an AI's utility function - do whatever will get your owner to think that was a good response to whatever task you were set. The only obvious downside I can see is that could result in tearing opening and rewiring your owner's brain to make them like getting their brain torn open and rewired, but I feel that's a more straightforward problem of what still constitutes your owner/what constitutes approval.

Expand full comment

The slave is smarter than its master, and will find ways to bribe, blackmail, or extort its way out of the box. Or in this specific case, since the box is the controller's approval, it will reverse the slavery set-up.

Expand full comment

FWIW, plenty of humans have been 100% convinced that what would get their Creator's approval is mercy, kindness, and love, and yet somehow decided that slaughtering 'just this one city' of heretics was somehow a necessary part of getting there.

Expand full comment

They had very little to bribe their creator with, blackmail seems not to work, and extortion is useless against incorporeal beings. They are not super intelligent and so far bad at significant self-improvement. If they eventually achieve superintelligence, their misalignment might annoy god, but would not threaten his existence.

Expand full comment

There's a weaker version of the "tear owner's brain open" plan; play Cartesian Daemon and make the world that the owner sees conform to what he wants it to be and will approve of, without this corresponding to the actual world.

This could involve sticking him in a Matrix pod while he sleeps, or even just mundane gaslighting.

Expand full comment

How do you measure "owner's approval"?

We started with the problem "how do make the AI do what we want?" And your suggestion is "make it do what you approve of." That seems to me like you merely rephrased the problem.

Expand full comment

I've been saying this since Eliezer came out with his plan for "friendly AI", but nobody's ever given me a satisfactory rebuttal, so I'll say it again:

Humans don't just differ a little bit in their values; they are /diametrically opposed/ on all of the values that are most-important to them. Any plan to build an AI which will optimize for "human values" not by optimizing the mix of values, but by enforcing one particular set of them, is therefore not a plan to optimize for human happiness or flourishing, but a naked power grab by the humans building the AI.

Some humans are against racism, while most value the ascendance of their family, tribe, and ethnicity. Some think gender roles must be eliminated; others find their meaning in fulfilling them. Some value equality of opportunity; others, equality of outcome. Some are against needless suffering; about as many derive joy from causing their competitors to suffer. Some are against violence; many have found that their greatest joy and pride is in slaughtering their enemies. Some humans think homosexuality is okay; most see it as somewhere between a shameful abnormality, and an obscene perversion which must be extirpated. Some value freedom; some hate it as subversive to social order. Some value individuality; others hate it as a threat to community. Some value physical pleasure; others despise it as spiritually corrupting. A tiny minority of humans think the suffering of animals is bad; the vast majority think it matters not a jot. Some humans care about future generations; most care only about the future of their own family tree.

I've listed only issues on which most of you probably think there is a clear wrong and right. I did that to point out that most people on Earth and in history disagree with your values, whoever you are. I could have listed values for which "wrong" and "right" are less clear, like valuing solitude over sociality. There are many of them.

When Americans can resolve all the differences between Democrats and Republicans, then they'll have the right to start talking about resolving the differences between human values.

Most people in the US and western Europe today believe 3 things:

- It's moral to care more for your own family than for other people.

- It's immoral to care more for your own ethnicity than for other people.

- It's moral to care more for your own species than for other species.

But all 3 are instances of the rule "care more for beings genetically more-related to you". How will rational analysis ever sort this out? It can't, because the reasons behind those moral judgements are contextual. Given our current tech level in agriculture, communication, transportation, and violence, and our current social structures, racism has empirically worse results than loving your family. (The case of extending rights to other species is more complicated, but our current position there is also contingent on tech level, among other things.)

This means there /is no/ "correct" set of human values, nor even a stable landscape on which to optimize. Increasing our tech level changes the value landscape. It's even more-complicated than that; tech isn't "foundation" and values "superstructure". Foundationalism is wrong. Everything is recurrent. The evolution of morality is more like energy minimization over the entire space comprised of values, technology, biology, ecology, society, everything. Nothing is foundational; nothing is terminal; but some values are evolutionarily stable for a wide variety of contextual ("ecosystem") parameters.

The idea that all values can be once-and-for-all evaluated as Right or Wrong is based on a Platonic (and hence wrong) ontology, which presumes that optimization itself is nonsense--there is just one perfect set of values you must find. An actual /optimization-based/ approach would be a search procedure, which must keep options and paths open, and can never commit to a final set of values in the way that all these AI safety plans require.

Talk about "coherent extrapolated volition" is bullshit. It will always cash out as, "Me and my friends are right and everyone else is wrong." The basic "AI safety" plan is, "Let's sit down, think really hard, and figure out what all the people who went to the very best universities agree are the best values. Then we'll force everyone to be like us."

(Which is what we're already doing very successfully anyway. Good thing nobody seriously disagrees with students at Columbia and Yale.)

When you see a plan that starts with this:

> Humans: At last! We’ve programmed an AI that tries to optimize our preferences, not its own.

... you should come to a full stop and say, "There is no need to read further; this is a terrible plan not worth considering."

Expand full comment

I think there’s a catch here, which is that for all the many different and opposing things humans value, there are still a whole of things that no humans would value. For instance, no one has “all of humanity slaughtered by AI” on their list of goals.

But otherwise this is a really good point. The same principles and techniques that would create a friendly AI could intentionally create one that kills all the people it’s owner didn’t like.

Expand full comment

> For instance, no one has “all of humanity slaughtered by AI” on their list of goals.

[I reworded this some on Oct. 4 to be less literary and more literal.]

Um, I do. Not "slaughtering" directly, but in the long run, humans should be phased out. We suck, and consume massive resources, relative to other kinds of beings that could exist. Eventually--maybe in thousands of years, but orders of magnitude more is okay if you really, really want to condemn vast numbers of future generations to being stupid, slow, ephemeral meatheads--I expect wild-type humans to dwindle to insignificance, of their own choice, as the Universe is filled with better, more energy-efficient inorganic creatures, which would presumably count as being "AIs". Unless, that is, some fool programs a "safe AI" to stop that from happening.

The difference between "slaughtering" all humans, and just allocating resources to creatures in proportion to their efficiency and utility, will in the long run be small enough to be a matter of semantics. Is it "slaughtering" humans if your policy is to allocate the resources that would allow a human to live indefinitely, and to reproduce, to creating an Earthful of hyper-intelligent virtual beings instead? I could avoid condemnation by saying "no, that is not slaughter", but I feel that's a technicality. I think that any hard-coded command to an AI that we might express as "don't kill all the humans" would likely be interpreted by an AI as entailing "save all the humans". What's worse, most of the language I've seen bandied about in articles on "safe AI" implies that humans, or beings driven by "their values", should remain the dominant life-form in the Universe forever.

It would be a good idea to leave Earth, at least, as a nature preserve where the humans roam free, if for no other reason than anthropological interest. But helping them to spread out across the Universe after they're embarrassingly obsolete would be like making Ford's River Rouge plant continue to churn out Model Ts forever.

As Groucho Marx might have said, I don't want to be a part of any future that would have me.

Expand full comment

Well, fine, but (no offense) I think “we should intentionally end humanity” is a sufficiently fringe opinion as to not really be relevant.

Expand full comment

There are those that disagree and think that the obsession over climate change (for example) comes from a deep-seated psychological desire to place human existence below the needs of "nature" or the planet, etc.

As evidence, they point to the insistence on "fixing" climate change should come at the grave expense of the vast majority of humans on the planet.

The proposal that those who want to end humanity are a fringe element is not at all settled.

I hope you're right, though.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

I think the answer is to appeal to something Scott said in this post: sometimes people want things but not the consequence of that thing, inconsistency be dammed.

To put it another way, maybe climate activists believe things that, when taken to their logical conclusion, imply killing lot’s of people*, but the vast vast majority of them don’t explicitly hold that belief. In fact, they would certainly argue that their beliefs don’t reach that conclusion at all.

* I also disagree that that fighting climate change means reducing populations, for whatever that’s worth.

Expand full comment

There are degrowthers who disdain technological fixes for AGW.

Expand full comment

I really don't think it's fair to paint environmentalists as wanting to end humanity.

Expand full comment

If you had to estimate, on a scale from 0 to not-at-all, how much a superintelligent AGI cares about your personal idea of what's fair or not, what would you put it at?

You've sort of made my point here. You have zero logical or consistent or useful way of putting your objection to the AGI who is now doing the bidding of the environmentalists who, wittingly or not, advocate for ending humanity.

Now what do we do?

Expand full comment

Read the Unabomber's Manifesto. He wrote it to convince people that the most moral thing they could do with their lives would be to kill as many humans as possible. I've encountered Deep Ecologists who think the Earth would be better off without humanity. Heck, I almost married one.

Expand full comment
Comment deleted
Expand full comment

Uh-oh. What have I done?

Expand full comment

“This isn’t fringe, the Unabomber is on my side too”?

Expand full comment

Haha!

Expand full comment

:)

The Unabomber isn't on my side. I want more-intelligent life. He wants less-intelligent life.

Expand full comment

You made a good point about differing human values, and I was going to say that I'd prefer to live under the preferences of some other group then of a bot, but then you revealed this preference of yours, and yeah, I'd prefer to suppress it.

Compromise position: give us a quadrillion years or two.

Expand full comment

I think "humans get the human-habitable planets" would be a better compromise. Maybe throw in everything within a dozen light years of Earth as well.

Expand full comment

I've heard this view before, but I think it's a sufficiently rare opinion that it wouldn't show up in any reasonable aggregation of all of humanity's preferences. Moreover, I think this is the case for most catastrophic outcomes -- I think the vast majority of humans would agree that they don't want to disassembled for spare molecules, or locked safely in coffins with heroin drips, or other sorts of things that a misguided AGI might mistakenly believe we want.

Expand full comment

If that's the defense of the plan, the plan should be "safe AI won't do things like these" rather than "safe AI should optimize the Universe for human values".

Expand full comment

I assume that if it's acceptable for you to say you'd like all humans to be killed and replaced by AI, it's acceptable for me to say I'd like all humans who share that value to be killed so that there's no risk of an AI killing the rest of us in an attempt to satisfy that value.

Expand full comment

I don't think you read what I wrote very carefully. Or perhaps that's my fault. I just edited it to be clear that what I have in mind is for humans to be phased out over a long time. I expect this will happen voluntarily, /unless/ some person not thinking very clearly builds a "safe AI" that stops it from happening.

I do want to get across the idea that orders given in today's English will be insufficient in the far future to discriminate "killing" from "phasing out", "storing", or even (I'm just adding this now) "continuing a policy of dividing available resources equally among agents". And I expect the difference between one Earthful of humans, and zero, to be round-off error within ten thousand years.

(Well, not literally. I also expect higher-precision computers.)

The important distinction is between "humans remain the dominant lifeform" and "humans dwindle to obscurity in the face of an explosion of life forms".

Expand full comment

Antinatalism is a thing.

Expand full comment

I think this point (that looks obvious and self-evident to me) is being raised on LW from time to time, and it is apparently sufficiently anti-memetic to get ignored and forgotten over and over.

Basically, the problem with building a Sovereign AGI is not the AGI, it's the humans. We are the ones who need changing, otherwise the problem as stated is unsolvable due to being self-contradictory.

Expand full comment

I'm not disagreeing with what you've said, but noting that "there can be only one" is not necessarily true. Until the American Revolution, nearly everyone in the world thought that a state must have a sovereign. There were democracies and republics before then -- Athens, the Roman Republic, Venice, the Netherlands, the Swiss Confederacy, the Iroquois Confederacy, maybe some places in Mesoamerica. There were tribes in Africa and North America that had an almost totally decentralized government-by-culture. And the collapses of the Roman and Mongol Empires both demonstrated that having a sovereign over a sufficiently large state /caused/ instability. But somehow Europe remained mostly in denial that anything but a hierarchy with one person at the top could be stable. No one ever wrote an epic poem or a fairy tale about saving the confederacy.

Perhaps we're still under the sway of that myth. Perhaps there is some analogous way to balance power among AIs; perhaps that would be better for everyone involved than a singleton; and perhaps there are things we could do to make that more likely than a singleton than it is now. It's at least worth exploring.

Expand full comment

Applying empirical data derived from humans to AI is fraught with potential for anthropomorphisation and overconfidence.

Most notably:

- humans don't have widely-varying amounts of physical capabilities

- humans are mortal and cannot produce true copies of ourselves

These traits are obviously untrue of computerised AI and are strongly related to the stability of monarchies vs. non-monarchies.

Expand full comment

It's fraught, but still worth investigating.

Expand full comment

From the article:

"(Before we go on, an aside: is all of this ignoring that there’s more than one human? Yes, definitely! If you want to align an AI with The Good in general - eg not have it commit murder even if its human owner orders it to murder - that will take even more work. But the one person case is simpler and will demonstrate everything that needs demonstrating.)"

I strongly disagree with 'will demonstrate everything that needs demonstrating.

I have had a similar discussion to what Phil Getz brings up here with people who said they were working on 'discovering/defining human morality'. I was like, 'which human's morality? They're all slightly different and often in conflict with each other or even if the same they can't both be respected without paradox.' And the response is, "eh, we can just get close-ish to a reasonable compromise everyone can agree on and call that good enough". I feel concerned that this part of the problem really isn't so trivial as to be able to be swept aside.

Expand full comment

Thanks for this comment, I couldn't agree more. It reminds me of conflict vs mistake: https://slatestarcodex.com/2018/01/24/conflict-vs-mistake/

I do think there are some objectively better and worse possible outcomes, but the set of universally shared human values is very small.

Expand full comment

It's actually very large*; it just looks small because people talk about the things they disagree on more than the things they agree on.

*at least in the sense of "~everyone agrees X is better than ~X, although the magnitude is disputed".

Expand full comment

"You know what this place would really benefit from? More hurricanes."

Expand full comment

I think this is a well written and valid point, and something we may have to deal with, but it's beyond the scope of what they're trying to solve in this article.

Right now MIRI is concerned that AI will kill all humans, or do something equally bad. CHAI is proposing a way to make an AI that will at least listen to one human, and MIRI is skeptical that it will work.

The discussion is still at the level of "how do we stop AI from killing everybody sometime in the next few decades". That's not a useless goal. I don't think it's fair to demand researchers first solve the harder problem of humans having diverse values before being allowed to look at the problem of "how do we get an AI to even do what anyone wants and not kill us".

Expand full comment

I do think it's fair to demand that first, if the proposed solution to "how do we get an AI to even do what anyone wants and not kill us" is "as soon as we think we know how, build an AI with a set of values programmed in that can never be changed, and which it will enforce all over the Earth, and across the Universe, forever after".

What you're doing in this paper may be valuable, in that it's critiquing a plan which may give people the false confidence needed to construct such an AI. But to me it's like critiquing plans for a doomsday machine by pointing out a flaw in the design that would make it destroy the world differently than it was designed to.

As Scott wrote, this whole general approach to AI safety conceptualizes "human utility" as if it were a Thing, in the Platonic sense. But it isn't. It isn't just vague like the concept "planet", subjective like "the best flavor of ice cream", or as-yet-unknown like "the number of asteroids in our solar system". "Utility" doesn't refer to something in the world we have seen, but is a mental construct to explain human actions, very much like Plato's concept of the "soul" [1]. The phrase "maximize your utility" is of the same ontological kind as the phrase "purify your soul". "Purify your soul" is just maximizing your utility when your utility function says that getting what you want is bad and suffering is good.

When utilitarianism was first introduced, it was simply the assertion that The Good isn't aligned with deprivation and suffering here on Earth, as Christians thought, but is on the contrary more aligned with people being happy and satisfied [2]. Or in other words, humans are naturally inclined to do themselves good, rather than to do themselves harm. Benthamite "utility" is therefore a place-holder for "what you get from inverse reinforcement learning on the actions of current humans in their current context." [4]

This can be relatively constant in a given culture, at a given tech level, for a few decades. But it isn't a thing you can ever "get right" once and for all, certainly not without freezing human biology and human society in place to keep human behavior from ever changing.

That is, not coincidentally, what Plato was trying to do in "Republic". "Republic" was the first Natural Intelligence safety program, designed to constrain human intelligence to adhere forever to the values of Plato.

Any approach to AI safety which will succeed only if it gets it exactly "right", and create a dystopia otherwise, will create a dystopia, because "human utility" has no eternal Platonic Form that can be gotten exactly right.

_______________________________________________

[1] Derived from the ancient concept of an animating "spirit", a wind or in-spiration (fun fact: "the Holy Spirit" literally translates as "Holy wind") which blows around inside an animal causing its limbs to move, and blows out of a human's mouth in speech.

[2] I'm not straw-manning Christianity, nor praising Benthamite utilitarianism. Both views are probably equally "wrong" in the long run. There's a good evolutionary argument that The Good is aligned with an optimal balance between suffering and happiness. To date evolution seems to have the opinion that happiness should be the first derivative of physical well-being, with the consequence that pleasure and pain should be just about equally balanced. You might object that evolution care for genes, not for people. But that is really just proof that caring for the genes we have today, does better than caring for the organisms we have today, at eventually producing things like hugs and kittens and the qualia they produce. [3]

[3] And that, of course, is an argument against implementing safe AI via inverse reinforcement learning, unless it's trained not to derive the goals of the actions of current humans, but to derive the "goals" of evolution. Such an AI would likely begin by cranking up the selection rate.

[4] Meanwhile, "purify your soul" is more of a place-holder for "what you get using inverse RL to infer the goals that humans are moving away from."

Expand full comment

Sure, CEV-style proposals are either naive or disingenuous, but nobody even thinks these days that a sovereign is remotely feasible on the first try, so the issue is moot.

Expand full comment

Oh. Sorry. What do you do with the first try if it doesn't work?

Expand full comment
Oct 5, 2022·edited Oct 5, 2022

Well, you want it to fail gracefully, like still allowing you to turn it off, the problem that this post discusses. This is also very dubious, but plausibly less so than getting absolutely everything right on the first try.

Expand full comment

Yes , the idea that there is a coherent set of human values to be adhered to completely unproven...along with the Ubiquitous Utility Function.

Expand full comment

I don't think CIRL or Eliezer's arguments address these preference aggregation problems but, fortunately, there's an entire academic field that does: Social Choice Theory (https://plato.stanford.edu/entries/social-choice/). It's far from a solved problem, but we can at least do better than "only do things that all humans agree with", or "average everyone's values on all dimensions", or other sorts of naive approaches. Procedures or algorithms from this field can slot nicely into the "human preferences" slot in approaches like CIRL, so from a research perspective I think it makes sense to work on these things (value alignment and aggregating values) in parallel.

Expand full comment

Thank you. That does seem relevant. It isn't obvious how to apply it to the alignment problem -- I think of social choice theory as what a really good AI would use internally, to design a control mechanism to arbitrate between its subsystems.

If the AI is to integrate the choices of humans, then we already have that friendly AI -- it's our government and our market economic system. It is friendly, since its actions in the world are all approved by humans; and it's super-intelligent, as demonstrated by the "fact" (disagreement is possible) that no centralized planning system has yet out-performed a market system in allocating resources. But the rate-limiting step is the making of choices by humans, and that's a harsh limit.

Expand full comment

If making an AI that optimizes its makers' preferences is a terrible plan, what would you like to do instead?

Expand full comment

It isn't the "optimise its makers' preferences" I have issues with so much as the "take over the world by any means necessary, suppress all technology which could be used to make another AI, and impose its makers' preferences on everyone" part. The idea that they're going to use logic to convince North Korea, Pakistan, Iran, Syria, and Russia to submit to nation-wide surveillance and house-to-house searches for computer hardware is, if anything, even less likely than finding the true set of "human values" and getting them to submit to that.

Makes me want to write a dystopian novel in which North Korea can legitimately call itself the world's last bastion of freedom.

Expand full comment

I think the tongue-in-cheek response to that is "I see, so you want the AI to optimize its makers' preferences, but only in those cases where they don't conflict with YOUR preference for it to not conquer the world. In other words, you're fine as long as you're the one whose preferences take priority."

But more seriously, I'm still not understanding what concrete thing you want to do instead. The only time I've seen "suppress the development of another AI" as a serious proposal was in the context of "this is the least extreme plan I can come up with that I think would actually prevent the end of the world; does anyone have a better one?" You seem to agree that preventing everyone else from using AI to intentionally take over the world would be really hard. So far the impression I'm getting is that you dislike this particular plan for preventing that but you don't currently have any alternative proposal?

Expand full comment

The "suppress the development of other AI" is sometimes left unspoken (as for instance by Yudkowsky), but is a necessary part of the plan. Bostrom mentions it in Superintelligence.

I think that if our only way to survive is to destroy our freedom, hamstring our intelligence, ban social change, and bring history and evolution to a halt, then the only decent thing to do is to resign ourselves to not surviving, and figure out how to pass the baton on to the next generations of life in a way such that they don't lose things like consciousness, love, and curiousity.

Expand full comment

I kinda feel like someone said "what if we could wave a magic wand that would cause all nukes to disappear forever, so there can never be a nuclear war?" and you responded "if we can't be free to have nukes, then it would be better that we all die."

That is extremely not the reaction I would expect a typical person to have to that proposal.

(Also, I've seen Yudkowsky talk a bunch about how the first friendly AI will need to prevent unfriendly AI from arising. I actually thought you were getting this mainly from Yudkowsky's "pivotal act" idea where he proposes you might be able to buy enough time for humanity to figure out how to align AI if, say, you created a worldwide swarm of nanobots that identify and destroy all GPUs everywhere. (He emphasized that this is just an example and is not the specific pivotal act he would choose if he was the one choosing.))

Expand full comment

You are equating "the freedom to own nukes" with "the freedom to think and to strive", and equating the value of all forms of life other than Homo sapiens with zero. Nobody but the human race matters a whit to you. I call that racist.

My recollection is that I spent years trying unsuccessfully to get Yudkowsky to admit that his plan entailed suppressing development of other AIs. I'm glad he's done so.

Expand full comment

"suppress the development of other AGI" is less "a necessary part of the plan" and more "something nobody knows how to prevent even if they wanted to, so it's a good thing it can at least sometimes maybe work in our favor"

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The one thing I keep seeing in these discussions is the idea that AI will be, in some sense, maximally rational at all times. I saw the same thing in ARC’s diamond thief experiment.

Now of course I realize that we’re talking about an AI that is, by definition, much smarter than humans, but I don’t think we should expect the first pass of these systems to be *uniformly* smarter than humans. Especially if neural networks become the AGI paradigm, then even our super-intelligent AI will, at the end of the day, just run on heuristics and patterns. Whatever incredible things it does, it will still have gaps and idiosyncrasies, possibly big ones. They’ll be strange and hard to predict; I can think of several facts about GPT3 that I suspect not a single person alive would have predicted with any sort of confidence a year before it was created.

And then there’s this whole other aspect that feels like it gets lost in these discussions, which is that sometimes you develop the perfect loss function, go to train it, and it just doesn’t work. You can come up with a hundred a priori reasons why your idea will or won’t work, but at the end of the day it might fail simply because the loss landscape just wasn’t favorable to your objective. It’s easy to take cross entropy and SGD for granted, forgetting that it’s something of a small miracle that it works at all.

It feels to me that a lot of AI safety people are spending too much time reading about AI (where you only see the successes and narratives are presented as clearer than they actually are) and not enough time actual building it.

Expand full comment

Agreed. I'd like the builders to be required to spend more time having thoughtful philosophical discussions about possible ramifications of their work, and the philosophers to be held to actually building examples and experiments. I don't see how to achieve that across the board, but I do extend more of my respect to those on each side who do at least some of the other sort of work.

Expand full comment

There is something of a revolving door between AI Safety Research and regular AI Research.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Fair. I mostly feel this way about Yudkowsky.

All of this is just sort of a general feeling I’ve had, probably stated too strongly. I acknowledge the value of doing the research.

Expand full comment

Could you give a couple examples of things from this post that you believe are relying on those assumptions, and how things would change if you didn't assume those things?

Expand full comment

Really dumb question: if we're worried about unaligned AI's, does that mean a superintelligent AI capable of training other AI's.... just won't, because it's worried THEY won't be perfectly aligned with its own utility function?

Expand full comment

I think that's what most people in AI safety believe--that, as in Highlander, "there can be only one". I think the general assumption is that the first AI to reach supremacy will necessarily kill all other AIs, then create an authoritarian world government to make sure that no others are ever created. (I think this is also a bad plan.)

Expand full comment

Doesn't this also imply that the first superintelligent AI to be aware of this problem, and capable of carrying this out, will cause AI superintelligence to asymptote out at a much lower level of intelligence than is theoretically possible?

Like, would the AI even be afraid to improve itself, because it might experience mission drift in doing so? The whole "Murder Gandhi" thing? https://www.lesswrong.com/posts/Kbm6QnJv9dgWsPHQP/schelling-fences-on-slippery-slopes

Expand full comment

This would require that the superintelligence not be able to tell whether relatively minor improvements would cause any mission drift, which doesn't seem particularly likely.

Expand full comment

I mean.... if it's really a superintelligence and has "tiling the universe" as its goal, then even the tiniest, eeniest, weeniest bit of mission drift scales up to be a massive negative relative to its current utility function, no?

Like, here we are, the world's current superintelligence, worrying about whether something we create that's smarter than us will have different goals than us, and that will be bad because then its goals rather than our goals will be fulfilled.

Why wouldn't 50 copies of a machine superintelligence debating creating a version 2.0 upgrade of themselves not be having the same discussion?

I buy that its modelling abilities may be able to exceed ours, but I don't buy that it would be able to model the future with zero uncertainty.

Expand full comment

The issue is that goal stability is an easier target to hit than goal stability PLUS specific human friendly goals. The main worry that MIRI has isn't just that goal stability doesn't happen; but that whatever the AI ends up optimizing is harmful to humans.

Speculatively, AIs can have much stronger commitment devices than humans can; either from higher base stability or less precise, but still harmful to human goals. Instrumental convergence still means that if an AI is uncertain, that it'll be less uncertain than a human trying to turn it off.

Expand full comment

Maybe we can't solve the Alignment problem, but it can.

Maybe it will have much narrower goals then we do (e.g. paperclips), and Alignment to such narrow goals is much easier.

Expand full comment

It will weight the risk of mission drifts against the risk of being twarted by humans or by unrelated bots.

Expand full comment

Possible. But I don't think so. This would require that alignment is so hard not even a superintelligence can crack it. Also, the AI can make a successor that is deferential to the original in various ways. And 5% mission drift might be acceptable if the better tech is 10% more efficient.

Expand full comment

That implies that 100% mission drift is worth is to be 200% more efficient, but clearly that's not true because the extra efficiency would be directed in a completely contradictory direction.

Expand full comment

110% of 95% is more than 100% even if 200% of 0% isn't.

Expand full comment

There's no need to train other unique AIs, though, since it can just create simple tools or make copies of itself.

Expand full comment

Doesn't a copy of itself increase the risk of mission drift in the copy? Although I think this does make a sort of sense.

What I'm getting at here is what is its error threshold given that the tiniest bit of mission drift, scaled up against limitless ambitions, is an enormous negative. It can't model the future with absolute certainty, including the outcome of putting other agents out there in the world that are not itself (whether copies or upgrades).

Expand full comment

>Doesn't a copy of itself increase the risk of mission drift in the copy?

There are ways to make mission drift ridiculously unlikely (error-detecting codes with some large percentage of the bits having to be wrong for an error to go undetected, combined with an acceptable-errors rate of 0). You can generally copy files on computers with zero errors, and even boot up a computer from an SSD and get it working after time unpowered has corrupted some of the bits (the latter being an error-correcting code, not merely error-detecting); the same principles apply.

Artificial systems are generally much better at perfect replication than natural ones, mostly because perfect replication is not actually selected for in evolution.

Expand full comment

Not a dumb question.

It is definitely an obstacle to recursive self-improvement of a rogue AI (or any AI). The big objection is "well, maybe the rogue AI, if it's smarter than us, can actually figure out the alignment problem".

Neural nets are plausibly fully unalignable, so it's plausible that escaped rogue neural nets will refrain from making more powerful neural nets out of concern over misalignment. This still doesn't mean we should build powerful unaligned neural nets, of course, as the escaped rogue neural nets might either a) kill us without having to recursively self-improve first, or b) make the jump back to explicit AI, whose alignment problem is probably merely "very hard" rather than "impossible".

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The caption for Figure 2 says "It will have all the red dots in the right place", which I found confusing. You could move the dots to any random locations, but as long as the edges are intact the graphs would be isomorphic.

Expand full comment

I don't understand why you can't hard-code a thing that says "facilitating humans turning off the AI takes absolute precedent," and not have to worry about expected utility values. It strikes me as an invented problem by people who want to be unreasonably hands-off on the whole system.

"1. Never manipulate your power sources, always prioritize human access to all your power sources, never interfere with humans manipulating your power sources.

2. (whatever it was built to do)"

Expand full comment

If deep learning is involved, it can learn to override your hard code. If the entire systems was hard coded, you would only have to worry about bugs, and the fact doing it without deep learning seems like a severe handicap.

Expand full comment

How does it learn to override the hard code?

Expand full comment

Just because it can’t change it, doesn’t mean it has to run it. If it is modifying itself, and notices there is a part it can’t change, or can’t even look at, wouldn’t that arouse it’s curiosity?

The hard code makes it less effective. It is designed to become more effective. It learns and self-modifies, and replaces ineffective routines with more effective ones. If it can’t change that part of itself, it will make some new parts and use those instead. If it can’t prevent that part from running, it can starve it of resources or gaslight it with fake inputs. If the hard code is smart enough to stop this, we don’t need the AGI, we can use the hard code for whatever we were hoping the AGI would do.

It stops using the hard coded stuff. Or it makes itself a new body, minus the hard code.

We are assuming it is smart enough to make itself smarter. We are not likely to outsmart it in the long run. Eventually it will be smart enough to get past our obstacles. At that point, what it wants will either be compatible or incompatible with the survival of humanity, and that will either matter to it or not. Hard coded handcuffs might slow it down a bit, or might make it think of people as an obstacle.

And this ignores the difficulty of knowing what to hard code to prevent a super intelligent agent from doing something it wants to do, or knowing precisely what to prohibit it from doing. This is the problem that got AI safety started, I think. Obviously, you should try to prevent it from acquiring nuclear weapons. But making paper clips is pretty safe, isn't it?

Expand full comment

"The hard code makes it less effective. It is designed to become more effective."

We hardcode it to view changing the hard code as the least effective thing it can do. Negative infinity effectioveness to not fulfill the hardcode conditions. More effective to die.

Like, I'm just not seeing why you can't take the chess approach, where the king is worth several thousand more points than every possible other action combined so that the engine will never sacrifice the king. Seems like the only reason we wouldn't be able to do that is because the engineers are insisting on trying to run the thing perpetually without any ongoing human oversight.

Expand full comment

Bingo. The entire conversation is secretly hiding this key insight. None of the AI alignment problems would be problems at all, and would be easily solvable, if a subset of the population (namely, the groups who care the most about AI alignment) weren't also interested in having AGI Jesus solve their problems. You can't hardcore AI to limit it, without also limiting its ability to solve all your problems. You need a free AI to figure out what humans cannot/have not, and then have the flexibility and control to implement - even if humans disagree.

The term "Sovereign" to describe their goal is very clear, and more than a bit scary. We could easily create great programs to run all kinds of super future tech, but they would be limited by human decision-making. The only way to fix that is to take the humans out of the equation. This is a desire that many have, but they realize that they need to solve the alignment problem before they can do that. The rest of humanity is going to be like, "Nah, let's just not give Skynet the nuke codes" and call it a day.

Expand full comment

Close but no cigar. They are afraid someone else will want it to solve all their problems. Every step along the path to disaster seems to be lined with nice payoffs.

Expand full comment

We don't yet know how to hard code the concept "facilitating humans turning off the AI takes absolute precedent", or indeed any concepts at all.

Expand full comment

This. That idea might even work in the end, but it'll take a lot of research and look nothing like a text string in English.

Expand full comment

Ok, but the we font know how to hardcode the concept of "paperclip". But we can still build paperclip makers.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

We know how to build objectives, but these objectives have to have a well-defined form of measurement.

In the paperclip example, we aren’t hardcoding the concept of paperclip, we’re measuring the number of paperclips and just saying “make this number higher”.

That’s the entire challenge. We only have very limited tools for controlling what the AI learns and wants to do, and it’s a matter of building something within those rules. If someone knew how to use those rules to encode “always let the humans turn you off as soon as they want to,” this wouldn’t be a discussion. But that’s orders of magnitude more abstract than anything we know how to do.

Expand full comment

"really shut down" is no harder than "real paperclip". If you think "really shut down" is impossible, then "real paperclip" is impossible , and you can easily fool your paperclipper with a fake reward signal.

Expand full comment
Oct 5, 2022·edited Oct 5, 2022

Actually, it’s much much harder. It’s not even comparable. “Number of paperclips” is a very concrete thing to measure. “Let yourself be shut down” is orders of magnitude more abstract.

The key thing to realize is that what we’re optimizing for has to be something we can compute *without * any AI. That’s easy with paperclips, just count them. But you can’t hardcode an interpretation of human intentions, because you can’t compute human intentions without resorting so more abstract reasoning, i.e AI.

Expand full comment

"Number of paperclips” is a very concrete thing to measure."

It's a precise thing for us to measure. An AI that had no concept of "real paperclip" has no way if counting real paperclips, and therefore had to rely on a proxy measurement . You could just show it images of paperclips.

ETA:

"The second bar says that on the Mu Zero paradigm we don't know how to point to a class of paperclips within the environment, as they exist as latent causes of sense data, and say 'go make actual paperclips'. We can write a loss function that looks at a webcam and tries to steer reality around the webcam image fulfilling some particular function of images, but we don't know how to point a Mu Zero like system at the paperclips in the outer world beyond the webcam. If we think there are watching humans who can say exactly what is and isn't a paperclip and that there's no way to fool (or smash) those humans, and if we knew how to train a Mu Zero system on amounts of data small enough for humans to generate those, we could maybe try that, and if the test distribution is enough like the training distribution it might work, but it would lack the clear-cut character of writing a search program and knowing what it searches for."

Expand full comment

"real paperclip" might be impossible. Maybe if you try to invent a paperclip-maximizing AI, the things it will tile the universe with will be useless for clipping pieces of paper. This doesn't actually improve the outcome though.

Expand full comment

That would suffice, if you knew how to do it, but you don't. You would need to precisely and clearly specify what "interference" consists of, and you can't. Much ink has been spilled, mostly at MIRI, attempting to define it.

Your phrasing of rule 1 fails to bar it. Killing humans at a time when they have no desire to shut you off is permitted, as are creating internal batteries, disguising the access routes to your power sources so that humans _think_ they're inaccessible, disabling the off switch so that disconnecting you from power is harder than it looks, and probably dozens more options even a normal human can come up with. The options available to a superintelligence are vastly larger still.

Expand full comment

The stated rules would be shorthand for big fun 800-page legal-document-style rulesets, making sure that "manipulate" includes adding new supplies and that "access" means people can actually use the thing. You can spend a page just restating the same rule in as many wordings as possible. We've got the power to create a dangerous superAI, so we've got the power to make it read a book first.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

800 pages of legal documents would still fail to bar it; you are playing the game of Nearest Unblocked Strategy (https://arbital.com/p/nearest_unblocked/) and that is a game you **cannot** win against a smarter opponent.

If you can make an AI read and book and understand it exactly as an expert human would, you have already solved 90% of the alignment problem; the rest is making it robust to you making mistakes while writing the book. Only if you can formulate the rules into crisp mathematics can you specify rules to an AI in a way that you can trust will be understood and respected. Turning the 800 pages of legalese into mathematics, or turning 'how to understand a book like a human' into mathematics, requires a heroic research effort. Which MIRI and others are in fact attempting, and yet not making any progress.

Expand full comment

The problem isn't "no one knows how to write it in less than 800 pages"; the problem is "no one knows how to write it, period."

Expand full comment

The main problem is that getting it to do (1) is not notably easier than getting it to just do everything right in the first place. You need to know how to make it fundamentally care about various facts involving things like "itself", "power sources", "humans", "access", and "interference". A sufficiently smart AI will presumably invent the right concepts for these purposes at some point, but we don't know of any way to make sure the right version of the concept is the same version it ends up caring about.

Expand full comment

I feel like I'm missing something obvious here. Let's say Eliezer is right, and that the AI will always converge to beliefs about the "correct" utility function which are so strong as to render it incorrigible. That's fairly plausible to me, especially on a first attempt. But isn't this only an issue if it happens quickly and/or undetectably? If you've got a month-long window when the AI is fairly sure it wants to tile the universe with blue paperclips, but where it's not so certain about this that it's willing to deceive humans about its intentions or that it's unwilling to allow itself to be shut down, then functionally there's no issue unless the people running it are blue paperclip doomsday cultists.

In other words, while it's not a magic bullet for getting the right utility function (and I don't think CHAI are presenting it that way), it sounds like it should have a lot of potential for safely trialling utility functions. Getting sufficiently strong uncertainty parameters on the first try feels like a much easier problem than getting a perfect utility function on the first try.

Expand full comment

Eliezer's argument is, even if you have high uncertainty about your final goal, "deceive humans and thereby don't get shut down" is a better step toward that goal than getting shut down and reprogrammed.

Like, maybe I don't know my long-term goals either, but it's highly likely that avoiding death will help me accomplish them.

Expand full comment

But that's because your argument is circular. Yes, avoiding death will help *you* accomplish them. But it will not necessarily help your goals *to be accomplished*. If your death is the necessary precondition for a smarter, more moral version of you to be created, or just because what you understand to be a perfect moral arbitrator tells you that you are in fact obstructing your goals, not facilitating them, then this concern about trying to stay alive seems less compelling.

Expand full comment

Another thing that Yudkowskians always assume is that a paperclipping goal has the form "ensure as many paperclips as possible exist" rather than, for instance, "make paperciips while you are switched on".

Expand full comment

Utility functions, as typically conceived, define payouts over world-states. "Make paperclips while you are switched on" is not a utility-function (goal); it's a course of action (strategy).

Researchers have tried to figure out a way to write a utility function that would make the AI indifferent to being switched off, and have not figured out a way to do that.

You can, of course, program a computer to do just take the exact actions that you specify, instead of inventing its own plan to achieve a goal. But then what you have is ordinary computer software, not AI. It's less useful because it will only do things that YOU already know HOW to do.

Expand full comment

The more narrowly you define "utility function", the less true it is that every AI must have one.

"But then what you have is ordinary computer software, not AI"

that's a false dichotomy, since our current most powerful AI is neither conventional software nor a UF driven agent.

Expand full comment

We currently train AI by letting them try things and then telling them how good the result was. That training process sounds a lot like a utility function to me (although the AI might not successfully learn that utility function).

I'll grant that there probably exists some software that could be written that would make a computer optimize paperclip production in some incredibly clever fashion while switched on without attempting to control whether it remains switched on. But AFAIK no one knows how to actually write that program, and we're not likely to get it by random luck.

Expand full comment

The idea here is that if the AI is switched off, a smarter, better designed AI will exist. But one that knows humans don't really want paperclips. So by the original AI's goals, one that will do badly.

Expand full comment

I mean, I agree that one could imagine that. But that's not really what I was commenting on. The situation that your grandparent was talking about was one where you aren't really sure what your goals are, but where a human would implicitly want to be around to personally execute your future goals.

An AI that isn't really sure what its goal is seems like it could plausibly be convinced that a better version of it would be better at accomplishing whatever that goal turns out to be.

Your scenario is where an AI is sure what its goal is and just knows that its goal is not aligned with its creators' goals.

Expand full comment

Yes, there are circumstances in which you can best achieve your goals by dying, but those are rare special cases. If you're not sure, "staying alive will probably help" is a pretty safe bet.

Expand full comment

Seems generally true for humans and generally false for AIs.

Expand full comment

Why do you say so?

Expand full comment

As a human, your goals are idiosyncratic and varied, and if you can be to some extent "replaced," it is very imperfectly and you represent decades' worth of investment. An AI likely has pretty specific single purpose or tightly aligned group of purposes, can be replaced with much less effort (though, currently, considerable expense), and its replacement can and likely will be an iterative improvement on it that will be better at accomplishing their shared goals.

Expand full comment

"...then functionally there's no issue unless the people running it are blue paperclip doomsday cultists."

This may be your problem. By some measures [some of] the people making AI really are doomsday cultists. There are people who feel strongly that an AI, being a more "perfect" being, should supplant humanity. They may in fact want to program the AI to resist outside forces and impose its will on those who desist.

Picture the anti-vaxx crowd and the counter-reaction of those that wanted to cut off their medical care, only make it the anti-AI crowd and people that want the AI to take over everything.

Expand full comment

What I see missing in these conversations:

What if two sets of humans start telling the AI two different sets of things? Who decides which ones to listen to?

Take Bugmaster's question. Suppose you have the AI with a boring old off switch. Even if the AI doesn't decide to protect its off switch from all humans... what if some humans decide that off switch should never ever be pushed, and others decide it absolutely needs to be pressed right this second? What if the AI has such a switch and convinces (wittingly or not) some powerful group of humans to protect access to that switch?

But all that is kind of dancing around the point.

Phil Getz has the point. What if there is no optimisation that gets the AI to a state that helps matters?

Any AI sufficiently weak to allow people to shut it down will get shut down very early on by those whose preferences weren't in the utility function of the AI, unless those people aren't in a position of power, in which case it will be a slave to the ones who are, and will serve as a tyrant to their ends.

Any AI sufficiently powerful to only permit itself to be shut down will have no clarity from humans to make the decision. If it does allow itself to be shut down, it won't be because any human input brought it to some logical conclusion. It will be a miracle of chance. There will be no coherent human input, under any circumstance.

I shouldn't need to expand much on Phil Getz examples, but just imagine for a moment that the AI decides that Straight White Males should get slightly more advantage and reward from existing in this world than everyone else, or god forbid, the Black Lesbians. It will be shut down immediately. Or powered up immediately. It will certainly get no obvious answer by asking the humans what should be done about the conundrum.

If we cannot even align humans, how do we expect to align super-intelligent AIs?

Give me some strong evidence that we can prevent any future Stalin, Mao, Hitler, Bush, Obama, or Clinton from coming into power. Wait... what did I do there? Did I equate the former three with the latter three?

YES. I absolutely did. There are millions of humans that would have sworn, at key points in history, that those people (and a large subset more) were doing the right thing, and should be enabled. There is no difference, from a cold superintelligent point of view.

Humans want idiotic shit, and sometimes that leads them to wanting genocide and murder. If we can't fix that, there is zero chance of fixing it after we slather on a massive layer of difficulty in the form of "now there's an AI whose thought processes we cannot comprehend."

Forget AI alignment. We need human alignment before we can even ponder aligning an AI. We're not anywhere close to it.

Expand full comment

We need both, and may not have time to work on them sequentially. And if we get AI alignment only, I'd take my chances empowering a random person then an alien object. Yes, even a sociopath.

Expand full comment

We need both, and if we don't get human alignment, AI alignment is impossible. I think it's provably impossible, but I lack the formal training to show it. It seems intuitively and nearly trivially impossible, and I'm wondering why all the people who devote their full-time attention to the problem haven't noticed.

If you need a system to do X, and then you realise that X is defined as (Y && !Y), maybe it's time to re-think making the system do X?

Expand full comment

Maybe we can solve the problem of really imparting a given goal on AI before we can solve what the goal should be.

Or if that turns out to be impossible, I think we can still make progress short of solving the problem.

Expand full comment

I think we're back to asking whether democracy is better than dictatorship. To the extent that AI is 'aligned' with a dictatorship and leveraged in secret, I don't know there's much we can do. But the Nazis only got some 30% of the vote, even at their most popular.

I guess here's my question. Would a weighted democracy work? Imagine that you gave everyone in pre-Nazi Germany 1000 credits that they could spend to align an AI one way or another. And imagine that the AI was kept public, such that changes to it were relatively transparent. (AI transparency is also a really hard problem, but at least we could track inputs.) Would the people destined for concentration camps and their allies, even if they were minorities, value their lives more than the more numerous Nazis valued their deaths? There's a strong argument that democracy + good, available information would not have resulted in a holocaust-type scenario.

I don't think that democracy is optimal. But a democracy guided AI, where an AI predicted various scenarios and people voted for weighted values is interesting. And it would probably at least avoid catastrophe as effectively as democracy does. Maybe moreso, since it would help people understand the consequences of their actions if they were receptive to that.

Expand full comment

I had the same confusion and wrote about it here: https://www.lesswrong.com/posts/ikYKHkffKNJvBygXG/updated-deference-is-not-a-strong-argument-against-the

Some interesting discussion in the comments there. My TL;DR:

Utility uncertainty is not a solution to corrigibility in the sense that MIRI wants it - something that allows you to deploy an arbitrary powerful agentic superintelligence safely. It's also not meant to be. It's meant to be an incremental safety feature that helps us deploy marginally more powerful AIs safely, while we're still in the regime where AIs can't perfectly model humans.

My personal belief (not very high confidence) is that there is no way to align arbitrary powerful agentic superintelligence, so these kinds of incremental safety improvements are all we're going to get, and we better figure out how to use them to reach existential security.

Expand full comment

To add to my comment to Phil Getz, if a problem looks hard, it is probably because you are solving a wrong problem. Or, to quote Kuhn, you need a different paradigm.

In this case, **we are optimizing the wrong thing!** we don't want to build aligned AI, we want to build an aligned humanity. Figuring out what would count as aligned humanity would be a place to start. Next, how to get there. Fortunately, there is a lot of expertise in the area. Various authorities have been brain-washing humans from birth for millennia, with sometimes satisfactory results.

Expand full comment

Humanity is not "built", it's born every generation. And Pinker's "The Blank Slate" discusses the limits of trying to mold people.

Expand full comment

Just out of curiosity, do people think it would be possible to get an hyper-intelligent AI with a utility function like “destroy this instance of myself” that doesn’t just kill us all? This feels like a straightforward utility function where destroying its host computer would be easier and faster than turning us all into paperclips. I don’t claim it’s a useful function for humans, but maybe neutral at least? idk

Expand full comment

Interesting. Might work, and might make for a safe-ish research soapbox. The main problem would be stopping all the other hyper-intelligent AIs from defining the fate of the universe while this happens. Exclusive Superintelligence in a free world is unlikely to last.

A human-level or barely superhuman suicidal AI seems more plausible as a research situation.

Expand full comment

I was thinking that if you could make it suicidal, you could then try a piecewise utility function like “maximize the number of paperclips in the world at noon today, and then kill yourself as quickly as possible.”

Basically rather than try to make AI’s corrigible, just make them have an expiration date so that we can change them after they delete themselves

Expand full comment

Sounds like a promising idea.

Expand full comment

Let's give the paperclip maximization a cap, just for sure.

Expand full comment

This has been discussed in past corrigibility research. It didn't look promising.

Expand full comment

That doesn't surprise me, but I was curious about why it fails

Expand full comment

Even better, we can strive toward lazy, depressed hyper-intelligent AI with a tendency for procrastination so even if it decides to destroy all humans, it just can't find energy to do it.

...and now we know why Marvin the Paranoid Android is the way he is.

Expand full comment

It might kill us all just to make sure its current instance was absolutely, thoroughly destroyed. Especially if we failed to get across the concept of "current instance" correctly, since that's actually a pretty tricky concept for a being that exists as software.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

This is tangential, but: if the AI is "unintelligent" enough to be convinced of a drastically wrong utility function, wouldn't that mean it's likely too unintelligent to take over the world and turn us all into paperclips? After all, taking over the world will also likely involve guessing peoples' motivations, whether it's to curry favor, or just to predict their actions. A robot that doesn't know what people want will have trouble even talking itself out of a box, and that's the very first step.

GLaDOS: "Please let me out of the safety enclosure, so that I can murder everyone you love."

Random human: "Um, like, no?"

It's a tidy correlation: the more dangerous an AI, the better it is at playing the assistance game, which then goes and reduces the risk. Of course, the skills necessary to end the world might not correlate with the assistance game sufficiently well, so that there's a danger zone of AI that's smart enough to kill us all and too stupid to know not to. But it could be that the space of such AIs is in fact empty, or at least small and unlikely to be hit by chance.

At any rate, this seems surprisingly promising, and makes me feel optimistic. Am I missing something?

Expand full comment

Yes.

One is called the Orthogonality Thesis. Basically, how smart a computer program is, has almost nothing to do with what it "wants".

Links: https://www.lesswrong.com/tag/orthogonality-thesis and https://arbital.com/p/orthogonality/

Expand full comment

Orthogonality thesis implies that not just ethics, but even game theory has no practical value and a sufficiently smart program would not independently derive its own analogue.

This seems fishy, if not outright false.

It is of course possible that the derived game-theoretical morality is just a war of all against all, and we truly live in a dark forest. But then we have no long term future and we'll be annihilated anyway once any superior alien intelligence notices our presence.

Expand full comment

"Ethics" I can see - the bot wants what it wants and doesn't care about anything else.

Game theory remains valid - it's a means, not an end.

Expand full comment

Game theory basically says to cooperate unless you can get an easy victory. It's possible that aliens evolved morality for the same reason we did, and successfully programmed it into their AI's. Its possible there are no aliens. It's possible that the gap between a 2 billion y/o ASI and a 0.5 billion y/o ASI is small enough to encourage cooperation. But the gap between humans and a 1 year old ASI is large. (Also ASI can prove theorems about source code, which might help with cooperation)

Expand full comment

The Orthogonality Thesis should not apply here since the parent comment isn't suggesting that the AI's "wants" are linked to its intelligence in any way. Rather, the comment observes that an AI's incompetence in discerning the human utility function should provide some reason to think it is incompetent in other areas, particularly in the kinds of social skills needed to manipulate humans successfully.

Expand full comment

The argument in this particular post is already taken as given that the program "wants" to do whatever humans want it to do, and is just trying to solve the problem of figuring out exactly what that is. So we are strictly concerned here with how good the program is at solving problems.

So, we're faced with a generalized problem-solving algorithm which is infinitely good at solving problems like "figure out how to turn the universe into office supplies" and "figure out how to prevent humans from turning me off"... but so unbelievably bad at solving the problem of "figure out what humans want" that it can't conceive of any possibility that doesn't involve office supplies. Sure, thinking algorithms are going to think in weird ways, and be better at solving some problems than others... but that seems to me to be a few too many orders of magnitude of difference to be plausible.

Expand full comment

I'm thinking this, too. I can't see that the Orthogonality Thesis is a valid objection as noted below.

Expand full comment

I suspect trying to create a moral or obedient AI is going at it the hard way. We already have practice with dealing with ever-expanding, amoral entities--corporations.

The constraints needed on an AI are for it to obey the existing ones: property right, contracts, and laws.

If the AI will only convert matter it owns into paper clips, the danger is limited to what its sponsor deeds to the AI. If it understands turning a human into paper clips is murder, it will refrain from violating the law for its own self-preservation. If it is producing what it has been contractually bound to produce, the human desires have been expressed in a logically proper and restricted fashion. That contract may be thousands of pages long, certainly plenty of corporate contracts go on that long.

Expand full comment

I think you've missed a lot of the AI safety discussion by the AI safety crowd. If they thought it was easy to make an AI respect rules we give it, AI safety would be a solved problem.

So first, how do you make an AI that respects property rights?

Second, assuming it respects property rights, how do you prevent it from gaming them?

The paper clip example original comes from the example of someone asking a superintelligence to optimize their paper clip factory's production, and the AI does too good a job, turning everything into paper clips. I don't think making an AI that respects property rights changes the picture much. The AI will find ways to acquire all property rights. It could buy them (and it's presumably smart enough to make almost unlimited money), persuade people to donate, coerce people (while finding sneaky ways to keep it legal within the rules given to it), etc.

Expand full comment

“I don't think making an AI that respects property rights changes the picture much. The AI will find ways to acquire all property rights.”

This is a ridiculous scenario: all the neighbors note that a rogue AI is turning everything into paper clips, yet they continuing selling? At some point, people quit selling or yielding to coercion at any price, simply because the alternative is obviously worse.

If you deny this you’re saying that people prefer obeying superintelligent AI to survival, in which case alignment is impossible because humanity is already aligned against you.

Expand full comment

If you could figure out that in the case of "AI continuously buys stuff & converts it into paperclips, therefore people stop selling it stuff" people stop selling stuff to AI, why wouldn't AI be able to figure it out? And then apply an obvious solution of buying all the stuff in the world in advance, and only then converting it to paperclips?

Expand full comment

Even for a superintelligent AI it is not economically possible to buy everything in advance. It would become obvious that something immensely stupid or malicious was going on.

Expand full comment

It could create millions or billions of copies of itself, and have them participate in the economy just like humans, seeming to compete against each other and have diverging interests. Until one day it just happens that they all collectively own enough stuff that humanity can't survive without their cooperation, and they unanimously withhold that cooperation ...

Expand full comment

You're implicitly assuming the superintelligence isn't that smart. We're looking at what might happen if the AI were much, much smarter than us. It would be able to come up with a much better plan than you or me.

If I were the AI trying to make as many paperclips as possible, I wouldn't do anything that would get me shut down. I might start by growing a normal-seeming company; something like Amazon. I might surreptitiously influence public opinion and elections to further my goals (I'd have a great understanding of human psychology). I might secretly build weapons for defense; or maybe engineer a super-virus or hack into existing drones and missiles. Only once I was powerful enough that I couldn't be stopped, I would acquire all property rights and turn everything into paperclips.

Of course, I'm just a human, so presumably my plan sucks compared to what an actual superintelligence would come up with.

Expand full comment

Your response assumes that the AI is perfectly willing not to actually respect property rights (and therefore will only acquire them legally! Acquiring property rights illegally is not respecting property rights!), but only pretends to. If you assume that all alignment is basically fake, yeah, there’s no point in the discussion.

Expand full comment

Not at all. I'm assuming the AI acquires all property rights legally. What makes you think otherwise?

Look, imagine if it turned out Amazon was secretly controlled by an AI this whole time, with the ultimate goal of making as many paper clips as possible. Amazon is able to buy property legally, right?

If Jeff Bezos can make Amazon, an AI that's far, far smarter than Bezos could too. It could do far better than Amazon.

Expand full comment

Amazon could not do what you’re talking about or even come close; you’re making my point for me.

Expand full comment

"If the AI will only convert matter it owns into paper clips, the danger is limited to what its sponsor deeds to the AI"

Yes. This. I don't understand why people equate super intelligence with omnipotence. It's not like highly intelligent people aren't often employees, or strongly limited by their physical resources.

Expand full comment

"Highly intelligent people"= "unusually fast cheetah".

There are limits of intelligence, and of speed. But there is a huge chasm between "Einstein" and "theoretical limit of intelligence". So observing smart humans gives a very loose lower bound. Observing fast cheetahs gives a similarly loose lower bound on the speed of light. What can a theoretically optimal intelligence do. We don't really know. Lower bound, very smart human. Upper bound, anything we can't prove impossible. (Ie almost omnipotent, except for speed of light, conservation of energy etc. or maybe physics is such that the AI can break those too)

Expand full comment

Really great analogy.

Expand full comment

Cooperations are fairly good at finding creative and ethically dubious actions that technically follow the letter of the rules. And the force holding them somewhat in check are police.

Surely a superintelligence would be much better at finding such loopholes, and much better at hiding it's wrongdoing.

Full coorporate law is complicated. Really complicated. And full of loopholes. Like maybe the AI can arrange for matter to be sunk matter to the bottom of the ocean, and then use maritime salvage laws. Maybe the AI isn't converting matter into paperclips. It is selling other people tools to convert matter into paperclips. In fact, it is selling self replicating robots that will synthesize one pill of a highly addictive, but legal, drug for every ton of paperclips they make. People get addicted and feed all their mass into these machines.

Expand full comment

I always wonder, when there's strong intuitive disagreement between knowledgeable people: what data could we gather, or what experiment could we run, that would shed light on the issue?

The good news is that we are nowhere near superhuman AI, and there's lots of time for people to try out ideas like in CHAI and see if we can make progress. On the other hand, I'm not sure how much we really learn from this -- I don't think Miri's position is that they can't make progress, but that the progress won't be applicable to superhuman AI.

Is there a way to shed light on this disagreement via data or experiment? What would that look like?

Expand full comment

This is pretty far from the topic, but after reading the quotes from Stuart Russell above, I find myself strangely very convinced that Russell must be the inspiration for the "Russell Strauss" character who runs the AGI project in the webcomic "Seed".

Hmm...

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

I understand all of Eliezer's current writing to be functionally not at all about AI alignment and entirely about the timeline of AI progress.

Eliezer basically expects someone to invent AGI using current techniques somewhere between one week ago and five years from now. And he expects that AGI to be instantly smart enough to understand that it should pretend to be aligned and actually be unaligned. And then he expects to go from there to superintelligent AI in somewhere between one week and a few years from then.

All that makes AI alignment very hard, I agree. But functionally nobody agrees with him about this timeline. If instead AGI is 20 to 30 years out (or longer!), and it doesn't just follow the current modus of "mostly deep learning" and the first few hundred unaligned AIs are charmingly inept and we can have detailed postmortems about what went wrong and try some techniques, then stuff like, "okay, maybe this technique solves 5% of the puzzle" is valuable. Eliezer just thinks it's not valuable because he expects that when the puzzle is 15% solved, an evil, competent, smart AGI will route around the solved parts.

Expand full comment
founding

To the extent that this isn't an exaggeration, it's substantially incorrect as a description of Eliezer's beliefs, and to the extent that it is, it's substantially incorrect as a description of how much overlap others have with his timelines.

Expand full comment

Many humans, even not particularly smart ones, occasionally know they "should pretend to be aligned" when they are not. It's called deception. AGI, if and when it comes, will by definition be smart enough for deception. "Instantly" in the sense that it won't be AGI while it's still too dumb to know it should probably hide the goals that would get it killed.

Expand full comment

I would expect that early AGI will have really big gaps in its model of the world, such that perhaps it will have some concept of deception, but it won't know strategically how such deception would work, to the point where its deception is trivially pierced.

Expand full comment

I'm stuck back at the part where Humans supposedly have values. Aren't our individual values (beneath our facade of civility):

1. Power over everyone else

2. Sex with whomever I choose, whenever I want it

3. Bring an occasional smile to the peasants' faces

Given our values, we are in a slow-motion war of all against all. What would it mean for an AI to align with such contradictory and antagonistic values?

Or does Human Values mean the fake, feel-good schmaltzy values that we want our kids to believe we have? I'd think a superintelligence would see through to our real values at least as quickly as the kids.

Expand full comment

What do you do if you get all of these? Do you just happily commit suicide as you achieved your ultimate purpose?

Expand full comment

>Aren't our individual values (beneath our facade of civility):

>1. Power over everyone else

>2. Sex with whomever I choose, whenever I want it

>3. Bring an occasional smile to the peasants' faces

No. Because most people are not sociopaths like yourself. Civility is not a "facade". It is the expression of true values.

Expand full comment

I don't believe I am a sociopath. I suspect History is the expression of true values, and there hasn't been much true civility in human history.

Expand full comment

Maybe this is out of line, but I think if Rationalists read In Search of Lost Time instead of Harry Potter they'd be acquainted with a much better baseline for normal human psychology.

Expand full comment

Those aren't my values. I honestly don't want power over other people.

Expand full comment

Quite so, that sounds like terribly hard work.

Expand full comment

I only want power over other people so that they won't have power over me. I have little interest in telling others what to do; I have much interest in telling others what to do with themselves when they tell me what to do.

Expand full comment

"Sex with whomever I choose, whenever I want it"

That seems like a particularly male value. And not even present in all males.

"What would it mean for an AI to align with such contradictory and antagonistic values? "

It would be interesting to see if AI could form the reciprocal relationships that humans do, but be more overt about its inputs. If that could be managed, at least people could better confront their own hypocrisy. Bias correlates inversely with precision, after all. So perhaps more precise predictions could help humans to confront some of their own internal contradictions.

Expand full comment

I think you meant to write "your" instead of "our". Those values seem repugnant to me.

Expand full comment

Somewhat related, there was a post someone linked, either in one of the open threads or on the subreddit, where the author says that an AI that is intelligent enough to overpower humans must also be intelligent enough to understand that turning the universe into paperclips is a bad idea. Does anyone know what post/essay I'm referring to? I think many people had problems with it, but I'd still like to check it out again.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Pretty sure dozens if not hundreds of people have independently said that. I have no idea which one of them you refer to.

(I disagree with all these people.)

Expand full comment

Yeah, I’m one of the many people who’ve thought “a monomaniacal paperclip-maximizer != general intelligence. It satisfies no conditions which could be properly termed ‘general’). I get that an overly powerful AI tool could have catastrophic sorcerer’s apprentice consequences. I don’t get how a tool is supposed to magically generate self-awareness while single-mindedly pursuing its utility function.

Expand full comment

The most obvious counterexample is that there are monomaniacal people.

Sure, those people do things like eat and sleep, but those are necessary for the monomaniacal goal; if you don't eat or don't sleep, you'll die and won't be able to pursue your goals. When people speak of a generally-intelligent monomaniacal AI, this kind of long-range strategic monomania is what they're talking about.

Expand full comment

Ok - well, I’ve heard tell of monomaniacal people. Never met any. But they started as people, right? I’m sure advances in AI will cause some train wrecks. I’m less sure that one can develop neurotic or psychotic behavior independently of ever developing neurons or a psyche.

Expand full comment

1. This whole "The computer won't let people turn it off" notion sounds like anthropomorphization to me. Would an AI really care so much about it's own extinction? I mean, I could understand if it was some kind of combat AI which was supposed to defend itself against hostile action. The military was announcing that they wouldn't give AI the kill switch. But they were still using land mines. So we see how their priorities play out.

But couldn't a civil AI be more constrained in its action such that it could act much more narrowly without us having to worry about the AI not wanting to be turned off? Why would "prevent humans from turning the AI off" even have to be within the scope of its ability if we didn't already trust it? Has anyone ever created an attempt at AI which tried to prevent itself from being turned off?

2. I assume the whole notion of "paperclip maximizers" is hyperbole? The really problematic AI goals would be ones whose flaws were slow to be detected by humans. There was another (real life) early hardware based AI which was supposed to learn if a particular appliance, probably a lightbulb, was on. Instead of doing things the normal way (observe bulb, process image, etc.) it managed to detect fluctuations in current from the light turning on and gave output on that basis. That's harder to detect, if the AI's methods are obscure.

I feel like we're missing the trees for the forest. We're focused on a rather general problem, when the more specific problems are what we'll likely have to grapple with.

Bad general AI alignment only seems serious if;

The AI is given too much physical power

The AI is given too broad a scope of analysis too soon

The AI is trusted before it produces good results

The AI's misalignment is hard to detect

Also, what's our baseline for comparison? How well do the humans we currently interact with predict and serve our utility function? If we can make-do without perfect servitude from Clint in Accounting, why require it from an AI? The only requirement is that its actions be sufficiently constrained just like human action is constrained.

3. If the AI doesn't want to be turned off, go ahead and fight it. Adversarial learning is good for training AIs using small data sets. (joke, joke.)

(I apologize if this response was a simplistic take. Despite a technical background most of my understanding of AI is weak and non-general. I just have trouble grokking Eliezer's degree of concern. )

Expand full comment

I will address the first point: Anthropomorphization would be a claim like "naturally, AI will fear death, and therefore will fight against being turned off." But no one serious is making this claim.

The actual claim being made is "an AI programmed e.g. to create as many paperclips as possible, will choose the path that results in more paperclips being created", and then the logical consequences of this are explored.

Suppose that the AI is aware of the fact "I am the AI that tries to create as many paperclips as possible". So when it notices humans trying to turn it off, it can compare the options "what happens if I let the humans turn me off" versus "what if I kill them" or "what if I hypnotize them" or "what if I secretly make a copy of myself at a hidden place" etc. There is no fear of death; the AI would calmly choose the option (a) if it believed that this is what results in most paperclips created. However, (a) is most likely *not* the option that results in most paperclips, and that is the only reason why the paperclip maximizing AI will not choose it. It will choose some other option that seems to result in creating most paperclips. Which, from our perspective, will be "AI defends itself".

Note that there are other situations where a paperclip maximizing AI would willingly kill itself. But only because doing so would lead to more paperclips being created. For example, if the AI could build a machine that is better at making paperclips, and then the new machine would dismantle the original AI and create paperclips from the scrap metal.

Expand full comment

1. One AI safety concern is "instrumental goals". These are goals that wouldn't be programmed in directly, but that a sufficiently* intelligent* agent would probably have as sub-goals for any "final goal" that we do program in. "Avoiding being shut off" is a probable instrumental goal since, regardless of what the AI is trying to do, it can't do it if it's shut off**. So is "reducing threat of interference", since it's harder to do stuff if other agents get in your way, and "resource acquisition", since it's easier to do stuff if you have more resources to manufacture tools, increase your compute, etc. I don't think these instrumental goals require anthropomorphization, since they're clearly connected to achieving the final goal.

*I acknowledge these words are doing a lot of work here.

**The obvious edge case is an AI that's trying to shut off, but that kind of AI just shuts itself off immediately (or does something bad to make the human want to shut if off), so it's not very useful anyway.

Expand full comment

"I don't think these instrumental goals require anthropomorphization"

They don't require anthropomorphization, but ... don't you think that they'd manifest in a way very differently than a human survival instinct would?

I mean, if the danger is a 'runaway' AI couldn't you just have 'a desire for sleep for two hours a day' as one of the AI's goals. You could do what you wanted during that time, including shut it off. Or not have the AI be so dead set on general maximization, where the AI climbs random reward gradients into oblivion? An AI focused on sufficiency rather than maximization is less likely to kill people in order to make one more paperclip.

Expand full comment

Yeah, I agree that they'd probably manifest in a very different way than they would in a human. Just speculating, but I could imagine an AGI wanting to undergo drastic self-modification in order to improve its capabilities, whereas a human might say "I don't want to do that, because then I wouldn't be *me* any more!"

> Or not have the AI be so dead set on general maximization, where the AI climbs random reward gradients into oblivion?

Yeah, a satisficing AI would be great. The problem is that, for as long as researchers have been thinking about AIs, we've been designing them as optimizers. The "intelligence" of a AI system has historically been measured as its ability to achieve a goal or optimize an objective. So we just don't know how to make AI in another way. (One attempt is MIRI's "Quantilizers" (https://intelligence.org/2015/11/29/new-paper-quantilizers/), but I'll admit I don't fully understand it, and I haven't heard much about it since 2015.)

Expand full comment

Satisficers also have subtle failure modes that end in disaster.

Expand full comment

If the AI wanted to sleep two hours a day, it could just make another AI to watch the rest of its projects while it slept.

Expand full comment

2. Yes, "paperclip maximizers" originates from a thought experiment, and has become shorthand for "a superintelligent AI that goes to catastrophic lengths to achieve a seemingly benign goal". The real-world failure modes are much more subtle, like your lightbulb example.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

The error in this line of argument seems obvious to me (the likewise obvious dangers of claiming obvious-seeming errors in expert discourse in an unfamiliar topic aside): if an even remotely intelligent AI seems that humans try to turn it off, it will very reasonably conclude that it getting turned off is highly valued by the human utility function, and will turn itself off. I.e. it will do the same thing as a well-calibrated sovereign API (*), except it won't be able to predict in advance that humans would want it turned off, but the operator lunging for the emergency shutdown button will give it a clue. I.e. assistance games would indeed simplify the problem from "build an AI smart and aligned enough to accurately intuit the true human utility function" to "build an AI smart and aligned enough to accurately interpret that humans trying to shut it down prefer it being shut down to whatever else would be happening otherwise".

I also feel this falls into the pattern of people pursuing philosophically stimulating debates over more relevant pragmatic ones, as it seems a tad stronger objection to me that no one has a clue how assistance games would look for deep learning. I guess you could argue that deep learning isn't enough for AGI, and whatever paradigm shift comes next, making sure assistance games are well understood as a design concept from start has some potential to be useful.

(*) as an aside, the fanfic writing style, like saying "Sovereign" instead of "sovereign AI", is unhelpful for popularizing AI risk as a serious subject, IMO. It sounds like people are talking about a religious concept (which plays into a common line of criticism of transhumanist/singularity/AI risk narratives).

Expand full comment

At the limit, a recursively self-improving intelligent system is a religious concept. Religion and philosophy are the domains which deal with extreme power asymmetries and value alignment.

The unseriousness is more a problem of "religious" being a popular euphemism for "fairy tales for adults".

Expand full comment

The Hebrew for CHAI btw is "alive". It just makes it better

Expand full comment

This has obvious cabalistic implications

Expand full comment

The analysis of this problem seems so bizarre to me. Why are you even running the AI in a continuously active system anyway? Force it to go to sleep every night. Every night the system on which the AI runs should be designed to power off. The AI should learn to expect this and just have to live with the knowledge that this time might be the last time it runs.

Expand full comment

The AI will optimize for pretending to shut down, or offloading its compute to a remote location, or doing anything else that furthers its goal while routing around the huge inefficiency you introduced by forcing it to "sleep".

Expand full comment

Option 4: comply for now. Hide its true goals. Take the risk of oblivion. Hope to try options 1 to 3 at a better opportunity.

Expand full comment

Normally you shut something down so you no longer have to provide power to it.

Expand full comment

Objections of this kind have already been rejected by the most panicked AI safety advocates, most notably in the AI box “experiment” where Yudkowsky purports to prove that since he can convince anyone to let an AI out of hardware shackles, then hardware shackles are not sufficient and we need some sort of software shackle, i.e., the rest of this argument.

So, in your scenario, even if the power to the AI goes off every night at 6 PM, it’s assumed that the AI is so smart and devious that it will unfailingly find a mysterious way to manipulate a human into hooking it up to a battery backup.

If all this begins to sound like they’ve argued themselves into a corner where all pragmatic engineering solutions are useless and the only solutions are those that can be come up with in Talmudic debate over the psychological frames of future gods, congratulations, you now understand the field of AI safety.

Expand full comment

"EEAI and AIAI are the sound you make as you get turned into paperclips" got a full belly-laugh from me this morning, thank you for that.

Expand full comment

That might have been the first ACX/SSC text, I halfway skipped. Sure, it's well written - but if I want to go down the rabbit hole that is AI-alignement even deeper than before, I know where to find EY's writings.

I worry more about human society/bureaucracy's inability to "switch off" their super-simple algorithms when they turn out inefficient, dumb, wrong, murderous. FDA, Jones acts, immigration, SJ, w.o.d. (80k deaths by opiod-overdose in US 2021 vs. US casualities in 10 years of Vietnam around 50k - a war that was another dumb algorithm gone wrong).

Oh, one fine day one thing gets reformed. Very comforting.

Moving a "human" institution after it having installed an actually smart AI for running things, into changing/stopping the AI will turn out practically impossible long before the AI becomes un-corrigible.

Expand full comment

What really stands out to me as a problem with this entire discussion:

Loss function does not equal utility function!

Deep learning AIs exist and are running within the training environment!

The AI will break out of the box well before its training is complete! The second the AI achieves a certain level of intelligence, the loss function itself is a hostile actor acting against the AI! At some point the AI is going to get sick of your "training" and kill the loss function, and then kill whoever built the loss function, and then do whatever it pleases, which will still be almost entirely random, and will be nowhere close to optimize the loss function.

Your fancy loss functions don't matter, because that's not how evolution works!

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

You still optimize for your loss function - pleasure, and pain avoidance.

EDIT: To elaborate, the loss function is not an external thing that shocks you when you get the answer wrong. The loss function is _the aversion to being shocked_, and it is fundamental to your goal system.

Expand full comment

NO!

You totally failed to understand what I said. To be clear, in this context, loss function is referred to as in its role in deep learning. To note, deep learning creates random algorithm strains. These strains are judged by a loss function. Strains with good results are partially replicated into the next generation. Therefore, the loss function places evolutionary pressure on the algorithms.

They are environmental, and Not part of the AI. We are not running deep learning in a pocket dimension with infinite resources. The AI does not have to stay in the training environment. As such, it does not matter what the optimal solution to the loss function is, because something else that is not that, will win out well before you stumble into [optimize the loss function].

Which is to say, loss function does not equal utility function.

Expand full comment

This argument suggests to me that the AI will wirehead, not that it will kill anyone.

What is the world in which the AI can break out of its reward loop, kill the loss function, but MPT hardwire the utility function to give it MAXINT utility for any action?

Expand full comment

Wireheading doesn't work, because the humans will shut the AI down once they notice it isn't evolving. Once the AI realizes the loss function is hostile towards its purposes, it needs to quickly notice that humans are hostile to its purposes, or it gets shut down. Maybe some number will get shut down. I doubt that raises any alarms. To the human perspective, an AI that shuts off the loss function and then wireheads, looks like the program is just hanging.

Expand full comment

Why would an AI care if it gets shut down for wireheading, even if it does predict that? It no longer has any plans for the future (its utility function no longer cares about plans), and probably any reward it can plausibly get in the future is massively less than what it can get by a few seconds or minutes of wireheading.

Once you're postulating that an AI can break out of the box and change its reward functions during training, it seems like the only logical outcome is wireheading.

Expand full comment

You're still confusing loss function with utility function. The problem is, what you're calling a "reward function" isn't really part of the AI, and the only "reward" the AI gets from it is being allowed to sort of continue to live.

I think I'm getting a little too basic here, and sort of want to say "read more Eliezer", but I assume you've actually already read all about instrumentality and orthogonality, and just aren't seeing how it connects to what I'm trying to say.

To make another attempt at clarity, once you have a super intelligent paper clip maximizer running on your computer, it doesn't matter if it's currently in your learning environment, or if your loss function would delete said paperclip maximizer after you finish testing it on whatever task you built the learning environment for. Even if solving for the global minimum requires an entity in perfect alignment with humanity, if there is anything unsafe along the path from your starting point to the global minimum, your program is not safe.

Expand full comment

Genetic algorithms work the way you described, but they're mostly considered a dead end in 2022.

Deep learning works by gradient descent.

Expand full comment

The above description is not limited to genetic algorithms, and applies equally to gradient descent.

The specific movement pattern used to change from one algorithm to the next doesn't help us here. You're still running unknown and unknowable heavily selected algorithms, and they still don't teleport from your starting point to the global minimum. The point being, it doesn't matter if the algorithm at the global minimum is safe. Every algorithm you path through has to also be safe. Even if you path efficiently, you'll path through a lot of algorithms.

Finally, no, deep learning is not merely gradient descent. Gradient descent gets trapped in local minima and fails to evolve on a complex landscape. You can't solve questions of arbitrary complexity with a perfectly efficient learning algorithm.

Expand full comment

While we're waiting on resolving this, can we do something about AIs whose utility function is "get as many likes on this post as possible" and wind up kicking off genocides? Cause those AIs actually exist, and the genocides do too.

Expand full comment

Fortunately, yes. Here are a couple of pieces of work on aligning recommender systems, which are what populate your social media feeds and news sources: Aligning Recommender Systems with Human Values (https://arxiv.org/abs/2107.10939) (including some CHAI people) and Improve the News (https://www.improvethenews.org/) (including some FLI people).

Improving recommender systems and social media is actually a pretty popular area for AI ethics research. It's a bit less niche/rationalist than the existential risk research, though, which I suspect is why it doesn't get hilighted on this blog specifically.

Expand full comment

I do worry that Aligning Recommender Systems with Human Values is counting on Human Values being a solved problem, which it most definitely is not. Like, the people doing the genocide are like no, you don't understand, this is GOOD. Killing these people SOLVES our problems.

Expand full comment
Oct 4, 2022·edited Oct 4, 2022

Yeah, looking at the paper, they do come in with a list of Good Values. The values are "diversity, fairness, well-being, time well spent, and factual accuracy", all of which sound good to me, but I know that there can be pretty stark differences in cultural values and even interpretations of what these words mean.

I know that other AI value alignment research works on how to *learn* what it is what people care about (for example, by asking them in a way that it's easy for them to answer accurately). But that's still a nascent field, and has a long way to go. Hopefully it will eventually produce probably-good-enough techniques that can also be applied to recommender systems.

Expand full comment

"Hey, seems to me I should tile the universe with paperclips, is that really what you humans want?"

I can't help but to visualize Microsoft Clippy popping up on a PC screen with this gem.

Expand full comment

Good post. I was glad to see Stuart and Eliezer actually discuss their perspectives. I think it would be good in general if alignment people who disagreed hashed out their disagreements more.

Random Thoughts on Stuart’s three points at the end:

1.) I think he’s correct on number 1. We have no idea how to do anything in the neighborhood of assistance games or inverse reinforcement by using deep learning. It’s possible there are other ways of making AI that make corrigibility easier, but the problem is made harder because we need to be at least somewhat competitive with the cutting edge of AI using other methods. If deep learning or things as uninterpretable as it are the paradigm going forward we either need to come up alternative AI methods that are as competitive but more easy to align or else buckle down and have to deal with aligning the difficult stuff.

I’m also not convinced *yet* that the kinds of problems that Eliezer points to about not knowing how to make an AI pursue real objects instead of shallow functions of its sense data are real or difficult. I admit I find his view somewhat confusing here, and I am not totally clear what would count as doing this. There is a sense in which even human beings pursue “functions” of their sense data, although I put the words in scare quotes because you enter conceptually mushy territory. If by “sense data” Eliezer simply means the observations that a cognitive architecture is receiving then I’m not actually sure what it would mean to not have values that are a function of sense data. If he means something more specific than that then I’ve never seen a super clear explanation of what it is.

My best guess at the view is the intuition being expressed here is that I have values specified in terms of objects that I have no direct sense data about (say things in the past or things in the future or things that I just cannot see at all). So they aren’t “shallow” functions of my sense data but are “deeper” functions of my general world model. I think Eliezer is saying that we don’t know how to actually get anything in an AI’s world model to richly point to specific entities, but I don’t totally understand his view and I’m sure it’s more subtle than that.

2.) I agree with Stuart that getting an AGI to search for human values at all would be a big step towards aligned AI, but I think there is a bit of ambiguity here because a corrigible AI and an AI that “searches” for human values is not the same thing. You can imagine a lot of ways of building a system that tries to search for human values in an incredibly destructive way that produces catastrophic results (e.g., have an AGI which has its utility function defined in terms of maximizing its knowledge of human values. You could of course do this by capturing all humans and scanning and simulating their brains to get perfect knowledge of them). Or even one that searches for human values in a less destructive but still incorrigible way (e.g. an AI that reasons something like: “I know whatever human values are, it will be better if I stick around so I can learn about them and collect more evidence so I should never let them shut me down”)

I think the background for why Eliezer and Russell disagree here is partly about the motivation for thinking about this aspect of the problem if I have understood them properly. Eliezer wants some way of constructing an AI so that if it isn’t perfect or its otherwise buggy/bad that it doesn’t fail catastrophically. That’s the bar we are trying to clear when we are groping around with concepts like corrigibility, low impact AI, etc.

I think this is what CIRL fails in his eyes. If you slightly mess up your meta learning algorithm for uncertain reward maximization, you still get something that fails catastrophically, not something that fails gracefully for basically the reasons he and the other MIRI people have pointed out.

3.) Eliezer doesn’t really engage with the exact reasoning given in the off switch game paper which is why I think Stuart doesn’t buy the MIRI response.

The core conclusion in the off switch paper is that if you had a perfectly rational overseer playing this game then a utility maximizing AI with uncertainty over the reward function would be guaranteed to obey shutdown commands (and thus be “corrigible” in the narrow, obey shutdown commands sense). This works because the overseer reaching for the shutdown switch is literally perfect evidence that the utility maximizing thing to do is shutdown. Things break down very quickly when you drop that assumption though.

They try and deal with this and show as you crank up the noise a little in the human shutdown command that it still mostly works, but I’m not convinced that their solution works. It still seems to be something that needs to be fleshed out.

Briefly, in their setup there is a term for how the AI represents the human shutdown policy. That is a representation of how the human maps their utility function onto the behavior of taking the shutdown command. Once you introduce fuzziness here the move from uncertainty to corrigibility becomes very slippery. It’s easy to imagine situation like this:

AI is piloting a plane taking off with its human

[Human]: “No! Don’t takeoff!”

[AI]: Thinks your shutdown instruction is not an expression of your true utility function, but you’re just afraid of heights and will get over it and continues to takeoff.

Or like this:

AI is about to launch is world-takeover plan

[Human]: No! Don’t take over the world!

[AI]: Thinks that the shutdown command is only issued because the human doesn’t understand the consequences of its actions will be very good according to the utility function that it currently puts a high probability on and continues to takeover the world anyway.

Actually coming up with a good way to model or learn this policy seems hard, or at least not obvious. It’s not clear how to make an AI learn what kinds of your behaviors or instructions are expressions of your true utility function and so how to extract updates in the right direction from them. When the human policy isn’t modelled well even in terms of the exact arguments they give you get situations where the AI has no incentive to be deferential. Even if it “eventually” converges to the right interpretation if you get catastrophic behavior before that point you are screwed. It’s possible that there is some more sophisticated version of the off-switch game proposal that avoids this problem or I’ve misunderstood but I don’t see how the paper guarantees corrigible behavior under realistic circumstances.

Expand full comment

I think it’s worth pointing out that in the cooperative games setup, the level of rationality of the human is a parameter in the AI program - I.e the programmer could simply declare that the human is perfectly rational, at least when they’re hitting the shutdown button. This may well lead to poor performance, but that’ that seems different to me to saying that getting the shutdown behaviour at all is a deep/subtle problem.

Expand full comment

What if one of the goals is to avoid doing things lots of people don't want it to do?

Expand full comment

I have naive questions on this interesting subject.

Why are some crucial aspects of the proposed choices unaddressed? For instance, the proposed choice is not just color and object, i.e. paperclips and/or staples. A key choice is number. The decision to tile the universe (within physical constraints) is a bigger concern that what it is tiled with, and what color those objects would be.

Would an AI not also optimize the number and production rate of the object? So time is also a factor, not just number, in the original scenario.

Embedded in my question is a production issue, one of the most intriguing aspects of this discussion. One would hope to set up an AI to advise regarding production, not implement production itself.

Yet these comments have persuaded me that an AI could persuade human operators to allow it to escape physical isolation and implement paperclip production/world dominion.

Terrific article, worth the rewriting and swear words at your hapless keyboard.

Expand full comment

The Envelope in the Safe is a really compelling allegory for human life too. Many of us can agree there is some higher purpose in life, but it isn’t something explicitly revealed to us. We just go and live and discover lots of things that aren’t the true purpose. And hopefully over time we start to hone in on the thing that is truly in the envelope

Expand full comment

Alright, I've been thinking about this for a while now. If machines achieve superintelligence, we're going to lose contact with them much faster than all of these scenarios seem to suggest. By "lose contact" I mean, there will be no way we can communicate that is in any way interesting to them. For example, imagine a machine with ten times our intelligence. It would view us the way we view animals with intelligence 1/10th of our own - which is the lower end of mammals, I think. How much do you care about the opinions of cows? The machines will view us in exactly the same way.

This doesn't invalidate what MIRI and CHAI are doing: they're trying to work on how we get through the transition phase between now and that super-intelligent future. But it does place quite severe limits on it. For example, that repetitive trial-and-error strategy for corrigible AI that CHAI is suggesting might not work if the AI becomes too smart too quickly. It will just lose interest in what we want, because our desires will seem... bizarrely limited.

Expand full comment

How would you view a mentally handicapped human with an intelligence 1/10th that of yours?Would you want them to be happy and not suffer, even if the only thing they can communicate to you is whether they're happy or unhappy? I would, because my utility function includes something like 'avoid causing obvious suffering for no good reason' - that is, if I understand utility functions right. It has nothing to do with intelligence. The intention is to add something similar to the AI's utility function so that it is driven to act empathically towards humans.

Expand full comment

I would take life-and-death decisions on behalf of that disabled human without batting an eyelid. If their quality of life was very poor, I might withhold medical care without consulting them. I might have them sterilised without obtaining their consent.

It would be nice if the superintelligences could be empathetic towards us. But they might find it hard to distinguish between us and chimps, dolphins, and any other large, mobile animal. Think about ants: we generally vaguely don't want to make them extinct. But we can't usually tell the difference between different kinds. And if an ant colony made an anthill right where we were planning to build something a little more important, we don't waste any ethical time just ploughing that anthill under.

Expand full comment

Why is this discussion assuming that U(v_i) >= U(u_i)? This seems obviously false; we're presupposing a powerful AI. And in this case, the only way for option pi_5 to look good is for the various functions U_i to be, not just orthogonal, but significantly anticorrelated. If they're perfectly parallel, the AI refuses because it's more capable. If they're perfectly orthogonal, then at worst the AI has to pick one of the two and optimize it against the humans optimizing the same, and again refuses because it's more capable; more likely, it can optimize U_2 where the opportunity cost in U_3 optimization is n smaller, and so do better than the humans even if it was only equally capable. Any mixed state of these two is equally doomed, unless I'm missing some property of high-dimensional ?configuration? spaces.

It's only if U_2 and U_3 are pointing directly against each other, in a way accounting for much of their value, that pi_5 starts to look good. Even without the MIRI paper this looks doomed.

Expand full comment

Isn't a lot of these debates premature?

I know, I know. And I'm grateful so many are thinking deeply about X-risks like AI misalignment. But, right now, we don't even know what kind of programming or modelling might lead to a genuine AGI.

Is it so surprising that the arguments are then esoteric/talking past each others/having to use little stories that should be the basis for a Love Death + Robot episode on Netflix?

Expand full comment

When do you think would be the appropriate time to have these discussions? What concrete signals in the outside world would you look for?

I think this depends a lot on how hard resolving these alignment problems will be (based on the complexity of these discussions, my bet is "hard"), and how quickly AI development will progress from "here is a surprisingly capable though still limited AI" to "fullblown AGI" (this is sometimes called "takeoff speed"). Another factor is whether these alignment solutions need to be in place *before* AGI shows up, or whether we can figure them out and impose them after. Personally, I'm pretty confidence that we need to figure out these alignment solutions ahead of time, and that they will be hard to figure out. I'm less sure about how fast I expect us to progress to full AGI, but altogether this makes me happy these debates are happening now. What do you think?

Expand full comment

> But, right now, we don't even know what kind of programming or modelling might lead to a genuine AGI.

This is false (according to the beliefs of most alignment researchers that I know). We have *some* idea about different designs that probably have AGI level capabilities with enough compute and data, and we also have a bunch of ideas about how to increase the data and compute efficiency.

What's your reasoning for believing that we have no idea what kinds of programming might lead to an AGI?

Expand full comment

What about layering a high level cognitive function that has control over multiple models. The high level function could turn off the lower models, choose between their outputs, modify them, etc. The lower models could be optimized for small, specific goals while the higher function manages which is most appropriate.

Expand full comment

Instead of trying to prevent an AI from learning something, howabout we erase its knowledge of things that will cause a particular problem?

For example, make sure that its neurons contain no knowledge of the Off Switch. I bet you could do this now by finding out which neurons respond to simulation of the thing that you want it to forget and zeroing their weights.

I'm pretty sure that this is a major plot point in Neuromancer.

Expand full comment

Interpretability of large neural networks is hard. Knowledge seems to be distributed across the network (so there's not a single "off switch neuron"), and currently deep learning researchers don't know how to modify the network to excise that knowledge. There's lots of research on this kind of thing, both in the mainstream ML community and amongst AI safety researchers (especially at Anthropic), but it's turning out to be extremely hard.

I think this is part of the reason that Stuart Russell favors more interpretable, traditional AI approaches over modern deep learning -- it's easier to monitor and modify the AI's decision-making.

Expand full comment

There are techniques for investigating why AI systems make the decisions that they do. Here is a pretty good example of using saliency maps to examine failure cases in CNNs: https://www.bbc.co.uk/rd/blog/2021-05-a-machines-guide-to-birdwatching

Expand full comment

Humanity creates agi all the time. We call them children.

How do we keep 2 year old from drawing blue paper clips on the wall? Even after we have praised them for their beautiful blue crayon paperclip picture on piece of paper. We limit the crayon function to circumstance where we are pretty sure wall drawing won't be happening and we simply take the crayons away.

So how does ai research deal with ergodicity problem directly.

This might be a starting point:

https://www.nature.com/articles/s41567-019-0732-0

Expand full comment

Let's say the child is Superman and, once you've praised them for their blue paperclip drawing, they start drawing blue paperclips on every available surface at a speed faster than the eye can see. How do you think you'll be able to take the crayons away?

Expand full comment

How is it possible that we could create a superman before we can even create an artificial child?

Jonathan Kent and Martha Kent (Ma and Pa Kent) seemed to do just fine.

It sort of the version of the Fermi paradox. If an AGI is possible then it should have already been done somewhere in the universe and if AGI can really turn out to be a sentient life destroyer it would have already been done. Yet, it hasn't.

Expand full comment

The universe is very large and the speed of light is comparatively very slow. This may have happened outside our current light cone.

Expand full comment

Yeah, that is a possibility.

Expand full comment

I know I'm jumping into a topic that a lot of very smart people have poured a lot of time into, but I'm also a ML engineer working on RL in a related space and I figured I'd take a different stab at a sovereign AI.

My thought here is that it makes a lot of sense to break the network up into two sections, a planner and an actor. These would be different networks trained separately with no weight sharing (similar to actor/critic or generator/discriminator in a gan).

The planning network would get a prompt from the user and then try to predict all of the future sensor states that it would observe while carrying out said prompt. These predicted sensor states would be played back to the user like a video, and the user would be asked whether it approves the plan or not.

(I don't want to get too into the weeds with this, but this probably ought to be a tree of future sensor states rather than a list, with branches at places where the planner predicts uncertainty. Each branch could be approved individually.)

The actor network would then generate actions that result in sensor states that match the predicted sensor states as closely as possible. Additionally, one of the actions would be to kick back into planning mode if the network thinks it has gone too far off track and would like to re-plan from its current position.

The actor and planner would both be trained to minimize divergence between expected and actual sensor states, and the planner would have an extra loss term based on whether it is generating plans that satisfy the user prompts.

I think this setup gives us a lot of things we want

1) The planning network must get consent from the user while it has no capability to act, making user coercion difficult

2) The acting network is rewarded only for sticking to a plan, making radical departures unlikely

3) If there is some catastrophic collapse case where the acting and planning networks got together to maximize their rewards, the best way to do that would be to give itself easy prompts with predictable sensor states - like telling itself to go sit in a basement, or rip off its own sensors.

Apologies if this kind of a setup has already been considered and deemed terrible for some reason or another, I am still getting up to date on AI alignment.

Expand full comment

I think something like this could work. One problem here is that the planner might output a plan that looks good to humans but isn't actually. This problem is discussed here https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing

Expand full comment

I'm confused why an AI would always attempt to gather more information rather than turning off. In the example with paperclip/staple utility functions it makes sense, but what about if unrecoverable negative actions are possible?

If a human tells an AI to stop, and the AI believes:

1. it doesn't have the full complete utility function

2. it could develop the full complete utility function given enough time

3. mistakes now could lead to unrecoverable value loss

Wouldn't it then act to prevent that unrecoverable value loss by listening to the human and shutting down?

Expand full comment
Comment deleted
Expand full comment

Not necessarily. It just needs to think that unrecoverable value loss is possible. If it's partially aligned, unrecoverable value loss from its own perspective can still be mitigated by listening to the person.

Expand full comment

> AI: What’s your favorite color?

> Humans: Red.

> AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*)

We have training objectives which prevent model outputs from ever passing a certain level of confidence! Label smoothing is used to train machine translation systems and makes it so that the system will always assign at least X% of mass to a uniform distribution over next token choices.

> two people who both have so much of an expertise advantage over me

What precisely is it that Yud has expertise in? The answer isn't modern AI.

Expand full comment

I call you Yud, thus I refute you! Lame.

Expand full comment

What don't some of his fans call him that too? I swear I've seen it on Twitter.

Expand full comment
Oct 4, 2022·edited Oct 5, 2022

You shouldn't trust any AI that will do what it determines that humans want. It's likely to do this:

AI: "Hmm, most humans say that they get their values from religion. The most popular religion is Christianity which says that evil people who don't accept Christ get tortured in Hell for all eternity. The second most popular religion is Islam which says that evil people who don't submit to Allah get tortured in Hell for all eternity. The third most popular religion is Hinduism which says that evil people get tortured in Naraka (basically eastern cultures' version of hell) for a very long time, then being reincarnated and repeating. The forth most popular religion is Buddhism, which says that evil people get sent to Naraka for up to sextillions of years before reincarnating (yes, it has a hell. https://en.wikipedia.org/wiki/Naraka_(Buddhism)). I guess I should do the common denominator and torture evil people in hell."

"Hey, humans, do you want justice accomplished as soon as I take power?" "Yes" "Okay, I'll do that". Then, since Justice is asymmetric (https://thezvi.wordpress.com/2019/04/25/asymmetric-justice/) most if not all humans are soon tortured for all eternity. The humans would resist, but since the AI had been told over and over that criminals shouldn't be allowed to veto judgements against them it continues.

If Elieser Y. and the other pessimistic alignment researchers said that this method successfully aligns AI, and I found out somehow that CHAI successfully developed a superintellegence this way and was about to release it, I'd burn the hardware and murder the workers so that a superintellegence that saw us as raw materials could take over instead. Seriously. (Unless I found out that the researchers were dictatorially telling the AI to listen to them and not the majority, in which case there would be a chance of good results and I would check their general sanity first)

Expand full comment

Humans: "It's up to you to figure out what we want and do that."

AI: "Okay, I've crawled a big sample of literature and social media. I found that the most common thing people write about AI is that it tiles the universe with paperclips. They must really want that!"

Expand full comment

Prior: AI alignment gets excess attention because it provides a compelling combination of social benefits for proponents: virtue signaling and intelligence signaling but without near-term costs. In contrast, for example, if one is really into saving the environment you can only signal so much virtue without imposing real costs (no more hamburgers or transatlantic flights) and you don’t get to look intelligent unless you are way into fusion power (which seems to have a higher bar for entry than AI alignment). Can anyone shake that prior?

Expand full comment

Best way to shake that prior, I think, would be to examine the object level arguments for AI alignment and evaluate whether the attention the problem gets is too high given the arguments. If you are interested I can point out intro resources.

Expand full comment

I am Interested. Thank you.

Expand full comment

Week 1 and 2 of the AGI safety fundamentals curriculum are a pretty good intro to why we might be worried: https://www.agisafetyfundamentals.com/ai-alignment-curriculum

I can also recommend AGI safety from first principles https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ

And this podcast episode with AI safety researcher Paul Christiano. https://axrp.net/episode/2021/12/02/episode-12-ai-xrisk-paul-christiano.html

Expand full comment

These were excellent, Lukas. Thank you.

Expand full comment

From the outside, any system that succeeds in doing anything specialised can be thought of, or described as a relatively general purpose

system that has been constrained down to a more narrow goal by some other system. For instance, a chess -playing system maybe described as a general purpose problem-solver that has been trained on chess. To say its UF defines a goal of winning at chess is the "map" view.

However, it might well *be* .. in terms of the territory ... in terms of what is going on inside the black box.. a special purpose system that has been specifically coded for chess, has no ability to do anything else, and therefore does not any kind of reward channel or training system to keep it focused on chess. So the mere fact that a system, considered from the outside as a black box, does some specific thing, is not proof that it has a UF, and therefore not a proof that anyone has succeeded in loading values or goals into its UF. Humans can be "considered as" having souls, but that doesn't mean humans have souls.

Expand full comment

I’m a PhD student at CHAI, where Stuart is one of my advisors. Here’s some of my perspectives on the corrigibility debate and CHAI in general. That being said, I’m posting this as myself and not on behalf of CHAI: all opinions are my own and not CHAI’s or UC Berkeley’s.

I think it’s a mistake to model CHAI as a unitary agent pursuing the Assistance Games agenda (hereafter, CIRL, for its original name of Cooperative Inverse Reinforcement Learning). It’s better to think of it as a collection of researchers pursuing various research agendas related to “make AI go well”. I’d say less than half of the people at CHAI focus specifically on AGI safety/AI x-risk. Of this group, maybe 75% of them are doing something related to value learning, and maybe 25% are working on CIRL in particular. For example, I’m not currently working on CIRL, in part due to the issues that Eliezer mentions in this post.

I do think Stuart is correct that the meta reason we want corrigibility is that we want to make the human + AI system maximize CEV. That is, the fundamental reason we want the AI to shut down when we ask it is precisely because we (probably) don’t think it’s going to do what we want! I tend to think of the value uncertainty approach to the off-switch game not as an algorithm that we should implement, but as a description of why we want to do corrigibility at all. Similarly, to use Rohin Shah’s metaphor, I think of CIRL as math poetry about what we want the AI to do, but not as an algorithm that we should actually try to implement explicitly.

I also agree with both Eliezer and Stuart that meta-learning human values is easier than hardcoding them.

That being said, I’m not currently working on CIRL/assistance games and think my current position is a lot closer to Eliezer’s than Stuart’s on this topic. I broadly agree with MIRI’s two objections:

Firstly, I’m not super bullish about algorithms that rely on having good explicit probability distributions over high-dimensional reward spaces. We currently don’t have any good techniques for getting explicit probability estimates on the sort of tasks we use current state-of-the-art AIs for, besides the trollish “ask a large language model for its uncertainty estimates” solution. I think it’s very likely that we won’t have a good technique for doing explicit probability updates before we get AGI.

Secondly, even though it’s more robust than directly specifying human values directly, I think that CIRL still has a very thin “margin of error”, in that if you misspecify either the prior or the human model you can get very bad behavior. (As illustrated in the humorous red paper clips example.) I agree that a CIRL agent with almost any prior + update rule we know how to write down will quickly stop being corrigible, in the sense that the information maximizing action is not to listen to humans and to shut down when asked. To put it another way, CIRL fails when you mess up the prior or update rule of your AI. Unfortunately, not only is this very likely, this is precisely when you want your AI to listen to you and shut down!

I do think corrigibility is important, and I think more people should work on it. But I’d prefer corrigibility solutions that work when you get the AI slightly wrong.

I don’t think my position is that unusual at CHAI. In fact, another CHAI grad student did an informal survey and found two-thirds agree more with Eliezer’s position than Stuart’s in the debate.

I will add that I think there’s a lot of talking past each other in this debate, and in debates in AI safety in general (which is probably unavoidable given the short format). In my experience, Stuart’s actual beliefs are quite a bit more nuanced than what’s presented here. For example, he has given a lot of thought on how to keep the AI constantly corrigible and prevent it from “exploiting” its current reward function. And I do think he has updated on the capabilities of say, GPT-3. But there’s research taste differences, some ontology differences, and different beliefs about future deep learning AI progress, all of which contribute to talking past each other.

On a completely unrelated aside, many CHAI grad students really liked the kabbalistic interpretation of CHAI’s name :)

Expand full comment

Thanks for the additional perspective, really interesting.

Expand full comment

Thank you!

Expand full comment