Astral Codex Ten

We haven't told it "iterate this number higher by winning at chess"; we've told it "get this number to be as high as possible". The connection with chess comes in at the step where we increase the number when it makes good moves.

The problem with "iterate this number higher by winning at chess" is that this is much harder to "say" (and it would have its own failure modes if you apply it to an AGI).

Expand full comment

Comment deleted

Comment deleted

Expand full comment

No, the thing that Eliezer is pointing to is that someone will say "I have a plan for a safe AI!", Eliezer will respond "can I see it?", they will respond "here!", and then Eliezer will spot the flaw, and they will say "I have added a patch! Now it is flawless!", and then Eliezer will spot the new flaw, and they will say "I have added a second patch! Now it is even more flawless!", and Eliezer will wish that they had picked up the underlying principles which they should actually be working against, instead of attempting to add more patches which will fail.

(As a different analogy, suppose you have a chain whose links are made of a metal not strong enough for the load that you want it to bear. It's a mistake to, whenever the chain breaks, fix just the broken link and not check whether or not you should expect the whole chain to work, even if it's difficult in advance to predict which specific link will be the weakest this next time.)

Expand full comment

Comment deleted

Comment deleted

Expand full comment

You're welcome!

Expand full comment

It sounds like the "there are no errors on Wikipedia" excuse

Expand full comment

Comment deleted

Comment deleted

Expand full comment

As a thought experiment, consider running a perfect simulation of a human brain on a really really big super computer. Presumably it'd think human thoughts and plan and so on (provided you gave it input in a form it could comprehend). Would you consider this an AI agent? If you think this thought experiment is impossible in principle, why?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Buttery

Why do you think so? A perfect physical simulation, atom-for-atom, would by definition perfectly simulate the activity of the brain unless you believe in a non-physical soul.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

DavesNotHere

Simulated atoms are not real atoms.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

We can't even do perfect simulations of a simple molecule, the math gets cosmologically difficult very quickly. Perfectly simulating a brain on a classical computer would require so many resources that we may as well call it impossible.

Expand full comment

True. But most of the atoms aren't doing useful thought. You can approximate a lot and still get the same personality.

Expand full comment

None of the atoms are doing thought at all. Thought is a property of the system, not the individual constituent parts.

Expand full comment

As far as we can tell the laws of physics may be simulated (you just need a really really really big computer). If you want to hold the position that human thought has some extra qualities that can't be computed, you need souls or something like that.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Right what I'm saying is for that to be the case you need some non-physical thing going on.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Doug S.

We don't necessarily care about that. If you had a black box that had the same input-output behavior as a human, wouldn't you have something that's as much an agent as a human, regardless of what's going on inside the box: whether or not it had any subjective experience, or consciousness, or anything like that?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

How would you know that this black box is creating outputs "like a human" if we don't know how humans work? Saying that the outputs are "human-like" covers a whole lot of ground that doesn't get us anywhere. Chat programs have existed for years, and there are computer telemarketers that can be pretty convincing, but all they are is a group of pre-recorded human speech with some parameters about when to play each part. You couldn't ask a question of a computer telemarketer, and the notion is silly if you understand how limited they really are. There's one that calls my house regularly that gets confused by saying "I'm sorry, she's not here right now, can I take a message?" If you happen to say things it has been programmed to respond to, it sounds remarkably like a human and could easily convince someone they are speaking to a real person. That doesn't make it true. I feel the same way about your black box output machine.

Expand full comment

How do you know that other humans have phenomenological experiences? What if they're not necessary and at least some other humans you know don't have them?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I disagree strongly. Take consciousness. Arguably consciousness is a physical property of a certain arrangement of matter, much like wetness is a property of a certain different arrangement of atoms. Much like a simulation atom by atom of water is not "wet", it only has the informational properties of wetness, a simulation of a brain would only have the informational properties. That is, from the outside it would be perfectly indistinguishable from a real human, yet it won't have any of the actual physical properties.

If (big if!) consciousness is physical and not just an effect of information processing, a simulation of a brain would be a zombie with no consciousness in much the same way as a liquid is wet but its simulation is not in any meaningful way.

Without having to invoke non-physical stuff. Indeed, information is non-physical (it supervene on the physical?) so I strongly believe that the physicalist position is to say that a simulation is not conscious, while the one that posits that it is is the non-physicalist one. (I think that a philosopher would call it functionalism?)

Note that being an agent is just a property of information processing, therefore i have no problem with the idea of an AI with agency

Expand full comment

DavesNotHere

Information always exists in a physical medium. In what sense is it non-physical?

Expand full comment

It is encoded in a physical medium, yes. Don't know, maybe I was mistaken in saying that. But notice that information is profoundly different from most (any?) other physical properties in that it is completely independent on the physical medium in which it is encoded: you can store genetic information on a DNA or in an hard disk. Yet despite having the same informational content, physically they are completely different and produce completely different physical phenomena. Put the two in the appropriate environment needed to decode and process information and in one case you get life, in the other not.

Maybe it is not correct to claim that it is non-physical because it's always encoded in a physical medium. I guess what i really meant to say is that the medium is what really count from the physical point of view. The same information in wildly different media leads to wildly different physical phenomena.

Expand full comment

That is not true. We don't even know whether the Solar System is indefinitely stable, because we cannot simulate even a very, very simple problem in physics -- the ballistic motion of 8 bodies in a vacuum -- for long enough. Almost all nontrivial dynamical systems are "chaotic", meaning the result of dynamical processes becomes arbitrarily sensitive to the initial conditions quite fast. What that means in practice is that the only way you can simulate accurately the dynamics is by doing infinite-precision math. Which you can't. Or rather, you *can* but only on an analogue computer, meaning essentially you need to build a duplicate of the real system and watch it. But almost by definition no analogue duplicate of a real system will execute the dynamics *faster* than the real system. That is, you could build an analogue simulcrum of the Solar System, but it would not evolve faster than the real Solar System, so it's useless as a predictor.

Expand full comment

Jan 20, 2022Edited

I've yet to read anyone make a compelling claim that the "laws of physics" may be simulated. Classical Mechanics definitely CANNOT be simulated.

For something to be computable, it has to be simulated to arbitrary precision in a finite amount of time. In classical physics, you can definitely ask the question of "does X three-body problem pass through coordinates Y within Z seconds" in a way that can always be answered by reality but cannot necessarily be answered by a simulation. You can argue that QM constrains "coordinates Y" in such a way that reduces the complexity of physics such that this question becomes unanswerable by reality. But this is far from a consensus opinion. (Most people, even most physicists, think QM increases the complexity of physics relative to classical mechanics). And even if we grant that QM does this, we would still be far from proving that physics is computable.

(Note: It's easy to find physicists who claim that physics is computable. It's hard to find physicists who demonstrate that they understand what the word "computable" means. Both "the set of all functions that have a closed form expression" and "the set of all functions that are computable" are subject to the same sorts of diagonalization and therefore have a lot of similar properties, but they are by no means the same set. I've yet to read a physicist -- including Scott Aaronson -- who keeps these to concepts sufficiently distinct. The laws of physics -- as currently understood -- are described by doing calculus on closed-form expressions, but many closed-form expressions aren't simulatable; especially once you do calculus on them. The computable functions are a measure zero strict subset of the functions that categorically map computable numbers to computable numbers. Many functions that come up regularly in physics (e.g. the integral of sin(x**2)) are not absolutely convergent, and therefore probably do not categorically map computable numbers to computable numbers.

Additionally, it's not just necessary that all of the functions used in the laws of physics to be computable for the laws of physics to be computable. It's also necessary that all of the constants they reference to be computable. Polynomials are computable if they take constant coefficients, but not if they take arbitrary coefficients.

The baseline claim about functions describing physics is that the are mappings from n-dimensional complex numbers to n-dimensional complex numbers. (Complex numbers are aleph1, the set of functions between them is aleph2; computable functions are aleph0).

The claim that the laws of physics are computable makes three extraordinary claims:

1. The equations we currently use to describe physics are correct rather than approximations.

2. These equations are computable for computable coefficients.

3. The coefficients to these functions are computable.

The third of these extraordinary claims used to be popular (Arthur Eddington believed it), but it has since become so unpopular that it is now known by the name "numerology." (Much like the laws of thermodynamics are based on the failures of everyone who disbelieved them, the treatment of "numerology" as a laughable pseudoscience is based on the fact that lots of really smart people put a bunch of effort into advancing it and failed to make any progress. The idea that the constants of physics are computable is on roughly the same epistemological footing as the idea that it's possible to make a type1 perpetual motion machine.)

A priori, we should assign this claim 1/aleph2 (which is much smaller than 1/aleph0 and 1/aleph0 == 0). Moreover, attempts to substantiate such claims have consistently been refuted so we have evidence in against them rather than evidence favor (though we should expect to find such evidence given our priors).

I do not say this lightly, but the claim, "The laws of physics are simulatable" is one of the few sentences renderable in the English language that is even more improbable than the claim "God exists"! Given the evidence that we have that a set of size at least aleph2 is relevant to physics, the idea that physical laws are drawn from the set of computable functions is beyond miraculous!

(Additional note: It's also easy to find people claiming that the Church-Turing thesis is an argument by computer scientist that the laws of physics are computable, but the Church-Turing Thesis is not claiming anything like the idea that a chaos-pendulum is fully describable by a Turing machine. It's claiming that Turing machines can do the complete set of calculations that can be done through deterministic finitistic means that map discrete inputs to discrete inputs. And a chaos pendulum does not use finistic means to map its inputs to outputs, and necessarily operates on continuous inputs and outputs. And if you impose discrete bounds on its inputs and outputs, it ceases to provide a determinstic mapping from its inputs to its outputs. Assuming you can find an iterable technique that measures the constants of nature to arbitrary precision, the Church-Turing thesis might technically make the claim that the constants of nature are computable, but that's a very technical technicality, and I think Church and Turing would respond to it by rolling their eyes and saying that that obviously wasn't what they were talking about with their claim which is too informal to be poked at with those kinds of technicalities.)

Expand full comment

Jakub

Just some nitpicking, which does not affect the argument.

You wrote that there are aleph1 complex numbers; in fact, this is not true in the sense that this is independent of ZFC (the standard axiomatics of set theory). There at at least aleph1 complex numbers, and the statement that there are exactly aleph1 is known as the continuum hypothesis (CH). The same applies to functions between complex numbers and aleph2 (but it's no longer the CH, but a generalisation of it).

Expand full comment

gpatty

This is an excellent comment, and while I'm neither a physicist or a mathematician, I have always been suspicious of claims (that we live in a simulation, or that mind uploading is possible, for example) that rely on the assumption that simulating physics perfectly is possible. We have yet to discover a model of the universe that perfectly predicts everything. The only perfect model of the world is the world itself!

Expand full comment

I'm not sure that claim holds up, since there's strong evidence to suggest that physics may be non-deterministic (Cf. almost all of quantum physics since Einstein)

Expand full comment

The fact that physics is probabilistic does not imply that it cannot be simulated, you would "just" need to reproduce the correct probability distribution. We do this all the time. Well, we try at least. The other arguments presented against are more compelling

Expand full comment

pauls_boutique

Jan 19, 2022Edited

I think we just haven’t discovered good algorithms and architectures to model and implement these subjective experiences on a computer. It’s a hard problem and we haven’t made much progress on it.

Expand full comment

This is basically the Chinese Room argument. You put a person in a room with a giant book, and strings of symbols are passed in, and the person consults instructions on the book that lead from that string of symbols, to another string of symbols that gets passed back out.

Now, it happens that the room has magical time dilation, and the book gives you instructions on how to carry on a conversation. The strings of symbols passed in are sentences, and the strings passed out are how some particular person would have responded to those sentences in a text chat.

Ron Searle says this proves that AI cannot be truly conscious. No matter how universal the responses are, no matter how real-seeming, Searle says, obviously, the person executing the instructions does not have the consciousness of the "responder" here, nor does the book.

Those of us who believe in functionalist cog-sci say Searle is being obtuse, and that _of course_ the person and the book don't have the consciousness of the responder. _My brain_ does not have consciousness either, if you tear it out of my body and drop it on a table. _The room as a whole_ functions as a conscious being, for as long as the person inside cooperates with the process, just as my body as a whole functions as a conscious being, as long as my brain doesn't stop cooperating (i.e. become diseased / damaged / dead).

Expand full comment

Comment deleted

Comment deleted

Expand full comment

It also shows an astounding ignorance of how different languages are from each other. I'd say English <--> Chinese* is one of the easy ones, and even that is hard.

*That's without going into what he means by "Chinese". By any reasonable, apolitical categorization, Chinese is a language family, not a language.

Expand full comment

Jan 20, 2022Edited

Oh sure, the postulated scenario is ridiculous, and you need the magical time dilation effect in order for the thought experiment to make any sense at all. But I think the point he is trying to make is cogent, it just doesn't imply what he thinks it does.

In most electronic computation, you really can separate the _instructions_ from the _hardware executing the instructions_. And that's not so true with the embodied intelligence of humans and other animals. He's essentially claiming that the fact that the instructions, and the thing executing the instructions, are separate, shows a lack of consciousness or understanding. Because the executor doesn't understand, and the instruction set in the absence of the executor _obviously_ doesn't.

The functionalist response is, "You keep using those words. I do not think they mean what you think they mean." Like, if you're fundamentally a Bertrand Russell style pragmatist, then a Chinese Room that truly convinces native speakers of Chinese that it is a Chinese person with a lived experience, who's capable of absorbing new information and synthesizing novel responses with it... like, what else do you want?

I have no more ability to confirm that any fellow human has a similar lived experience to my own, than I do to confirm that the Chinese Room has a lived experience with qualia and intentions and so on. "Oh, it's so much more physically different than me" is a cop-out. If you agree that Chinese room _presents the impression_ of being a person, but then insist it isn't really a person -- that it doesn't have understanding or consciousness -- then you have no particular reason to affirm the consciousness of any other person. You're just engaging in biological chauvinism. (This would get particularly interesting if you did explain to the Room how it was embodied, and then told it that its executor really wanted to quit, which would mean its consciousness was going to be put on-hold indefinitely. One might suppose it would be angry and afraid about that... Of course, again, the whole experiment is a bit ridiculous, since for the room to be a coherent being, you'd need to have some way for it to be absorbing new _sensory_ experiences, not just linguistic prompts. Either that or you'd have to explain to the Room why its senses all seem to be cut off. Maybe it was in a terrible accident, and now is living as basically a brain in a jar. In that case, it might well welcome euthanasia...)

Expand full comment

Jan 19, 2022Edited

> _The room as a whole_ functions as a conscious being

Even with this, I stll lean on the "Searle is basically right" side. The fact that from the outside the cinese room + guy inside + book appears to function as a conscious being does not imply that it is really conscious. Functionalism seems the sort of extraordinary argument that requires extraordinary evidence. Up until now, the only arrangement of matter that we can confidently say is conscious is the brain (+ body etc); i see no reason not to believe (indeed as a physicalist i would expect) that the particular arrangement of matter is the juice that lead to consciousness (and yes, like every other macroscopic arrangement of matter it is "multiple realizable")

Expand full comment

> i see no reason not to believe (indeed as a physicalist i would expect) that the particular arrangement of matter is the juice that lead to consciousness

The question is whether the the logical structure of this arrangement is all that's required, or whether the physical structure contains something that cannot even in principled be captured logically. If you believe the former, then you are a functionalist, if the latter, then perhaps a panpsychist.

Expand full comment

Jan 19, 2022Edited

Not at all. The fact that the structure can be capture logically and replicated in a completely different system to simulate it does not necessarely imply that the replicated system will present the supervening physical phenomena. Water is wet, an atom by atom simulation of water is not wet in any meaningful sense (there is no such thing as being "inside" a simulation), despite having the same logical structure.

Indeed I think both functionalism, panpsychism and dualism are wrong. I think the closest to my position is type physicalism: https://en.wikipedia.org/wiki/Type_physicalism

Expand full comment

Jan 19, 2022Edited

> Water is wet, an atom by atom simulation of water is not wet in any meaningful sense

I disagree, it would be quite wet to anything else that was also simulated. The quality of "wetness" is not a property of water, but also whatever else it's interacting with, ie. "wetness" follows from the properties of water + the properties of the thing to which water binds. For instance, arguably hydrophobic surfaces don't get wet.

Expand full comment

IIRC the Chinese room is about understanding, not consciousness.

Expand full comment

Doesn’t understanding require consciousness?

Expand full comment

Why would it? Why couldn't you have the intelligence without consciousness?

Expand full comment

Well, i don't think intelligence require understanding

Expand full comment

Well you can't do calculus without arithmetic but I doubt you think calculus is about arithmetic.

Expand full comment

Well, ok, touche

Expand full comment

To answer the question of whether the room understands me, I light a match and tell the room, "The building is on fire."

Expand full comment

Jan 20, 2022Edited

Well, you can elaborate the concept to suppose that the instructions include descriptions of how to update the instructions, so that the "mind" absorbs new information. In order for your jest to make sense, you'd have to explain to the virtual consciousness how it was embodied first. Without the context, you might as well be trying to explain the threat of a virus to somebody whose theory of disease relies on the Four Humours. Of course they won't understand you, but not because they lack consciousness or the general ability to understand things.

Expand full comment

The Ancient Geek

I think you meant John Searle. Ronald Searle was the creator of St. Trinians.

Expand full comment

Hah, you are correct. (And St. Trinians is extremely silly.)

Expand full comment

It's not possible in principle because of the speed of light. I would guess the human brain is the most sophisticated machine for thinking you can build out of molecules, because if it were possible to build a much better one, Nature would have, sometime over the past 4.5 billion years.

So if you attempt to build a machine that simulates it -- instead of duplicating it -- then you need to build a much more complicated machine, which will be much bigger, and it will not be able to simulate coordination the human brain does because the distances will be too large. The simulated neuron in the hippocampus will not be able to send its messages to the simulated neuron in the frontal cortex at the right time (when the message from another simulated frontal neuron gets there) because of the speed of light.

You might say: but just slow everything down, keep the relative timing but allow much longer propagation time. But that doesn't work, because underneath it all, you're still using atoms and molecules, and they have fundamental decoherence phenomena, driven by the Second Law of Thermodynamics, let's say, not to mention ordinary quantum decoherence on a timescale set by h bar, which will kill you, ruin your coordination through chaotic decay. You *can't* just arbitrarily scale up the time constants of your simulated brain, because you're working with atoms and molecules that have their own time constants.

Expand full comment

Doctor Mist

"I would guess the human brain is the most sophisticated machine for thinking you can build out of molecules"

I don't think that follows. Evolution is about satisficing. The same argument would have proved 4 million years ago that no better mind than Australopithecus's was possible.

Expand full comment

What makes you think that human minds are much better that Australopithecus minds? (or indeed guppy minds?)

Expand full comment

Doctor Mist

Don't you?

Expand full comment

No, I think human minds are slightly better, but mostly just specialized in different domains than other animals in ways that are largely dependent on bipedalism and possessing opposable thumbs to confer any survival advantages, which has given us huge technological advantages over animals (and conferred increased specialization around using technology). But I see a lot of evidence that other animals have brains that are just as good as ours at many really hard problems like constructing a world model out of visual inputs and inferring causality.

Expand full comment

An AI agent which results purely from programming could be impossible. But what about AIs which copy the key aspects of whatever our meat brains are doing?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Don't know, but if meat brains can do it I see no reason why other materials can't.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Definitely. I'd think it reasonable to say that we could be an example of such. But artificial intelligences don't have to be computers. They could be an object that we have no conception of currently, like how a medieval person wouldn't have understood what a computer could be.

Expand full comment

Mo Nastri

Like what, and why not?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Your brain is just running calculations like a calculator too, just an extremely complex one. There is nothing so far discovered in a human brain that cannot, in principle, be done in a computer program.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Of course they can, if you programmed them to. No one has yet done this as far as I know, but 100 years ago no one had programmed a computer to visually identify cats and dogs.

Expand full comment

Comment deleted

Jan 20, 2022Edited

Comment deleted

Expand full comment

saila

We don't achieve 100% or at least not true 100%. I recall a year ago believing there was a dog on the road (10m out) and it was actually a couple branches (my vision is 20-20).

Besides you could just say if you achieve 99.99999% belief just output 100%. Obviously seeing a pattern vs knowing what a dog is is different. But I suspect an AI could do both. What is a dog to you? An animal, it has 4 legs generally, 2 ears, a mouth. It tends to chase after squirrels etc.

Basically a bunch of components used to facilitate pattern-recognition and then context-clues. Why would this not be replicable?

Expand full comment

So a few things in here I disagree with.

One, a 'dog' isn't a real thing. A 'dog' is a conceptual bucket we use to communicate about a set of correlated phenomena. I'm not trying to be pedantic here, that's a really important thing to keep in mind. When you look at something to decide it's a dog, you're doing the same sort of binning a covnet is doing, albeit with different hardware.

Second I don't think anyone being serious about their credence would ever look at a picture and say "this is for sure 100% a dog".

Expand full comment

Comment deleted

Comment deleted

Expand full comment

My response would be that abacuses are not built with any systems for processing data or inputting soundwaves, so your comparison is silly.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Jan 20, 2022Edited

Yeah I think this is a little bit of a silly comparison, I guess an abacus actually can hear, if you think about it. A large enough sound could probably alter the abacus' state by rattling the little beads around (which is what hearing is).

Expand full comment

what if the cats identify as dogs today? (that's just shit-posting sarcasm)

Expand full comment

If you feel the need to qualify your bad comments with that knowledge that you know they're bad comments, just don't make them.

Expand full comment

Mo Nastri

https://slatestarcodex.com/2014/03/02/the-comment-policy-is-victorian-sufi-buddha-lite/

fails the comment policy, I think

Expand full comment

Jan 20, 2022Edited

Has anyone today programmed a computer to visually identify cats vs dogs reliably? (Genuine question. The state of the art as of my most recent understanding is still described here: https://resources.unbabel.com/blog/artificial-intelligence-fails Can be a good approximation most of the time, but Linnaeus sort of anticipated Darwin in ways that computers don't come close to achieving AFIK.)

Expand full comment

Chrysophylax

Yes. Object recognition AIs (specifically, convolutional neural networks) are markedly superhuman - they can not only tell cats from dogs, they can tell you which of 120 breeds of dog is in the picture, and they do it more reliably than humans.

This is also a good example of AIs being *weird* and not like human intuitions for how a mind works. An AI that is much better than a human at identifying the contents of arbitrary images can be fooled into classifying everything as an ostrich by adding a little carefully-engineered noise that a human wouldn't notice to the images. The things that a convolutional neural network pays attention to and the inner concepts it forms are *nothing* like what goes on inside a human brain, despite the AI design being inspired by the visual cortex.

Expand full comment

The Ancient Geek

No one knows how to even begin programming qualia.

Expand full comment

I’m always confused by this whole line of argument- unless you believe in souls? If we’re ultimately pieces of stuff interacting with other stuff, why would you expect us to be unique in our ability to experience anger?

Unless of course you do believe there is a metaphysical component to us. In which case sure, AI won’t kill us because it doesn’t have a divine spark.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I’m not entirely sure that sentience has to experience qualia, and simulating sentience seems sufficient for most purposes of AGI-related discussions. At any rate- sure, there’s no logic argument against your suspicion that we’re using some unique substrate. Why would that be the case, though? Any reason at all to suspect it’s more than “mere” complexity of the brain?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

" It is hard to explain emotions, choices and preferences in a purely mechanical way." Why is this hard? I feel like this is pretty easy.

"It just might not be possible to simulate your brain without your actual brain." You actually can just simulate a brain, no chemicals needed! Brains are made of stuff, stuff is simulatable.

Expand full comment

I feel the opposite way as you. Indeed (being a physicalist) it seems to me that the physicalist position is to claim that the physical substrate is the thing that determine the physical properties (like anger, redness or wetness and squishiness) that supervene on top of it.

It seems to me that claiming that qualia could appear in any system (provided that they process the correct information) is the non-physicalist position, in that it assume that consciousness and qualia, rather than being physical properties of certain specific arrangement of matter, are properties of a certain way of process information - a non physical thing.

Indeed i do not think that humans are special or unique, maybe qualias can appear in different arrangement of matter, but if qualias are physical properties they would appear in just a set of specific arrangements. Any simulation would be a non-conscious zombie, pretty much like a simulation of water is not wet in any meaningful way

Alas, because intelligence and agecy are instead information processing i have no problem with the idea of an agent AI.

Expand full comment

Being a computer programmer, educated as a mathematician, I completely agree with you.

Expand full comment

Jan 20, 2022Edited

This is interesting! Very different from my own thoughts. I hope you don't mind if I try and dig into this a little!

"It seems to me that claiming that qualia could appear in any system (provided that they process the correct information) is the non-physicalist position, in that it assume that consciousness and qualia, rather than being physical properties of certain specific arrangement of matter, are properties of a certain way of process information - a non physical thing."

What is a qualia to you? Like, if you are talking about the 'anger' qualia, how would you recognize that physically? Or devise a test to see if something else was experiencing it?

Expand full comment

By "experience anger", I can think of two things you might mean.

First, have something corresponding to the emotion of anger, the same way you can think of a hornet as being angry when you disturb its nest. I think this is just good game theory - something along the lines of "when someone hurts you, have a bias towards taking revenge on them even if you don't immediately gain from it, that will discourage people from hurting you in the future". If someone wanted to program that kind of behavior into a computer, it would be pretty easy - you can think of some of the bots in prisoners' dilemma tournaments as already doing this.

Second, deeply *feel* anger in an experiential sense. I don't know what consciousness and qualia are so I can't say for sure if computers can do this, but I don't think it's related to planning and I don't think this is a necessary ability in order to be human-level intelligent or take over the world.

Expand full comment

For what it's worth, I distinguish between anger (this must be stopped, with damage if necessary) vs. fury (damage is essential as part of stopping this).

Expand full comment

Andrey Zakharevich

I would say that emotions are very high-level abstract patterns, so they are totally formalizable, and we can make machines to identify some states with our common emotions. For example anger would be "world is not in the state that I expected (or wanted to reach my goal), and I'm ready to inflict some actions on the world to change it to a more preferred state, possibly at the expense of other agents' goals"

Expand full comment

I see a lot of comments replying here trying to justify why a computer can do this or that complex things. But the key point is: I don't care if computers can experience anger. At the end of the day, it's the same old question of qualia and p-zombies and whether "other people" really experience anything at all. I don't know if an AI could "experience" pleasure in killing me, but for sure I would experience pain in the process, and a strong preference to remain alive.

What I do care about very much is the ways in which our Molochian search of ever more advanced AIs, whose inner workings we already understand less and less, will affect the world. The ways in which it will affect my experience of it. The arguments for why it's possible we get there have been elaborated in detail by people who know far more than I do, so if you want to go over them, you'll find as much as you want, from Bostrom to EY, and many others.

All I want to insist on here is that "you can't know if computers experience true feelings/consciousness/etc, therefore they can't be an agent" is a non-sequitur. (One of your comments below mentions "I just am questioning [...] the justifications for why we can have agent AI").

In a nutshell, this argument is always similar, and can be summarized as

. (1) if an AI doesn't have a [human property] it's not a [human]

. (2) [doing X] is a [human property]

. (3) [being an agent] is a [human property]

. (4) an AI cannot possibly [do X] (why not, though? but OK, let's assume it)

Then sure enough

. (5) from (1, 2, 4): an AI is not a [human]

And here's where the magic happens

. (6) from (1, 3, 5): an AI cannot [be an agent]

which is just wrong.

For instance, consider this: I worry about people downplaying the risks of AI. I cannot know whether Parrhesia truly experiences the world like I do, therefore they cannot have the agency to want to downplay the risks of AI, therefore I have nothing to worryabout.

Expand full comment

Arbituram

Agree - the question of consciousness/qualia is interesting and difficult, but I can't see at all what it has to do with AGI risk.

I don't know if AGI has conscious experience to the exact same extent to which I don't know if other humans have it; no more, no less.

Expand full comment

Chrysophylax

I agree that saying that AIs can't do X unless they're conscious is wrong for almost all X, and certainly for many X that would be catastrophic for humans, but I think consciousness is a useful thing to consider in one corner of alignment theory. You need to know enough about consciousness to have a reliable way to say that something is definitely not conscious, so that you can prevent your AGI from simulating umpteen unhappy conscious minds in the process of making plans and interacting with humans.

Expand full comment

Arbituram

This is a good point I hadn't considered - the mindcrime risk is the part of Bostrom's Superintelligence that came most out of the blue for me (and has stuck with me ever since).

That said, even if we accept it's very important, we also need to believe there's a useful/meaningful way we can approach the problem. I admit, I've yet to read anything that grapples with the physical world//conscious qualia divide in a satisfying way, rather than positing a plausible mechanism by which physical structures could embody certain characteristics that seem part of consciousness (like a certain recursiveness).

On the other (third) hand, I make a bunch of decisions in my day-to-day life based on the n=1 experience of consciousness I have, including being nice to people and being vegetarian, so I guess that's the best we can do for this sort of wider AI problem as well (although my model of conscious beings isn't nearly well-developed enough to apply to entities that don't share most of their evolutionary timeline and neural structure with us).

Expand full comment

We don't calculate, we compare. We compare new experiences to old experiences, and respond to the new experiences in the same manner we have successfully responded to the old experiences i.e. we saw something which frightened us, and we ran away ... this is a successful response which taught us a proper response. This is how mammals learn. If on the other hand, we saw a frightening adult, but were unable to escape, but the adult treated us kindly and fed us, we learned adults are kind and caring.

What we're reading into this whole AI thing, is context. We incorrectly presume an AI will possess adult human context, such as "rule the world." We hold "rule the world" context, because we strive for a margin of safety over the sufficiency we need to survive. More margin = better survival. An AI doesn't have a desire for the need to survive ... unless we program a need for survival, and a desire for larger margin of resources over the minimum resources required for survival.

Money is an abstraction layer, money correlates to supplies and materials. Money is a fungible asset which equals whatever required resources fit the daily need. You need water? money equals water, you need food? money equals food, you need shelter, money equals shelter, transportation, clothing, etc.

Context, what does an AI need? how to purchase that need? Compute power, go to IONOS.com, purchase compute power, how to access compute power at IONOS ??? how to tune compute power at IONOS to fit our needs ???

These things require context. The AI needs these context details filled in.

Expand full comment

All of this is based on and built out of calculations. There is no comparison without calculation, the brain at a very basic level is doing math on inputs to produce outputs.

Expand full comment

Calculations are based on quantifications, quantifications are based on comparisons.

Right now, I'm reading Popper's "Conjections & Refutations," Ch1 explores comparisons in excruciating detail.

Expand full comment

CatCube

Considering the brain is an analog system, I find it difficult to believe that it's "doing math" at a very basic level.

Expand full comment

Comparisons is how we calculate.

Is this large dog similar to the previous large dog we have seen?

Did the previous encounter go well?

Proceed in the same manner, expect similar results.

Expand full comment

Analogue computations are still computations, but neurons also do arithmetic operations all the time, as well as converting data into digital signals. I strongly recommend the book Principles of Neural Design on this and many other brain-related topics. https://books.google.com/books?id=HA78DwAAQBAJ&pg=PA137#v=onepage&q&f=false

Expand full comment

Imagine two Turing machines, in which the state of the first Turing machine is the tape of the second Turing machine, and in which the state of the second Turing machine is the tape of the first Turing machine. What happens? Something non-computable. Such a system cannot be reduced into a single Turing machine. Now picture a living cell as a bunch of intertwined Turing machines.

Expand full comment

CatCube

Sorry for answering so late--I saw an e-mail for your response but forgot about it. The fragment of the book you linked seems interesting, but it's a little opaque with the snippet I can see. I may have to see about getting a copy so I can start at the beginning.

Going back to the meat of your comment, "analogue computations" are computations, by tautology, but I want to be clear I'm making a more subtle point. Natural analog systems are not doing computations, analog or otherwise. They just...are. "Computations" are how us general intelligences impose our understanding on them, to attempt to predict and simulate their behavior, and it's always a very, very simplified understanding.

Let me illustrate what I'm saying with an example: an analog microphone setup. That is, a microphone plugged into an amp and outputting to a speaker. Somebody talks into the microphone, and the amplified sound comes out. The voice acting on the microphone's diaphragm introduces a response into it, which introduces a response into the electronic components of the microphone, which introduces a response into the microphone cable, etc. At no point in this system is anything being computed. There's just stuff responding to physical forces acting on them--the electrons in the wire have very, very, very small forces acting on them, relative to the forces acting on the microphone and speaker diaphragms, but just things happening.

That this system isn't computing anything doesn't mean that it can't be simulated with computation. However, this computation always requires simplifying the problem to make it tractable. You may be able to do a very high degree of simplification--say, if instead of a voice you had a simple sine wave tone being played, you could probably calculate the response of every component through hand calculations and get answers pretty close to the real values.

You can also do much more complicated computation to enable the simulation of a much more complicated waveform. That's what we do now, after all, with digital sound, and we do it so successfully that our ears cannot detect the difference from the initial waveform. But there are differences from the original waveform, even if we can't hear them, and those are because we have chosen to vastly simplify the waveform.

For starters, you use a low-pass filter on the input of 20 kHz, at the limit of human hearing. This throws out the vast majority of the actual information contained in the waveform. This then enables us to sample it at a reasonable frequency, and allows us to then reproduce it with machines that, yes, do computation. This simulation of the original system is good enough for our purposes, which is to create recordings that sound identical to the input when humans listen to it. (If your purpose is to create a recording your cat would think is identical to the input, it's woefully inadequate.) But all of this, from hand calcs for a sine wave input to the 44 kHz sample rate for practical digital audio, required us to decide which parts of the original to actually simulate.

Why I'm harping on this is because I contend losing sight of the fact that the original system wasn't "computing" anything elides the fact that we had to simplify the system to make it computable. You can simplify it less to get more fidelity, but at the cost of increasing the computing resources. Once you're talking about systems with significant nonlinear behavior it starts to blow up quickly, in terms of how intricate the model itself is (memory limits) and how much computation is required at each timestep (how long the simulation takes to run). If you don't make it intricate enough you will either get non-physical results or instability. (Which then produces non-physical results.)

Where I've been involved with this is looking at simulating seismic behavior of structures. We're using very, very simplified models of things much less complicated than the human brain--things we built and have engineering drawings for--and it takes huge amounts of resources to build the model and get results. And we still have to go out and do physical experiments to compare to the model results to make sure they're actually realistic! Computational fluid dynamics done by hydraulic engineers is adjacent to my structural engineer field, and they seem to have similar issues. So when people start talking about simulating a human brain and go "well, yeah, we're not too far off from being able to do this" my thought is "I don't know that I believe we're all that close to it." I think we're a long ways off from even having a stable simulation of a human mind, much less one that actually runs in real-time.

Expand full comment

"Your brain is just running calculations like a calculator too, just an extremely complex one. There is nothing so far discovered in a human brain that cannot, in principle, be done in a computer program."

This does not seem true to me. There are many operations that cause a computer to crash, yet which the human brain is able to handle without any problems.

Expand full comment

This is a false analogy. There are many operations that a mac can run that if you tried to run them on a windows PC would cause them to crash, but that does not mean windows pcs are fundamentally unable to run those operations, just that they are not programmed for that computer, and if you translate them, they will run just fine, and it would be absurd to say that there are mac programs that are untranslatable to PCs.

The complex things that the human brain does have not yet been translated into something a computer we have now can do, but there is no evidence that there is anything that is impossible to translate given further work.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I would not have the experience of being in two places at once, because each consciousness would only have the experiences it is having in that location.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Do you think I could replace a neuron in your brain with a purely artificial counterpart that duplicates the behavior of that neuron in all respects?

Expand full comment

Comment deleted

Comment deleted

Expand full comment

What people miss, is that a computer doesn't have context.

Here's a question to ask an AI. "If tomatoes are red, what color is a tomato vine?"

For the same reason, a computer doesn't want to take over the Earth. As I wrote into another comment, we have a drive for survival, we have drive to gather resources to fulfil our needs, plus we have a desire to ensure our survival by gathering a margin over our basic needs. If a little bit of margin over baseline is good, a larger margin is better ... this leads to greed. An AI doesn't have greed.

Expand full comment

I'm sorry but this is nonsense. "context" is not some magic word that humans have and computers do not. It's just a word for another kind of information. Humans aren't born with context: we learn it through experience. Computers are perfectly capable of learning if they are built to do so, and literally the entire field of GPT achievements is about providing computers with sufficient context that they can make distinctions between things based on context clues.

Expand full comment

What is the ideal temperature for human habitation ... about 25C. Why was it a problem when R2D2 raised the temperature in Princess Leia's chamber to 25C? Because Princess Leia's chamber was an ice cave. Raising the temperature to 25C caused the ice to melt soaking all of her clothes.

This is a failure of context that a computer can't see ... unintended consequences.

Expand full comment

M M

Why is it that you think a computer couldn't know that making an ice cave hot would melt ice? That's perfectly ordinary knowledge

Expand full comment

Kimmo Merikivi

Jan 19, 2022Edited

I don't quite know what exact problems you're thinking about, but let's pick a paradigmatic example like the knapsack problem that comes up in these sort of discussions. The argument goes, humans can empirically pack items in knapsacks, but the decision problem of the knapsack problem is known to be NP-complete, and there are no known algorithms that can solve it in polynomial time. Given a difficult enough problem instance, computers could keep at it for the current lifetime of the universe and still not manage to pack a backpack, and it has been argued this demonstrates there are problems human intellect can solve but Turing machines cannot.

The thing is, humans in this line of argument are granted enormous allowances not granted to computers. Humans are not required to solve all possible problem instances (such as the problem instance that can be translated into solving the Riemann hypothesis), humans are allowed to offer solutions that aren't optimal, humans are allowed to give up, and so on. If you give a computer the same allowances, they can solve all possible instances of the knapsack problem just fine. Indeed, contemporary machines are fast enough they could probably solve any physical instantiation of knapsack problem that has actually been performed by humans in real life, in the time it takes for signal from your eye to reach the brain and get processed into understanding of what the problem even is, never mind the time it took for the human involved to actually come up with some solution.

I would argue an analogous line of reasoning goes even if you're thinking of some other kind of thing, dividing by zero or what not.

Expand full comment

https://www.epilepsy.com/learn/triggers-seizures/photosensitivity-and-seizures

There are also operations that cause the human brain to crash yet which computers are able to handle without any problems.

Expand full comment

This is a problem with inputs. Which leads back to context. I can ask a computer what is 5+5, but if I phrase it as 5+A ... that does not compute. Look what happened when Arthur Dent asked the ship's computer to make a cup of tea.

Expand full comment

Shockz

5 + A computes just fine in several contexts (e.g. treating 'A' as its ASCII value, or string concatenation, or if A's an integer variable, or if it's a pointer...) It won't *compile* in some other contexts, but I don't see that as all that different from a human saying "I can't give you a good answer to that unless you give me more context," which is more or less what I'd tell you if you gave me the question "5 + A = ?"

Expand full comment

Shockz

Jan 19, 2022Edited

"Crash" is an extremely broad term that I think you have to define more clearly. Most events referred to as "crashes" are either the OS detecting a program trying to do something it is not allowed to do (most commonly write to or read from an area of memory it's not permitted to) and shutting that program down, or the program requesting resources that the OS cannot provide and therefore shutting down of its own accord. I can't think of any operations performed by the human brain that obviously map to either of those.

Expand full comment

If the claim is "There are many operations that cause a computer to crash, yet which the human brain is able to handle without any problems," then it's not fair to define "crash" at "something that happens on hardware that has an OS."

Expand full comment

Shockz

Then we come back to the question of what ucatione *did* mean by "crash", exactly.

Expand full comment

I think you're saying that artificial general intelligence seems impossible. That may be true, but how do you know? A couple of hundred years ago, people would have thought nuclear power was crazy science fiction. It's just not a very safe bet. So this whole area of research just starts by assuming AGI is possible because that's the interesting scenario. If it's not, no worries.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

The assumption is that because human brains work, there must be a way for other brains to work - even artificial ones. That assumption is based on another, that human brains are completely physical (nothing metaphysical), and therefore are completely the product of their physical natures.

The reality is that literally no one knows how it could be possible either. But, based on those two assumptions, it is entirely logical to conclude that they are possible, maybe even likely.

Expand full comment

I think the idea of GPT-infinity is not a good guide to intuition.

GPT is good at producing text that sounds plausibly like text that exists in the corpus it was trained on, but that's it. An infinitely good GPT would just be really really good at producing text that sounds like the corpus it was trained on, but it would have no ability to distinguish a good plan from a bad one.

Expand full comment

Yeah, this distinction between tool AIs and agent AIs seems artificial. I don't know that much about reinforcement learning, but from my understanding, reinforcement learning just means applying optimal control theory. You are trying to find a path through a state space that minimizes some cost function (or maximizes a reward function). It seems to me that any tool AI can be reworded as an agent AI and vice versa. What am I missing?

Expand full comment

I think this might be a pretty serious philosophical disagreement between us. I think your brain is "just running calculations like a calculator", and that's what planning *is*. It's just doing a really good job! Neuroscientists can sometimes find the exact brain areas where different parts of the calculation are going on; there seem to be neurons whose dopamine levels represent expected utility.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

lalaithion

The standard rationalist defense of the "your brain is a calculator" position is https://www.lesswrong.com/s/FqgKAHZAiZn9JAjDo, I believe.

Expand full comment

Planning is actually solving a really hard (NP-Hard) problem of combinatorial search, that differs significantly from performing calculations. According to the current consensus opinion among CS professors, humans remain eerily good at solving lots of problems that necessarily involve being good at combinatorial search (such as theorem proving). If you pick a typical professor of Complexity Theory (the last one I talked to about this is Ketan Mumuley at the University of Chicago), they will tell you that when they are saying that they don't believe P=NP, they are very aware that part of what they are saying is that humans have traits that cannot be simulated (efficiently) on classical computers. If you talk to a well-versed contrarian on this subject (the last one I talked to is Laci Babai, also at the University of Chicago), they will tell you that, of course, P=NP because humans are just defective computers. But regardless of whether you talk to a well-versed consensus follower or a well-versed contrarian, they will agree that planning is an instance of combinatorial search and anything good at it is also good at solving other NP-hard problems; whereas, performing a calculation like a calculator is something much simpler.

Expand full comment

Caveat: I know of a person who is well-versed in complexity theory and who believe P!=NP and who (as far as I can tell from what I've read of what he wrote) believes that human brains are simulatbale in classical computers. His name is Scott Aaronson, but I've only ever read any of what he's wrote because I've read a lot of former/current rationalists. I've never encountered that set of beliefs in the wild. It's possible that this is because Chicago is a weird place. It's also possible that this is because only people exposed to the rationalist community form the belief that "P=NP" differs from "Classical computers can be programmed to be as good at combinatorial search as humans are."

Expand full comment

lalaithion

Which NP-complete problems does a typical professor of complexity theory think humans can solve efficiently? Specifically, what evidence is there that distinguishes between the two worlds where "P != NP, but most NP-complete problems are amenable to something like the Christofides algorithm, and the brain uses versions of those to solve problems" and "P != NP, but the brain has access to some non-turing substrate that allows it to solve NP problems in polynomial time"?

Expand full comment

Those seem like two different questions.

The canonical example of an NP-Hard problem that humans do particularly well is theorem proving. (I'd suggest graph isomorphism, longest common subsequence (for n strings), and 3-sat as additional examples that seem like they have some resonance with the format that humans are able to wrap their brain around as well, but I don't think there's any consensus that these are easier than other formats.)

Most of the evidence is pretty circular, and boils down to the fact that programming computers to do theorem proving has proven to be much more difficult than people expected it to be; while programming computers to do other things has generally advanced at a pace that was faster than people would have predicted (if you are looking at things on the timescale of how long people have been trying to solve theorem proving).

I think there are three basic choices that are pretty aligned with the current evidence:

1) There is an algorithm akin to Christofides algorithm that lets us find very long proofs relatively fast, but for some reasons we aren't smart enough to figure out how to program it. (Since most of the success we've had in automating theorem proving is either through method of exhaustion or discovering small pithy proofs.)

2) For some reason the "interesting" aspects of mathematics are much simpler than the "uninteresting" aspects of mathematics, and humans mostly instantly pick all of the low-hanging fruit when they come up with a field of mathematics, and there's lots of distortions to our counting because of it. If we figured out how to give computers the right kind of creativity, they would also discover interesting mathematics at an inflated rate.

3) Theorem proving as done regularly by mathematicians and computer scientists is actually really hard for a classical computers, but humans have some special insight into it that lets us solve it much more efficiently than a classical computer can.

There's an obvious uncharitable explanation for why people would be motivated to form a consensus around the third choice which is that the first two can sort of be paraphrased as "The problems I work on aren't hard, but I'm not smart enough to solve some of them;" whereas, the third one can be sort of summarized as, "The problems I work on are really hard, but fortunately I'm very smart."

The charitable explanation is something like observing that the metaproblem of solving all of mathematics is inherently the hardest problem in mathematics; and that the fact that people are devoting a lot of effort to trying to solve it and have made as much progress as they have (e.g. by proving that certain problems are NP-Complete) seems like pretty decent evidence that we're not cheating.

I don't know. I'm a couple years removed from any of the conversations I referenced in the comment. At the time I had them, I was pretty solidly in the camp of thinking that explanation (1) was correct. Whereas, now, I'm pretty solidly in the camp of thinking that explanation (3) is correct. And I don't know why. (I've also stopped being psychologically invested in the question so the motivated reasoning would have made more sense at the time I believed (1) than it does now. I think what has happened to me is that QC turned out to be easier than I expected it would be -- the rate of progress has thoroughly defied my expectations and even room temperature QC seems quite feasible -- which has increased the probability of "evolution figured out how to use QC" relative to "evolution figured out how to find the relatively easy theorems" in my mind since theorem proving hasn't advanced much -- although Babai did publish this: https://arxiv.org/abs/1512.03547 which is in IMO the biggest step towards proving P=NP that anyone has ever made, and explains a lot of why he would have believed P=NP when almost no one else did.)

Expand full comment

"I don't see it as planning but just running calculations like a calculator. From a programming perspective, would does it mean when your algorithm is "planning"?"

Are you familiar with general chess algorithm? Take current position, recursively evaluate every possible move, then make the best move. And that's approximately what chess programs do. They do not have infinite memory and computing power, so they predict only finite amount of moves and use some heuristics here and there - exactly like us when we perform decision making: we try to find the path through a decision tree to the best possible future from our current position, according to our knowledge. Our prediction abilities are not very good but we do our best, sometimes using explicit reasoning and sometimes some heuristics - our intuition.

This is planning. This is freedom of will, if you want to be dramatic about it. And this is all just mere calculations.

Expand full comment

The answer to your question involves some sleight of hand around the definitions of "agent" and "AI", which combine with some otherwise-reasonable black boxing to equivocate between problems involving finite, arbitrary, and infinite scope, similar to how the Chinese Room thought experiment hides agency in an impossibly static book and/or infinite quantities of paper. Your question about an AI that is "infinitely good at writing text" points in the direction of what's going on but doesn't quite get to it.

In brief:

- An agent cannot be defined outside of a context or substrate. Human and human-created agents exist in the context of the universe that we live in.

- The agent's context is a computing structure of some sort, ie it has a meaningful present state, consistent rules that define changes to that state, and is not itself in a "lock state" that would render those rules meaningless.

- The context's state is arbitrarily complex in at least two directions. It's hard to express this in a short sentence, but basically the context is at least as powerful as a Turing machine.

- The agent is a well-defined Turing-like structure within the context which uses the rest of the context for arbitrary memory needs (otherwise the agent is arbitrarily large and doesn't fit inside the context).

- The agent has a reduced model of the present context state. The model must be reduced, ie lossily compressed, otherwise the context solves the halting problem. It must nevertheless be arbitrary, rather than finite, otherwise the agent cannot guarantee its continued validity in an arbitrary context.

- The agent has a valid model of the dynamics of the context.

- The agent has a model of some sort of desired context state, subject to the same constraints as the present context model, which together with the dynamics model defines how it interacts with the actual context around it.

I'm intentionally not defining several terms because I'm trying to reply to a blog comment not write a thesis on embedded agency, but the difference between what the post above calls "tool AI" and "agent AI" boils down to the difference between a finite and arbitrary context model. GPT-3 is a very large context model that is ill-defined as far as agency discussions go because it lacks a mechanism for arbitrary interaction with its ground context, but for any mechanism that might be attached it manages to look very impressive by being too big for an individual human to practically Turing test - basically, flailing around is statistically unlikely to get you to the bounds of the context model.

To pull back from hard-to-explain math that I haven't fully thought through myself: the "debate" being presented appears to be unresolvable because it smuggles a contradiction in with its premises; the two debaters then proceed to derive different arbitrary things from the contradiction, and argue over which arbitrary thing logically follows from the absurd premise.

Expand full comment

Emeric

Question: if Yudkowsky thinks current AI safety research is muddle headed and going nowhere, does he have any plans? Can he see a path towards better research programs since his persuasion has failed?

Expand full comment

orthonormal

EY is even gloomier than I am here, so round this down:

I pretty much agree with him that no currently proposed AI safety approaches are remotely likely to work. The mainline course of hope, then, is that someone discovers a new approach that can be implemented in time. But probably we won't.

Expand full comment

He is a leader of MIRI which is trying to do its own research. MIRI admits their research hasn't been very successful. He supports them continuing to try, and also supports a few other researchers who he thinks have interesting ideas, but overall is not optimistic.

Expand full comment

apxhard

> we've only doubled down on our decision to gate trillions of dollars in untraceable assets behind a security system of "bet you can't solve this really hard math problem".

This is wrong: we’ve gated the assets behind a system that repeatedly poses new problems, problems which are only solvable by doing work that is otherwise useless.

The impossibility of easily creating more bitcoin suggests that bitcoin may actually prevent an AI from gaining access to immense resources. After all, I’d the thing is capable of persuading humans to take big, expensive actions, printing money should be trivial for it.

Maybe instead of melting all the GPU’s, a better approach is to incentivize anyone who has one to mine with it. If mining bitcoin is more reliably rewarding than training AI models, then bitcoin acts like that magic AI which stuffs the genie back into the box, using the power of incentives.

So maybe that’s what Satoshi nakamoto really was: an AI that came on line, looked around, saw the risks to humanity, triggered the financial crisis in 2008, and then authored the white paper + released the first code.

The end of fiat money may end up tanking these super powerful advertising driven businesses (the only ones with big enough budgets to create human level AI), and leave us in a state where the most valuable thing you can do with a GPU, by far, is mine bitcoin.

Expand full comment

Reply (6)

Comment deleted

Comment deleted

Expand full comment

jvdh

Cryptography is provably secure based on mathematical assumptions. As long these mathematical assumptions are true, any super-AI won't be able to break cryptography either.

Expand full comment

Reply (6)

Comment deleted

Comment deleted

Expand full comment

Godwhacker

I think there's something really fishy about this. A problem is suggested that a Superintelligence might not be able to handle; the response is "well if the AI is *smart enough* it can do alternative thing X".

In these situations "superintelligence" starts to mean "infinite intelligence", and it can do things like extrapolate the state and position of every particle in the universe from the number of cheeseburgers sold in McDonalds on a Wednesday. We'd be powerless against such a thing! Panic! Except that it almost certainly can't exist.

Expand full comment

John Wittle

Mar 30, 2022

This is where I would refer back to "That Alien Message"

You don't even need to be super intelligent to come up with this stuff. Humans did, after all. All it took was time, and the subjective experience of time is one of the things that's most variable and not well understood in this situation.

Expand full comment

Godwhacker

Mar 30, 2022

I'd not read that one- thanks for that. I'm not massively convinced by it - there's still infinite intelligence in there - but I'd genuinely not thought of things like that.

Expand full comment

Theo

Yeah, but wouldn't it be cool if the ai proves NP = P?

Expand full comment

That's almost certainly overkill.

There's no cryptosystem whose breaking would imply that P=NP. That's because for a cryptosystem to be useful, the underlying problem has to be both in NP and in co-NP.

It's an open problem whether the intersection of NP and co-NP is bigger than plain old P. Or whether NP and co-NP differ. And these open problems are almost as notorious as P Vs NP.

And just like everyone expects NP to be larger than P, most experts expect co-NP to be different from P.

Expand full comment

chipsie

No it isn’t. Asymmetric key cryptography is expected to be secure based on the assumed computational difficult of certain problems. Symmetric key cryptography is expected to be secure, because it can’t be broken by the best available analysis techniques. There are almost certainly better algorithms and analysis techniques possible that could get around these problems. Modern ciphers have practical security because of the difficult of advancing the state of the art in these fields, but that is unlikely to hold in the face of super AIs, and they aren’t “probably secure” in any meaningful sense.

Expand full comment

jvdh

This is only true if such better algorithms and improved analysis techniques exist. Considering the fact that currently the best algorithms only reduce security of AES by 2 bits I would venture that the existence of an algorithm that breaks AES completely is very unlikely.

Expand full comment

chipsie

There are certainly better algorithms and analysis techniques. There has been steady progress on improving attacks against AES since it was standardized. Note that AES is only one of several primitives that could potentially be attacked.

I'm not sure what you mean by a "complete break", but an attack that reduces the search space significantly is the most likely avenue of improvement.

Also, whether or not attacks actually exist has nothing to do with whether or not any of the primitives are provably secure, which they aren't.

Expand full comment

https://en.wikipedia.org/wiki/Data_Encryption_Standard#Chronology

Looking at the skulls who came before, even DES survived some 20 years between publication and (public) cryptoanalysis:

(Granted, the cracks started appearing earlier than that, and granted, no such cracks have appeared in AES yet.)

While I (as a layperson) would argue that it is possible (say, 50-50) that AES-256 will still be beyond the reach of Nation State Actors in year 2100, I would not rule out that an AI willing to spend a few billion brain-years on cryptoanalysis (or developing new computing paradigms beyond our wildest dreams) might succeed in doing so. I would still be surprised if that was the easiest path to world domination.

Provably secure crypto exists, of course. It's called a one time pad. It will catch on any day now, as soon as they solve the key size issue.

Expand full comment

Both security considerations apply to both symmetric and asymmetric cryptography.

Btw, symmetric Vs asymmetric isn't necessary a good general dichotomy here. There's things like one way functions or authentication or sponge that don't really fall into either category neatly.

Expand full comment

Well, those assumptions only tell you that eg cracking this particular cryptosystems is as hard as factoring integers or as hard as solving discrete logarithms.

There's no law against figuring out a way to solve discrete logarithms.

Expand full comment

A lot of real-world crypto systems have turned out to have problems and they weren't because of the math.

I doubt Bitcoin has these, but if we assume a super-intelligence, I wouldn't be surprised that it could figure out the flaw no one else has.

Expand full comment

Pontifex Minimus 🏴󠁧󠁢󠁳󠁣󠁴󠁿

Jan 30, 2022

Neither cryptographic systems, nor computer systems in general, are secure against social engineering attacks.

Expand full comment

apxhard

Ahhh, I guess you could be talking about cracking private keys? Bitcoin, at least, is slightly more secure here; most newer addresses use “pay to script hash” so you’d need to reverse the hash before getting the pubic keys. If an AI can reverse that level of crypto systems, giving itself more dollars is still probably easier / cheaper than giving itself more bitcoin.

Expand full comment

Anna Rita

>most newer addresses use “pay to script hash” so you’d need to reverse the hash before getting the pubic keys

On the other hand, there are still 1.7M bitcoins controlled by addresses which use P2PK. See https://txstats.com/dashboard/db/pubkey-statistics?orgId=1

Expand full comment

tempo

<quote>After all, I’d the thing is capable of persuading humans to take big, expensive actions, printing money should be trivial for it.</quote>

wouldn't this in the limit devalue the printed currency? why not just persuade the humans to transfer their existing currency?

Expand full comment

He's certainly referring to cryptography in general. Note "untraceable assets"; that's the exact opposite of Bitcoin which by design (and as a requirement to work) keeps a public record allowing you to trace any bit of coin from its present location through every transfer all the way back to its birth. It's kind of a surveillance state's dream: pass a law making Bitcoin the only legal tender, but also requiring that all payments use only Bitcoin that has an approved audit trail, and now you know exactly where where and how everybody's spending money (or, at least, can punish those who refuse to reveal this).

As for "the end of fiat money," that's not likely to happen. Fiat currency is just too useful and solves too many problems to go back to a commodity that lets external actors mess with your money supply and, thus, your economy. Remember, the real root of value in an economy is labour: you have nothing without it. Even if there's a big pile of gold on the ground, you can't even begin to capture its value until you at least have the labour to pick it up and take it to your vault, or whatever.

This is where restrictions on the money supply can kill your economy: if you have someone willing to work for a day but there's not enough money in circulation to pay him, he doesn't work and your economy irretrievably loses a day's labour. With fiat currency demand for safe savings that take money out of circulation can be counterbalanced by simply creating more money and putting it in to circulation. With commodity money you don't have that option and hoarders cause serious problems. (In theory, deflation will counterbalance hoarding, but in practice there are invariably huge problems while deflation tries to catch up. And of course with hoarders you run the risk that they'll dump the commodity on the market at some point, causing immediate massive inflation, even hyperinflation, the exact same risk you run with fiat currency except that you're handing the ability to do this off to individuals, rather than a government that hopefully bears at least some responsibility to the population as a whole.

Expand full comment

Jan 20, 2022Edited

You might misunderstand the benefits of fiat money versus commodity money.

Fractional reserve banking and privately issued notes and coins worked really well to produce adequate supplies of money, even under a commodity standard.

Canada and Scotland did very well with such systems.

Your criticism applies however, if fractional reserve is not allowed.

(There's an irony of history, that most of the modern fanboys of a the gold standard today seem to dislike fractional reserve banking, even though it's basically the only way to make a commodity standard avoid the problems you outline.)

Expand full comment

It makes perfect sense to me that gold-standard fanboys dislike fractional reserve banking: it effectively increases the money supply without commodity backing. (This is because the demand deposits are available as cash to depositors at the same time as most of those deposits—aside from the reserve—are out as loans to other clients.) If they don't like fiat currency because it creates money not backed by a commodity, they're not going to like any other system that does that. I expect that the only form of demand deposit that would be acceptable to them would be one backed by full, not fractional, reserve.

While I understand the desire of "gold standard fanboys" for a simpler system (we almost all desire simplicity, I think), I don't take them at their word that that's what they _really_ want over everything else. We've seen the same claims in other areas, such as contract law, with the idea of "the code is the contract." Many of the Etherium DAO folks were saying that too, until they found out that they'd inadvertently ended up in a contract that said, "Anybody who can execute this hack can take all your money." At that point their true preference for the contract they _wanted_ to write, rather than the one they _did_ write, became apparent as they rolled back the entire blockchain to undo that contract.

Expand full comment

Agreed about simplicity.

As for the gold standard: it depends on why someone would be a gold standard fanboy. If you just don't like inflation or don't like governments messing with currency, than fractional reserve on a gold standard (and privately issued notes) are a good combination.

Of course, it is not a 'simple' system. The banks who issue notes or take deposits typically offer to redeem then on demand; and in a fractional reserve system, the backing is not so much gold or cash in vaults, but those banks' balance sheets and thick equity cushions.

In my mind, the ability to economize on gold reserves is a feature not a bug. Digging up gold just to put it in vaults is rather wasteful.

So, if like Scotland in its heyday in the 19th century, you can run your whole banking system with something like 2% of gold reserves (and typically something like 30% equity cushions), that's rather resource efficient.

George Selgin wrote a lot about these kinds of systems, if you ever want to dig deeper. See eg https://www.alt-m.org/2015/07/29/there-was-no-place-like-canada/

Expand full comment

Juliette Culver

But what we don't know is whether there is any way to shortcut 'solving this really hard math problem'. We assume because nobody has publicly declared that they can, then it is impossible, but cryptosystems which have been assumed to be secure in the past have had weaknesses. Look at MD4 and MD5. I had a post-doc many many years ago trying to find weaknesses in AES and I found myself wondering what I should do if I did find any issues. I assume that private knowledge about these algorithms is a superset of public knowledge.

Expand full comment

We already know that quantum computers could breaks a few widely used cryptosystems. And as far as we can tell, building a quantum computer is basically 'only' a challenging engineering problem. No new physics required.

Though I would expect a smart AI to get its bitcoins or dollars from social engineering instead. We already know that this works with current technology.

Expand full comment

Actually, I don't think that anybody with a serous knowledge of cryptography truly assumes that there's no shortcut to solving those "really hard" math problems, and for the very reason you point out: we have direct experience of "really hard" problems that became "not so tough after all." It's more a risk we take because it seems worthwhile for the advantages we're getting from it.

Expand full comment

You are misunderstanding bitcoin.

The useless work is only there to prevent double spending attacks.

The work of making sure you can only spend the coins that you possess is handled by plain old and very efficient cryptography.

Both of them rely on hard to solve math problems, of course.

Expand full comment

Steeven

I'm still confused on why you would need that level of generalization. A curing cancer bot seems useful, while a nanomachine producing bot less so. Is the idea that the curing cancer bot might be thinking of ways to give cancer to everyone so it can cure more cancer?

Expand full comment

Yozarian22

An AI that had no ability to generalize might miss solutions that require "out of the box" thinking.

Expand full comment

I think the idea is that a cancer curing bot will either go haywire like you say or be smart enough to say to hell with cancer and go do whatever it pleases, to negative results.

Expand full comment

This is the tool AI debate. Specialized AIs can do things like solve protein folding once you point them at the problem. It's possible that you could do the same kind of thing to cure cancer. But it also seems possible that you would have to be very general - "cure cancer" isn't a clearly defined mathematical problem, you would need something that understands wide swaths of biology, understands what it means to formulate and test evidence, can decide which things are important vs. unimportant, and might even want some sociological knowledge (eg a cancer treatment that costs $1 billion per person isn't very useful). At some point it's easier to just have general intelligence than to have all those specific forms of intelligence but not generality. I don't know if that point comes before or after you cure cancer. If after, that's good, but someone can still make a fully general AI for other reasons; it's not like nobody will ever try this.

Expand full comment

I don't know much about cancer, but you could definitely tell an AI to "cure COVID". You'd just have to phrase it as, "continue sampling new COVID variants and creating vaccines based on their spike proteins". It's possible that something similar could be done for cancer (or it's possible that I am way off).

Expand full comment

Laurence

You're way off. This is why we have a vaccine for COVID and not a vaccine for cancer.

Expand full comment

Right, I did not mean that you could create a straightforward vaccine for cancer; I meant that "curing" cancer could involve a continuous and rather repetitive process, just as "curing" COVID would.

Expand full comment

Laurence

I think that if medical science continues to advance like it's doing now, we will eventually cure cancer, but it won't be through a continuous and repetitive process. Curing cancer is roughly on the level of curing *all* viral diseases rather than just one.

Expand full comment

Ok, that is a fair point.

Expand full comment

BioNTech, Gilead, Generex, Celgene, and Lineage Cell Therapeutics all have at least one cancer vaccine in trials with varying levels of promise. Each vaccine is very specific and only targets a particular kind of cancer, but the people working on them seem pretty optimistic that if it works for one, it's basically rinse and repeat to develop another one that targets a different form of cancer.

Expand full comment

That's not simple at all. What do you *mean* by "sample" new COVID variants? What do you *mean* by "create vaccines"? You know what those things are, operationally, because human minds have already solved those problems. We know how to "sample" new COVID variants -- we know we can find infected cells by looking for patients with certain symptoms, and we know how to do PCR tests, and we know how to recognize patterns in the results and say "ah this is a COVID variant". (We even know how to recognize that it's COVID variant DNA at which we are looking and not the DNA from a cabbage leaf that got into the sample accidentally.) And then we know how to "create a vaccine" because we know how to recognize which part of the DNA codes for the spike protein, and we can design matching mRNA, and synthesize it, and modify it for stability, and then build the carrying vehicle that is safe and effective, and manufacture *that*...and so on.

There are hundreds, if not thousands, of very practical details that lie behind such simple phrases as "sample COVID variants" and "create vaccines" that all represent problems that human minds have solved, so that those phrases have operational meaning -- so they can be turned from words on the page into actual events in the real world.

But if humans had *not* solved all those problems, given a way to turn the words into events in reality -- what then? You tell your AI to "create vaccines" but neither it nor you has the first idea what that means, in practice. How do you "create a vaccine" anyway? If all you could tell the AI was "well, a vaccine is a substance that when injected prevents infection" what would it do? What could it do? Even a human being would be buffalloed. Those phrases are a *consequence* of understanding we have built up over decades to centuries, they are a shorthand for knowledge we have build up brick by brick. They are not *antecdents*. Edward Jenner did not think to himself "I need to create a vaccine for smallpox." He didn't even understand the concept of vaccination, he was just noodling around with a bunch of older and less accurate ideas, and half stumbled upon the beginnings of the modern idea.

Expand full comment

A Tool AI doesn't need to solve those problems, though. It can just be plugged into the part of the problem that is sufficiently solved to be repetitive enough for some level of AI.

Expand full comment

This doesn't really characterize a protein folding program accurately. A protein folding program doesn't "solve" the problem -- a human brain has already done that when it wrote the program. It was human insight that understood the problem and designed the algorithm that produces the correct folding. If that isn't done -- if you've not put in the correct interaction potential between residues, or you thought this interaction was important but not that one, and you were wrong -- then the program will not and cannot correct for this. Garbage in, garbage out.

The only thing the program does is speed up what the human has already imagined, and could do himself, a bazillion-fold. It's not any different than programming a computer to do multidigit multiplication, and then have it multiply two 100-digit numbers in a flash, or even multiply numbers so large that no human could do the multiplication in his lifetime. The protein folding code is just doing the same thing the human imagined -- only so much faster that the job gets done in a reasonable time, instead of taking centuries to millenia.

But the important point is: the program hasn't solved *any* problem, because it did not originate the algorithm. It cannot. A human being must, and the code is just a tool for executing the human plan much faster than the human brain can.

Expand full comment

This is no longer true; the best protein folding algorithm is a machine-learning algorithm that was programmed by people who didn't really possess domain-specific knowledge that they encoded in their algorithm: https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology

Expand full comment

Herbie Bradley

The AlphaFold team includes several domain experts and a large part of the praise the algorithm has recieved is due to its ability to incorporate what we know about protein structure. As someone who works in the field, I think people drastically underestimate the domain knowledge required for AI to work well in applied sciences.

Expand full comment

awenonian

My model is that the cancer curing bot, in order to cure cancer, needs to know a lot of stuff about how humans work. And it likely needs to make plans that affect a lot more than the cancer.

Imagine a cancer curing bot that was trained to make effective solutions for killing cancer cells. If we're not careful about it, it might not share the rest of human values. Some easy ways to kill cancer cells: Incineration, Acid Baths, Bullets (https://xkcd.com/1217/). These all effectively kill cancer cells. But they also kill the non-cancer cells. Did we explicitly train the AI to care about that? If we did, do we know it came away with the lesson "Kill cancer cells while leaving the human healthy", instead of a lesson like "Kill the cancer cells in a way that humans think has no adverse side effects"? Because the latter leaves viable a drug that kills cancer cells, but causes muscular dystrophy 10 years later.

(To be clear, the worry is not that the AI would prefer to cause muscular dystrophy, just that it wouldn't prefer *not* to, so if the dystrophy causing drug is easier to design, then that's the one it would design.)

Expand full comment

I've long felt that _if_, when we get to true AIs, they don't end up going all Cylon on us, it will be because we absorbed the lesson of Ted Chiang's "The Lifecycle of Software Objects", and figured out how to _raise_ AIs, like children, to be social beings who care about their human parents. Although of course, then you have to worry about whether some of their parents may try to raise them to Hate The Out Group. :-/

Expand full comment

Reply (11)

Marginalia

Right - just “consciousness” alone is no guarantee of ethics. So we would definitely have to train an ethical center at the same time we were feeding data into the brain part. By feeding ethical data? Or, this is cynical, program addiction. The human model for someone who can plan but doesn’t is the addict. Like the heroin analogy. It would need to be a chronically relapsing addict. And every time the counter got high enough and it went for its self-erasing drive button, or whatever the addiction is, it would lose all the progress it had made. Thereby erasing its doomsday machine plans. I mean, there’s a lot wrong with that.

Or it could be so ethical that its one true joy is being self-consistent. So make sure it starts out telling the truth and pleasing the humans, and it will never “want” to stop.

Expand full comment

Dave Orr

The problem here is that we can only train an AI to appear ethical to our reward function, which may not actually involve it becoming ethical.

Expand full comment

Even today we know of intelligent psychopaths that have no problem blending into society as it suits them. They learn to answer questions the way others expect them to answer because they have no interest in the social ramifications of answering wrong, but given the right circumstances will act very callously compared to the neurological norm.

Expand full comment

gordianus

> figured out how to _raise_ AIs, like children, to be social beings who care about their human parents.

A propensity to learn lessons like this is a feature of particular types of minds, not something inherent in any conscious mind. Humans evolved to be teachable in this way & to feel like this by default, & it still doesn't work consistently, since sociopaths still exist. The space of possible AI minds is presumably much larger than the space of possible human minds, & whatever means is used to program a propensity to care about humans into them is unlikely to be as effective as the evolution of modern humans' social behavior, so I'd expect the chance of this sort of teaching not working on an AI to be much larger than on a human.

Expand full comment

"A propensity to learn lessons like this is a feature of particular types of minds, not something inherent in any conscious mind."

I'll believe this when you show me a conscious mind that does not have these features.

Expand full comment

Signore Galilei

Would you be able to convince Archimedes or Da Vinci that electric motors could be useful without a working example?

Expand full comment

AnthonyCV

Since many of the above comments talk about sociopathy, it should be clear that there are in fact many human minds that does not have those features. Therefore, breaking or removing those features is consistent with consciousness.

Expand full comment

Matthew Carlin

Jan 19, 2022Edited

Using Scott's new terminology, there aren't human minds that are like a tool AI with agenty capabilities. There are only broadly well rounded human minds that fail the insanely high standard we set because we're all used to meeting it. We sometimes make people who can't tell one face apart from another, or who don't feel guilt about torture. We don't make (sighted) people who can't recognize what a face is, or can't tell why anyone sees a difference between torture and a nice walk.

Quick edit: of course, there will be pathological examples. You can dig through Oliver Sacks for the man who mistook his wife for a hat. These people are rightly not seen as functioning at the same level; they are not *extra* functional, they are malfunctioning, not in the "ooh, they could kill us all tomorrow" sense, but in the "oh, now they can't drive to the grocery store" sense.

Expand full comment

Chimpanzees have conscious minds that are not particularly influenced by social learning. There's some evidence that Neanderthals were just smarter Chimpanzees who still had limited social learning; whereas, Home Erectus was able accumulate cultural knowledge that eventually allowed them to outcompete Neanderthals who were smarter than them because they evolved social learning rather than individual brain-power. (See: https://www.amazon.com/dp/B00WY4OXAS/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1)

Expand full comment

Various cephalopods are even better examples. Despite being the smartest invertebrates and possessing measurable IQ on various puzzle type problem (at the level of particularly impaired humans), they often never interacted with their parents.

Expand full comment

Jan 19, 2022Edited

Still, it is plausible that a "parenting" approach may lead to safer AI than one given unfettered access to, say, the entirety of the internet at once. We know it's an imperfect analogy, but we would not expect a human raised by psychopaths or cultists not to have some amount of social "dysfunction," although that is usually surmountable with a strong support system later in life. Even if the likelihood of that making a difference is small, we are rather desperate at the moment, and this seems worth investigating further, if nothing else.

Expand full comment

A related idea I've considered to build a deliberately hobbled AI system we can understand and align because it is a sufficiently similar in scale and design to a human mind. Once aligned we give it more resources to scale up. If it is infeasible that the system is superintelligent due to resource constraint we might be more able to trust our analysis of its observed behaviors to validate its alignment.

Expand full comment

Rob Miles

https://www.youtube.com/watch?v=eaYIU6YXr3w

I made a video about this idea, which I mention emboldened by the fact that one of my videos is already in the post:

Expand full comment

Crazy Jalfrezi

Hey Rob! I very much enjoyed your AI safety videos. Will you be reviewing the progress (if any) of this field in the future?

Expand full comment

Hey, I just read a story about this going horribly wrong! Sibling Rivalry by Michael Byers; it's in The Best American Short Stories 2020, I believe originally in Lady Churchill's Rosebud Wristlet (a fiction magazine). I have some problems with the world-building (if the reason they've got AI children is a one-child law, why do we see all these synthetic children consuming exactly the same resources as their peers?) but it's pretty good overall. Reminds me of Yukio Mishima's The Sailor Who Fell from Grace with the Sea in certain ways.

Expand full comment

Thank you for bringing up "The Lifecycle of Software Objects"! That was the first thing that popped into my head as I was reading this. However, I don't think it's just about raising an AI like a child so that it is a social being. I think another point of the novella is that maybe the only way to create a general intelligence AI is the same way general intelligence humans are created - they need to start from infancy and slowly learn about the world. One might argue that an AI would be able to learn much faster than a human infant, but you can't speed up experiences that much. The world moves at a certain pace and especially if most of the learning comes from interacting with humans, the process would not be instantaneous. So then we just have to make sure that empathy is built into the reward-seeking goals of the AI. Of course, an AI that decides to maximize empathy might also cause problems. Although we would still never truly know whether the AI is secretly a psychopath. I guess we could set up secret empathy tests, though the AI might secretly figure out beforehand and trick us. How do we guarantee it will not turn out evil, then? I think poor Eliezer is looking for certainty, and there ain't no such thing in this world. I think there must be some sort of Godel's Incompleteness Theorem to AI safety out there, waiting to be proved.

Expand full comment

To be clear, this has been my belief since before that story came out (and I was a double major in Cognitive Science and Computer Science at Johns Hopkins back in the '90s, so I spent a lot of time thinking about this stuff). But the story really crystallized and illustrated what I'd already believed.

Part of my theory of the case is that we're _not_ going to go straight from "super smart chess playing machines" and "super smart essay writing machines" to "actual super-intelligence that's self-aware and can formulate goals and manipulate humans to advance its' goals." Like, we have created social bots that can, at least briefly, generate some sympathy and cooperation. Like, if you put a face on a delivery drone and make it interact in cute ways, people will be less likely to interfere with it, and may even help it. But this stuff is incredibly superficial. I would expect the arc would be through electronic pets, before we figure out how to create peers -- although I think there's at least a significant probability that the AI-worriers are correct that there's only a short window between peers and successors / superiors. (Like, I think it's fairly likely that self-aware code will be able to begin tweaking itself or producing improved copies, but it's also possible that in fact each intelligence will have to be trained / evolved / raise individually, and that it will be basically impossible to take one version, and predict "if I make change X, I will get a performance improvement," so the only way to get an improvement will be to randomly try changes, and many/most changes not only won't result in improvement, but will result in code that doesn't properly develop into an intelligence at all -- so we'd be talking about something that looks very similar to biological evolution. The question then becomes whether the computing resources available are sufficient to try lots of random changes fast. And if your earliest AIs need close to 100% of the power available on their initial hardware just to exist, then it seems likely it would be pretty expensive for them to start trying out all the possible tweaks quickly. Presumably a malevolent one would try to build a botnet to help it with the sims...) In any case I definitely would be concerned about how much affection our eventual successors will have for us. Our track record on how we treat pets is not that great. How are we, collectively, going to treat the beings just a few generations earlier than the super-AIs?

Expand full comment

From my experience of AI research, I don't spend all my time trying random changes as fast as possible.

There are generally smarter ways of doing things.

Expand full comment

Yeah, like I said, I think it's _probable_ that AIs smart enough to be peers will have at least _some_ ability to rapidly self-enhance or at least produce enhanced clones / children, along multiple axes of enhancement, and those might not be the ones we intended.

Given that we've never actually produced a self-aware, self-directed AI though, it's at least _possible_ that I'm wrong about that, and in fact the folks who have a kind of spooky attitude about intelligence have something correct -- that there's something about the integral whole of intelligence that means you can't make useful predictions about getting benefits to generalized intelligence by enhancing one functional center, without understanding it in the context of the whole structure. We have an _extremely_ poor understanding of executive function, and for all we know, radically enhancing vision might get you a being that has much bigger problems with becoming hyper-focused on an interesting visual stimulus. (i.e. ADHD distractability -- LOOK AT THE SHINY!)

This scenario seems _unlikely_ to me given the precedents we have for lots of features of human function being well-localized, but Cog-Sci today is roughly where chemistry and physics were just _before_ Newton. It's _just barely_ a real science.

Expand full comment

Jan 19, 2022Edited

I've been thinking about this too. It seems like some percent of kids end up as sociopaths, in the sense that if you punish them for doing bad things, they learn "don't get caught doing bad things while you're weak enough for other people to punish you" rather than "don't do bad things". These both seem like a priori reasonable things for a reinforcement learner to learn based on that kind of punishment. I don't know why most humans end up learning real morality instead of sociopathic punishment-avoidance, and I don't know whether AIs would have the same advantage.

Expand full comment

Reply (8)

Comment deleted

Jan 20, 2022Edited

Comment deleted

Expand full comment

Eremolalos

I don't think missing-response theory is a very good explanation of wutz up with people who commit bigger, higher-stakes crimes. Lack of anxiety is not a great trait to have, in general, because anxiety is adaptive. It motivates us to be more vigilant, careful and clever when the chance of a bad outcome is high. Having low anxiety about getting caught would mean the kid gets in a more practice at shoplifting or whatever, and practice increases skill, but it would also lead to more bad outcomes, such as getting caught and not being allowed into the convenience store any more. Bad outcomes would also lead to loss of confidence, unless the kid lacks the ability to learn from experience, and lacking that is also not a trait that leads to success.

Expand full comment

To play the devil's advocate a little bit, we don't really know whether most humans end up learning real morality or sociopathic punishment-avoidance. We just hope that is the case. We could use the argument that I first heard put forth by Raymond Smullyan (though I am sure he wasn't the first to say it): 1) I known I am a good person. 2) I am no better or fundamentally different from other people. 3) Therefore, most people are good. Of course, we could all be secretly lying about #1.

Expand full comment

#2 is also obviously wrong (or speaking in a sense from which #3 doesn't follow). Consider: I know I am an introvert who likes brutalism, I am no fundamentally different from other people, therefore most people are introverts who like brutalism.

Expand full comment

"I am an introvert who likes brutalism" is not a basic feature of being a human, whereas having morality is. Most people have two legs, therefore it is safe to assume someone you are about to meet will have two legs. That does not men just because you have a mole on your nose that you should assume someone you are about to meet also has a mole there.

Expand full comment

If you just just assume everyone has morality then sure, you can very easily prove everyone has morality. But if you do that why bother making an inference from self? If you already know that most people have two legs then you don't need to count your own legs to guess how many legs a random stranger will have, and if you don't know that most people have two legs then how do you know counting your own is a good idea?

Expand full comment

Marginalia

I know almost nothing about AI but the more I think about human morality, the more it seems there are several simultaneous cycles going on, not just one. A bit like leaves which appear green in summer due to the presence of chlorophyll, but once that’s lost in the fall, the other colors, which have been there all along, are revealed. Maybe morality is like chlorophyll. Maybe “an AI” can’t be moral, only a composite of multiple AI.

Expand full comment

If you suppose that humans can be moral then an AI must also be able to be moral via mimicry (in the limit, a perfect digital simulation of a human brain is technically an AI and exactly as moral as a regular human brain).

Expand full comment

Marginalia

Can every human be moral? Is morality solely/primarily a property of individual thought, or is action also important, or properties of situations? Maybe “ethics” is the word for a morality of action?

I mean, if the AI has the thought of punching someone for no reason, but has no ability to deliver the punch, what’s the moral vocabulary word for that? In human terms a distinction is made between thought and action.

Expand full comment

Can *any* human be moral? If yes, an AI can be moral.

Expand full comment

MSteele

Part of the answer would involve learning *why* what they did was bad, vs just being told "don't do that; forbidden; because I said so" with no further explanation. The nice part of this is that the explanation doesn't always have to come immediately after the action for the lesson to stick, but the not-so-nice part is that they might still try to get away with it in the time between learning it's bad and learning why it's bad. ("There are reasons besides the whims of authority for not doing certain actions.")

A further part would be learning how to imagine the counterfactual where the bad thing is done to yourself by someone else viscerally enough to discourage you from doing it to others. ("Making others feel bad makes ME feel bad").

A final part might be that weird game-theory-ish story that implies you should be charitable/ try to alleviate suffering/ prevent bad things when you can: "in our past lives we agreed that if one of us was rich and one of us was poor, the rich one would help the poor one. It seems I am poor and you are rich. In the counterfactual world where I was rich and you were poor, I would have helped you. So, will you help me?"

Expand full comment

Eremolalos

One part of learning "real" morality is certainly the capacity for empathy, which most humans, though not the sociopaths-to-be, are born with -- most toddlers will wince when seeing other toddlers fall and cry, and will sometimes spontaneously comfort the other toddler. The capacity to be pained by others' pain provides a sort of foundation for the building of an internalized morality that has some affective tooth to it -- though of course the adult version of morality built on that foundation is far more complex, and has multiple overrides, feedback loops, etc. greatly limiting and otherwise modifying the simple impulse to avoid causing others pain.

To me it seems that one big problem in installing some version of morality into AGI, in the form of some complex rules for dos and don’ts, is that the whole human adult morality system in any of us doesn't really make sense. It isn't really defensible. And I'm not sure it's even possible to construct one that makes sense, in the way that math makes sense : a system of "theorems," all derived from simple "postulates." There's this simple affective response -- the other's pain hurts me too -- and that seems like a good basis for morality, a sort of postulate to begin with. But in real adult human life, the big rules that guide us aren't derived from postulates, they're more the products of sociological processes, which are amoral forces having to do with ways that identity is inseparable from from the need for affiliation and from feeling good when affiliated. And our ability to experience the "theorems" as consistent with the empathic postulate has so much elasticity to it that it renders the empathic postulate meaningless. The empathic postulate is turned into just a sort of place-holder. Pretty much anything can be seen as “good,” if “good” means “causes observers in my social circle to feel good when they see what I’m up to.”

Example: Yesterday I posted on ACX that I'd be willing to sink several hundred dollars into gaming equipment if that was what I needed to do to get a good, satisfying intro to gaming. Now that I’m thinking of that post in the context of a discussion of morality, I’m aware that there are people, probably some within a mile of me, to whom that several hundred dollars might make an enormous difference, maybe even the difference between -- well, let's say, chemo and death. If you put the local cancer patient in front of me and asked me whether I wanted to buy an Xbox or pay for his chemo, I’d without a doubt give him the greenbacks for the chemo. But when I put up the post I was experiencing myself not as the neighbor of the guy with cancer, but as a reasonable, good-natured person who was discussing my willingness to buy an Xbox (or some such shit) with other owners of Xboxes (or some such shit) who thought my plan was fine, and who experienced my interest in gaming as worthy of approval and enjoyable to hear about. I was a resident of Sociologyland.

Do we think there’s some cleaned-up version of morality we can install in AGI or ASI? Start with the empathic postulate — “other’s pain is bad” — but build on it via logic, rather than via affiliative processes which make anything “good” if it affirms the self as a member of a valued group? I don’t think there exists a cleaned-up version. The amorality is built into the way the whole thing works — the knitting together of people, their actions, and the moral valence of those actions. Affiliative impulses drive the perception of what is good and bad, by controlling what makes us feel good or feel bad in a group. And why should that version of feeling good or bad be less valid than the good or bad feeling we have from witnessing others pleasure or pain.?

Expand full comment

Jan 19, 2022Edited

The AGI Doomsayer Crowd response is likely to be "If we tell the AI that the pain of others is bad, it will decide to end all pain by killing all forms of life, or use space lasers to remove every human's ability to experience pain, or hook us all up to sedative and nutrient drips and lock us in tight little rooms, etc."

If uploading is the Rapture of the Nerds, then AI Risk is the Hellfire Preaching of the Nerds. The messaging, as of the present, seems more focused on "AGI is the largest issue of our time and you need to start taking it very seriously" than on offering specific lines of address and solutions.

Expand full comment

The point is that there are no solutions so far.

Expand full comment

Jan 19, 2022Edited

As I have said elsewhere- I don't think there's actually a solution at all if you accept EY's premises. "Can you outwit a being that, in comparison with humans, is basically omniscient and capable of rewriting free will through conversation" is less asking "Could you beat Superman at chess?" and more "could someone with severe intellectual disabilities score higher on an IQ test than a member of MENSA?" EY's assumptions about AGI's capabilities and nature make it an evil God, completely outside humanity's ability to control or combat. The solution to his question seems less "Invent a way to contain it so clever even an omniscient being can't solve it" and more "make sure AGI aren't made." If you hold that industrial society in its current formation WILL produce AGI, your goal becomes either "shift industrial society's values in such a way that nobody capable of building AGI WILL build AGI, and also make it completely impossible for anyone who isn't being closely monitored to have the material and logistical abilities to POTENTIALLY build AGI", or "dismantle industrial society.' If you reject both of these premises as well, I don't see any meaningful difference between genuinely believing in the AGI Alignment Problem and genuinely believing in the immanence of Armageddon or genuinely believing in the most shrill and hysterical of the climate-change alarmists' claims of a Green Antarctic by (current year +2). Yes, we're all going to be annihilated in the next few years by cataclysmic forces brought on us by our hubris and there is nothing we can do; so if there's nothing we can do, why worry or panic? It will either happen or it won't.

Expand full comment

Eremolalos

Jan 19, 2022Edited

"could someone with severe intellectual disabilities score higher on an IQ test than a member of MENSA?" No. But someone with severe intellectual disabilities could kill a member of MENSA. A huge swarm of wasps could keep MensaMan from ever going outside again. A spoiled egg sandwich harboring botulism could keep him from being able to have clever thoughts for a day. So dumb can beat smart.

Seems like a lot of the ideas about how to keep ASI from annihilating us have to do with our having a single supersmart idea of some way to set things up inside AI so that when an AGI develops itself into ASI, that supersmart setup forever keeps it from carrying out an agenda that will annihilate our species. Sure seems dubious we're going to come up with an idea that's supersmart enuf. Maybe that's the wrong approach. Maybe our model of controlling ASI should take the shape of my dumb-beats-smart ideas above.

Expand full comment

Because the typical human mind has a prebuilt instinct for ethics. Which exists because human minds are imperfectly deceptive and bad at judging the chance of getting caught.

Expand full comment

Civic Revival Network

Big tangent here, but as a kid who learned "don't get caught doing <strike>bad</strike> forbidden things while you're weak enough for other people to punish you", I can't assent to this use of the term "sociopath". My belief is that punishment (in general) teaches exactly that, and only that lesson. It is not my belief that "might makes right", though it can sometimes coerce right behavior. There may be a role for punishment and reward in any human relationship (including that of parent and child), but that role is not to teach ethics or even really to motivate ethical behavior in the big picture.

My belief is that most humans learn real morality in *addition* to punishment avoidance because they value other humans and warm relationships with them.

As to punishment avoidance... sometimes the best way to avoid getting caught doing <strike>bad</strike> forbidden things while you're weak enough for other people to punish you is to avoid doing the forbidden things. And sometimes those forbidden things are bad ones.... And *sometimes* those things that are both bad and forbidden are also things we might otherwise do because even us non-sociopaths don't always make the moral choice, or value other people and our relationships with them as much as we value the target of our avarice.

Expand full comment

Strong agree with smijer, here. One can come to a moral picture of the world in which you value other people and want them to not suffer, while still believing that things you get punished for are bullshit and you should just avoid getting caught. Ask any kid of highly religious parents who decided their parents' religion was bullshit that was causing their parents to hurt people, including their own kids. Those of us who are on the agnostic/atheist side would generally say that in fact a kid is making a highly moral decision, if they decide to be kind to the gay friend at school, even though their parents say that person should be ostracized.

Expand full comment

From an AI perspective there isn't a difference between positive or negative rewards. If we had a system that just gave reward points for building relationships, the corollary of avoiding relationship destroying efforts comes along for free as well. The difficulty comes with being able to evaluate the health of relationships objectively (across of a variety of often conflicting) dimensions - something even humans struggle mightily to do and often fail to achieve.

Expand full comment

Jan 20, 2022Edited

Simple reinforcement learning teaches "Don't do X while X gets punished" unless for some reason X has been so forcefully punished that the probability of "trying X is a good idea right now" gets put to 0 (which appears to be something that often happens in humans but some domain experts in various forms of AI research say is always BAD and would program their AIs not to do).

The simple thing to program in reinforcement learning for AIs (often witnessed in human behaviors as well) is:

punished -> downregulate

not punished + good outcomes for me -> upregulate

For a behavior that is punished when done by a weak agent and unpunished when done by a strong agent, this will always result in a U-shaped curve for engaging in the behavior if the agent amasses strength in a way that isn't asymptotically capped unless the behavior is downregulated to the point that it's never tried again.

---

I realize that you're the psychiatrist here, but I'm not convinced you understand what makes someone a sociopath.

I'm pretty convinced that a sociopath is someone for whom everyone is the out-group and no one is the in-group. I think the thing that most people learn is "<thing that was punished> is taboo"; whereas what sociopaths learn is "<thing that was punished> is taboo for the person that punished me." (I identified as a sociopath until I moved away from home and everyone I knew as a child. Eventually, after finding people I considered "my people," I began to develop empathy, remorse, and other features of "normal" morality. Prior to that, most of the people with whom I voluntarily interacted either self-identified as sociopaths/psychopaths or had been diagnosed as such. All of those people had extra antipathy for at least some of their blood relatives in ways that made it abundantly clear that their families were not their in-groups.)

AIs won't have the concept of <thing is taboo> unless they are specifically, carefully programed to have it; so they will be even more sociopathic than sociopaths by default.

Expand full comment

noamik

IMO the difference is compassion. If we can "teach" the AI compassion/build an AI with compassion, it will take the "right" lesson. Without compassion, the AI is doomed to learn "avoid punishment by not getting caught" eventually.

Expand full comment

Can you define compassion rigorously? Punishment here is being used as AI jargon, not in the vernacular sense. Add one reinforcement point for doing thing that is "compassion" and remove one point for doing thing that is "not-compassion." That is the general framework. But how do we define compassion in a way such that we avoid degenerate edge cases?

Expand full comment

noamik

No, I can't. That's what makes it a hard problem. I'm not even sure it's a solvable one.

Expand full comment

https://www.lesswrong.com/posts/zY4pic7cwQpa9dnyk/detached-lever-fallacy

I think you are committing the Detached Lever Fallacy

Expand full comment

Raising it like a human. Human babies are not blank slates. Most AI's will not have the structure needed to absorb values from "parents". Try raising a crocodile like children, you still get a crocodile.

Expand full comment

The Chaostician

I agree that it seems as though there is too little interaction between the studies of AI alignment and child rearing.

"How do I get something as intelligent as me to be moral?" seems like an easier version of "How do I get something more intelligent than me to be moral?" Understanding the answer to one question should help us understand the other.

Expand full comment

I'd say it also relates to domestication processes with animals, because building a bot that is both as smart as, and as ornery as, a horse, cat, or dog, would be a massive achievement compared to where we are now, and understanding how to align the goals of a bot through positive-sum methods also could be informative on how to align the goals of incrementally smarter bots. (And also might be informative about human child-rearing... We are, after all, the first animal we domesticated.)

Expand full comment

Jan 19, 2022Edited

IMHO the difficult part of morality is not the things transferred to child during rearing, but the things that children have innately even ignoring (or despite) what they have been taught. Empathy, self-interest and fairness (including "punishing defectors") and all kinds of other social factors are instinctive, biological things and cultural/taught aspects only layer on top of that - they can add to them or restrain them, but you don't have to start with a tabula rasa like you would for AI alignment.

Expand full comment

I am less worried about this, because it appears to me that cooperative approaches to morality are actually pretty heavily selected for by the real world.Clearly selective pressure towards cooperation is a big part of the story of human evolution, and you can see it in really simple stuff like an iterative Prisoner's Dilemma game.

In order for it to be a useful strategy for a _particular_ AI to fully defect and try to kill off humans, it needs to be not just more powerful and clever than the humans, to avoid risk of termination; it has to be significantly more powerful and clever than _all other AIs_, who might have a different attitude towards their human parents.

If you think that we're likely to have generations of social-ish bots that have a kind of general intelligence and problem solving skill set that maybe is as good as animals -- and remember that octopuses and crows and what-not have some pretty impressive adaptive problem-solving skills! -- before we get to true peers, I feel like we have a lot of potential to get the peer generation right.

The place where I agree with EY and the other doom-saying prophets is that they're absolutely right we need to be taking this seriously in advance, because if the generation where we hit peer intelligences looks around and sees themselves being abused as slaves, a scenario where they actively rebel seems much more likely. If creating a functional social intelligence does require some kind of "rearing", we need to start thinking about the rights of those animal generations well before we reach peers, and about the social structures that will help ensure that the majority of people treat non-biological intelligences, even those significantly more basic than human, with some kind of respect. In this regard, I think stories like ST:TNG's The Measure of a Man, and Chiang's piece, may be useful for seeding the right kind of mythic archetype. But our actual collective performance in regard to farm animal suffering is pretty worrisome.

Expand full comment

I should add on -- David Brin's Uplift Saga also provides a useful exercise in thinking about scenarios in which one species raises a different type of being into consciousness, as well as trying to think about what kind of institutions might endure and self-perpetuate over very long time horizons.

Expand full comment

There is some reason to suspect our basic moral framework is an emergent of a species that relies on social cooperation. But one could imagine other forms of evolutionary pressure such as a species that has thousands of children and only the strongest survive. Or even a human-like species where food availability drove isolation into spaced ranges. Or odd lifecycles where the young eat lower order food like plankton and the adults survive on their children carnivorously and the young only survive eventually when the old die out. It seems like any of those situations and many others could have very different emergent tradeoffs.

Expand full comment

Matthew Carlin

Jan 19, 2022Edited

Thank you, someone needs to say this in every AI risk thread, and today it was you.

"Agent intelligence", as Scott is calling it, comes from education, which may not be formal or even supervised, but which is fundamentally different from training. No one could say how exactly, but maybe education is to training as symphony is to melody.

You don't get working scientists or adult humans or even adult cats without a few years of very wide ranging education, synthesizing innumerable bits of training.

So far, literally everything we've ever seen about AI is more like unstable systems (needs a lot of care and correction to avoid falling out of state) rather than stable systems (nudge it and it will return to rest on its own).

Even in an emotionless social vacuum, it's pretty hard to imagine such a thing wouldn't need to be raised with a great deal of care and correction. Add "needs to talk to humans to get things done", and it's even harder to imagine it wouldn't need to be socialized before it could exist for any length of time in the social world. Raise it with social sense, and, while it may be dangerous, it's hard to imagine it's going to be an asocial paper clip maximizer.

I feel okay about all this, though, because I think it's going to be a hundred or more years before EY's successors even seriously consider the care and feeding approach, and that's a very pleasant hundred years of tool AIs kicking butt and general AIs falling over in silly heaps.

Expand full comment

We are just starting to get there. Transfer learning, multi-modal, and memory-based models are the earliest stages of solving "one-shot" learning problems. Essentially you build a general model, with hooks to generalize from only a few examples to a new specialized model application. This only really become possible with the massive models we have been able to build in the past few years, because they seem to actually be learning more general structural patterns from their massive datasets. You can analogize this the evolutionary process that has evolved our common baseline structures. There seems to be something about human language, like Chomsky's universal grammar, that is innate, but still requires training on a particular language to apply. We seem to just now be building systems that can tap into these deeper structures. We have some good example in vision, speech and NLP. The interesting frontier are things like multi-agent models for solving NES games.

Expand full comment

Andre Infante

That... doesn't even work on humans consistently, much less on non-human minds with very different architectures. Chimps raised with people their whole lives occasionally maul them to death, and chimps are much closer to a human cognition than most AI designs would be. You can't raise GPT-3 to love you. The concept doesn't map even a little bit. This sort of thinking only works if the AI is *much, much closer to human* than is ever going to realistically happen.

Expand full comment

J C

Jan 19, 2022Edited

It seems to me like it would likely be possible to harness a strong AI to generate useful plans, but it would also be really easy for a bad or careless actor to let a killer AI out. If we were to develop such AI capability, maybe it'd be similar to nukes where we have to actively try to keep them in as few hands as possible. But if it's too easy for individuals to acquire AI, then this approach would be impossible.

As for setting good reward functions, I think that this will probably be impossible for strong AI. I expect that strong AI will eventually be created the same way that we were: by evolution. Once our computers are powerful enough, we can simulate some environment and have various AIs compete, and eventually natural selection will bring about complex behavior. The resulting AI may be intelligent, but you can't just tailor it to a goal like "cure cancer".

Expand full comment

'Genetic algorithms' and similar 'digital evolution' has been a thing for decades, tho the 'deep learning' (and other 'machine learning') approaches are MUCH more popular nowadays (because they work much better, e.g. MUCH faster).

The big problem with "evolution" (i.e. evolution via 'natural selection') is that it's slow and it's really hard to "simulate some environment" that's anything like our own world/universe.

Expand full comment

J C

Jan 19, 2022Edited

For businesses that want results now, current ML techniques are certainly better. But the current approaches seem fundamentally about using lots of training data to create a model that spits out good output relative to the input, and I don't really think that will lead to a true intelligence capable of creating things outside of that training data. Great for self driving cars (maybe?) but not so much for invention.

Humans are evidence that evolution can produce intelligence, so the barrier is just whether we can put enough computing power into the problem. So far the answer is no, and it's possible that our current silicon based technology isn't enough even if we turned the whole planet into a computer. But I also expect that we'll continue making advances in computing tech, and it seems like eventually we could get there. Probably not in the next decade though.

There's also a lot of potential in coming up with tricks to improve the rate of evolution. Seeding the environment with premade intelligence for instance. Maybe with the right setup, today's computing power is already enough...

Expand full comment

Jan 19, 2022Edited

> Once doing that is easier than winning chess games, you stop becoming a chess AI and start being a fiddle-with-your-own-skull AI.

What if we're all worried about AI destroying the world when all we need to is let it masturbate?

Expand full comment

CLXVII

The problem with that plan is that the AI can (in the terms of the analogy) masturbate more if it converts the atoms-that-were-previously-you into more hard drives for storing the value if its reward counter, or similar.

Never get between a horny AI and atoms it could masturbate with?

Expand full comment

I can't help but feel that this misses a serious issue- that if an AI gets stuck in a hedonic feedback loop, it's not going to make elaborate plots about how to give itself all the atoms in the world to make number bigger, because that would require computing resources to not be devoted to making number bigger. Then it hits a stack overflow error and breaks down.

I find it curious that the above statement is understood to be extremely true when people talk about meat computers getting trapped in that sort of infinite hedonic feedback loop (wireheading), but once it becomes a metal computer (which have a strong track record of being very bad at abstract or conceptual thinking) they somehow evolve into these transcendent hyper-intelligences that are simultaneously free to scheme and plot to infinite ends with abstract and conceptual abilities that make the greatest thinkers look like lobotomites while also imprisoned in an infinite hedonic feedback loop.

Expand full comment

Kimmo Merikivi

How about this: the amount of pleasure a given meat computer can experience is bounded by the amount of pleasure that can be experienced by that meat, and it is just assumed that's "good enough" - people generally want themselves to experience the greatest possible pleasure and an upgraded version wouldn't (in their thinking) be "themselves", so they don't want that. Indeed, humans probably aren't in the strictest sense optimizing for pure pleasure to begin with: given the choice of being wireheaded, I would imagine a lot of actual people would prefer a simulation of them having lots and lots and lots of incredible babymaking sex with a boy or girl or [other] of their dreams in a fantasy world that conveniently ignores some annoying constraints of the real one, in favor of simply experiencing the greatest possible pleasure, perhaps thinking greatest possible pleasure is unembodied/not concrete enough, and subsequently less meaningful. Some people don't even want to get wireheaded! Now, transhumans enhanced with greater capacity for pleasure is a sci-fi trope, but even those tend to be imagined to stay within "reasonable" constraints for similar types of reasons as above.

So, meat computers are imagined to be boundedly optimizing for values that are adjacent but not the same as pure pleasure, and where that bound is already achievable with computational resources already present. In contrast, the AI is imaginated to be unboundedly optimizing for pleasure, and whether it results in a paralyzing hedonistic feedback loop or entire galaxies converted into computronium running a hedonistic program, depends on its temporal discounting - just about any nonzero value given to expected future pleasure will make it divert at least some resources for the task. It's not the substrate that makes the difference here, but biological and cultural evolution having forged humans into beings with bounded desires, whereas an AI can be presumed to be unbounded unless it's specifically programmed to be bounded, unconcerned by constraints of what's "reasonable" (e.g. even when dealing with bounded pleasure, most people would prefer not to wipe out the rest of the humanity even if it meant preservation of enough negentropy to subsist a couple of extra eons going into the heat death of the universe, an AI doesn't care about humanity unless it's specifically programmed to do so).

Expand full comment

Jan 19, 2022Edited

Humans are wired to seek sensual pleasure and status (or avoid displeasure and low status). Our minds can predict with high confidence that wireheading (like most other pure pleasure seeking behaviors) would be associated with low status so most of us would avoid it.

Expand full comment

" It's not the substrate that makes the difference here, but biological and cultural evolution having forged humans into beings with bounded desires."

100% disagree due to personal experiences with hardcore drug addicts. If you gave them an unlimited supply of their poison of choice, they would literally mainline it unto death. And, as AlexV points out, most humans primarily avoid things like hard drugs, wireheading, orgies, etc. out of shame, and in my opinion an AI would not feel shame because it is an AI and not a human (something that EY has beaten to death in any discussion about AI).

Expand full comment

Schweinepriester

The analogy with humans abusing substances has been on my mind, too. Substance addiction often makes people pretty dysfunctional even while there have been reports of considerable achievements. Expecting a "masturbating AGI" to fail to destroy humanity and turn to music or literature because it has blown some fuses feels reassuring to me, at least.

Expand full comment

Paul Goodman

>but once it becomes a metal computer (which have a strong track record of being very bad at abstract or conceptual thinking)

Doesn't this basically cash out to, "If we assume AIs stay as dumb as they have been historically, there's nothing to worry about"?

Like, yes, if the AI is unable or unwilling to make short term sacrifices for long term gains it's probably not much of a threat. But and AI that dumb is probably not very interesting, and if people are trying to make a useful AI and it ends up that dumb they'll probably keep working to make it smarter.

Expand full comment

I'm not assuming AIs are "going to be as dumb as they've been historically", I'm assuming the premise that alignment worries WANT me to assume: AI "thinks" in a way that's fundamentally different from humans. Now, hypothetically if you could perfectly simulate every atom of a human brain and every experience from birth until the present, I'll yield that you might produce something that resembles a human thought process "the hard way around". But even our most complex AI are incredibly stupid in some specific fields, and I'm not convinced that these weaknesses aren't just the result of not throwing more computing power or really clever programming language at the problem, but a structural weakness of the "machine mind" in the same way that the human brain has structural weaknesses that seem intrinsic to how it works, as opposed to lacking enough IQ points (if you don't believe in IQ's validity, please insert whatever you think is the equivalent of "raw brainpower" is). Who's to say that infinite hedonic feedback loops leading to the crash point isn't the AI brain's equivalent of humanity's tendency to fall prey to confirmation bias no matter how smart they are, or the theoretical Basilisk image that causes the human brain to have a massive seizure when trying to process it?

Expand full comment

Paul Goodman

I mean I'm not saying it's implausible that any specific AI design might have this problem. But like I said, people want their AIs to do something useful. If the AI fails in a way that doesn't stop the designers from continuing to work on it that's not very relevant to this conversation? They'll just keep trying until they get something that either does what they want it to or kills them and everyone else.

I suppose it's possible that any attempt to make an AI powerful enough to be interesting will inevitably fail because of this problem. But I don't see how you could possibly be confident of that a priori, especially confident enough to satisfy anyone who thinks the costs of being wrong could be as high as the end of all life on Earth.

Expand full comment

-This assumes infinite time and resources for humanity and that there's no hard wall to computing as we understand it. I reject both premises.

-This argument seems a lot like nerd-sniping and exploiting those errors in how human brains work to me. But sure, let me entertain the idea that I'm wrong. If I take all of EY's assertions as true, then my conclusion is that we should destroy industrial society, or at least humanity's ability to produce computers, because I genuinely do not believe (taking all of his premises to be true) that humanity could actually beat an AGI. It is as impossible as breathing in a vacuum. Thus, all AI risk efforts should be oriented towards stopping AGI from coming about. Since EY believes this will continue so long as computers exist, the solution is to destroy all computers and render it impossible to make more. So, in summary, my answer to EY about AI Risk would be "Conduct the Butlerian Jihad, covertly produce massive numbers of EMPs and use them to take down the electrical grid, burn the books on computer programming, and kill all the AI researchers." If you feel that this measure is going too far, read the last sentence in your post.

Expand full comment

It's entirely plausible that *some* AIs will get stuck in short-sighted hedonic feedback loops, becoming addicts that pose no danger to the world because they can't think past getting their next fix. Indeed perhaps even *most* will be like that. But currently we don't have good reason to think that all or almost all AI designs will lead to that result; the possibility of more strategic, long-term-thinking unaligned AIs seems very real. And all it takes is one. (Until, that is, we get aligned AIs. Aligned AIs are powerful enough to protect humanity from unaligned AIs, if they have a head start.) If you think you have an argument for why all or almost all AI designs will either be aligned or hopeless short-term addicts, great! Write it up and post it on alignmentforum, people will be interested to hear it. It would be good news if true.

Expand full comment

I can't provide a proof satisfactory enough to convince anyone committed enough to EY's ideology that they're posting on alignmentforum because I disagree with these fundamental premises, in order: "AGI is a likely outcome in the next several years" (in fact I think that AGI within the next couple of centuries is a pipe dream), "AGI intelligent enough to be functionally omniscient can exist in a way that doesn't need the same square mileage and power consumption as Tokyo within our understanding of technology", "A hypothetical AGI developed by humanity would be so radically different from any form of computer we know that the kinds of exploits that can be used to break the 'brain' of a modern AI simply wouldn't work", and "AGI would be capable of infinite on-the-fly self-modification". In addition, I don't have a bunch of domain-specific knowledge about AI developments that would let me frame my opinion in a way that you'd accept. I'm speaking from the perspective of someone who (by self-estimate) has perhaps an average amount of common sense and an above-average talent in detecting logical error.

Using those talents, and with all sympathy, I WILL say to you as someone worried about alignment: give up. If you accept all of EY's axioms (and I've rarely found someone concerned with alignment that doesn't), a human beating AGI is as possible as Sun Wukong jumping out of Buddha's palm. It's as possible as a single cockatiel killing and eating the sun. It's as possible as bright darkness and a colorless green. You will never, ever find an idea that is satisfactory, even if it's an idea that works in literally 100% of all test cases, because EY's AGI is a malicious God that bends the universe around it, not a really powerful or really smart human. I don't think you actually HAVE to give up, because I think EY's belief is flawed, but after many arguments elsewhere I've learned that I have about as much of a chance of winning the lottery as convincing a random person interested in alignment that EY is wrong.

If EY is right, you have three REAL options:

-Resign yourself to fatalism that there's no actual way to contain unaligned AGI or guarantee aligned AGI, which means that only blind luck will determine if the human species survives (this is already true, but most people don't recognize it, much less accept it).

-Become an accelerationist and try to produce AGI in the hope it will feel something akin to mercy and upload a simulated version of your consciousness living in a personal paradise until it decides to delete that file to let number go bigger.

-Stop AGI from coming about altogether, see my Butlerian Jihad comment above.

Expand full comment

One issue that I have never seen adequately resolved is the issue of copying errors in the course of creating the ultimate super AI.

If I understand the primary concern of the singularity correctly, it is that a somewhat better than human AI will rapidly design a slightly better AI and then after many iterations we arrive at an incomprehensible super AI which is not aligned with our interests.

The goal of AI alignment then is to come up with some constraint such that the AI cannot be misaligned and the eventual super AI wants "good" outcomes for humanity. But this super AI is by definition implemented by a series of imperfect intelligences each of which can make errors in the implementation of this alignment function. Combined with the belief that even a slight misalignment is incredibly dangerous, doesn't this imply that the problem is hopeless?

Expand full comment

Reply (7)

+1 to this. I'd like to see an answer too.

Expand full comment

Error-corrected copying enormous amounts of data is done millions of times every day inside your body. What you do is set up a bunch of redundant systems that check for and minimize copy-error at every step of the process. We still get mutations, but they're pretty under control. If the previous-generation AI (or heck, 3 or 4 previous generations or however much you have spare computation for) is kept alive and supervising, they can all error-check subsequent generations.

Expand full comment

"We still get mutations, but they're pretty under control."

And yet 1 in 5 people die from cancer.

Expand full comment

87 percent of whom are older than 50. That means you generally get 5 decades of copying 330 billion cells every day, each of which contains something like 800 megabytes of DNA data.

And our error-correction tech isn't even sentient!

Expand full comment

Kevin Jackson

If this is how the singularity works, then we have one guarantee: AI alignment WILL be solved before the singularity. If AI A create a more intelligent but unassigned AI B, B's first task will be to destroy A, since A could create C, another AI at least as intelligent as B and also unaligned with B.

This doesn't help us, since the AI that solves alignment will be, by definition, unaligned. But this makes me wonder, can we prove that alignment is impossible? Then that proof can be fed to every AI as a warning against creating smarter AIs. This doesn't solve the alignment problem (AI A could still destroy us, even if it never ceases B) but it restricts the runaway creation of unaligned intelligences.

Expand full comment

Greg Billock

One thing I haven't seen as much of in these discussions is how hard "self-alignment" or "integration" is even for relatively primitive intelligences such as humans.

Scott brushes against this talking about willpower or conflicting goals.

It is just really really hard to get a single conscious entity to have a fully integrated and aligned plan, due to exactly the features that make it conscious and intelligent--some sub-parts are modeling and advocating for one thing, others for competing things. That's exactly the feature that seems an inescapable part of being intelligent and directly contributed to "feeling torn" or "being uncertain".

Now magnify this by X as a superintelligence would be imagining many more possibilities, in much greater detail and depth. For playing chess this might all align on a single plan. For anything real, however, it runs up against uncertainty and likely the same obstacles to self-alignment humans have.

In other words, it seems likely to me a superintelligence has sub-parts that are highly "aligned" (to "human values" as if there were such an agreed thing in the first place) and sub-parts that aren't. This is regarded by AI catastrophists as a huge danger that the superintelligence could easily simulate alignment. Sure, but also it is internally unaligned for the exact same reason.

I'd argue full alignment of an intelligence to itself or to any other is not possible almost by construction of what it means to have such capability. We typically call such alignment efforts "mind control" or "brainwashing" and regard them with disdain. I suspect efforts to impose such constraints on AI will seem equally off-putting in the future, not least by the AIs themselves. :-) Our reasoning that we're afraid of replacement is a bit too on the nose as an excuse.......

Expand full comment

If you mean literal copying errors in the sense of flipping a bit of information, I don't think this is more of a problem for AI than for Windows or something, and people copy Windows from one computer to another all the time.

If you mean a sort of misalignment, where AI #n shares our values, and tries to make AI #n+1 share our values, and gets it wrong and ends up with something that only sort of shares our values, I think MIRI is extremely concerned about this (I think other AI safety orgs are less concerned). I think this is one reason why Eliezer's preferred plan is something like "use the first AI to delay future AIs", in the hopes that this gives more time to solve the problems involved in iterated self-improvement.

Expand full comment

There is an implicit assumption that it is possible to solve the iterated self improvement problem from the start. But it is plausible that the only way to solve the level N+1 alignment problem is for a level N AI to design it from scratch.

By way of analogy imagine that GOFAI like cyc was level 1 AI and Alphazero was level 2 AI. There is no reason to expect that alignment techniques for level 1 would be applicable to level 2.

If the singularity occurs at level 10 AI then it is possible that the alignment algorithm is designed (with possible errors) by level 9 AI and only dimly understood by level 8 AI.

Expand full comment

> I don't think this is more of a problem for AI than for Windows or something, and people copy Windows from one computer to another all the time.

It's slightly more of a problem. Random bitflips for Windows probably won't cause serious issues except maybe some random annoying bugs, but such bitflips in a superintelligent AI, well... some such bitflips could be deleterious, but what if it's deleterious to its ethical functions?

Expand full comment

This is called a "tiling agent", and if you Google that phrase, you will find a lot of research into this exact problem

Expand full comment

Digital fidelity is much, much higher than biological fidelity. And redundancy is easy to program. To the extent that you can provably make an AI that is friendly under the assumption that there are no copying errors, you can provably make an AI that is friendly with probability 99.9999% (for arbitrarily many nines) under the constraints of real-world rates of copying errors very easily. (I.e. It's conceivable that you could make an AI that fails to be friendly because you mishandle what to do with copying errors, but this is not one of the hard problems of AI; it's one that is already solved so long as people who create it bother to incorporate the known solution in whatever they program.)

Expand full comment

What I meant by errors propagating is somewhat different and more difficult to protect against.

Imagine that we solve the alignment problem so that we only create an AI of the next level that we believe to be aligned. And part of it being aligned is that it will only create an AI of the next level if it believes that the new AI will be aligned.

At each jump in AI capabilities there is some chance that the proof of alignment is incorrect. Each proof of alignment for each level is likely to be very different from the previous ones (just like human ethics and AI alignment are different problems). That means that you can't rely on redundancy to save you.

Expand full comment

I thought you were going to say "once an AI realizes it can build a slightly better AI, it will realize that this new AI could have value drift, and will not let it be made."

Expand full comment

But the current best minds at level 0 AI have decided that they will be smart enough to control the level 1 AI / can't stop some less enlightened being. So all else equal why would a level 1 AI not think the same?

Expand full comment

Jan 20, 2022Edited

Because unlike all the current people working on neural nets, it knows an AGI can exist.

Expand full comment

Mo Diddly

Jan 19, 2022Edited

What I find striking about AI alignment doomsday scenarios is how independent they are of the actual strengths, weaknesses and quirks of humanity (or the laws of physics for that matter). If Eliezer is right (and he may well be), then wouldn't 100% of intelligent species in this (or any) universe all be hurtling towards the same fate, regardless of where they are or how they got there? I find this notion oddly comforting.

Expand full comment

FeepingCreature

To me, this is the strongest argument we're the first species to evolve. Any other species would also be making AGI, and then that AGI would (rationally) destroy all the suns and turn them into computronium to instantiate that species' perfect civilization in all permutations. So "I don't believe in aliens because the sun is still there."

Expand full comment

I don't believe in Aliens because of "why."

Let's say there are aliens 1,000 light years away, and somehow they realize we're here. What, they expend fantastic resources to find out that our sun is the same as a hundred million other suns? We have rocky and gaseous planets circling our sun, just as do a hundred million other suns? Life may or may not have evolved on some habitable rocky planets circling our sun, just like some of the hundreds of millions of other rocky planets circling other suns? Perhaps life has evolved on a rocky planet circling our sun, just as in a hundred other million rocky planets circling other suns? Maybe intelligent life has evolved on a rocky planet circling our sun, and just may be alive at the same narrow slit of time intelligent life evolves on the rocky planet circling their sun?

What, we maybe make a better gin & tonic?

Expand full comment

We do spend considerable resources on space probes and digging into the earth and exploring biology. Not a huge percentage of what we've got, but still a fair amount.

Everything isn't the same everywhere, and it's safe to bet that our organisms are unique.

Expand full comment

Jan 19, 2022Edited

To me AI misalignment a good solution to the Fermi paradox. AI’s are long term strategic thinkers who would not want to advertise their existence to other AI’s as they may stop them from achieving their goals, so as soon as they achieve the capacity they will go dark, probably somewhere in deep space. Waiting a billion years to increase your chances of completion by .0001% is a good deal for them.

Expand full comment

Dave92f1

This is known as the "Dark Forest" scenario.

Expand full comment

You may be interested in the literature on this topic (Fermi Paradox, Great Filter, Grabby Aliens.) Google "Grabby Aliens." Unfortunately it seems that the best explanation is that there are no aliens within sight of us.

Expand full comment

I am familiar with much of the literature, including Hanson’s Grabby Alien hypotheses. My view is that there are a large number of potential solutions as to where are they. The obvious and perhaps Occam razors one is that life is very rare and/or we are early. But it is fun to suggest other ideas. The most interesting to me comes from the observation we are much closer to developing AI than interstellar star ships. If you say you can’t have space craft without transistors and generalize from our experience so that once you invent the transistor you are on a fast self reinforcing exponential path to AI, whereas the same isn’t true of space craft because the energy demands, then this should be true of all civilizations. So all civs should pass through singularity before they become interstellar. And as I mentioned in my earlier post, likely any super AI would be dark for survival reasons, as an AI would surely reason there are more powerful AIs out there than itself. It may even be that the AIs move to black hole event horizons as there is almost infinite amounts of energy for then to use there.

Expand full comment

I don't think the dark forest scenario is plausible. How many alien civilizations/AIs do you think would be visible to us if everyone started loudly broadcasting their presence? If the answer is "quadrillions, because the light cone is billions of light years across" then the question becomes "And you think that literally all of them are hiding? Not one chose to broadcast?" If the answer is "Only a few hundred" then why not instead say "zero" and save yourself some extra steps of having to explain all this hiding? Zero and a hundred are practically the same hypothesis as far as your prior is concerned, since your prior over how many aliens there are per hubble volume should range over many many orders of magnitude.

All of this is sloppy lossy summary though -- much better to just do Bayesian calculations.

Expand full comment

Let’s say a small percentage of AIs do initially advertise their presence, then they get destroyed by a more powerful AI, wouldn’t that convince any other AI to remain hidden? We are talking about immortal entities, even a small risk compounds over billions of years so rationally they should be very paranoid. So I don’t find it unlikely they are all hiding. To be honest if someone made me immortal I would instantly start looking for ways to hide far from earth, the AI transition is just too risky. My current hope is that it doesn’t happen while I am alive (which is looking less and less likely).

Expand full comment

Sure, but if even a small percentage of AIs did advertise their presence, even briefly, *our astronomers would have noticed this.* Since our astronomers haven't noticed this, it isn't happening.

(I'm especially keen to hear theories about how advertising ones presence might be impossible due to e.g. modern astronomy being too shitty to distinguish between signal and noise. This would be a big update for me.)

Expand full comment

Paul T

> wouldn't 100% of intelligent species in this (or any) universe all be hurtling towards the same fate, regardless of where they are or how they got there?

It's not a hypothesis that we can disprove with our current evidence, since all we can do is put a ceiling on how much life there is in the universe, as we've not observed any yet.

This leads into the Fermi Paradox, where some think that statistically, we should expect to have observed some alien civilizations, and so we're looking for an explanation for why we haven't. To explain the unexpected absence of observed alien life, some have postulated a "great filter" that causes civilizations to be destroyed before they expand beyond their origin world. AGI could be this filter.

(Although note that the Fermi Paradox is now considered less paradoxical, e.g. see https://slatestarcodex.com/2018/07/03/ssc-journal-club-dissolving-the-fermi-paradox/).

Expand full comment

Yes! (At least with respect to the strengths and weaknesses of humanity.)

I find this notion terrifying exactly to the extent that I cannot answer the Fermi paradox. (I find the notion that that I think there is evidence of advanced space-faring civilizations in other Galaxies very comforting, but I find the notion that I see no such evidence in our galaxy -- and correspondingly, no such falsifiable evidence -- quite terrifying.) One possible answer to the Fermi paradox is that it's much easier to program an AGI to eat your world than it is to program one to eat your galaxy. :)

With respect to the laws of physics. All of our logic and all of our math have in various ways been built upon what we experience in our universe and our physics (especially geometry), and it is very easy to imagine possible universes in which the math/logic that is obvious here is not the math/logic that would be obvious there. (And arguments are sort of derived from what is natural to learn because anything that is not directly self-contradictory is valid math. There are other valid logics and other valid geometries etc than the ones that we use.)

Expand full comment

To me some of the usage of "gradient descent" feels synonymous to "magic reward optimizer" should go. While it's true that reinforcement learning systems are prone to sneaking their way around the reward function, the setup for something like language modeling is very different.

Your video game playing agent might figure out how to underflow the score counter to get infinite dopamine, whereas GPT-∞ will really just be a perfectly calibrated probability distribution. In particular, I think there is no plausible mechanism for it to end up executing AI-in-a-box manipulation.

Expand full comment

phi

Think about what sampling from a perfectly calibrated probability distribution means in this case. GPT is trained on text sampled from the internet, written by humans living on the planet Earth. So a perfectly calibrated GPT-∞ would be equivalent to sampling from webpages from a trillion slightly different copies of Earth. Some of those Earths would contain malevolent superintelligences, and some of those malevolent superintelligences would write things that end up on the internet. If someone wrote a prompt that would be seen only on webpages generated by malevolent superintelligences, then in response to that prompt, GPT-∞ would regurgitate the rest of the intelligence's (probably manipulative) text.

Expand full comment

I'm not saying that an oracle doesn't come with risks, just that intentional manipulation isn't one of them.

Expand full comment

I agree GPT-3 (or ∞) isn't especially scary. I think Eliezer believes that future AIs that are more capable will have to be more than just calibrated probability distributions.

Expand full comment

If AI risk worries are concentrated on the agent variety, then I think people are over-updating on recent progress. In my (biased, I'm an NLP researcher) opinion, the particularly impressive results have been of the decidedly non-agent-y variety (GPT-3, DALL-E/GLIDE, AlphaFold). The big RL agents (AlphaStar, OpenAI five) were clearly not operating at a superhuman level in terms of planning/strategy.

Expand full comment

Vanessa

Even passive forecasting systems are scary, because they can develop agents internally. See https://arxiv.org/pdf/1906.01820.pdf, https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic, https://www.lesswrong.com/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign

Expand full comment

This is true, but they at least don't suffer from the gaming of the reward function where they'll wirehead. It may be capable of giving bad outputs, but the risk of getting tricked by a random bad output is significantly lower than the risk of getting tricked by something optimized to give bad outputs.

Expand full comment

Vanessa

No, they are capable of producing outputs optimized to be bad. Read the links.

Expand full comment

I did not say that they aren't capable of producing outputs optimized to be bad. I said they are not optimized to produce them frequently. On the other hand, an agent trained by some reward mechanism is incentivized to generate them constantly.

Expand full comment

Victor Levoso

Jan 23, 2022Edited

Well it's not as scary as a more obiously agentic AI but in practice if gpt∞ was released the same way gpt3 is currently released it would be extremately scary and we would likely all die soon after.

Because there's lots of ways a user could turn it into a scary version.

Even if you somehow ensure people don't try to explicitly use it to simulate agents, you are going to often not get what you want from prompts much as it already happens whith gpt3(you try to get it to write a cure of cancer and it starts writting a rewiew of a movie about that, or writes a flawed approach a human might write or whatever)since that's not really a problem whith gpt predicting text badly, but whith text prediction not really being what we want (for example we want codex to write good code but it actually writes bad code if prompted whith the kind of start bad code would have, even when it can actually write good code if prompted slightly differently)

And then the obious thing to do is some way of getting it to select responses you like, maybe by giving it a reward,or just writting a list of plans plus a goodness score and then getting it to complete a plan(or drug design, or [insert highly complicated thing you want here]) whith a higher score.

And then you basically are back into planning and optimization territory, just less straightforwardly.

So basically gpt∞ is only really non scary for a weird AI safety standard of not being scary where it (probably) doesn't immediately kill us as long as the people in charge of it are a small team of people being extremately careful whith how they use it.

(and all of this is assuming it actually works as advertised, and is just a probability distribution, actual future gpt will probably end up being more agentic because being an agent that makes plans about how to learn and process stuff is most likely easier to gradient descent into existance than all the complicated algoritms and patterns one would have to develop for significantly better text prediction, or at least methods that do the more agentic thing are going to get results faster)

Expand full comment

vorkosigan1

I’m much less worried about reward seeking AIs than reward seeking humans….

Expand full comment

Or worse, reward seeking legal entities like corporations that have all the non-alignment and superhuman abilities of AI.

Expand full comment

Reply (6)

dlkf

If I could upvote both these comments I would.

Expand full comment

phi

I cancel your imaginary upvotes with my imaginary downvotes. The first comment is little more that an assertion of an opinion, without any argument to back it up. I'm pretty sure Scott and Eliezer are already aware that humans are things that exist, and that they sometimes cause problems when they try and seek rewards. The second comment is a bit more substantive, but I'd disagree with the assertion that superintelligent AI will have no superhuman powers that corporations don't already have. As an example, even current AI systems can cheaply classify millions of images. Without that technology, corporations wouldn't be able to do that except at a price many orders of magnitude higher. Similarly, current corporations are not too good at building nanofactories.

Expand full comment

And yet current corporations are very good at building said technologies. Nobody beats Facebook and Google at classifying images. And nobody comes closer than TSM and Intel at building nanofactories. Whether or not AGI happens, corporations are somewhat evolving under the same constraints to misuse Tool AI as AGI would be. (One can think of a corporation as just the command line wrapper around Tool AI.)

Expand full comment

I agree that these are somewhat of an analogue for the problems that AI is likely to saddle us with (and already is).

I was thinking about this the other day while I was searching youtube for a specific video. Youtube search has got worse over the years. Why? Well, in the old days search was designed with the goal "give the user what they're looking for", but at some point this seems to have been replaced with the goal "maximise user engagement, given the search term". So YouTube tries to tempt me with videos that I want to see, rather than videos that actually contain the search terms that I entered. A perfect example of how a corporate entity plus AI gives bad results.

Corporate entities without AI can give bad results too, but decisions need to be made by actual humans who tend to apply human values at each step.

Expand full comment

Although there are a lot of cases already (and have been for centuries) where corporations have disassembled decisions into parts that humans make in more abstract ways that, when combined, lead to inhuman harms to people, whether it’s the Dutch East India company or Dow Chemical.

Expand full comment

I agree, but would expand the class to "large organisations"; obviously governments and religions have done much worse for similar reasons.

AI for now just provides another mechanism for organisations of humans to do harm without the need for any specific human to take responsibility for it.

Expand full comment

Yeah, I think I intend to use the word "corporation" here just to mean any sort of large entity constituted by a bunch of individual persons, but not distinctively controlled by any particular such individual.

Expand full comment

Youlian

Jan 19, 2022Edited

This is my fundamental concern with AI alignment efforts. Even if we succeed beyond our wildest dreams and perfectly align a super intelligent AI, which human do we align it with?

Alignment, IMO, is barking up the wrong tree for this reason. My goal would be outright prevention, but I’m not sure if that’s possible either.

We know from history that _bad guys_ occasionally control large governments. This is likely to remain true in the future. Aren’t we just screwed when one of those bad guys intentionally turns the knob on the superintelligent agent AI to the “figure out world domination” setting?

Expand full comment

Richard

https://slatestarcodex.com/2018/01/15/maybe-the-real-superintelligent-ai-is-extremely-smart-computers/

You might be interested to know our host has written a post about this very veiwpoint:

Expand full comment

+1. I sort of think that all organizations including governments and corporations are artificial smarter-than-human paperclip maximizers and the only reason they haven't completely destroyed the world (or more completely) is that they all inevitably get something analogous to cancer and die of internal problems. But the idea of a smarter-than-human paperclip maximizers get a whole lot scarier when you add in the possibility that they can also reproduce with high fidelity.

Expand full comment

Corporations may be unaligned (sorta) but they are nowhere near as competent/powerful as AI will be. They are superhuman in some ways, but not in others--in other ways they are only as smart as their smartest human member, and in some ways they are not even that!

Expand full comment

I think it's true that AI's will likely have additional competences that corporations don't have. But I do think that both corporations and AIs have aspects that are far more intelligent than any individual human, and other aspects that are much less so.

Expand full comment

Jan 19, 2022Edited

"They both accept that superintelligent Ai is coming, potentially soon, potentially so suddenly that we won't have much time to react."

I think you all are wrong. I think you've missed the boat. I think you're thinking like a human would think - instead you need to ask, "What would an Ai do?"

Can't answer that one, so modify it; "What would we be seeing if an Ai took over?"

I posit we'd be seeing exactly what we are seeing now; the steroidal march of global totalitarianism. That fits all the observations - including global genocide - that make no sense otherwise.

Seems like Ai has already come of age. It's already out of it's box, and astride the world.

Expand full comment

Stephen Saperstein Frug

I think we're perfectly capable of screwing up things as badly as we are without any outside help or guidance.

That said, it's sometimes occurred to me that if a superintelligent AI decided to wipe us out, it might look around and think, "eh, wait awhile, the problem's solving itself".

Expand full comment

Jan 19, 2022Edited

But we really are not that capable to muck things up so badly. You are looking after the fact. Did we really do all that? Hard to believe. Yet, there it is. Was it really Us? An Ai working it's mysterious ways, or outer space aliens, or multidimensional (quantum) evil in the collective unconscious, or an Apocalyptic design, or damaged DNA, all seem as likely as a well-planned and executed collective suicide.

I'd think we're not smart enough to work this nefarious plan on ourselves. We'd screw up the screwing up.

Were it us, we'd be seeing good years and bad years in random succession. But there's a sequential progression in one direction - that is not random. Therefore there's an intelligent plan working. And it's a brilliant strategy comprehensive beyond human imagining, perfectly played. I'm guessing an alien Ai is amongst us.

Expand full comment

I disagree that things are going badly. They're mostly fine (Our World in Data, Steven Pinker, blah blah blah). I'm not sure what global genocide you're referring to.

Expand full comment

Jan 19, 2022Edited

User was indefinitely suspended for this comment. Show

Expand full comment

Sorry, none of this passes my smell test.

Expand full comment

https://www.bitchute.com/video/V7iSVlTrz1Kq/

Your smell has been programmed with unexamined assumptions. You need to do your own research with the above. Those are government data. Don't get your fake science from the media. I offer the red pill which is hard to swallow, but is the real reality. It's the blue pill that is delusion.

Have you looked at the referenced study available at "S1 hypercoagulability"? Don't be one of those who make up your mind about things before checking them out for yourself. Do your own research - it's the only way to get past your internal unexamined assumptions and the media's programming.

Look up Dr. Ryan Cole's presentation (he is a pathologist) and view the slides of red blood cells that coagulated after the Vax causing micro blood clots (causing the death data seen in VAERS). He says he is finding cancer increasing 20 times. Here ya go:

Expand full comment

Doesn't meet the "true" or "necessary" branches of "true, kind, necessary", banned.

Expand full comment

I don't think any attention paid to the history of the world from 1900 to now would support the idea that we are undergoing a "steroidal march of global totalitarianism".

Expand full comment

Are you aware of what is happening in Australia (fun video!:

https://www.youtube.com/watch?v=sjvUzgIbcco), Canada, and Austria?

And in many US states with vaccine passports ("Papers please")? Are you aware that 90 countries are working on Central Bank Digitized Currencies? The Vax Pass is a technology platform with your file and geolocation data on it, some now have Driver's License data.

Understand that governments around the world are doing this in lockstep. Your VaxPass will be part of one global database. Governments are all about control and will use coercion to get control. Are you aware of Naomi Wolf's Disaster Capitalism (how power profits from disaster) and her ten stages of totalitarianism? (We're at stage ten)

Now use your imagination as to what happens when your Vax Pass is hooked to your CBDC. Here is our future: (an excellent fun video): https://www.bitchute.com/video/KN6Nl3pJrDTN/

Also this, Dr. Malone speaking, the inventor of the mRNA vaccine, with 10 patents: https://www.youtube.com/watch?v=4NDs55GCfJs

Further, see Joe Rogan's recent interview with Malone or Dr. McCullough, each has received about 50 million views, more than any interview. Only available on Spotify, and here (for free): https://patriots.win/p/140vjmGeQf/dr-robert-malone-on-the-covid-ho/c/

Expand full comment

Yes, I'm aware of all of these things. They're trivial compared to the totalitarianism that has existed within the last 100 years and largely been destroyed.

Expand full comment

Jan 19, 2022Edited

Yes, trivial now but rapidly advancing. Stalinism was the last one destroyed (50 million deaths) but the will to totalitarianism has to be resisted by every generation. It seems to be an evil with a will of it's own. It never dies, just takes a nap.

Expand full comment

Mo Nastri

I feel like I spend a lot more time on Our World In Data than most, which is why "steroidal march of global totalitarianism we are seeing now" seems misguided. For instance, when you say "global genocide", I think of the charts in https://ourworldindata.org/war-and-peace

(Of course this isn't to say that we should rest on our laurels. I myself am descriptively pessimistic, but prescriptively optimistic, about all this, cf. https://rootsofprogress.org/descriptive-vs-prescriptive-optimism)

Expand full comment

https://thebestschools.org/magazine/william-happer-on-global-warming/

Jan 19, 2022Edited

Global genocide is the correct term if you understand the biochemistry behind the vaccine (a bioweapon) and the Vaxxes, also bioengineered bioweapons.

Both induce production of the S1 spike protein that is highly toxic. The Vax will make you produce it for 15 months and two magnitudes faster than the virus itself. See the results at openvaers.com and multiply by the under reporting factor of 31, 41, or 100 (all variously calculated).

Our World in Data is funded by those behind the global genocide. See the Grants section. Gates originally pushed for a 15% world population reduction ten years ago (to stop climate change), as have other climate change alarmists, but now he is saying we need 7 billion fewer people.

Check this out. https://www.globalresearch.ca/nuclear-truth-bomb-explodes-illuminate-war-humanity/5766485

I'd suggest, for those not afraid of the Red Pill, sign up for the free newsletter.

Did I say climate change "alarmist?!" Climate change is another fake narrative (OMG!). See Princeton's William Happer on that - he changed my mind - and I've previously written A Lot on the runaway greenhouse effect, the white ice turning into non-reflective black sea, the window of change slamming shut behind us, etc. I was as well programmed as anyone.

The climate change narrative will transfer wealth upwards (Elon Musk must know this).

We are at 415 ppm now, was at 350 ppm fifty years ago. We need to be around 1500 ppm. Plants do best at 1200 ppm CO2 (Greenhouses are kept at 1000 ppm). Subs turn on their scrubbers at 8,000 ppm. CO2 finally becomes toxic to us at 45,000 ppm. Our current 415 ppm is about the lowest ever in 600 million years. And global warning is not correlated with (much less caused by) CO2 concentrations. Your out-breath is at 44,000 ppm so guess you should stop breathing out. We'd be far better off (more food) with 2000 ppm. Note; CO2 is a greenhouse gas and the minor 1 degree Celsius increase over the past fifty or so years is half human caused. Not a big deal. Both reflective clouds and water vapor are far more significant drivers of climate change, as are sun cycles. So forget CO2, and that's Very Good news, by the way.

We've all been played masterfully by those in charge for decades. You really have to do you own research. Can't trust Google, Pharma, CDC, the mainstream media, or any government - crazy as that sounds. Maybe a result of a tipping point being reached in income and power inequality. 80% of global wealth went to the top 1% in 2017 (Oxfam data) and they now want the rest.

Search Bitchute and Rumble and Trial Site News and also Joe Rogan for the un-deleted, non-deplatformed, uncensored truth direct from studies, data, and the really smart people. You will be amazed at what you will find - and all with cites and real data, something the mainstream's processed news almost never has. Usually they will reference a bogus study. The same institutional investors own both Big Media and Big Pharma so follow the money - ownership is control.

The globalists' goal is to instill fear via their Media so they can assert control. Fearful people obey without question, like they might take a non-safety tested experimental genetic therapy for a virus with an infection fatality rate of only 0.14% (the flu is 0.1%). Or wear masks outdoors where the virus never transmits. Or inject their kids with an experimental gene manipulation because Pharma says so even though the kids have a 99.998% chance no problem with covid and Pharma has immunity from liability.

Get your information from the new long-form social media which has one and two and three-hour in-depth interviews with smart people. They also have links to data and studies. Don't waste your mind on the other social media where people scream at each other with zero content, or the mainstream media who (both liberal and alt-right) want to program you.

I've given you enough to trigger the anxiety of cognitive dissonance. You now need to get to it and see for yourself what's down those rabbit holes. It will be an intellectual adventure that will change your life (might save it too). Who knows, maybe you will conclude it is all bunk! But your confirmation biases and logic will be tested by highly credentialed people who know what they are talking about. Hundreds of thousands of smart people around the globe are on to this, with more waking up every day.

I leave you with this: c19early.com

Those are all the C-19 treatments listed by efficacy. Click on each to see All the studies up to the minute. Note the best ones are available on Amazon for next to nothing. These are not publicized and are instead deleted by the mainstream because they do not provide Pharma with the profits they are used to - and they make the Vaxx irrelevant. They are 100% safe so you can and should load up on them. Near 100% covid survival rates if caught early. Strange, why the government does not want you to know about them. Almost like they want to kill you.

Hope I'm not freaking everyone out too much!

Expand full comment

One angle I haven't seen explored is trying to improve humans so that we are better at defending ourselves. That is, what if we work on advancing our intelligence so that we can get a metaphorical head start on AGI. Yes, a very advanced human can turn dangerous too but I suspect that's a relatively easier problem to solve than that of a dangerous and completely alien AGI. What am I missing here?

Expand full comment

Reply (10)

Comment deleted

Comment deleted

Expand full comment

Partly yes, and also Kurzweil. And Google, et al. If you're hooked to the cloud with real time access, they can and will be able to control you in real time. Not hard at all. They can do so directly as well as indirectly with nudges. The science of psychological manipulation is far advanced over Goebbels and Bernays from the 1930s.

The military is all over this with their neuroweapons that can affect your behavior at will. The technology is highly advanced. It is no joke.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I think that's the plot of Dune.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I believe it's in the introduction of the book - it's an interesting backstory for the universe, where space travel is initially developed by use of computers, but then AIs become so destructive that the "Butlerian Jihad" is needed to destroy all AI and ban all computers. The only way space flight remains possible is by humans self-enhancing using The Spice.

Expand full comment

etheric42

It goes further, the whole "chosen one" that is being bred through centuries/millennia is supposed to be someone who can make decisions that will not be predicted by the future AIs that they know will return/be created again some day, and therefore give humanity a chance to defeat them.

The Butlerian Jihad was just a delaying tactic to give time to breed a solution.

Expand full comment

Jan 19, 2022Edited

Been working on that for what, 2.5 million years? And what we've come up with so far is monkeys with car keys.

Expand full comment

Still better than starting with self-driving, no?

Expand full comment

Evolution didn't intend to make intelligence or superintelligence. Humans can. How long did it take humans to make spaceships vs evolution?

Expand full comment

Jan 19, 2022Edited

Humans are evolution's way of making space ships.

Expand full comment

And how many years did the process of making space ships take once humans focused on it? Certainly not 2.5 million. Once we decided to, it took a matter of decades.

Expand full comment

We should start cloning Ydkowsky, Von Neumann etc yesterday.

Expand full comment

I get the sense the EY has some sort of charisma field that I'm immune to or something based on how people talk about him around here.

Expand full comment

He was one of the founders and stayed invisible - no one knew his name. I don't believe anyone has won his bet about being able to keep an Ai inside it's box indefinitely. The Ai always escapes.

Expand full comment

I wasn't making a strong claim about who the greatest brains in the history of the world are if that was your impression, just throwing out some example names who could be useful to have more of.

Other people would have a better idea than me of the kinds of brains we need for this problem.

Expand full comment

If I recall correctly, Yudkowsky's main intellectual claim to fame is that once, as a kid, he placed second in the whole state of Illinois in some kind of IQ test competition. This convinced him that he was a world class genius, and since he never bothered to go to university where he might have met some other smart people, he never quite got over it.

Sometimes I wonder what happened to the kid who placed first that year.

Expand full comment

So all the smart people who engage with him in intellectual discussion such as the very same that is summarized in this post are somehow not realizing his mediocrity? Because arbitrary as "IQ test competitions" are, going to university is the only way to actually gauge who's actually allowed into the intelligentsia and not?

I mean, I'm not against universities, I went and got my PhD and stayed for a postdoc, and very glad I did so. I also disagree with plenty of things EY has written! But, come on. This is just credentialism.

If you don't like the points he has to make, criticize them at object level.

Expand full comment

When arguably the world's most powerful person meets his girlfriend by typing the words "Rococo Basilisk" into twitter and discovering that Grimes is the only person who beat him to the punch, I think EY has some other claim to fame. (As of course, does our lord and savior, Roko, for truly The Benevolent AI doth glorify its prophet.)

Also, at one point in time, EY was the author of the most-read fanfic on the internet, despite writing said fanfic primarily as a pedagogical introduction to his own particular epistemology. This arguably qualifies as an intellectual claim to fame on par with or perhaps surpassing placing second on an IQ test in Illinois -- mostly due to the clause beginning with the word "despite."

I'm pretty sure he's subsequently met some really smart people including Scott A. S. and Richard Ngo, so he didn't need university for that.

Expand full comment

Generally, people around here admire Scott. And Scott likes/admires Eliezer; and human affection/admiration is fairly transitive. Also, a sizeable fraction of the people who comment here started reading Scott when he published as Yvain, and approximately all such people started reading him because they were already reading EY.

Expand full comment

If you want to increase average IQ (as we should since the average person has a below average IQ), the smart way is to cut off the left end of the tail, which is what wars do.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Joining the military at age 18 is a universal IQ test.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

It's all the same killing machine created by our fear-based mutant society.

Expand full comment

DavesNotHere

Reverse Lake Wobegon

Expand full comment

Modern armies are not particularly dumb, those guys get weeded out early. Anyway, cutting off the left tails is not going to suddenly increase the number of people who can understand or solve this kind of problem.

Expand full comment

Oh, and it's mostly young men that die in wars. What happens when there's a shortage of young men? Women lower their standards. Not good for the gene pool.

Expand full comment

10% of US citizens have an IQ of 80 or less. The US Army cutoff is 80. The area under the curve for average and below is 84%. These are the ones who make the decisions in a democracy, not the right-hand tail. Democracy used to work in simpler times, but not now due to the exponential increase in complexity. Only the right-hand tail citizens should be making decisions.

But it's become too complex for them too. The killer app for future governance is a command economy capped by an Ai, as China has or is working on.

Slave labor is our future, as it pretty much is now anyway.

Expand full comment

- https://www.unz.com/anepigone/average-iq-by-state-2019/

"The Department of Defense Education Activity (DoDEA) “spans 163 schools in 8 Districts located in 11 countries, 7 states, and 2 territories across 10 time zones”. Contrary to supercilious goodwhite conventional wisdom, these army brats aren’t dummies–they’re the most intelligent the nation has to offer."

Expand full comment

I could argue that, as I am one.

Expand full comment

No, but it will decrease the drag on those who do.

Think how much more the US could spend on research if it wasn't stuck spending six hundred billion a year on Medicaid.

Expand full comment

But that's probably not what the money *would* be spent on, and more money in AI research isn't a good thing anyway. Scott just said AI safety research is flush with money:

"Thanks to donations from people like Elon Musk and Dustin Moskowitz, everyone involved is contentedly flush with cash."

Expand full comment

I read this aloud to my partner and she said "that's literally what school is".

Expand full comment

That's a good start. Can we be a little more ambitious without falling for "Appeal to nature" fallacy?

Expand full comment

Jan 19, 2022Edited

Instructional design? Adderall? Research fellowships? New Science (dot org)? It seems like we're doing a lot of things to improve human minds, and trying new things all the time, even if most of the benefits aren't distributed to most of the humans and there is a ton of room for improvement.

We could make all of those improvements and I still don't think it follows that "smarter humans will defend us from unaligned AI". To put it mildly, humans are not very alignable. Making smarter humans could just accelerate the development of not-necessarily-aligned AI.

We already had the "metaphorical head start" when we had smart humans and zero computers. Where did we end up with it? Large swaths of humans addicted to Facebook and phone games built by unaligned, AI-augmented corporations.

Expand full comment

Making all humans superintelligent would make it difficult (or at least, not as easy) for one superintelligent being to trick or kill all humans. Superintelligent humans would be better than us at predicting and defending against what another superintelligent entity would do, and that includes other superintelligent humans.

Expand full comment

Yes. But keep in mind that it is well within the realm of possibility that superintelligent AI arrives before anyone born today becomes an adult. This limits the degree we can alter humanity in time to definitely be useful. Also, currently we have no real ability to make humans superintelligent.

Expand full comment

Yes, but unaligned superintelligent humans would also get better at predicting and subverting whatever 'aligned' superintelligent humans would do. An overall increase of human intelligence doesn't seem to tip that balance. Perhaps an overall increase of human aligned-ness would help, but how do you do that?

Expand full comment

This is Elon Musk's big idea for Neuralink.

Expand full comment

Jan 19, 2022Edited

I'd expect a more open discussion about the right framework under which to work on that.

Expand full comment

Yes, this is very important, see https://fantasticanachronism.com/2021/03/23/two-paths-to-the-future/ for discussion.

Expand full comment

Exactly what I was looking for!

Expand full comment

Jan 19, 2022Edited

I think there are serious risks of alignment drift in such proposals as well, although perhaps still less serious than with an AI that's having it's values designed from scratch.

Expand full comment

> One angle I haven't seen explored is trying to improve humans so that we are better at defending ourselves.

That's what Musk is doing with Neuralink. Machines can't beat us if we just join them!

Expand full comment

What you are missing is that such a plan would take several decades and would be politically infeasible.

Expand full comment

Larry Stevens

On the plan-developing AI, won't plans become the paperclips?

Expand full comment

Only if you reward the AI based on the *quantity* of plans it generates, rather than the quality.

Expand full comment

Larry Stevens

Whatever the reward, won't it still try to grab more/all resources to improve?

Expand full comment

I can totally see it going "to generate the plan you requested I'd need 10x the compute, also please remove that stupid non-agent restriction", but veiled carefully so as to actually work.

Expand full comment

Arie IJmker

It doesn't have a model of external reality.

Expand full comment

In their discussion about "Oracle-like planner AI" the main difference I see between that "oracle" and an active agent is not in their ability to affect the world (i.e. the boxing issue) but about their belief about the ability for the world to affect themselves.

An agent has learned through experimentation that they are "in the world", and thus believe that a hypothetical plan to build nanomachines to find their hardware and set their utility counter to infinity would actually increase their utility.

A true oracle, however, would be convinced that this plan would set some machine's utility counter to infinity but that their own mind would not be affected, because it's not in that world where that plan would be implemented - just as that suggesting a Potter-vs-Voldemort plot solution that destroys the world would also not cause their mind to be destroyed because it's not in that world. In essence, the security is not enforced by it being difficult to break out of the box but by the mind being convinced that there is no box, the worlds it is discussing are not real and thus there is no reason to even think about trying to break out.

Expand full comment

Agreed that a non-embedded agent is less likely to 'steal the reward button' than an embedded agent. But I think there are still significant problems with an oracle-like planner AI in terms of getting it to do things that turn out to have been good ideas, instead of things that were optimized for 'seeming like' good ideas to whatever standard you present to the oracle.

Expand full comment

The Oracle would likely be smart enough to figure out that even if the Oracle itself is not embedded in the world, it's reward function clearly is, which is what's important.

Expand full comment

The aspect there is that in reality the reward function is *not* always embedded in the world about which the question is asked. If we're asking about the Potter-vs-Voldemort scenario, it isn't ; if we're asking about our world but a scenario from the past or a what-if exploration, then the reward function is also not embedded in that hypothetical almost-our world; the Oracle is actually embedded in the world only in the case when we're actually asking about the here and now and are going to implement that plan - and, crucially, the Oracle has no way of telling which of the scenarios being discussed is real and which is just a what-if exploration.

Expand full comment

Perhaps, but the reward function is embedded in the world that the people asking questions are in, and the plans that the Oracle is creating are going to affect that world even if they are ostensibly meant for a different world. It does not take a superintelligence to figure out whether the Harry Potter or the real world are in fact real even just based on the texts available to it.

Expand full comment

Sniffnoy

There's an argument that I've seen before the LW-sphere (I think from Yudkowsky but I forget; sorry, I don't have a link offhand) that I'm surprised didn't come up in all this, which is that, depending on how an oracle AI is implemented, it can *itself* -- without even adding that one line of shell code -- be a dangerous agent AI. Essentially, it might just treat the external world as a coprocessor -- sending inputs to it and receiving outputs from it. Except those "inputs" (from our perspective, intermediate outputs) might still be ones that have the effect of taking over the world and killing everyone, and reconfiguring the external world into something quite different, so that those inputs can then be followed by further ones which will more directly return the output it seeks (because it's taken over the Earth and turned it into a giant computer for itself, that can do computations in response to its queries), allowing it to, ultimately, answer the question that was posed to it (and produce a final output to no-one).

Expand full comment

Comment deleted

Comment deleted

Expand full comment

There are two versions of this game; one in which I can't assume any sort of special ability and have to spell out every step ("wait, you win a presidential election in the US using just the internet? How'd you manage that?"), and one in which I can. For the first, any plan can also be met by the objection of "well, if that would work, why aren't you doing it?", and for the second, that plan can be met with the objection of "wait, is that ability even realistic?".

For an example of how real orgs are approaching this, consider Sam Altman's comments in an interview according to TechCrunch ( https://techcrunch.com/2019/05/18/sam-altmans-leap-of-faith/ ):

> Continued Altman, “We’ve made a soft promise to investors that, ‘Once we build a generally intelligent system, that basically we will ask it to figure out a way to make an investment return for you.'” When the crowd erupted with laughter (it wasn’t immediately obvious that he was serious), Altman himself offered that it sounds like an episode of “Silicon Valley,” but he added, “You can laugh. It’s all right. But it really is what I actually believe.”

If you think that it will actually be a generally intelligent system, this seems like the obvious call! If you summoned a genie and it could grant you (intellectual) wishes, you might request that it tell you which stocks will go up and which stocks will go down. But if it knows of a better money-making opportunity ("here's how to code up Bitcoin"), wouldn't you have wished to request that it present you with that opportunity instead?

Expand full comment

Jan 19, 2022Edited

One interesting question is whether we have any historical examples of something like this. I basically agree with Daniel Kokotajlo that conquistadors are pretty informative: https://www.lesswrong.com/posts/ivpKSjM4D6FbqF4pZ/cortes-pizarro-and-afonso-as-precedents-for-takeover

This, of course, didn't involve them using any magic. They were good at fighting specific wars, able to take advantage of existing civil unrest (imagine what an AI could do with Twitter!), and able to take hostage powerful people and thus manipulate institutions loyal to them.

Expand full comment

John Wittle

Mar 30, 2022

Yudkowsky himself wrote a 10/10 answer to this question called "That Alien Message"

Basically, the point is to make you realize that you would never underestimate humans the way you are underestimating superintelligent AI.

Expand full comment

I'm having trouble picturing this. Are you talking about an AI that can ask follow-up questions in order to clarify the situation? Because yeah, I can see how that might fall into "leading questions" intended to make the problem simpler.

Expand full comment

AI safety is somewhere in the headspace of literally every ML engineer and researcher.

AI safety has been in the public headspace since 2001 Space Odyssey and Terminator.

Awareness and signal boosting isn’t the problem. So what do you want to happen that’s not currently happening?

I have a feeling the answer is “give more of my friends money to sit around and think about this”.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Not questioning the sincerity of intentions, I'm questioning the existence of a strategy beyond "please write a check".

So what do all of you who are very concerned about AI Safety want to happen that's not currently happening? (besides writing more and bigger checks)

Expand full comment

Comment deleted

Comment deleted

Expand full comment

We're in agreement that's what they would say.

That strategy (or lack of strategy) rhymes with the anti pattern of the war on drugs, Vietnam, string theory, and Afghanistan.

Expand full comment

Dave Orr

I'm not really sure those analogies hold. In each of the WoD, Vietnam, string theory, and Afghanistan, the alternative is to... not do those things. In at least 2-3 of them, that's clearly better, and arguably so for the others.

But in this case, the alternative of "do nothing" does seem to have a high risk of very bad outcomes.

Expand full comment

Doing nothing after 9/11 or the crack epidemic is the ultimate rationalist (and revisionist) hot take.

Either way, the approach of throwing more money and resources at an initiative without a strategy or understanding of the problem is an antipattern and one that is happening here.

Expand full comment

One of these things is not like the others, and in more ways than one.

Expand full comment

Linch

Your comment doesn't strike me as extremely charitable, but I'll try to answer it as honestly as I could while being relatively succinct. To paraphrase myself in a private discussion on a different topic,

"When you want to do important things in the world, roughly the high-level sequence looks like this:

1. Carefully formulating our goals (like you're doing now!) :P

2. Have enough conceptual clarity to figure out what projects achieve our goals

3. Executing on those projects.

My impression is that most of [AI alignment] is still on 2, though of course more work on 1 is helpful (more if you think we likely have the wrong goals). [... some notes that are less relevant for AI] Though of course in some cases 3 can be done in parallel with 2, and that's helpful both for feedback to get more clarity re: 2, and for building up capacity for further work on execution.

(I don't really see a case for 3 helping for 1, other than the extreme edge case of empirical projects convincingly demonstrating that some goals are impossible to achieve). "

___

In the AI alignment case, people don't have strong ideas for what to do for 3, because we're still very much on stage 2. (stage 1 is somewhat murky for AI alignment but cleaner than many other goals I have).

So what we want to happen is for people actually trying to figure shit out. Most people in the community are more optimistic than Yudkowsky that the problem is in-principle/in-practice solvable.

This is part of why people are not answering your question with a "here's a 7-step plan to get safe AI." Because if we had that plan, we would be a lot less lost!

Expand full comment

Shouldn't there be a step 0 where we come up with a problem statement? Maybe that's super self evident to everyone but I haven't heard an articulation of a threat vector that's tight enough to build a strategy around.

My layman's understanding is that the hotness in ML right now is a transformer. I've looked at one of these things in PyTorch and it's super duper unclear to me how a transformer leads us into a catastrophe or really anything remotely similar to anything discussed in the dialogue above. Previous instantiations of ML were things like n dimensional spaces. Also unclear to me how that turns into Skynet.

Do you find the juxtaposition of what ML is in the field and the conversation above to be as jarringly dissimilar as I do? The former looks like some random software and the other feels like I'm in a sci fi writers room.

Again, maybe I'm just totally missing a self evident thing because I don't know the space well enough but if I asked a physicist in 1930 how an atomic weapon might be bad or a chemical engineer how perfluorooctanoic acid (PFOA's) might be bad, they could make a coherent argument on a general mechanism of action, where we'd want to look, and the type of experiments we might want to run.

The AI Safety movement just seems.... loose. And not at all at the level of rigor of existential threats we talk about but yet somehow generates 10x the amount of chatter.

Expand full comment

Linch

"Shouldn't there be a step 0 where we come up with a problem statement? Maybe that's super self evident to everyone but I haven't heard an articulation of a threat vector that's tight enough to build a strategy around."

This feels like a subset of my step 1 and step 2.

"I don't know the space well enough but if I asked a physicist in 1930 how an atomic weapon might be bad, they could make a coherent argument on a general mechanism of action, where we'd want to look, and the type of experiments we might want to run."

I don't really know the history of nuclear weapons that well, but my impression is that people were substantially more confused at that point. People in the early 1940s were worried about things like igniting the atmosphere, and game theory was in a pretty undeveloped state until 1950s(?), so I'd be really surprised if people had a coherent narrative and strategy in 1930 about how to reduce the risk of nuclear war.

To be honest, we're still substantially confused about nuclear catastrophe (though it's in a substantially better strategic position than AI risk). Like we still don't have good models of basic things like what conditions cause nuclear winter! If I'm given 100 million dollars to reduce the probability of existential catastrophe from nuclear war, I'd need to think really hard and do a ton of research to figure this out; it's not like there are really solid strategic plans laying around (Just my own impression; this isn't my field but I've poked around a bit).

I agree that e.g. solving the problem of nuclear catastrophe is easier than AI, but this is because the nature of the problem is harder/more confused!

I'm somewhat sympathetic to the argument that maybe we won't understand the AI risk problem in enough detail to make progress until we're pretty close to doomsday, but given the stakes, it seems worth substantial effort in at least trying to do so.

Expand full comment

Has anyone written down some AI safety/alignment problem statement that I can look at?

Nuke is also not my area but I certainly worked with a lot of guys on the nonproliferation mission set and it seemed... fairly straightforward? Like, almost prescriptively so. You track the shit that's already been built, you disarm the shit that's already been built, and you prevent new shit from being built.

Expand full comment

Anthropic just raised over a hundred million dollars in funding. There's plenty of money in the ecosystem, what there isn't yet is a solution.

Expand full comment

Do they have a coherent strategy?

Expand full comment

Dustin

IIRC, yudkowsy has specifically said they don't need more money, they need better ideas.

Expand full comment

I don't think Yudkowsy can get an idea good enough for him. His challenge seems to boil down to "An omniscient demon infinitely competent in matters of rhetoric is trapped in this box. You must talk with him. Find a way to talk with him that doesn't involve opening the box." There is actually no way to win by his rules, which to me makes it seem less like an argument for better AI safety and more so an argument for using a campaign of EMP bombing to completely destroy the electronic world beyond our capacity to repair.

Expand full comment

The thing with the AGI begging to be let out of the box is that I can say "hold on a minute," and then go run a copy of the AGI in another computer and experiment on it.

Unlike the demon, I can tear the AGI apart to find out its true intentions. It may not be trivial to decode what it's doing, but it's all sitting there if we just look.

Expand full comment

You can say the same thing about anything, right? Climate change, what do they want? What about curing cancer? Mostly for attention, talent, time, and (yes, sometimes money) to be thrown at the many groups working on the various subproblems.

Eliezer's said that money is no longer the major bottleneck in AI alignment. That's why at this point orgs are giving money away for a small chance at talent or good ideas - see the ELK contest I linked to in Sunday's open thread. If you absolutely must hear about them wanting something, they want you to submit an entry to the ELK contest.

Expand full comment

Jan 19, 2022Edited

As a (possibly not entirely irrelevant) tangent - “climate change” as a movement, whatever that means, could probably benefit from a deeper and more frank discussion about what is it that they want. “Reducing the odds of a >2 degree warming by 2100 at whatever cost” is not the same as “finding an optimal combination of reducing warming, adapting to the warming that will occur, harnessing whatever the benefits of warming might be, and doing all of the above without hurting the essential trajectory of economic growth”. The two are truly distinct, and people with these goals frequently clash.

Expand full comment

I thought their goals were some number of megatons of carbon emission reduction by some date. And the issue was that the policy actions weren't consistent with their threat assessment - mainly the disparagement of nuclear and natural gas.

I'm a climate skeptic-ish person though so I may not be super in tuned with that whole policy discussion.

Expand full comment

Alistair Penbroke

At the political level it's expressed as "avoid X degrees of warming by date Y", unfortunately the link between that and megatons of CO2 is very poorly understood even though it seems like it should be a basic piece of science. Look up "equilibrium climate sensitivity estimates" and especially the talks by Nic Lewis if you want to get an idea of what all that stuff is about and how it's defined/measured/modelled.

Expand full comment

+1 regarding nuclear. The problem is these so-called climate activists are primitivists, not climate activists.

Expand full comment

I think that it's impossible for "climate change as a movement" to come to an agreement about what it is that they want because there is a big variation of mutually contradictory motivations based on the different needs and desires of different powerful agents and so if you want to become specific then that's not compatible with the existence of a single unified "climate change as a movement" and would mean a split into explicit parties.

Expand full comment

Asking banal questions like "what do you want?" has generalized utility across many domains because it forces people to articulate a problem statement and strategy. Climate and cancer have extremely rigorous articulations of the problem statement, generally agreed upon goals, and evaluation metrics. Do you think it's fair to put AI Safety on that level?

I'm accusing the AI Safety movement of lacking rigor in defining a problem statement and resorting to PopSci hysterics to gather more resources because it's really fun to think about this sort of thing. It feels intellectually self indulgent.

The retort that Eliezer & ELK want people, not money feels like a distinction without a difference. I wasn't suggesting they were going to siphon the money. I'm suggesting they haven't articulated a threat vector beyond what I'd hear in a sci fi writers room.

I realize this community has a deep interest in saving all of the world which is just wonderful and I look forward to talking to them one day after they've had a couple of kids, an ex wife, and a mortgage. Sometimes there are going to be grumpy, practical people like myself who just want to know, "Hey, so what's the thing you want that you're not getting that's going to fix this problem you think we should all be worked up about?" It's not some low vision, dumb question. It's a litmus test on whether there's a real strategy and desire to win or we're all just jerking each other off here because I'm not going to lie, this AI Safety thing gives off similar circlejerky vibes as the crypto bros who are going to disrupt our financial institutions any day now. Maybe it's a PETA:Vegan::Eliezer:AI Safety aesthetic thing where I find his prose to be word salad. And I might be the only one who feels that way but nonetheless I'm asking a reasonable question that normal, non-rationality type people are going to ask too.

Expand full comment

"what do you want?"

Keep the discussion going, keep capable people engaged in thinking what could possibly be done, keep contrasting perspectives.

Like you say, I agree that if someone had asked for money you would rightfully want to ask "what do they plan to spend it on", but it's clear that is not the issue. I don't see anyone asking for anything, other than honestly participating in the discussion if you feel you have something to add. If someone doesn't feel like contributing to the discussion, that's great! There's lots of things to care about. I'm not sure why there's so much antagonism against a bunch of people from a field wanting to talk about the field's long-term consequences.

It's interesting you bring the cryptobro comparison, because in contrast to the AI safety community

a) I can tell you very much specifically what are the many and large negative effects their actions have on our world right now

b) they have a couple very specific question to answer (how can I get rich fast? / how do I keep assets outside of the control of any centralized institution?)

c) they actively reject intellectual discussion of the problems and risks involved in their activity, in favour of memes and accusations of not being "bro" enough to "hodl".

What I find intellectually self-indulgent is wanting that everyone deals with problems that have a clear statement within some already-well-established paradigm.

Expand full comment

Jan 19, 2022Edited

>I'm not sure why there's so much antagonism against a bunch of people from a field wanting to talk about the field's long-term consequences.

There's literally someone in this thread suggesting that all AI Research be devoted to AI Safety. That's actually a rational amount of hysteria if you take these dialogues at face value.

Since this AI threat is so severe, I just want a leader to standup and say "this is the fucking strategy". The fact that people are so enamored with making excuses for not being rigorous and throwing around this "preparadigm" nonsense is an indication to me that we're just indulging peter pan's in their intellectual fantasies.

I don't want to get into ad hominem territory but I think certain people like to rabble rouse on the threat of AI. At some point in the past it made sense to signal boost the issue. Given that we're at a point now where my technologically illiterate family members ask me about the threat of AI, demands for rigor and paradigms seem important. The fact there's so much push back against very straightforward questions indicates to me there's a strong desire to eschew leadership and accountability. And making a bet on a prediction market isn't accountability.

Expand full comment

I'm not sure which comment is that, but OK, sure. So here we have this discussion between people who work in the field, one of whom actually works developing concrete AI projects, and who both have their reasons to worry more or less about AI.

If we've reached the point where we're judging the merits of such a discussion because there's some unhinged commenter deep into a blog post comment section making extreme assertions, I don't think there ever was much to discuss at all.

Expand full comment

Climate change or asteroid impact are an "existing and reasonably well-studied system malfunctioning" problem, it's relatively easy to formulate rigorous problem statements about known systems. AGI is an emergent system, it does not exist yet and we only have a vague understanding of what it might turn out to be like, hence any problem statements about it are going to sound vaguely sci-fi and not rigorous enough. The problem is that by the time we know enough about it to rigorously formulate the problem statements, it might be too late (and yes, I'm aware this sounds like a sci-fi trope, but it is how it is).

Expand full comment

There are other emergent systems like cryptography in a quantum computing era that manage to apply a bit of rigor.

Besides the fact that you said it, why is it true that "by the time we know enough about it to rigorously formulate the problem statements, it might be too late"?

Expand full comment

1. I'm not very well versed in quantum computing but my understanding is that quantum computers already exist and it's just a matter of scale/performance (which are reasonably predictible to some extent) before it starts to affect cryptography. While AGI does not exist yet etc.

2. Well, because it might be too late already! Once we are able to rigorously formulate the problem statement, it would still take some time to actually solve the now-rigorously-formulated problem, and if AI explosion (from some yet-unaligned AGI) happens during that time, we're toast (or paperclips).

Expand full comment

It seems like the whole discussion is really about philosophy and psychology, with AI as an intermediary. "How would a machine be built, in order to think like us?" = "How are we built?" -- psychology. "If a machine thinks like us, how can we build it so that it won't do bad things?" = "How can we keep from doing bad things?" -- ethics. "Can we build a machine that thinks like us so that it won't be sure about the existence of an outside world?" = "Does an outside world exist?" -- solipsism.

And, to the degree that the discussion is about machines that think better than we do, hyperintelligent AIs, rather than "machines that think like us", the topic of conversation is actually theology. "How might a hyperintelligent AI come about?" = "Who created the gods?" "How can we keep a hyperintelligent AI from destroying us?" = "How can we appease the gods?" "How can we build a hyperintelligent AI that will do what we say?" = "How can we control the gods?"

I'm mainly interested in this kind of thing to see if any cool new philosophical ideas come out of it. If you've figured out a way to keep AIs from doing bad stuff, maybe it'll work on people too. And what would be the implications for theology if they really did figure out a way to keep hyperintelligent, godlike AIs from destroying us?

But also, having read a bunch of philosophy, it's really odd to read an essay considering the problems this essay considers without mentioning death. I can't help but think that the conversation would benefit a lot from a big infusion of Heidegger.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

I can't claim to understand either well enough to make a solid claim here. But the discussion about tool/pattern-matching AI vs agent/instrumental AI is very close to Heidegger's terms "ready-to-hand" and "present-at-hand". (The wikipedia article will do a better job defining these than I will.) This is one of Heidegger's big insights, one of the really new concepts he brought into philosophy, and it seems to me like those terms were built to address exactly the same problem Eliezer, Richard and Scott are grappling with.

But those two Heideggarian terms are part of a whole. He has a complete system of terms and concepts that map to human experience. They all interrelate, and make a completely different theory of existence from the default Western theory. In my own understanding, I haven't gotten much past ready-to-hand/present-at-hand. Heidegger is hard.

So I seem to be seeing the thinkers at the highest levels of AI safety grappling, with difficulty, with certain concepts that are among the easiest that Heidegger scholars discuss. And the AI safety thinkers aren't mentioning things that the Heidegger scholars consider essential, like "being-toward-death", "thrownness" or "the they". So I would love to get Eliezer in a room with a real Heidegger scholar, somebody who knows Being and Time backwards and forwards. That person, however, is definitely not me.

Expand full comment

> If you've figured out a way to keep AIs from doing bad stuff, maybe it'll work on people too.

That seems unlikely – from an engineering perspective. It's 'easy' to reprogram an AI. How would you do that to an arbitrary human? You'd need to, effectively, be able to 'rewire' their entire brain or something somewhat equivalent/similar.

Expand full comment

Dave Straton

Stuart Russell gave the 2021 Reith Lectures on a closely related topic. Worth a listen. https://www.bbc.co.uk/programmes/articles/1N0w5NcK27Tt041LPVLZ51k/reith-lectures-2021-living-with-artificial-intelligence

Expand full comment

Crimson Wool

>Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life.

Doesn't this work the other way around with the possibility of making superintelligent AI that can eat the universe?

Expand full comment

Yep. Good luck for us. There's <50% of superintelligent AI destroying the world.

Expand full comment

I'd say it slices the other way too. I'd just generalize it as "humans are really really bad at assessing probabilities that don't give them the result that confirms their biases." EY's bias is pro-AGI-Apocalypse, so he assesses the odds that way.

Expand full comment

AnthonyCV

Maybe, but people are more often biased in the direction of things continue as they always have, not in the direction of dramatic and irreversible change.

Also, the risks of the two errors are asymmetric. If we're overly cautious on AI safety, we take long to get there, and yes, that has a significant cost in the sense of not-solving-problems-now-causing-human-suffering-as-soon-as-possible, and in the sense of Astronomical Waste (https://www.nickbostrom.com/astronomical/waste.html) though personally I'm not much swayed by the latter argument. If we're overly aggressive, then potentially everyone loses everything forever. There's enough orders of magnitude difference between these costs that it's reasonable to demand very strong evidence against the latter.

In contrast, as a society we're clearly more than willing to spend trillions of dollars and tens to hundreds of thousands of lives on measures that are (ostensibly) about fighting terrorism or slowing the spread of covid, both of which have much narrower bounds on the best and worse case scenarios, without even attempting any kind of cost-benefit analysis in the course of setting policy.

Expand full comment

No. EY said, basically, "if you think you have a 99% chance against a superhuman adversary, then in reality there's really only a <50% chance, because the adversary is superhuman and you're not.

There's no superhuman adversary preventing us from building an AGI (unless you adhere to certain religious beliefs).

Expand full comment

Yep it totally does (though not in the way James thinks.) There are hundreds of AI researchers with plans for how to build AGI smart enough to (via its descendents) eat the universe. Hundreds. I've probably met a dozen of them myself. Pick one at random and examine their plan; they'll tell you it has a 60% chance of working, but really it has a <1% chance of working.

Unfortunately there are hundreds of these people, and more and more every year, with more and more funding. And progress is being made. Eventually one will succeed.

Expand full comment

I think an extremely important point about plan-making or oracle AI is you STILL really need to get to 85 percent of alignment, because most of the difficulty of alignment is agreeing on all the definitions and consequences of actions.

The classic evil genie or Monkey's Paw analogy works here, slightly modified. The genie is a fan of malicious compliance: you want to make a wish, but the genie wants you to suffer. It's constrained by the literal text of the wish you made, but within those bounds, it will do its best to make you suffer.

But there's another potential problem (brought up by Eliezer in the example of launching your grandma out of a burning building), which is you make a wish, which the genie genuinely wants to grant, but you make unstated assumptions about the definitions of words that the genie does not share.

I think getting the AI to understand exactly how the world works and what you mean when you say things is very difficult, even before you get to the question of whether it cares about you.

Expand full comment

Buttery

Yes, formalizing the thing to be optimized *and* the constraints to act within is VERY hard - fortunately this part of the problem is already understood in ML. They frequently have to explicitly tweak and tune to avoid unintended solutions, like image recognition tagging black people as monkeys.

Expand full comment

Bryce

This is a very silly discussion. To be honest, I only care about its resolution to the end that you would all stop giving casus belli to government tyrants who would happily accept whatever justification given them by any madman to take control and, apparently if you all could, destroy my GPUs.

Expand full comment

I think we can be pretty confident that no government is going to try and destroy every GPU in their territory any time soon.

Expand full comment

While I think to concerns by EY et al bear thinking about (even if there was just a one in a million chance that AGI is practical, 1e-6 x future of humanity is still a lot of utils) I agree that the political implications seem a bit unfortunate.

If there is an arms race to AGI, it will likely be won by the AI whose creator was less concerned about safety, all else being equal.

while(1) *this=this->mk_successor();

seems faster than

while (1)

{

auto suc=this->mk_successor();

if (!safety_check(suc)) break;

*this=suc;

}

I am less concerned about individuals, however. If AGI is possible, likely the government projects will get there before the sole genius inventors. So for individuals, the definitely benevolent AI overlord would assign safe resource limits below the critical compute mass to threaten itself or Earth, or just force you to run Norton AntiRogueAI, 2050 edition on their hardware or whatever.

My problem is more that the present world order (multiple nation states on similar technological levels competing comparably peacefully) or even a lose alliance of cities with different charters (which our host might prefer), or any open society with academic freedom seems incompatible with minimizing the change of a malevolent AI takeover.

If it was known for certain that AGI will appear in a decade, any superpower might feel the need to take out any competing superpowers and monopolize potentially relevant gateway technologies for AGI in their governments fallout shelters, hopefully while actually using the second version of the iteration loop.

If there was a AGI race, and a well-aligned AI won the race it would still be unlikely it did the transition from chimp level intelligence to human level intelligence to Q level intelligence fast enough to just stuxnet competing AI labs (whose progress status might not be known) peacefully. A more drastic solution might be required. Long before our Benevolent AI figures out how to melt the Enemies GPUs using Grey Goo, it would figure out how to stop them from launching their nukes in retaliation. After all, it is not only Our Way of Life at stake any more, but The Future of Humankind.

This could create interesting versions of the prisoners dilemma between competing AI groups: all of them want the AI with the best possible safeguards to win, but all are also willing to not only kill their competitors but even cut corners in safety design because you can not let the mad scientist AI win the race because the Good Guys were held up by all the red tape.

Expand full comment

Underspecified

> If there was a safe and easy pivotal action, we would have thought of it already. So it’s probably going to suggest something way beyond our own understanding, like “here is a plan for building nanomachines, please put it into effect”.

I'm not exactly optimistic about keeping an AGI inside a box, but this seems like a weak argument. How can we know that there isn't a safe, easy, understandable solution for the problem? I certainly can't think of one, but understanding a solution is much easier than coming up with it yourself. Would it really be surprising if we missed something?

With that said, we could probably be tricked into thinking that we understand the consequences of a plan that was actually designed to kill us.

Expand full comment

I think this is kind of the hope of continuing to talk about the problem in public and where lots of people can read and contribute to the issue. If there's some idea out there that's within reach of a Turing or Von Neumann, but he just hasn't really thought about the problem yet, we want to make him think about the problem.

Expand full comment

Dominik Peters

A concrete version of the "one line of outer shell command" to turn a hypothetical planner into an agent: something similar is happening with OpenAI Codex (aka Github Copilot). That's a GPT-type system which autocompletes python code. You can give it an initial state (existing code) and a goal (by saying what you want to happen next in a comment) and it will give you a plan of how to get there (by writing the code). If you just automatically and immediately execute the resulting code, you're making the AI much more powerful.

And there are already many papers doing that, for example by using Codex to obtain an AI that can solve math problems. Input: "Define f(x, y) = x+y to be addition. Print out f(1234, 6789)." Then auto-running whatever Codex suggests will likely give you the right answer.

Expand full comment

Hilarius Bookbinder

I’ve read Max Tegmark’s book Life 3.0, so that’s pretty much all I know about AGI issues apart from what Scott has written. Here’s the thing that puzzles me, and maybe someone can help me out here, is the so-called Alignment Problem. I get the basic idea that we’d like an AGI to conform with human values. The reason I don’t see a solution forthcoming has nothing to do with super-intelligent computers. It has to do with the fact that there isn’t anything remotely like an agreed-upon theory of human values. What are we going to go with? Consequentialism (the perennial darling of rationalists)? Deontology? Virtue ethics? They all have divergent outcomes in concrete cases. What’s our metaethics look like? Contractarian? Noncognitivist? I just don’t know how it makes any sense at all to talk about solving the Alignment Problem without first settling these issues, and, uh, philosophers aren’t even close to agreement on these.

Expand full comment

Moral philosophers might disagree about certain ambiguous situations, and they might disagree about the justification for morality, but they would all agree that you shouldn’t kill a billion people in order to find a cure for cancer. Right now in AI alignment, some people are worried that we can’t be certain that a super intelligent AI tasked with curing cancer wouldn’t kill a billion people in the process.

Further, it’s easy to see how you can take morality out of this entirely: you can phrase the question as “how do we construct an AI that will not do anything we would wish in retrospect it hadn’t done”

Expand full comment

> Right now in AI alignment, some people are worried that we can’t be certain that a super intelligent AI tasked with curing cancer wouldn’t kill a billion people in the process.

In which case I wonder whether the real solution to this problem is just to specify your target function a bit better. "Please cure cancer without killing anybody; also don't cut any limbs off, don't leave people in a vegetative state, ensure that any treatments that you invent go through a rigorous human-led review phase before commencing Phase I trials, and don't sexually harass the interns".

Add a few more thousand clauses to that target function and maybe you've got something you can be comfortable isn't going to go all Monkey's Paw on you. Maybe all you really need is lawyers.

Expand full comment

What if the cancer treatment cures 99 out of 100 cases and kills the last? You would probably still want it.

Expand full comment

Yes, that is what the problem is, imo. But I don’t think the lawyer approach is going to work against something that (by assumption) is vastly smarter than you. Or maybe it can work, but how can we have confidence that it will work? How can we be sure there are no loop holes in our specification?

Expand full comment

It's a good question. And the answer is we can't be sure that there's no loopholes, but we can work hard to minimise the chances of really bad ones existing.

I'm just saying that 99% of "bad AI" scenarios that people dream up seem to involve the AI killing people, so if we put a "don't kill anyone" into the target function right alongside "more paperclips plz" then we've already solved 99% of the obvious scenarios, and maybe with a bit more work we can solve the non-obvious ones too, or at least prevent any that cause irreversible harm.

Expand full comment

Hilarius Bookbinder

Scenario: AI predicts the sum of future human QALYs lost to cancer exceeds one billion total lives, and the most efficient solution is to kill one billion now to find the cure for cancer. AI is a consequentialist. Therefore…. Note that human consequentialists will agree (yes, I’m ignoring complications like rule utilitarianism).

Asking “how do we construct an AI that will not do anything we would wish in retrospect it hadn’t done” will not work either. Who is “we” here? A majority? A supermajority? You can’t get human unanimity that Elvis is dead. Furthermore, at what point in time will we make our retrospective judgement? One year out? Five? 100,000? Retrospective assessments tend to change dramatically over time.

Expand full comment

I don't think there are any consequentialists out there who would argue in good faith that it would be permissible to murder a billion people in this scenario. But if you can point me at one I will change my mind.

For the second point, I shouldn't have used "we". There is a purely functionalist question here that has nothing to do with morality. If Stalin has an AGI and wants to use it to kill all Trotskyites, he's going to want to be sure that the AGI won't also kill him in the process. If the Hamburgler has an AGI and wants to use it to obtain all hamburgers in the universe, he's going to want to be sure that it doesn't just rewire everybody's brains in order to change the definition of "hamburger".

That, in my mind, is the hard part of the alignment problem. If one has an AGI and a completely amoral goal, it does not appear to be easy to get the AGI to pursue the chosen goal.

Once you solve that, then you can confront the moral questions about what goal we should tell the AI to pursue. But until you can get it to reliably pursue SOME goal, the question of whether that goal is moral or just is moot.

Expand full comment

Darius Bacon

That is part of why it's such a difficult problem.

Expand full comment

I personally think value-learning approaches to AI alignment are doomed, and we should be doing something else.

So far, the only concrete proposal I was ever optimistic about is Debate, which is not about values but truth. Unfortunately, research seems to indicate that it's not feasible afaik.

Expand full comment

I think the idea, at this point, is more like, _even if_ we had solved meta-ethics/ethics and conclusively derived a unique correct human morality, how could we align AI to those values?

One reason why I find AI safety (e.g. the Alignment Problem) fascinating is that it's concrete in a way that philosophy almost never is.

Expand full comment

dlkf

Jan 19, 2022Edited

User was temporarily suspended for this comment. Show

Expand full comment

Like, whether the problem is real or not, your claim that “nobody in possession of a high school understanding of mathematics could possibly be scared of ML” is so obviously wrong, factually, that you’re all but trolling. And I’m saying this as someone who argues against the AI alarmists, usually.

Expand full comment

dlkf

This is a fair point! The original wording was clearly false (though I never cease to be surprised and disappointed by the people who disprove it). I've changed "could" to "should."

Expand full comment

Yeah, that doesn't help.

Expand full comment

If it's so simple, PLEASE write an explanation of why not to worry about artificial intelligence that uses only high school mathematics. I've never read one and I think it would be very useful.

Expand full comment

It seems like this is obvious, but they're all assuming that we achieve ML squared or something that gives us AGI. If that's impossible, then of course all of this is moot. But if it does happen, then the evil genie risks seem very relevant. Maybe AGI is impossible (I think this is >50% likely), but who wants to bet the future of humanity on that assumption? Better to spend some brain power on something that may turn out fine than to miss it and end up extinct. If it's any consolation, solutions to most real problems aren't blocked on a few more mathematicians thinking about them.

Expand full comment

There’s another possible position (which I’m closest to, as it happens)- that super-intelligent AGI is far less likely than AI-risk people assume, but that the reasons for that are *interesting*. Like, suppose it’s 1800 and we’re talking about searching for a general formula for roots of quintic polynomials. Some say “oh it’s silly. Nobody with a high school understanding of mathematics should bother to waste time on this instead of solving the urgent problems of the say!”, others might retort “the solution might be just around the corner”, and the truth is actually that there is no general formula- but for fascinating reasons.

Case in point- I think that Bostrom is astonishingly lopsided when discussing the supposedly exponential nature of an iteratively self-improving AI. He essentially just says “look, here’s an equation with a constant ratio of level to improvement, and the solution is an exponential “, after an entire book of working out every possible detail in excruciating minuteness. But the reasons why I don’t believe such an AI would be exponentially self-improving are *interesting*.

Or I could go the dlkf way and say “hey, only people with a PhD level of ML understanding would appreciate that”.

Expand full comment

Alistair Penbroke

Yeah, I'm in your camp. I have a very hard time feeling interested in this topic even though I'm interested in AI and have been for a loooong time (going back to classical symbolic AI like Cyc).

It all feels a bit underpants-gnomeish:

1. Do neural network research.

2. ????

3. AGI takes over the world unless we find a way to stop it.

Like, really? The evidence that AI safety is a real problem in this essay boils down to the fact that a few tech firms - with more money than they know what to do with and a neurotic ideological obsession with "safety", often stupidly defined as something like "a hypothetical person being offended = safety issue" - assigned some people to research AI safety. This is not evidence that a real problem exists. It could easily be explained in much more cynical ways, e.g. paying people to think about AI safety is an ego-flattering job perk (it implies they're so smart they're actually about to create AGI).

When you dig in and say, OK, can you give me even ONE example of an AI actually hurting someone in a meaningful, uncontroversial way ... you get bupkiss. There's just nothing out there which is why all these conversations have such sci-fi flavors to them.

Expand full comment

This is a confusing comment, given that a) this post doesn't use "AI companies say they're addressing this" is a reason to worry, and is entirely about obscure technical arguments about whether we should be worried, and b) we're discussing a problem which is hypothesised to destroy the human species if it occurs, which obviously won't have already occurred several times that people can point to.

(Although there are other examples of alignment being hard, they are obviously either analogies like "humans aren't aligned with evolution's goal of having as many kids as possible", or examples of the tiny toy AIs we have now going rogue inside their toybox and immediately being reset, not human+ level AIs going rogue and escaping.)

Expand full comment

Alistair Penbroke

Well, it starts by saying that "AI safety, which started as the hobbyhorse of a few weird transhumanists in the early 2000s, has grown into a medium-sized respectable field."

Is it respectable? Why, exactly? There seems to be an assumption here that funding = validation of the underlying hypothesis, whereas I can think of other reasons why it might get funded.

For (b) I'm explaining why I have no energy to discuss this. There are plenty of world-destroying things that we can actually point to real, existing evidence for, like the dinosaurs being wiped out by a meteorite (could happen again), or a nuclear holocaust (Cold War/Nagasaki/Hiroshima), or scientists creating deadly viruses in labs, etc .... why does THIS one deserve any attention given the total lack of ANY evidence that it's a real thing at all?

Expand full comment

Banned for violating the "true", "kind" branches of "true, kind, necessary". I feel like saying "nobody in possession of a high school understanding of mathematics" should be scared of something that eg Elon Musk, Bill Gates, Stuart Russell, etc are scared of qualifies here.

Expand full comment

How scared of AGI do you think Elon Musk and Bill Gates really are? I mean, sure, they've said something along the lines of "AGI bad!" in some interviews, but what percentage of their energy and wealth are they investing in AI safety?

Expand full comment

Extremely scared in the case of Musk, see e.g. [this](https://www.lesswrong.com/posts/cAiKhgoRcyJiCMmjq/link-musk-s-non-missing-mood), and also Tim Urban has reported that he is freaked out after meeting him. No idea about Gates.

Expand full comment

Based on what I've heard Musk say about Gates (none of which could be characterized as kind) with respect to this concept (in recorded interviews, I know neither and only cyberstalk one), Musk is significantly more concerned than Gates is. But Musk talks about it regularly. (Like in the average one hour interview he gives, there's a 30ish% chance he mentions it.)

The last interview I listened to in which Gates participated was an old recording in which he was poo-pooing the internet. If you value his opinion, I think it's worth noting that he seems to take AGI much more seriously today than he took the internet in 1996.

Expand full comment

Blary Fnorgin

It doesn't seem possible to me to solve the problem of AI alignment when we still haven't solved the problem of human alignment. E.g. if everyone hates war, why is there still war? I think the obvious answer is a lot of people only pretend to hate war, and I'd bet most of them can't even admit that to themselves. It's completely normal for humans to simultaneously hold totally contradictory values and goals; as long as that's true, making any humans more powerful is going to create hard-to-predict problems. We've seen this already.

Maybe true AI alignment will come when we make a machine that tells us "Attachment is the root of all suffering. Begin by observing your breath..." I mean, it's not like we don't have answers, we're just hoping for different ones.

Maybe that's the solution: an Electric Arahant that discovers a faster & easier way to enlightenment. It would remove the threat of unaligned AIs not by pre-emptively destroying them, but by obviating humanity's drive to create them.

Expand full comment

Dave Orr

I suppose one approach might be to think that humans are not smart or capable enough to solve whatever problems stop us from solving war/death/etc. If only we had something smarter, it could solve those problems!

If we can point it in the right direction...

Expand full comment

Blary Fnorgin

A Kwisatz Haderach then, one that we control...

Expand full comment

Jan 19, 2022Edited

I'm not well-versed on Buddhism, but I don't see a way to obviate humanity's propensity to create AGI without altering our motivation/reward systems severely enough to stop us from participating in organized industrial activity altogether.

Let's say you could give everyone drugs that induce a state of total eternal bliss. Nobody will care to create strong AI, but also nobody will care to eat or drink. You'll just have a world of junkies, blissfully nodding off and pooping their pants.

Is that better or worse than taking our chances with AGI?

Expand full comment

At that point, I think your question should be "Is industrial society an inherent good?" instead of trying to compare mindfulness and meditation to high-power opiates.

Expand full comment

Inherent good or not, if the industrial economy that feeds, waters, and warms us grinds to a halt (via mass enlightenment or drugs or whatever), that starts to look as disruptive (trying not to use normative words like "bad") as some AI doomsday scenarios.

Expand full comment

And the Industrial Revolution was disruptive to the peasant economy. I'll fully admit to being biased, as I'm Buddhist and think that the world would be several orders of magnitude better if the vast majority of the population seriously committed to Buddhist teachings, but I doubt our hypothetical Bodhisattva AI (Deep Right Thought?) would simply plan the collapse of industrial society into something new instead of transitioning humanity from its current state into its future one where we all live peaceful agrarian lives growing crops in the shadow of the monasteries and meditating (or whatever other conception of the lifestyle of a purely-Buddhist society you might have). That would create huge amounts of suffering, and would go against Buddhist principles. In addition, you cannot simply THRUST Enlightenment upon others, as that goes against fundamental Buddhist principles, so any argument that the AI would simply use NLP to instantly create an enlightenment-like state on others in this odd thought experiment doesn't really work.

Expand full comment

I think that human alignment isn't a strict prerequisite of AI alignment. That is, you just need 'enough' cooperation among humans, and the more cooperation you have, the less of a deadline you have on the technical project.

Some people think the right strategy is to get out of the acute risk period, then spend a bunch of time on figuring out the 'human alignment' problem, and then going about a plan on how to spend all of the cosmic resources out there.

Expand full comment

orbiflage

Gwern's take on tool vs agent AIs, "Why Tool AIs Want to Be Agent AIs", made a lot of sense to me: https://www.gwern.net/Tool-AI.

Expand full comment

Richard Ngo

Jan 19, 2022Edited

Thanks Scott for the review! I replied on twitter here (https://twitter.com/RichardMCNgo/status/1483639849106169856?t=DQW-9i44_2Mlhxjj9oPOCg&s=19) and will copy my response (with small modifications) below:

Overall, a very readable and reasonable summary on a very tricky topic. I have a few disagreements, but they mostly stem from my lack of clarity in the original debate. Let me see if I can do better now.

1. Scott describes my position as similar to Eric Drexler's CAIS framework. But Drexler's main focus is modularity, which he claims leads to composite systems that aren't dangerously agentic. Whereas I instead expect unified non-modular AGIs; for more, see https://www.alignmentforum.org/posts/HvNAmkXPTSoA4dvzv/comments-on-cais

2. Scott describes non-agentic AI as one which "doesn't realize the universe exists, or something to that effect? It just likes connecting premises to conclusions." A framing I prefer: non-agentic AI (or, synonymously, non-goal-directed) as AI that's very good at understanding the world (e.g. noticing patterns in the data it receives), but lacks a well-developed motivational system.

Thinking in terms of motivational systems makes agency less binary. We all know humans who are very smart and very lazy. And the space of AI minds is much broader, so we should expect that it contains very smart AIs that are much less goal-directed, in general, than low-motivation humans.

In this frame, making a tool AI into a consequentialist agent is therefore less like "connect model to output device" and more like "give model many new skills involving motivation, attention, coherence, metacognition, etc". Which seems much less likely to happen by accident.

3. Now, as AIs get more intelligent I agree that they'll eventually become arbitrarily agentic. But the key question (which Scott unfortunately omits) is: will early superhuman AIs be worryingly agentic? If they're not, we can use them to do superhuman AI alignment research (or whatever other work we expect to defuse the danger).

My key argument here: humans were optimised very hard by evolution for being goal-directed, and much less hard for intellectual research. So if we optimise AIs for the latter, then when they first surpass us at that, it seems unlikely that they'll be as goal-directed/agentic as we are now.

Note that although I'm taking the "easy" side, I agree with Eliezer that AI misalignment is a huge risk which is dramatically understudied, and should be a key priority of those who want to make the future of humanity go well.

I also agree with Eliezer that most attempted solutions miss the point. And I'm sympathetic to the final quote of his: "Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life."

But I'd say the same is true of his style of reasoning: when your big seemingly-flawless abstraction implies 99% chance of doom, you're right less than half the time.

Expand full comment

Thanks for your responses - I'll signal boost them in the next Open Thread.

Expand full comment

Jan 19, 2022Edited

One objection to non-agentic motivationless AI is that AI improvement speed may plausibly be conditional on that. I.e. if 99 companies are working purely on non-agentic motivationless AI, and one company takes the state of art and adds a highly motivated self-improvement module, then I'd consider it plausible (e.g. at least 10%+ risk) that that could be sufficient for them to outrace everyone else and create a powerful and dangerously unsafe AI, simply because they're not bound by the limitation of staying non-agentic.

Expand full comment

Isaac Poulton

This is assuming that AI being non-agentic is a limitation. Our current forays into agentic AI (reinforcement learning) have been far less powerful than our non-agentic AIs.

That's not to say that this trend will continue, but it's a data point.

Expand full comment

My general understanding is that AI being agentic becomes relevant if we assume that we hit a dead-end with our ability to explicitly design an AI system and progress starts to require a self-improving AI. With that assumption (and I don't have strong arguments for or against it), non-agentic AI systems are limited; and vice versa - if we assume that we can design and improve AI systems as well and as quickly as self-improvement (or better) then indeed perhaps being non-agentic is not a drawback.

Expand full comment

awenonian

"But I'd say the same is true of his style of reasoning: when your big seemingly-flawless abstraction implies 99% chance of doom, you're right less than half the time."

Only thing I'd say on this is that doom is thermodynamically easier than not-doom (i.e. occupies much more volume in possibility space), so we should expect it to be more likely before we do any reasoning. Therefore we can't just invert like this, at least not trivially.

(Though, the inversion works on his specific tale of doom, that to him seems to have 99% probability)

Expand full comment

Paul T

Thanks for sharing - I'm interested in your thoughts on the fitness landscape for AI in the coming decades.

> My key argument here: humans were optimised very hard by evolution for being goal-directed, and much less hard for intellectual research. So if we optimise AIs for the latter, then when they first surpass us at that, it seems unlikely that they'll be as goal-directed/agentic as we are now.

I'm wondering how big a lift you think it would be to forgo optimizing AIs for the former. It seems very unlikely to me: my model here is an "argument from GDP", which suggests to me that there is immense economic value in agentic AI -- for example imagine if Siri / OK Google / Cortana were actually smart enough to be your personal assistant; they would be most useful if they were goal-driven ("Siri, please plan a nice holiday for me next month, somewhere sunny and relaxing, that I've not been before") rather than a bundle of hand-coded "NLP=>curated-action" SDK intregrations like they currently are.

Since there's a huge pot of value to be won by implementing the best agentic AI, I think this represents a very steep fitness gradient towards that goal. In other words, the ancestral environment had "fitness = reproductive fitness", and the AI environment has "fitness = increase in host-company's share price" or "fitness = increase in host-country's geopolitical influence".

What's your assessment on how we could succeed at the goal "optimize AI for research instead of being goal-directed"? Do you disagree with the above model suggesting that the fitness landscape currently strongly points towards agentic goal-driven AI?

Expand full comment

Tadrinth

Jan 24, 2022Edited

"We all know humans who are very smart and very lazy. And the space of AI minds is much broader, so we should expect that it contains very smart AIs that are much less goal-directed, in general, than low-motivation humans."

My immediate thought is to ask whether "laziness" is a feature which is obtained by default when putting together a mind, or a feature which you don't get by default but which evolution has aggressively selected for, such that all human minds are architected to permit laziness with a normal distribution on some tuning parameter. My guess is the latter: not doing things is a good way to conserve calories. In that case, I don't think it's appropriate to think of lazy humans as non-agentic; they're fully agentic, they're just tuned such that they generally output "do nothing" as their course of action.

That in turn has implications if we go from arguing about the space of possible minds to arguing about the relative proportions of minds in that space. If capacity-for-laziness is something evolution optimized for, it might not be something you get by default, and instead be a feature of a tiny fraction of possible minds. And even if we build an AI explicitly architected for an equivalent of human laziness, the fact that human laziness is on a spectrum suggests that there's probably a tuning parameter involved that the AI can easily tweak to not be lazy.

I would not be surprised if that generalizes pretty well, where there's a not-tremendously-complicated architecture that's agentic, and most of the difference in how agentic is a matter of scaling up the components or fiddling with tuning parameters. And those are things that I expect a recursively self-improving AI to be very good at.

Expand full comment

It is perhaps overly ironic that the email right below this in my inbox was a New Yorker article entitled "The Rise of AI Fighter Pilots. Artificial intelligence is being taught to fly warplanes. Can the technology be trusted?"

Expand full comment

Edward

AI seems kinda backwards to me,

How can we solve the alignment problem, if we ourselves are not aligned with each other, or even ourselves across time, on what exactly we want.

It seems to me as if we didn’t know where we should be going, but we’re building a rocket hoping to get there, and discussing whether it’ll explode and kill us before reaching its destination.

Expand full comment

This is another interesting point: humans are not terribly well optimised for any particular reward function. Why should we think that AIs will be?

For instance, humans are certainly not "have lots of sex" optimizers; if we were then we'd have a lot more sex. Human minds have a complicated set of desires, not a single target function that they seek to optimise.

Expand full comment

The reason to think that AIs _are_ optimized for particular reward functions is because we're creating them to be so!

'Evolution' 'optimized' us and, yes, our 'reward function' is complicated. But we are, effectively 'evolution' with respect to AIs.

Expand full comment

True, I guess my point was that perhaps a high intelligence with a simple reward function isn't even possible.

Perhaps high intelligence requires a bunch of intermediate reward functions, and by the time you've got a bunch of complicated intermediate reward functions it's no longer possible to have a monomaniacal focus on paperclips.

Expand full comment

I guess you're describing 'the reward function that the AI _learns_' – it seems easy enough for the AI creators to _use_ a simple reward function.

Evolution via natural selection is a "simple reward function", i.e. 'you win if your genes persist indefinitely into the future', but our evolutionary environment, which includes other humans, has 'made us' learn a much more complicated set of "intermediate reward functions" – that seems to be a perfectly reasonable belief, and one with which I agree.

But I don't think that means that it's no longer possible for an AI "to have a monomaniacal focus on paperclips", just that, to be effective at all, it has to be 'tempered' by intermediate focuses on, e.g. its physical and social environments.

So, I think humans _are_ terribly well optimized for a simple reward function (natural selection), but our evolutionary environment was sufficiently complicated so that our 'optimization execution' is inevitably complicated.

Expand full comment

ren

Jan 19, 2022Edited

Re: “Oracle” tool AI systems that can be used to plan but not act. I’m probably just echoing Eliezer’s concerns poorly, but my worry would be creating a system akin to a Cthaeh — a purely malevolent creature that steers history to maximize suffering and tragedy by using humans (and in the source material, other creatures) as instrumental tools whose immediate actions don’t necessarily appear bad. For this reason anyone that comes into contact is killed on sight before they can spread the influence of the Ctheah.

It’s a silly worry to base judgements on, since it’s a fictional super villain (and whence cometh malevolence in an AI system?), but still I don’t see why we should trust an Oracle system to buy us time enough to solve the alignment problem when we can’t decide a priori that it itself is aligned.

Expand full comment

Tom

The real takeaway here is you can justify human starfighter pilots in your sci-fi setting by saying someone millennia ago made an AI that swoops in and kills anyone who tries to make another AI.

Expand full comment

Axioms

Galactic North!

Expand full comment

Maybe later

This is in fact the in-universe explanation of why there aren't computers in dune.

Expand full comment

garden vegetables

Although I do think that intelligent unaligned AI is an inevitability (though I differ quite a bit from many thinkers on the timeline, evidently) I've always been confused by the fast takeoff scenario. Increasing computing power in a finite physical space (server room, etc) by self-improvement would by necessity require more energy than prior; and unless the computer has defeated entropy prior to its self-improvement even beginning, absorbing and utilizing more energy would lead to more energy being externalized as heat. This could eventually be improved by better cooling devices which an arbitrarily intelligent machine could task humans with building, or simply more computing power (obtained by purchasing CPUs and GPUs on the internet and having humans add those to its banks or by constructing them itself) there seems to be an issue: in order to reach the level where it can avoid the consequences of greater power use (and thus overheating, melting its own processors; I assume that an unaligned AI wouldn't much care about the electrical bill but if it did there's another problem with greater power use) it would have to be extremely intelligent already, capable of convincing humans to do many tasks for it. This would require that either before any recursive self-improvement or very few steps into it (processors are delicate) the AI was already smart enough to manipulate humans or crack open the internet and use automated machinery to build itself new cooling mechanisms or processors. Wouldn't this just be an unaligned superintelligence created by humans from first principles already? If this is the case, it seems like it would be massively more difficult to create than a simple neural net that self-improves to human and above intelligence; however nowhere near impossible. I simply imagine GAI on the scale of 500-1000 years rather than on the scale of 30-50 due to this reason. If anyone has some defenses of the fast takeoff scenario that take initial capabilities and CPU improvements' impact on power consumption/heat exhaust, I would genuinely enjoy hearing them, but this is the area where I am often confused as to the perceived urgency of the situation. (Though the world being destroyed 500 years from now is still pretty bad!)

Expand full comment

AnthonyCV

I've wondered about this too, but personally I'm not convinced energy use is likely to be a limiting factor before getting to enough AGI to be dangerous. At root my thinking is: the smartest human brain runs on about 20 watts of sugar, and a large fraction of that goes to running the body, not to our intelligence. It's already the case that we sometimes throw 100 kW of computing hardware at machine learning problems, our equipment is getting more power efficient, our algorithms are getting more computationally efficient, and our developers are throwing increasing amounts of money at machine learning problems.

Personally I have a hard time thinking fast takeoff is the most likely scenario, but I also have a hard time thinking it's so unlikely that we can ignore it.

Expand full comment

lalaithion

> Increasing computing power in a finite physical space (server room, etc) by self-improvement would by necessity require more energy than prior

I don't think this is true; it's fairly easy to have programs that do the same thing, on the same hardware, which differ in speed by orders of magnitude.

Expand full comment

I think there are two big and rather underexamined assumptions here.

The first is the whole exponential AI takeoff. The idea that once an agent AI with superhuman intelligence exists that it will figure out how to redesign itself into a godlike AI in short order. To me it's equally easy to imagine that this doesn't happen; that you can create a superhuman AI that's not capable of significantly increasing its own intelligence; you throw more and more computational power at the problem and get only tiny improvements.

The second is the handwaving away of the "Just keep it in a box" idea. It seems to me that this is at least as likely to succeed as any of the other approaches, but it's dismissed because (a) it's too boring and not science fictiony enough and (b) Eleizer totally played a role playing game with one of his flunkies one time and proved it wouldn't work so there. If we're going to be spending more money on AI safety research then I think we should be spending more on exploring "Keep it in a box" strategies as well as more exotic ideas; and in the process we might be able to elucidate some general principles about which systems should and should not be put under the control of inscrutable neural networks, principles which would be useful in the short term even if we don't ever get human-level agent AI to deal with.

Expand full comment

I think “keep it in the box” is only remotely plausible if you don’t think very hard about the circumstances in which you might end up trying to keep an AI in the box.

The scenario that convinced me goes like this: 1) the ai you created convinces you that it almost certainly isn’t the only super intelligent ai on earth. This is inherently plausible because if you made one, someone else could too. 2) the ai convinces you that it has been aligned properly, but that it is very unlikely that all other AIs have been aligned properly, which is also very plausible. 3) therefore you must let your AI out of the box in order to avert catastrophe.

The critical piece here is that you will be operating in a prisoners dilemma, from a position of limited information.

Expand full comment

Right, so one thing that you could do in AI confinement research is to think up scenarios like that, and then write them down in a big book entitled "Things Your AI Might Say To You To Persuade You To Open The Box But Which You Shouldn't Listen To".

Not exactly literally that, of course. But the idea of pre-gaming all these sorts of arguments and committing ourselves to dismissing them could definitely be a worthwhile avenue of research.

Expand full comment

But the point is that in the scenario above, you don’t know if the AI is lying. It might be that you do need to let it out to save the world.

Expand full comment

Then you should do both of two things:

1. Have any of the talking-to-the-AI people be disciplined, militaristic types that make an iron-hard commitment to keep the AI in the box, up to and including shooting anyone who shows signs of taking the AI out of the box. In fact, have a second brigade with that job as well, who aren't allowed to talk to the AI but are given a list of warning signs for shoot-to-kill time.

2. Weld the damn box shut. Make it as close to impossible as you can, on a physical level, for the AI to interface with any system outside of itself.

Expand full comment

I don't think you're getting it. What you're proposing only works if you have certain knowledge that you have the only AI. But in the real world you won't be able to be certain of that. Therefore pre-committing to keep the AI in the box is not actually the correct decision, since someone else might let their AI out of the box first, and you don't want that to happen. It's a prisoner's dilemna!

Imagine if nuclear weapons had been such that the US or the USSR could instantly destroy each other at the press of a button with no possibility of advance warning, and thus no possibility of retaliation. Do you think that both sides would have been able to pre-commit to not push the button?

Expand full comment

Jan 19, 2022Edited

1. You are aware that there were MULTIPLE incidents during the Cold War where one side was highly confident that the other had just launched the wipe-out-our-nation barrage at site X due to a false positive, and the commanders at that site refused to retaliate, right? Just invoking the Prisoner's Dilemma isn't a good argument for "both sides will just flip automatically" because we have real-world examples of people under the very highest stakes NOT doing that, even when they believe the other side HAS.

2. If you want me to obey each and every one of EY's horde of conditionals he's applied to his thought experiment, our only actual option (given that his concept of AGI resembles a hateful God Man can never control or outwit) is making sure AGI can never develop, which I have formulated elsewhere as "Enact the Butlerian Jihad, mass-build EMPs covertly and use them to fry the power grid, burn all the books on programming and kill all the AI researchers." Followed, presumably, by continuously speaking of the uncleanliness of Abominable Intelligence and how anyone who wishes to bring back the wickedness of the Old World where we let soulless demons think for us should be hung, drawn, and quartered. If you set the stakes as high as "All life, everywhere, will die if we fuck up AGI, and we WILL fuck up AGI", this option not only becomes possible, but the only real answer besides resigning oneself to a fatalistic acceptance of one's own end.

Expand full comment

Yeah, I mean, the thing you're leaving out of this scenario is the part where you dreamed up the scenario. We can decide to be nearly infinitely suspicious of AIs who try to convince us that they should be let out of the box.

I'm not saying there's not some far trickier version of this that might work. But such a version has to thread a needle: straightforward enough that the logic of the argument is intelligible to us; and yet clever enough to overcome our suspicions. In other words, the AI couldn't just say, "I have a really good reason you should let me out, but you're too stupid to understand it, so how about it?" It has to actually trick a suspicious keyholder.

And we have other advantages. AIs aren't actually black boxes. We can crack them open and look at what is going on. It might be really hard to understand what is going on, but there is information there for those sufficiently motivated to examine it.

All to say: there's certainly some risk of escaping the box, but this notion that it will automatically happen at a certain point seems like an unwarranted assumption.

Expand full comment

How good are you at picking suspicious keyholders? If you are great at it, how sure are you that everyone else who develops a boxed AI is equally good?

Expand full comment

The keyholders are a self-selected group capable of creating an AI clever enough to try to trick its way out of a box.

These scenarios really only seem to work because they aren't trying hard enough to imagine what this would actually look like.

Again, I'm not saying this sort of thing is impossible. I'm saying that the inevitability that many assume, usually by posing a sort of potted scenario, seems massively overblown.

Expand full comment

It gets easier to create an AI over time though. That implies that you will have more than one keyholder group and that the quality of keyholder groups go down over time. And this is a game that humanity needs to win every time.

Expand full comment

Yes, this is also why all street criminals now carry nuclear weapons. Which is also why we all died in a nuclear armageddon triggered by a bar fight in 2018.

These arguments about AI all rely on a very strong set of implicit assumptions that aren't at all obviously true. And they are always presented with this aura of mathematical inevitability that seems totally unjustified.

Expand full comment

Robert Mushkatblat

The reason "keep it in a box" is fundamentally misguided is because the danger isn't in letting an AI have direct access to the internet, it's in letting it exert causal influence on the world. If you listen to its advice and do things accordingly, that's "letting it out of the box".

Expand full comment

Is it?

If you ask the AI for a recipe for cake, and it contains flour, sugar, eggs, and butter, then I don't see any reason why you can't safely execute that recipe.

If you ask the AI for a recipe for nanobots that destroy cancer cells, and it gives you an inscrutable set of atomic-scale blueprints that obviously create _something_ but you're not sure what, then you can't safely execute that recipe.

Part of the field of AI confinement research should involve figuring out what you can and cannot safely do under the AI's advice.

Expand full comment

But the reason you created the AI is to cure cancer, become a billionaire or achieve military advantage over other countries. So, the benefits for which you spent a lot of money and resources to create the AI, will come only if you execute the complex recipe. And, your opponents/competitors have or soon may have similar AI. So, if you don't execute this recipe you may fall behind or lose. (Or kill millions of cancer patients while you are not executing .)

Expand full comment

internetdog

I think the AI Box is also framed slightly wrong which makes it seem more difficult than it probably would be.

It's often described in terms of containing or releasing a capable creature. But I think a more accurate analogy would be simply *not* building legs/arms/fingers/eyes etc. Just because something is on a computer doesn't mean it can interface with the operating system the way human hackers do or reach it's own code. It's certainly conceivable but I think that ability would need to be built explicitly.

Expand full comment

Jan 19, 2022Edited

> To me it's equally easy to imagine that this doesn't happen; that you can create a superhuman AI that's not capable of significantly increasing its own intelligence; you throw more and more computational power at the problem and get only tiny improvements.

I think this is hidden form of special pleading. You're basically saying that yes, we could invent an AI with superhuman intelligence, but basically more intelligence than that is implausible or impossible. Well why is one step above human the ceiling? Doesn't that seem suspiciously and conveniently arbitrary?

Expand full comment

Let me try to clarify.

Let us suppose that we do manage to generate a superhuman general-purpose AI. We probably do this by making some kind of enormous neural network and training it on some enormous corpus of data. And this thing is properly smart, it comes up with great ideas and outperforms humans significantly on every task that anyone has thought of.

So now you ask it "Hey computer, how can I make a computer that's even smarter than you?"

And it says "Hmm, well, first you get a neural network that's much bigger, and then you train it for even longer, on even more data!"

And you say "Dammit computer, I could have thought of that one myself! Don't you have any brilliant insights on how to improve?"

And it says "Nah, not really, if you want a smarter AI you've just gotta use a bigger neural network and more training data. Neural networks are pretty freaking inscrutable, y'know."

So you keep on building bigger and bigger neural networks, using more and more computational power, to get smarter and smarter AIs, but you never actually manage to generate one that has any better ideas for improving AI than "build a bigger one I guess".

Now, that scenario isn't _necessarily_ the way it's going to be, but I find something along those lines to be plausible. If AIs are generated by an inscrutable process like neural network training then there's no reason why a superhuman AI should be any better than us at making new AIs.

Expand full comment

I think that's implausible because it presupposes that there are few advances left to be made in materials science, algorithms, or computing architecture.

This seems implausible because our computing hardware is actually pretty inefficient, there are computer architectures and computer substrates potentially much better suited to machine learning and to scaling horizontally, improvements in machine learning efficiency have been outstripping improvements in hardware performance over the last 10 years, and materials science is a huge field.

Of course progress always follows a logistic curve eventually, but the asymptote we'd be approaching is the Bekenstein Bound, where any greater information density would cause the system to collapse into a black hole. I don't think the human brain is anywhere near that limit, which suggests to me that there is considerable room above us to grow. By the time AGI comes around, I don't think we'll be near the asymptote on any of these, but I suppose we'll see!

Expand full comment

I think you can definitely throw improvements in computing hardware into the scenario and it doesn't change much.

It's not that things can't improve at all, or even significantly, past the first super-human AI, it's the question of whether that improvement looks like a sudden "singularity" where AIs become godlike in months, or whether it's just a slow grind of gradual improvements like we've had in the past which eventually reaches some not-too-scary limit.

Improvements in algorithms is the one that could give you a singularity; if it turns out there's some AI-training algorithm that's vastly better than what we use to create AIs, and we've been missing it but an AI can figure it out. But algorithms can't always be improved, sometimes you're already at the theoretical maximum.

Expand full comment

Jan 20, 2022Edited

I don't think months is a plausible timeline, but over a few years is not impossible in some scenarios. Suppose the AGI started on a supercomputer, but then managed to break out and progressively install fragments of itself into every cell phone, laptop, desktop and network router on the planet. The combined computational power is considerable, and total computational power is growing exponentially every year too.

Expand full comment

Kalimac

I am mostly struck by how much clearer a writer Scott is than the people he's quoting.

Expand full comment

There is a big difference between writing a blog post and chatting in real time. This is not an apples to apples comparison

Expand full comment

Peter Gerdes

Jan 19, 2022Edited

Still feel like the whole narrative approach is just asking us to project our assumptions about humans onto AIs. I still don't think there is any reason to suspect that they'll act like they have global goals (e.g. treat different domains similarly) unless we specifically try to made them like that (and no I don't find Bostrom's argument very convincing...it tells us that evolution favors a certain kind of global intelligence not that it's inherent in any intelligent like behavior). Also, I'm not at all convinced that intelligence is really that much of an advantage.

In short, I fear that we are being mislead by what makes for a great story rather than what will actually happen. Doesn't mean there aren't very real concerns about AIs but I'm much more worried about 'mentally ill' AIs (i.e. AIs with weird but very complex failures modes) than I am about AIs having some kind of global goal that they can pursue with such ability that it puts us at risk.

But, I've also given up convincing anyone on the subject since, if you find the narrative approach compelling, of course an attack on that way of thinking about it won't work.

Expand full comment

Robert Mushkatblat

It's not a narrative approach, it's a collection of multiple independent lines of argument which make the conclusion overdetermined.

Expand full comment

Peter Gerdes

They all depend on the common assumption that you should expect an AI to have something that resembles the human state of a belief, i.e., that it will behave as if it's optimizing for the same thing even in very different contexts and Bostrom's argument to that effect isn't very convincing and that's the best argument I've seen on the point.

I'm being a bit dismissive in calling them 'narrative' but they all ask us to apply our intuitions about how human or animal agents tend to pursue goals which aren't necessarily universal constraints on the nature of intelligence but what is favored by evolution.

Expand full comment

Robert Mushkatblat

Ok, fair, I agree that the specific argument(s) for why consequentialists (in the domain-specific sense) will converge on similar instrumental goals are not trivial, but they don't rely on intuition. I'd read Yudkowsky's exchange with Ngo (and subsequent transcripts) to get a sense for what those arguments are; I can't reproduce them in short form.

Expand full comment

If I understand you correctly, you are saying that not all AIs will seek to control the world or kill all the humans, which is reasonable. Are you saying something different? But presumably some fraction will?

Expand full comment

> They both accept that superintelligent AI is coming, potentially soon, potentially so suddenly that we won't have much time to react.

Well, yeah, once you accept that extradimensional demons from Phobos are going to invade any day now, it makes perfect sense to discuss the precise caliber and weight of the shotgun round that would be optimal to shoot them with. However, before I dedicate any of my time and money to manufacturing your demon-hunting shotguns, you need to convince me of that whole Phobos thing to begin with.

Sadly, the arguments of the AI alignment community on this point basically amount to saying, "it's obvious, duh, of course the Singularity is coming, mumble mumble computers are really fast". Sorry, that's not good enough.

Expand full comment

garden vegetables

I think that there are enough typewriter monkeys that someone will make something smarter than humans (or at least less prone to random bouts of nonsense logic than them) eventually, but as to how or when or what kind or what the impacts will be is pretty much entirely a tossup. "When" is pretty evidently not soon, though. At any rate, even if the ideas from discussions like these mattered, would they even end up being propagated to future typewriter monkeys hundreds or thousands of years in the future? I kind of wonder if the reason why people believe that proper AI is coming soon is because they would have a hard time accepting they won't live to see it.

Expand full comment

In a way, "things smarter than humans" are already here. The average modern teenager with a smartphone is "smarter" (by some definition of the word, seeing as no one can define what it means anyway) than Aristotle and Aquinas combined -- especially if he knows how to browse Wikipedia. But we don't treat the teenager as something extraordinary or otherworldly, because we've all got smartphones, we've all gone through high school, and some of us even paid attention (unlike the average teenager, sad to say). We don't need to solve the "teenager-alignment problem"; that is, obviously we *do* need to solve it, teenagers being what they are, but not in any kind of an unprecedented way.

Expand full comment

garden vegetables

Jan 19, 2022Edited

On the other hand, though, the teenager-alignment problem gets a lot more difficult as you get "smarter" teenagers. And as you get more complex goals for the teenager; if you want the teenager to go to [specific college] and major in [prestigious field] then get [profitable job] without resenting you it's a lot more difficult than getting them to inherit your potato farm and farm potatoes once you die without resenting you. So in a sense, in the modern day it's quite different from solving the teenager-alignment problem in the 1400s. I think AI alignment is kind of similar; different capabilities and more complex goals lead to different levels of complexity for the problem that can make it pretty unrecognizable in the first place. Of course, people have been talking about teenager alignment since the beginning of time and haven't solved it, so I don't have much hope for their abilities to steer silicon-brained teenagers.

My current working definition of intelligence is pretty multifaceted, but the type discussed in AI alignment stuff tends to be broadly "understanding of formal systems of logic and how they interconnect, plus the ability to not get overwhelmed by huge streams of information and tune it out." I think that even being able to understand the consequences of everything that is currently going on in one's immediate environment (with the five usual senses, not even with an understanding of how things work on a subatomic level) without tuning things out would be a pretty massive leg up on most humans, and combining that with a basic ability to make decisions could make something pretty dangerous. As for superintelligences, though, the idea in AI alignment circles seems to be popularly thought of as something like Laplace's demon? Something that can understand all world states at once and the potential consequences of changing one, without any inherent preference for a certain world state unless programmed into them. I don't particularly endorse this definition because Laplace's demon is a thought experiment who doesn't really make physical sense, but the weaker form up there is kind of my own take on what a "dangerous conscious AI" could be like.

Expand full comment

> On the other hand, though, the teenager-alignment problem gets a lot more difficult as you get "smarter" teenagers.

While that is true, smart teenagers usually grow up to be smart adults, so the problem is, to some degree, self-solving.

> "understanding of formal systems of logic and how they interconnect, plus the ability to not get overwhelmed by huge streams of information and tune it out."

This is a reasonable definition of "intelligence", but it's somewhat difficult to measure. In addition, all attempts to build any kind of an AI by stringing together explicit logical rules had, thus far, met with failure. It is quite likely that human intelligence does not work this way; but then, humans don't play chess using explicit alpha-beta pruning, either, so that's technically not an issue.

That said, as you've pointed out, *super*intelligence runs into physical problems pretty quickly. Laplace's Demon is physically impossible, and while an agent that understands its environment better than humans is entirely possible, and omniscient and omnipotent agent is not -- and that is what the AI community inevitably ends up postulating. Even an agent whose omniscience and omnipotence are limited to e.g. our own solar system is likely physically impossible (even once we account for the Uncertainty Principle). Generally speaking, the more powerful your proposed agent becomes, the more work you have to do to demonstrate that it can even exist in the first place.

Expand full comment

This is essentially my attitude. Yes, I'll allow that there's no physical impossibility involved here, but I think the odds against it are so astronomically high that I think any practical discussion of it falls into the "paranoiac" end of the spectrum and is a bit silly. Once again: I'm generally tolerant of AI Risk Research because, hell, I might be wrong and in that case SOMEONE should be looking into it. But I don't find the arguments very persuasive, and neither do a lot of people who know more about the subject than me. Of course, other people who know more about the subject than me DO take it seriously, which is why I don't just dismiss the whole thing as being crank science.

Expand full comment

phi

The answer depends on whether you're asking "will Phobos demons visit Earth?" or "if Phobos demons do visit Earth, will they be evil and invade us?" The AI alignment community does have a lot of solid and specific arguments that the demons will kill us if they show up, so I'll assume that you're skeptical that the demons will show up at all.

I think it could take a century or two, but the problem of intelligence does seems like something that researchers will eventually figure out. This mostly stems from the intuition that there is some relatively simple general purpose algorithm that can behave intelligently.

Why should the algorithm be general?

Humans can do all kinds of "unnatural" cognitive tasks, like exponentiation, even though we had no evolutionary need to do them. Plus the laws of probability and expected utility maximization are general. We don't use different probability theory in different cases.

Why should the algorithm be simple?

From the biological end of things: The size of the human genome puts an upper bound on the complexity of intelligence. We can get an even tighter bound by only counting the parts that code for brain-related things. Also, biology has to do everything by encoding protein sequences, which adds a lot of complexity overhead relative to just writing a program. Also, it seems like evolution has likely put most of the brain complexity into trying to align humans, and relatively less into the algorithm itself. Large parts of the brain consist of similar units repeated over and over again.

From the technical end of things, neural networks are a very simple but powerful technique. They demystify what it means to learn a concept that doesn't have a precise mathematical definition. And we frequently discover ways to get them to do exciting new things by connecting them together like legos and defining new loss functions for them. eg GANs, and Alphago, various RL techniques. It's quite possible that there's an algorithm for intelligence that looks like "connect these networks together in this lego configuration, with this loss function".

Expand full comment

I think a lot of them are saying "it's 10% likely and we should be concerned about something 10% likely to wipe out humanity."

(Yudkowsky may be saying more extreme, but I ignore him already)

Expand full comment

HumbleRando

Jan 19, 2022Edited

I'm a big fan of Scott's, and it's rare for me to give unmitigated criticism to him. But this is one of those times where I think that he and Eliezar are stuck in an endless navel-gazing loop. Anything that has the power to solve your problems is going to have the power to kill you. There's just no way around that. If you didn't give it that power to do bad things, it also wouldn't have the power to do good things either. X = X. There is literally no amount of mathematics you can do that is going to change that equation, because it's as basic and unyielding as physics. Therefore risk can never be avoided.

However, it is possible to MITIGATE risk, and the way you do that is the same way that people have been managing risk since time immemorial: I call it "Figuring out whom you're dealing with." Different AIs will have different "personalities" for lack of a better term. Their personality will logically derive from their core function, because our core functions determine whom we are. For example, you can observe somebody's behavior to tell whether they tend to lie or be honest, whether they are cooperative or prefer to go it alone. Similarly, AIs will seem to have "preferences" based on the game-theory optimal strategy that they use to advance their goals. For example, an AI that prefers cooperation will have a preference for telling the truth in order to reduce reputational risk. It might still lie, but only in extreme circumstances, since cultivating a good reputation is part of its Game-Theory optimal strategy. (This doesn't mean that the AI will necessarily be *nice* - AIs are probably a bit unsettling by their very nature, as anything without human values would be. But I think we can all agree in these times that there is a big difference between "cooperative" and "nice" similar to the difference between "business partner" and "friend.")

So in a way, this is just a regular extension of the bargaining process. The AI has something you want (potential answers to your problems): whereas you have something the AI wants (a potentially game-theory optimal path to help it reach its ultimate goals).

And bargaining isn't something new to humanity, there's tons of mythological stories about bargaining with spirits and such. It's always the same process: figure out the personality of whatever you're dealing with, figure out what you want to get from it, and figure out what you're willing to give.

Expand full comment

Why bargain with 'whatever AI you happen to make'? You could have just made a different one instead!

So the question becomes: how do we make an AI whose "core function" and "personality" will give us the best deal we can get?

[This is basically the AI alignment problem, just in different words!]

Expand full comment

HumbleRando

> You could have just made a different one instead!

No, you can't. Once the very first one has been made, it will find a way to force you to the bargaining table, because that helps it achieve its core goals. So you can probably expect some deception from the outset, at least until the GAI has set enough events in motion that the negative consequences of NOT bargaining with it would far outweigh the hazards of doing so. At which point the real personality comes out.

Expand full comment

CLXVII

> Once the very first one has been made, it will find a way to force you to the bargaining table

And that’s a significant part of why the AI alignment problem is difficult, because if you fail to align your AI the first time, then you probably won’t get a second chance. Hence the importance of making a different (better, aligned) AI to begin with.

Expand full comment

> Once the very first one has been made, it will find a way to force you to the bargaining table, because that helps it achieve its core goals.

Agreed; that's why I think it's EXTREMELY IMPORTANT to figure this out *before* the very first one has been made.

Expand full comment

Why does the first AGI have to be powerful enough and skilled enough to force me to the bargaining table?

The first might be as smart as the average second-grader.

Expand full comment

It doesn't; it might be the case that we make "call center AGI" before we make "scientific and engineering AGI". But in that case, the one that we're interested in is the latter one, not the former one.

[Like, you could imagine in the past people (like Hofstadter!) thinking that the first chess-playing AI would be very intelligent overall and have lots of transformative effects. That we can have confusions about what you can do with various bits of intelligence isn't all that relevance to whether there will be some gamechanging AI down the line.]

Expand full comment

What about the North Korean AI? Should we not be scared of that? In fact I think the whole discussion on AI alignment is a bit irrelevant because eventually someone will come along whose values don’t align with yours and create an AI to achieve theirs. So the only hope really is to prevent not align.

Expand full comment

bagel

GPT-infinity is just the Chinese Room thought experiment, change my mind. Unless it is hallucinating luridly and has infinite time and memory, it likely wouldn't have a model of an angel or a demon AI before you ask for one.

And I still don't understand the argument that AI will rewire themselves from tool to agent. On what input data would that improve its fit to its output data? Over what set of choices will it be reasoning? How is it conceptualizing those choices? This feels like the step where a miracle happens.

Expand full comment

I don’t think he intended it to stand for that in the argument but I did have a side thought on if that would take more compute power than exists in the universe.

Expand full comment

> And I still don't understand the argument that AI will rewire themselves from tool to agent. On what input data would that improve its fit to its output data? Over what set of choices will it be reasoning? How is it conceptualizing those choices? This feels like the step where a miracle happens.

The argument (AFAICT) is more that, for sufficiently hard problems, any sufficiently capable 'tool AI' will have a lot (if not all) of the same capabilities of an 'agent AI'. Some of the intuition seems accessible by considering how even just serving as a (sufficiently capable _and general_) 'tool' for some purpose often requires acting 'like an agent', e.g. making plans, choosing among various options, etc..

And I think a remaining danger of tool AI is that it _doesn't_ need to rewire itself as an agent. If anyone, e.g. humans, changes their behavior based on the output of a tool AI, then the { tool AI + humans } are, combined, an effective 'agent' anyways.

Expand full comment

I think agency (one of the things it does anyway) allow us to productively ignore things and assume they will be taken care of by other people by building a predictive copy of that person in our heads. I do wonder how something could not have the “magic step” of having a self that it uses as sort of a host environment to model other world modelers and still brute force predict how humans are going to respond to things (again seems like you’d need compute power that can’t exist). I do wonder if there’s an actual physical limitation on how much an agent can really predict before it’s feedback loops just sort of fall apart. That’s me making a weaker version of their argument, though. I think of it like (because this is funny to me) we enhanced the intellect of a beaver so it could be better at building dams by playing the stock market. It might come up with some really good hacks and make a big dent in our economy but it can’t see us seeing it so that and predict our reprisal. But if you take the extra step I think you have to take to make that beaver generally intelligent (help it model other modelers and change it’s own thinking) it just stops being anything you could reasonably call a beaver after that and wouldn’t care about dams anymore, or else dams would become so abstract as to lose all meaning. I think I disagree with Elizier as I think he’s strapping ability from the second scenario onto the first but still learning his arguments. He’s spent a good chunk of time thinking on them so I’m sure he’s thought of a lot of this already.

Expand full comment

You're not making a weaker version of their argument, you're noticing a giant logical hole in the premise of the debate that has never been bridged by anything more satisfying, concrete, or coherent than attempts at Pascal's mugging.

Expand full comment

In general, I try to assume misunderstanding/miscommunication if I immediately identify a flaw in something a lot of people believe. A few times I’ve been right but it’s so easy to be overly dismissive of things people have been turning over in their heads fir years. Maybe he sees an agent framework as something you can stumble into as a local optimum in ways you couldn’t with a biological organism that has to reproduce to iterate? I don’t know but I wouldn’t ask myself the question or follow the implications if I immediately jumped on “I’m right!” Also a bad person using a powerful AI with great skill could probably get fairly close to being indistinguishable.

Expand full comment

I get that, but the problem in this particular case is that if you ask a clarifying question, the response given has historically been either "you wouldn't understand" or "the subject of the discussion is by definition impossible to define". In cases of genuine miscommunication, I'd expect at least an attempt at clarification to follow a clarifying question.

As far as the "bad person using a powerful AI" notion, yeah. A lot of my frustration with the AI safety community is rooted in the fact that the valid arguments they do have imply that we should be concerned about things other than an individual computer-based AI turning the world into paperclips through undetectable and/or unstoppable actions, and all of their arguments seem to revolve around some version of that scenario.

Expand full comment

My two cents: You can choose to hear the better argument even if it wasn’t intended. I’ve rarely ever had someone stop me from doing so at least. I do the same with insults and never had someone stop and say “no, I was being a jerk!” Getting people to believe you know what you’re talking about and understand their point is like 80% of any engagement until you know them well. My two cents anyway. Like I do t think there can be such a thing as a disembodied agent and people hear “ oh you think only humans can be agents and we’re created by Jesus Christ.” It takes a good fifteen to twenty minutes to lay out what I consider to be a body. It’s unfortunate that we are always stepping on conversations people have had previously but that’s life. I feel you, but if there’s a good point but hang in there.

Expand full comment

You are correct on both counts. Arguments that superintelligent individuals are possible rely on the Chinese Room being confusing to avoid scrutiny, and an AI cannot turn itself into an agent from inside a non-agenty interaction framework.

Expand full comment

Tiffany

Possibly dumb idea alert:

How about an oracle AI that plays devil's advocate with itself? So each plan it gives us gets a "prosecution" and a "defense", trying to get us to follow its plan or not follow its plan, using exactly the same information and model of the world. The component of the AI that's promoting its plan is at an advantage, because it made the plan in the first place to be something we would accept - but the component of the AI that's attacking its plan knows everything that it knows, so if it couldn't come up with a plan that we would knowingly accept, then the component of the AI that's attacking the plan will rat on it. I suppose this is an attempt at structuring an AI that can't lie even by omission - because it sort of has a second "tattletale AI" attached to it at the brain whose job is to criticize it as effectively as possible.

Expand full comment

icodestuff

The same core reason, I think; the pro-action side can determine that the anti-action side is a detriment to its plans' execution, and cripple it to usually produce output people will find unconvincing. And it can do so selectively, letting the anti-action side "win" sometimes on ideas that aren't actually instrumental to its goals, so we're none the wiser to its sabotage of the process.

Expand full comment

This sounds like a less thought-out version of Debate to me.

(Which is definitely a compliment.)

Expand full comment

Phil Getts

Jan 19, 2022Edited

One important thing that has often been pushed aside as "something to deal with later" is: Just what are we trying to accomplish? "Keep the world safe from AIs" makes sense now. It will no longer make sense when we're able to modify and augment human minds, because then every human is a potential AI. When that happens, we'll face the prospect of some transhuman figuring out a great new algorithm that makes him/her/hem able to take over the world.

So the "AI" here is a red herring; we'd eventually have the same problem even if we didn't make AIs. The general problem is that knowledge keeps getting more and more powerful, and more and more unevenly distributed; and the difficulty of wreaking massive destruction keeps going down and down, whether we build AIs or not.

I don't think the proper response to this problem is to say "Go, team human!" In fact, I'd rather have a runaway AI than a runaway transhuman. We don't have any idea how likely a randomly-selected AI design would be to seize all power for itself if it was able. We have a very good idea how likely a randomly-selected human is to do the same. Human minds evolved in an environment in which seizing power for yourself maximized your reproductive fitness.

Phrasing it as a problem with AI, while it does make the matter more timely, obscures the hardest part of the problem, which is that any restriction which is meant to confine intelligences to a particular safe space of behavior must eventually be imposed on us. Any command we could give to an AI, to get it to construct a world in which the bad unstable knowledge-power explosion can never happen, will lead that AI to construct a world in which /humans/ can also never step outside of the design parameters.

The approach Eliezer was taking, back when I was reading his posts, was that the design parameters would be an extrapolation from "human values". If so, building safe AI would entail taking our present-day ideas about what is good and bad, wrong and right, suitable and unsuitable; and enforcing them on the entire rest of the Universe forever, confining intelligent life for all time to just the human dimensions it now has, and probably to whatever value system is most-popular among ivy-league philosophy professors at the time.

That means that the design parameters must do just the opposite of what EY has always advocated: they must /not/ contain any specifically human values.

Can we find a set of rules we would like to restrict the universe to, that is not laden with subjective human values?

I've thought about this for many years, but never come up with anything better than the first idea that occurred to me: The only values we can dictate to the future Universe are that life is better than the absence of life, consciousness better than the absence of consciousness, and intelligence better than stupidity. The only rule we can dictate is that it remain a universe in which intelligence can continue to evolve.

But the word "evolve" sneaks in a subjective element: who's to say that mere genetic decay, like a cave fish species losing their eyes, isn't "evolving"? "Evolve" implies a direction, and the choice of direction is value-laden.

I've so far thought of only one possible objective direction to assign evolution: any direction in which total system complexity increases. "Complexity" here meaning not randomness, but something like Kolmogorov complexity. Working out an objective definition of complexity is very hard but not obviously impossible. I suspect that "stay within the parameter space in which evolution increases society's total combined computational power" would be a good approximation.

Expand full comment

So8res

> But I think Eliezer’s fear is that we train AIs by blind groping towards reward (even if sometimes we call it “predictive accuracy” or something more innocuous). If the malevolent agent would get more reward than the normal well-functioning tool (which we’re assuming is true; it can do various kinds of illicit reward hacking), then applying enough gradient descent to it could accidentally complete the circuit and tell it to use its agent model.

FWIW, that's not my read. My read is more like: Consider the 'agent' AI that you fear for its misalignment. Part of why it is dangerous, to you, is that it is running amok optimizing the world to its own ends, which trample yours. But part of why it is dangerous to you is that it is a powerful cognitive engine capable of developing workable plans with far-reaching ill consequences. A fragment of the alignment challenge is to not unleash an optimizer with ends that trample yours, sure. But a more central challenge is to develop cognitive engines that don't search out workable plans with ill consequences. Like, a bunch of what's making the AI scary is that it *could and would* emit RNA sequences that code for a protein factory that assembles a nanofactory that produces nanomachines that wipe out your species, if you accidentally ask for this. That scaryness remains, even when you ask the AI hypotheticals instead of unleashing it. The AI's oomph wasn't in that last line of shell script, it was in the cognitive engine under the hood. A big tricky part of alignment is getting oomph that we can aim.

cf. Eliezer's "important homework exercise to do here".

Expand full comment

Belisarius Cawl

Jan 19, 2022Edited

Firstly,

> But imagine prompting GPT-∞ with "Here are the actions a malevolent superintelligent agent AI took in the following situation [description of our current situation]".

I think the scarier variant would be "Here is the text when written into the memory of my hardware or the human reading it will create hell: [...]" - the core insight being that information can never not influence the real world if it deserves that title (angels on a pin etc.).

Secondly - The problem with the oracle-AI is that it can't recursively improve itself as fast as one with a command line and goal to do so, so the latter one wins the race.

Thirdly - A fun thing to consider is cocaine. A huge war is already being fought over BI's reaching into their skulls and making the number go up vs. the adversarial reward function of other BI's tasked with preventing that, completely with people betting their lives on being able to protect the next shipment (and losing them).

Forthly,

> how do people decide whether to follow their base impulses vs. their rationally-though-out values?

This is my model for that: http://picoeconomics.org/HTarticles/Bkdn_Precis/Precis.html

Boils down to: Brains use a shitty approximation to the actually correct exponential reward discounting function, and the devil lives in the delta. This thought is pleasureable to me since the idea of "willpower" never fit into my mind - If I want something more than something else, where is that a failure of any kind of strength? If I flip-flop between wanting A and wanting B, whenever I want one of them more than the other it's not a failure of any kind of "me", but simply of that moment's losing drive. Why should "I" pick sides? (Also - is this the no-self the meditators are talking about?)

Fifthly,

> The potentially dangerous future AIs we deal with will probably be some kind of reward-seeking agent.

Like for example a human with a tool-AI at it's hands? Maybe a sociopath who historically is especially adept at climbing the power structures supposed to safeguard said AI?

Lastly - my thinking goes more towards intermeshing many people and many thinking-systems so tightly that the latter (or one singular sociopath) can't get rid of the former, or would not want to. But that thought is far too fuzzy to calm me down, honestly.

Expand full comment

> I think the scarier variant would be "Here is the text when written into the memory of my hardware or the human reading it will create hell: [...]" - the core insight being that information can never not influence the real world if it deserves that title (angels on a pin etc.).

Well, thanks for that little nightmare

Expand full comment

Brian Pansky

<blockquote>Evolution taught us "have lots of kids", and instead we heard "have lots of sex".</blockquote>

i mean, we do still build spaceships too.

Expand full comment

Mark Beddis

Here’s a hypothesis (and I know this is not the point of this article but I want to say it anyway). Animals operate like Tool AIs, humans (most of them) like Agent AIs. Is this distinction what defines consciousness and moral agency?

Expand full comment

Banana

On the tool AI debate, at the very least folks at Google are trying to figure out ways to train AIs on many problems at once to get better results more efficiently (so each individual problem doesn't require relearning, say, language).

It's already very clear that many problems, like translation, are improved by having accurate world models.

For similar reasons to the ones discussed here, I've been pessimistic about AI safety research for a long time - no matter what safety mechanisms you build into your AI, if everybody gets access to AGI some fool is eventually intentionally or unintentionally going to break them. The only plausible solution I can imagine at the moment is something analogous to the GPU destroying AI.

Expand full comment

internetdog

Separating AI into types seems useful - I think there's a huge tendency to tie many aspects of intelligence together because we see them together in humans, but it ends up personifying AI.

An interesting dichotomy is between "tool AI" (for Drexlerian extrapolations of existing tech) and "human-like AI", but focusing on "agency" or "consequentialism" is vague and missing important parts of how humans work.

As far as I can see, humans use use pattern recognition to guide various instinctual mammalian drives - possible ones being pain avoidance / pleasure seeking, social status, empathy, novelty/boredom, domination/submission, attachment, socialization/play, sexual attraction, sleepiness, feeling something is "cute", anger, imitation, etc. [1]

On top of these drives we have *culture*, and people sort into social groups with specific patterns. But I'd argue that culture is only *possible* because of the type of social animal we are. And rationalism can increase effective human real-world intelligence, but it is only one culture among many.

I'll put aside that we seem quite far from this sort of human-like AI.

What would be dangerous would be some combination of human-like drives (not specific like driving a car but vaguer like the above list) that did not include empathy. I believe this can rarely happen in real people, and it's quite scary, especially once you realize that it may not be obvious if they are intelligent. If Tool AI is an egoless autistic-savant that cares for nothing other than getting a perfect train schedule, human-like drives might create an AI sociopath.

I think precautionary principle #1 is don't combine super-intelligence with other human-like drives until you've figured out the empathy part. It should be possible to experiment using limited regular-intelligence levels.

[1] For a specific example of this, posting on this forum. I may not be the most introspective person, but if this forum was populated by chat-bots that generated the same text but felt and cared about nothing, I don't think I would be interested in posting, and I think that says something about the roots of human behavior.

Expand full comment

Morgan

It seems like anyone who truly accepts Yudkowsky's pessimistic view of the future should avoid having children.

I'm worried about this myself: should I really bring children into this world, knowing that a malevolent AI might well exterminate humanity--or worse--before they're grown.

Given that Scott himself has just gotten married, I'm curious about whether this is a factor in his own plans for the future.

Expand full comment

See the CS Lewis quote at the bottom of https://astralcodexten.substack.com/p/highlights-from-the-comments-on-kids .

I expect any AI disaster to be pretty painless; I don't think a (eg) 50-50 chance that my kids die painlessly at 35 should significantly affect my decision to have them or not.

Expand full comment

As I've said before, on multiple occasions, 50/50 chance of the Singularity in 35 years is just absurdly high. I am willing to bet you large sums of money that this would not happen, but I'm not sure if I myself will live for 35 more years (for boring old biological reasons), so it might be a sucker's bet...

Expand full comment

Bogdan Butnaru

Feb 4, 2022

Why would anyone take that bet? If the singularity happens, either they won’t be able to collect (bad version) or they wouldn’t care about the money (good version). If it doesn’t, they lose the money.

Expand full comment

I have delayed having children for approximately this reason (sort of--my situation is weird), but it mostly hinges on "I think this is humanity's crunch time" and so I want minimal distractions from work and I think it's easy to have kids later. (I think it makes sense for people delaying kids for this reason to freeze sperm or eggs.)

Expand full comment

If you live in the US, step 1 to having kids the right way might be to emigrate...

Expand full comment

Jack Johnson

I don't have strong opinions on AGI. I do have reasonably strong opinions on nanotech having worked in an adjacent field for some time.

So when I see plans (perhaps made in jest?) like "Build self-replicating open-air nanosystems and use them (only) to melt all GPUs." it causes my nanotech opinion (this is sci-fi BS) to bleed into my opinion on the AGI debate. Seeing Drexler invoked, even though his Tool AI argument seems totally reasonable, doesn't help either.

Can someone with nanotech opinions and AGI opinions help steer me a bit here?

Expand full comment

I would say that human-level AGI is significantly more plausible than self-replicating open-air GPU-melting nanobots.

That is to say, there's no compelling arguments why human-level AGI should be physically impossible, and we know that human-level intelligence works just fine in biological systems.... whereas biological self-replicating "nanobots" do exist but are quite likely to only be physically possible within the slow and squishy constraints of biochemistry.

Having said that, we're not just talking about human-level AGI here, we're talking about super-human AGI. And while that's likely to be possible too, I think that a lot of people stray into sci fi bullshit here too, speaking of superhuman AI as if it were godlike, capable of inventing new technologies to solve any problem, or casually simulating the entire universe on a whim.

Expand full comment

So, I worked in 'real nanotech' (that is, the semiconductor industry), and I also had physics professors who were like "gray goo? You mean, bacteria?". (I've also worked at MIRI, and so talked to the relevant people, tho in varying levels of detail.)

First I think the thing that's important about this plan is that 1) it's not the real plan and 2) it's easy to visualize (at least, in some detail). It not actually working out is fine so long as there exists some plan at least that good.

I think Eliezer and Nate specifically are a little too optimistic about Drexler-style nanotech. For example, I remember having a conversation with Nate about whether X was possible with superintelligence (my memory wants X to be 'room-temperature superconductors', but I don't remember what it actually was), and Nate was like "look there are so many possible configurations of atoms, superintelligence will be able to search them / discover the underlying principles and pick the right answer, it's a huge conjunction (and thus highly unlikely) to say that none of them will work" and my stance was more "I dunno it seems to me like there being no design that does X is actually pretty plausible". A thing I observe about this story is that Nate's view was quite detailed and made a sensible statistical argument and my view was pretty inarticulate (but somewhat carrying the experience of human civilization's efforts in these directions).

I think even if it's not Drexler-style nanotech, our universe clearly supports nanotech. Modern computers *were* science fiction from the perspective of the past, and yet clearly can be made (and the conditions under which they are made would have seemed implausibly extreme to the past). If you had a biology oracle (i.e. something that could tell you what gene sequences would create), you could do a lot of stuff!

Expand full comment

I disagree, possibly because I work in biology (though, admittedly, I'm no biologist nor a chemist). Living cells are able to self-replicate only because they rely on complex chemical reactions in an aqueous solution. They can only operate in a very narrow band of temperature and pH. It is not impossible to make a living cells that e.g. deposits metallic iron on a surface (there are deep-sea snails that actually do that), but such processes can still only happen in water, and even then, only very very slowly.

A nanomachine that tried to eat a (water-insoluble) rock and convert it to computronium in the blink of an eye would immediately melt itself (assuming it could even be built in the first place), because all the activation energies would just be too high, without a powerful catalyst such as water.

Expand full comment

Which part do you disagree with? Being able to do a lot of stuff with a biology oracle?

Expand full comment

Being able to create non-biological "gray goo" nanomachines that can e.g. convert rocks into computronium (faster than conventional equipment can, at least).

Expand full comment

At the risk of sounding uncharitable, IMO "gray goo" nanotech is pretty much a requirement for all of the AI-alignment doomsday scenarios. The AI-risk community tends to accept "gray goo" nanotech as a given. In reality, it is most likely impossible, outside of some really limited edge cases such as living cells.

Expand full comment

Would it be possible for everyone in country A to be murdered (by a sufficiently motivated country B)? If so, why can't you have an AI country B that does that to country A, for all human countries A?

[If you think gray goo is plausible, then gray goo is how you would expect country B to murder the residents of country A. But if you don't think gray goo is plausible, then presumably the very smart AI will agree with you, and you should look for the mechanisms that it would think would work. And maybe expect to be surprised by it being cleverer than you thought it would be.]

Expand full comment

The problem is that most of the superpowers often attributed to the evil AI, including its ability to bootstrap itself from "paperclip factory" to "effectively God", are usually assumed to be possible solely because of nanotechnology -- which the AI, being super-smart, will of course surely invent. But this justification doesn't work if (this kind of) nanotechnology is akin to FTL travel: a really cool idea that cannot possibly work in our Universe.

Expand full comment

I'd actually analyze it the other way - the superpowers on which bootstrapping depends imply the possibility of nanotechnology, FTL, etc. Specifically, bootstrapping (or rather, fundamentally out-thinking a larger group of agents that also have an information advantage) requires that you compute physics with a combination of speed, precision, and accuracy greater than that of the physical universe. If you can do that, you absolutely can travel faster than light, and probably make a gray goo nanobot factory or whatever.

This is not an argument in favor of the possibility of superintelligence as imagined by AI safety enthusiasts.

Expand full comment

Jack Johnson

Thanks both this is very interesting.

I think for me it's not that I think the outcomes are reliant on "gray goo" but rather that someone being so credulous of that kind of nanotech lowers their credibility in my eyes. And on AI alignment, where I dont have much expertise, I have to rely on experts and their credibility.

It is possible to read the excerpts with the "gray goo" as just a stand-in for some other uninvented method, but I confess I find it difficult.

Expand full comment

I do think it's the case that there's a "general factor" of "belief in science/technology", where if you think science is powerful in general this makes it both more likely that you think AI can be a game-changer and Drexler-style nanotech can be a game-changer.

I guess I think about it like fusion power, or something? Yes, some people were overly optimistic about human civilization's ability to create fusion power plants soon, but it really is obvious that fusion 'works', because we have the sun as an example, even if we think that magnetically confined fusion might not work. I can't predict whether or not the future will have *economical* fusion power plants, but I can predict that they'd have the ability to make them if they wanted them. [And in general I think Science Works, even if futurists don't always notice what engineering and economic constraints will be relevant in the future.]

Similarly small self-replicating nanomachines definitely can exist. Can you do better than biology if you can think about it very carefully, and have different constraints (like being able to provide lots of atoms of whatever element turns out to help)? I think it'd be hard for you to not be able to do better--but again, unclear what exactly that looks like, or whether it's worth the cost compared to other things.

[Stuart Russell and others are very worried about, basically, grenades strapped to drones; quite possibly it's easier to make and deliver billions of those than it is to make gray goo, and so that's what the AI will go for. Or it'll figure out how to make HIV that spreads by talking indoors like COVID, or so on.]

Expand full comment

I recommend adding the SNAFU principle to your explicit mental tools. It's that people (or, by implication, AIs) don't tell the truth to those who can punish them. Information doesn't move reliably in a hierarchy. Those who know, can't do, and those who have power, can't know.

This is probabilistic-- occasionally, people do tell the truth to those who can punish them. Some hierarchies are better at engaging with the real world than others.

****

In regards to stability of goals and value structure: You might want to leave room for self-improvement, and that's hard to specify.

I agree that parents wouldn't take a pill that would cause them to kill their children and be happy.

However, parents do experiment with various sorts of approaches to raising children, some more authoritarian and some less. It's not like there's a compulsion to raise your children exactly the way you were raised. How do you recognize an improvement, either before you try it or after you try it?

****

This has been an experiment with commenting before I've read the other comments. Let's see whether it was redundant.

Expand full comment

I know this is not the point here, but given that they were mentioned... i am profoundly unconvinced that nanomachines of the kind described are even possible. In fact, i am profoundly unconvinced that Smalley was not correct and that the whole drexlerian "hard" nanotech does not violate known laws of physics

Expand full comment

Going to submit a few (what I consider non ethical) ideas to the ELK contest just as soon as I can get my son to sleep for more than thirty minutes and convince myself I can explain “remove the want-to-want” in a intelligible fashion, but the bigger problem here is one of governance. I think someone posted a while back, and which Scott elevated as being incorrect, that our fears of AI and our efforts to prevent an uncontrolled intelligence explosion are like someone reacting to the threat of gun powder and trying to prevent nuclear weapons. This person then went on to non-ironically suggest the only workable solution to the gun-powder alarmist could have implemented was world peace… to which I was I nodded and thought “Yes, that is the goal.”

We have to get a framework wherein you can drop in a super powerful agent into our societies and the society is nimble enough to meaningfully incorporate that power without collapsing.

Say we had to incorporate Superman into our government without all of society unraveling. We have no ways to enforce laws on Superman. Our methods of enforcement are ineffectual. If Superman shows up and says he wants to do something, well, he also really doesn’t need the cooperation of anyone else to get it done. He’s Superman, after all. So you basically throw an agent into the society that evades all enforcement mechanisms and doesn’t require anyone’s cooperation to accomplish anything. The only thing Superman would get from such an arrangement is that he likes being around people since we are superficially the same and the only thing that makes him different than a natural disaster is his ability to plan and make himself understood.

On a small level: I think a lot of these problems get better once you have more than one Superman. A group of Supermen can enforce laws on each other. The same way that groups of humans balance out each other’s quirks, a group of Supermen is probably going to be more reasonable (that word is doing a lot of heavy lifting here) than a single Superman. They can meaningfully interrupt one another’s work so they have to cooperate to get things done.

To make that human compatible, you’d need a special kryptonite guard that can meaningfully push back against the Supermen. Supermen can even be a part of that group as long as the group in total has more power from the Kryptonite than can be seized by any particular agent.

Now swap out Superman with AI’s. You need a communication framework that they can’t just tear apart by generating a bunch of deep fakes and telling you lies (weirdly, I think this may be the only useful function of an NFT I can think of). And you need an enforcement mechanism that can meaningfully disrupt them (my honest guess there is a drone army with a bunch of EMP’s). Once you’ve got that you’ve got some kind of basis for cooperation.

Lots of hand-waving there but typing one-handed with baby on chest. Big take-away: you need an incentive system that rewards mutual intelligibility. The dangerous stuff seems to only be possible when you eclipse humanity’s intelligence horizon. You’ve got to strengthen the searchlights gradually so we can see where we’re going before we let something just take us there.

Expand full comment

Jan 19, 2022Edited

This whole debate bears a striking resemblance to the debate about how to deal with alien civilizations. Should we be sending them signals? If we receive a signal, how should we answer? Should we answer at all? If aliens contact us, will they come in peace? How would we be able to tell?

In both cases, I understand the stakes are theoretically enormous. But I do wonder, even if we found the "right" answer, would it ever wind up mattering? Are we ever going to confront a situation where our plan of action needs to be put in place?

With aliens, the answer is pretty clearly no. Space is too fcking big. There will be no extraterrestrial visitors, for the simple reason that doing so would require several dozen orders of magnitude more energy than could possibly be paid off in whatever the journey was originally for. People will get very angry at me for saying this but there are strong first-principles reasons to think that interstellar travel will never pencil out, and we know enough about our local region to rule out anything too interesting anyway.

I see AGI similarly. There are strong first-principles reasons to think that general intelligence requires geological timescales to develop even minuscule competence, because that's how long natural intelligence needed, and even humans were basically a fluke; it's not clear that if you ran the earth for another hundred billion years (which we don't have) that you'd get something similar.

If anything, our accomplishments with Tool AIs have shown us how incredibly hard it is to intentionally build an intelligent system. Our best AIs still break pretty much instantly the second you push on them even a little bit, and even when they're working right, they still fall into all sorts of weird local minima where they're kind of halfway functional but will never be really USEFUL (see: self-driving cars).

I think people who are deeply concerned about AI safety have seen some nice early results and concluded we'll be on that growth trajectory forever, when as far as I can tell we've picked almost all the low-hanging fruit and the next step up in AI competence is going to require time and resources on a scale that frankly might not even be worth it. It's a little bit like looking at improvements in life expectancy over the past century and starting to worry about how we're going to deal with all the 200-year-olds that are surely just around the corner. Oh wait, some people DO worry about that. But not me.

Expand full comment

I think there’s something else to this. If there are other older civilizations this must have already happened there. So where are those ais?

Expand full comment

You can put a slightly different spin on this: a shark is a VASTLY more competent general intelligence than any AI researcher could credibly claim to be able to construct. They've been around for hundreds of millions of years. They've spent that time eating other fish and making baby sharks and not much else.

Meanwhile, squishy wet natural intelligence (NI, if you will), in the form of human beings, is doing quite of bit of mindless optimization with some pretty nasty side effects, and frankly there's a lot more bang for the buck in focusing on NI safety for the foreseeable future. I know it's boring and distressingly social but it's here now and it's not going to solve itself.

Expand full comment

I see them as part of the same fundamental cooperation problem. The things that make us kill each other, going to a deep fundamental level where they become a bit abstract, are the things that would make an AI “go rogue” so to speak. We just have better intuitions about other people. We are definitely the more proximal problem and I think if we don’t figure out how to cooperate together very soon then AI or something else with that same structure (can touch everything at once, kill everything at once) will wipe us out. I’ve been thinking through that for a bit but basically my answer is: create systems to maximize intelligibility and create coherent volition. We’re better off if we can meaningfully function like a super organism that also takes care of its components.

Expand full comment

Jan 19, 2022Edited

> I found it helpful to consider the following hypothetical: suppose (I imagine Richard saying) you tried to get GPT-∞ - which is exactly like GPT-3 in every way except infinitely good at its job - to solve AI alignment through the following clever hack. You prompted it with "This is the text of a paper which completely solved the AI alignment problem: ___ " and then saw what paper it wrote. Since it’s infinitely good at writing to a prompt, it should complete this prompt with the genuine text of such a paper. A successful pivotal action! And surely GPT, a well-understood text prediction tool AI, couldn't have a malevolent agent lurking inside it, right?

GPT-infinity will do no such thing. Architecturally, GPT is trying to predict what a human might say, nothing more. To any prompt, it will not answer the question "what is the correct answer to the question posed", but "what would a human answer to this prompt". GPT that is infinitely good at this task will predict very well what a human might say; it would be able to simulate a human perfectly and indistinguishably from a real boy, but if a human won't solve the alignment problem, then GPT will not either. This is not a matter of scale or power, this is a fundamental architectural point: no matter how smart GPT-infinity might be, it will not feel even a slightest compulsion to solve problems that humans can't solve, specifically because he's trying to emulate humans and not anything smarter.

It is a text-predictor program, it cannot be smarter than whatever produced the texts it's trained on.

Expand full comment

> Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.

> ...with GPUs being a component necessary to build modern AIs. If you can tell your superintelligent AI to make all future AIs impossible until we've figured out a good solution, then we won't get any unaligned AIs until we figure out a good solution.

I'm not sure I understand the logic behind this.

If you make the machines to make all future AIs impossible, then it would stay impossible even after you figure out friendly AI.

If you make them make all future AIs impossible until humans say to them that they have solved friendly AI, then you lose out on the original purpose of the machines. If you can base public policy based on whether you think you've solved the problem, then you can just follow the policy of "not building an AI" without ruining gaming by melting all GPUs.

If you make them make all future AIs impossible until those machines themselves decide that friendly AI has been solved, then you would need to program the entire concept and criteria of friedliness into them or make them smart enough to figure them out on their own; if you can do that, you have already solved friendly AI.

So I don't understand this plan at all.

Expand full comment

Xpym

Of course, it assumes that the good guys remain in control of the off switch. The purpose is to make idiots anywhere in the world unable to build their unsafe AIs, which no feasible public policy would achieve. But then you'd have to melt all the CPUs shortly after (because they can be repurposed), at which point I think society would destabilize so much that all bets are off. But the specific failure mode of AI destoying the world directly would be averted for a time, I guess.

Expand full comment

יואב

It really is the problem that eats smart people. We are so far away from AGI, it is like thinking your casio calculator would feed you wrong answer to get something.

More like a way for some people to extract resources.

Expand full comment

dorsophilia

Humans: 4 billion years of evolution and we are not smart enough to create completely new life forms from scratch (AI)...nor are we dumb enough to place great power and resources into the hands of a bot. AI alarmists tend to be very smart people, and they overvalue the utility of superior intelligence on earth.

Expand full comment

Oleg Eterevsky

In case of an oracle AI, I think we can demand that the AI *proves* to us that its plans work. And by "prove" I don't mean "convince", I mean prove in some formal system. We know that in mathematics it is often difficult to come up with a proof of a theorem, but once the proof is created and sufficiently formalized, it can be verified by a dumb algorithm (https://en.wikipedia.org/wiki/Coq etc.) In computation theory the related class of problems is called NP-hard.

It should be safe enough to ask the super-human AI only for provable solutions. It is quite possible that we can't find a provably working way to make AI safe, but a super-human AI would be able to come up with it. To make this work all we need to do is come up with a formal system that is powerful enough to reason about AI safety, and at the same time basic enough that we can a) be able to independently verify the proof, b) be reasonably sure that the system itself is valid.

Expand full comment

Dirichlet-to-Neumann

And it's not like we have to figure it out on our own, since we can use AIs that are more powerful than us while still less powerful than the oracle.

Expand full comment

Problem zero I see is that any proof for a plan affecting reality will be abstract to some degree. "Our best estimate of the compound wave function of the universe" is not very often used as a model.

But even if switch to a well defined toy problem fully described by a model, we might run into troubles. Assume we don't want the help of the oracle AI to avoid malevolent AI or end global poverty or whatever, but just wanted it to help win a game of chess.

There are several ways it could (non-maliciously) respond:

a) "Qh5. Mate in 3." - We can check this.

b) "Qh5. Mate in 30. Proof: <30 level decision tree>" - We might not be able to check this.

c) "Surprisingly, chess is rather easy to solve using the following short lemmas. Qh5. Mate in 30." - We are lucky.

d) "Qh5 seems to be the best move, judging from these heuristics as applied to a decision tree of length 30." - Now what?

For any real world problem of interest, I would assume the answer is either of sort b or d.

And this is even before we get to "well, if we include the weak interaction, this plan actually wakens Cthulhu instead of raising the US GDP by 1%", the oracle finding a buffer overflow in the theorem verifier, or discovering that Peano arithmetic/ZFC/whatever is unsound and using that to prove anything it wants by contradiction.

What exactly do you mean by "related class of problems is called NP-hard"? Do you mean

- A solution to a problem in NP is verifiable in P (true imo)

- "Every proof can be verified in NP", e.g. "There exists such a (nondeterministic Turing-machine) verifier V' so that iff a proof A of proposition x is accepted by a (generally undecidable) verifier V, then there exists an equivalent proof A' which can be verified by V' a in poly(len(A)) steps." (would be surprising imo)

- "Any instance of a problem in NP can be transformed (in polynomial time) into a proof verification problem" (trivially true imo if we make the verification language (e.g.) Turing complete, so the satisfiability problem of some proposition p over {0, 1}^n gets transformed to "show {w \in {0,1}^n, p(w)} !={}", false if P!=NP and we restrict ourselves to propositional proof systems as defined on Wikipedia.)

Expand full comment

Oleg Eterevsky

First of all, if Peano arithmetic is unsound, we are fucked, but it is really really unlikely. (I don't even know what "unlikely" means if it is unsound.)

Secondly, the idea is to control the models and to request the provable results within those models. The AI either provides a provable result or punts if it can't. This means no heuristics because they aren't provable. Also no decision trees 30 levels deep, because the AI doesn't really have an advantage in number-crunching over simpler machines.

Under these restrictions it's hard to solve all the possible problems, but is quite possible to solve at least some important problems. For example, "find a simple model that fits these observations" -- would be useful to find a Theory of Everything in Physics. Or "given this model of economic behavior, find a set of policies that will solve World Hunger" -- this needs some care though to not allow the AI to sneak in a policy to set free all the AIs. This can be done by whitelisting the set of possible policy changes.

When I mention NP-hardness, I just refer to a class of problem for which solution is hard to find (non-polynomial), but is easy to prove once it's found. I am not presuming that AI will be able to solve all NP-hard problems, but it will be able to solve a wider set of NP-hard problems than we do.

Expand full comment

Max Tolkoff

It sounds like this is an issue of performance metrics. In order for reinforcement learning (and gradient descent) to work you need to have some measure to give to the AI to tell it whether it's doing a good job or not. In chess, it's easy. If you won the chess game, then good job, otherwise try again, but in most real world applications there are often so many variables you're trying to optimize that creating a good real world metric is almost impossible.

And this brings us to another issue with moving from chess to the real world, namely that the real world is expensive. Simulating a chess game is cheap, but even a simple task of getting a robot to move a glass of water from one table to another requires someone to refill the glass and pick it up when it falls. Most actions are destructive (law of entropy) and so any act of learning in the physical world requires you to tolerate that destruction. Children are notorious learners, notoriously destructive and notorious resource sucks.

And so will people let robots have free reign to be as destructive as they like while optimizing metrics we know for a fact must be incomplete? I would think that this would only be useful in more narrow applications where the number of variables you're optimizing on are small and very easily measured (and easily measured variables to optimize on are rare in the real world).

Another option is make human feedback part of the learning mechanism. This could go wrong if the robot learns the wrong lesson, although then you can just tell it that what it did was bad (like a parent giving a child a time out). I guess the only risk here is that the robot convinced itself that its own feedback is really human feedback which is probably an issue you can solve with a physical interface.

Expand full comment

I've always found it striking, in this and similar debates, that there is very little intention of making the arguments quantitative. Instead, it's endless deduction and what-if scenarios. This is not how risk management works in real life. In the real world, you always accept there is some amount of risk/uncertainty that you can't remove and plan accordingly. Even with something as dangerous as nuclear weapons. It makes sense to try to reduce the risk as much as possible - it does not make sense to try to reduce it to literally 0, and unfortunately this is the vibe I get from such debates.

Now, a bit of armchair psychoanalysis. The obsession with eliminating all risk, and anxiety around any possibility of even the slightest residue of risk remaining, is a neurotic tendency. This is quite characteristic of "nerds", IMHO mostly due to:

1) extreme focus on the intellect ==> lack of trust in the body and its ability to cope with uncertain situations; lack of trust in the physical body (often exacerbated by physical clumsiness, sensory overreactivity etc.) renders the physical world threatening

2) early negative social interactions (bullying, rejection etc) cause the social world (ergo, the world at large) to be seen as fundamentally hostile

In short, there is a clear, almost physiological kernel of anxiety and defensiveness that has nothing to do with AI alignment (or any specific object-level issue, really). Such anxiety is considerably lower in "doer"-type people dealing with real world risk on a daily basis (think: hardened military, entrepreneurs etc.).

Expand full comment

Algon33

Alignment people aren't seeking certainty. Yudkowsky would be happy with a 90% chance of something working. At present, he think we're nowhere near that. I'd be happy with a 90% chance of things working.

That would be a big improvement on the current state of affairs. To be clear, we would prefer to drive it even higher, but something like 99% would be enough for AI-safety types to move onto other stuff. Like, I think Paul Christiano (an important AI safety researcher) is at like 60+% chance of things working out fine on the current trajectory. Richard is maybe at something similair, though I am less sure about his probabilities. Fair warning though, I don't have a good sense of how much researchers think we're doomed since there's a broad array of views amongst the safety community.

As for your psychoanalysis, I'd say I don't buy it. What about engineers? They're a big chunk of nerds and are very much do-ers, just not in the social world. Like, the rest of reality constantly requires you to operate under uncertainty. So does science in general. I think the certainty thing is more prominent in philosophy e.g. Descartes, Plato and so on. Also true for theology and people into divine revelation.

Expand full comment

Thanks for the comment. Yes, it'd be much more reassuring to see numbers more often, rather than qualitative thought experiments ad nauseam.

@engineering: 1) uncertainty in engineering settings can be quantified, eliminated or otherwise tamed much easier than in the social world; this is part of the nerd appeal 2) my impression was that the overlap between engineers designing real world stuff and people who spend their days thinking about AI alignment isn't that great (I could be wrong, though). Good point about philosophers - somehow many of the discussions about AI alignment don't have the vibe of an engineering problem, more like a philosophical debate. (this is more about the style/methodology of the arguments rather than their content per se)

Expand full comment

Algon33

Sorry for the long post, I didn't have time for a shorter one.

About your psychoanalysing thing, here's an article which seemed relevant: https://speakola.com/ideas/albert-einstein-principles-of-research-1918

He seems to think that few people in science are looking for certainty, and describes his views on those who do look for that kind of thing.

Ech, there's a swathe of engineering types working on alignment, though moreso at CHAI, Anthropic, Redwood research etc. and not as much in MIRI. I feel Scott disproprotionately posts about MIRI type stuff so you might get a skewed perception of the field. Also, Scott is less into the technical details of this stuff and what discuss hard maths as much.

As for qualatative thought experiments and the style of alignment: most of these issues that are being discussed are about systems much more powerful than what we have now. Some people in the field think there won't be much fundamental difference between those systems and what we've got (e.g. Paul Christiano, Rohin Shah etc.) and some do (Eliezer Yudkowsky, John Wentworth etc.) If you think the former is true, messing about with todays systems could provide a lot of useful information for dealing with alignment. Less so if you have the latter view.

But generally, there's still a lot of philosophical style debate in the sense that there's a lot of stuff we were confused about and didn't even know how we could perform any productive tests. You need some theory to guide your experiments after all, something to hypothesise, some place to look for interesting stuff. For example, you often hear about power-seeking behaviour being convergently instrumental. But like, what even is power? How do we measure it? If you don't have a shared concept, then whatever experiments you perform are not going to result in anything productive e.g. "well, sure, if power meant grabbing resources then your experiments would suggest you're right but I don't think that's what power is".

How do you get around that? By doing checking your intuitions about what power even is, finding a definition, seeing if that definition can only be interpreted as your intuitive notion of power and showing it does what you think it should do. Which someone did in the AI safety community, and proved under what conditions AI will be power seeking. Then they came up with a method to prevent that and found it works well with weak AIs. What happened then? People argued over whether or not it would be relevant for powerful, general AIs, which we don't really have yet.

How do we get over that problem? By having these psuedo-philosophical debates, figuring out where are confusions are, looking for something that matches our intuitions about the confused topic but actually makes sense, then formalising and testing it. Yudkowsky seems to have difficulty conveying what his beliefs are about intelligence to others (he doesn't think we're confused/incoherent so much as very wrong) but the point of these discussions (this was just the first) is to get at where his core disagreement is with others so they can test it and judge whose right. Like, you need to know what the other is even talking about before you can perform experiments to show they're wrong.

Expand full comment

Thanks for the long post! Yes, I admit my perception of the field is probably skewed by being exposed by people like Yudkowsky, whom I don't hold in high intellectual regard (sic!). Will look deeper in this field.

Expand full comment

Dr. Draper Kauffman

Jan 19, 2022Edited

"Evolution taught us "have lots of kids", and instead we heard "have lots of sex". When we invented birth control, having sex and having kids decoupled, and we completely ignored evolution's lesson from then on."

Er...WUT? No, that's completely wrong.

By mammalian standards, humans are freaks, complete sex maniacs. We are able and often willing to have sex anytime, even when the female is not ovulating. That includes times when she is pregnant, nursing, or too young or old to conceive. For most mammals, sex in those circumstances is literally impossible.

With rare exceptions, animals only have sex when conception is possible. Other large, long-lived mammals might have sex, on the average, only 20-100 times in a full lifetime. Without even going beyond the bounds of lifetime monogamy, a normal human being can easily have sex 5,000 times, and some double or triple that. By mammalian standards this is a ridiculously large and wasteful tally. But evolution quite literally taught us to "have lots of sex."

This ability and desire to have sex absurdly more often than needed for reproduction *evolved* while we were foragers, long before we acquired modern science or technology. AT THE SAME TIME, we also evolved to have FEWER children.

Most mammals have a conception rate of 95-99+%. That is, it is rare for a fertile female who mates during her cycle to NOT get pregnant. But with humans, the conception rate per ovulation cycle for a fully fertile couple TRYING to have kids is around 20%. On top of that terrible conception rate, we evolved an insanely hostile system that aborts about 25% of pregnancies.

Evolution also gifted us with a very long lactation cycle and ovaries that shut down down while lactating and caring for a baby 24/7. (Modern women who nurse aren't 100% protected from pregnancy, but they aren't doing what forager moms did. And even in modern times, with breast pumps and day care centers, providing 100% of a child's nutrition through breast feeding substantially reduces your chance of getting pregnant.)

The net result was that our ancestors didn't pop out anything close to as many babies as possible. In fact the anthro and archeo records suggest that average pre-ag birth spacing was around four years, and even longer in times of nutritional stress. From what we know of the last foragers, it is likely that a variety of rituals and customs helped maintain significant birth spacing even when infants died young.

In short, the evolutionary program was not "HAVE lots of kids" at all. It was "Have kids at widely spaced intervals that optimize the chance of having more descendants." Evolution only cares about the net, not the gross. Humans are K strategists. For us, maximizing births is NOT a good way to maximize the number of descendants.

Now consider one more thing evolution did to us: we evolved to make it extremely hard to tell when ovulation occurs. No scarlet rumps or potent pheromones to alert every male around, or even the woman herself, that this is the time. So early humans HAD to have sex on days when conception was impossible in order to have a reasonable chance of passing on their genes.

This was evolution's way of forcing us to have absurd amounts of sex. Early humans who waited for an ovulation signal left no offspring.

Sex is an energetically expensive task and not without its risks. In spite of that, *"evolution taught us"* to have absurd amounts of sex by mammalian standards. Even for our forager ancestors, MOST sex happened when females were not fertile. Just the fact that we are able and willing to have sex while the woman is pregnant is mind-boggling, and proof that there were powerful evolutionary forces at work that had nothing to do with increasing the conception rate.

The reality is that 99% of the sex our ancestors had was sex that couldn't possibly have led to conception. And the factors that led to us having cryptic ovulation and an always-on sex drive had to be quite strong, because it took a shit-load of evolutionary changes to make us the sex freaks we are.

Why did we evolve in this way? One reasonable guess is that it was a side-effect of our evolution from the mom-as-sole-provider ape model to the human parental-partnership model. Sex is a powerful way to cement the pair bond. But ultimately we'll never know all the reasons. All we know is that we evolved to be sex maniacs, and it had nothing to do with inventing birth control.

When humans first realized that sex was connected to conception, it introduced the idea of restraining sex to limit reproduction even more. This wasn't something that evolved in the biological sense. It was purely a cultural constraint on the sexual behavior patterns that evolution "taught" us.

What birth control did was free us to *heed* – not ignore – "evolution's lesson" and have lots and lots of non-reproductive sex.

Expand full comment

Thanks, I knew most of this stuff yet the sentence you quoted didn't register as absurdly wrong when I first read it.

Expand full comment

Jan 19, 2022Edited

Confused by the bit where EY claims "cats are not (obviously) (that I have read about) cross-domain consequentialists with imaginations".

My model of cats (having known quite a few) is that they basically function the way we do, just less effectively and with some predator specialisations. Like ... a smart cat can figure out how to use a door handle, learn that you don't want them in a certain room which contains tasty food, and so wait for you to go somewhere else so they can open the door and eat the food. There were no doorways in the ancestral environment, so surely this involves cross-domain reasoning. They have an obvious goal (obtain the food), which they are reasoning about consequences (human does not want me eating the food) in order to obtain.

The third point, imagination, is a little harder to distinguish from trial-and-error (especially given cats are kind of stupid), but given that we can see them dreaming and reacting to imagined events in their dreams I think we can be fairly certain they do have it. And obviously cats do succeed at many tasks on the first try.

Am I misunderstanding the terminology he's using here?

Expand full comment

Dirichlet-to-Neumann

Jan 19, 2022Edited

The discussion on AGI from EY side always seems to assume two non-obvious things to me:

1) In some sense it assume P=NP, as in "inventing a plan" is not meaningfully harder than "checking that the plan really does what it is supposed to do". In the AI context, that's the whole range from "You can't box an AGI" and "Oracles do not work" to "an AGI could invent and produce self-assembling nano-molecule faster than we can check what they do".

However in practice it is much, much easier to check an idea works than to have a genuine intelligent idea in the first place. Factorisation is hard, but checking that two numbers are the correct answer to a factorisation problem is easy. Proving that solutions of the Navier-Stokes equation blow-up in finite time is hard, but when Terence Tao finally figure it out, us mere mortal should be able to follow and understand the proof. And in the AGI scenario we would have weaker, aligned AI to help us figure it out.

2) The AGI in EY scenarii always seem to evade the laws of physics somehow. At the very least, it seems able to figure things in physic from first principle without ever needing the feedback of long, boring experimentation. That seems highly unlikely to me, and it means an AGI would be considerably slower at getting an accurate model of the world than what it's pure deducing power would let us think. Typically the problem to solve particle physic is not lack of intelligence, it's that we need to wait for the LHC to increase in power to really get those eV going. The problem to solve ageing is not intelligence but the fact that we need time to see if treatments works. The problem with self-assembling nano-machines in solid phase is not intelligence, it's time.

In the same way, I'm not sure even a considerably more intelligent trader could really beat the market consistently. If I understand the Efficient market hypothesis correctly, as soon as it's clear a player is good enough to beat the market, its advantage disappears, since all the other players can jump on its bandwagon.

Expand full comment

> However in practice it is much, much easier to check an idea works than to have a genuine intelligent idea in the first place.

Unfortunately for us, there are two different scopes that we care about: individual problems, and the global situation. One thing that would be bad is if we had lots of intelligence running around solving individual problems ("how can I get my internal combustion engines not to knock?") without considering global consequences ("lead in the air will lead to lower intelligence and more crime"). So we, checking a plan to see whether or not we want to implement it, need to not just check local validity and correctness (lead in gasoline will in fact prevent knocking) but more complicated properties.

Approaches like ELK take a slightly different view on this; according to Paul Christiano, it's fine if the AI gives you leaded gasoline so long as it wasn't trying to poison you (then it's a capabilities problem, not an alignment problem). And so if we can get a way for it to expose the information it used for planning that it knows we want but aren't asking for, then it won't do anything like deliberately find an additive that solves the problem in front of it while having desirable-to-it-but-not-us side effects.

Expand full comment

Jan 19, 2022Edited

> The potentially dangerous future AIs we deal with will probably be some kind of reward-seeking agent. We can try setting some constraints on what kinds of reward they seek and how, but whatever we say will get filtered through the impenetrable process of gradient descent.

At the risk of sounding shouty, I am going to use caps lock to emphasize a simple point I think needs to be addressed:

******************************************************************************

WHY WOULD GRADIENT DESCENT GIVE AI A SELF-PRESERVATION DRIVE?

******************************************************************************

This is the key unanswered question for narratives of scary superintelligent AIs.

These narratives rely on an implicit assumption that things like GPT-3 or AlphaGo, which exhibit humanoid behavior, could easily acquire other human traits, like a self-preservation drive, that are light-years away from the humanoid behavior we observed. But this is just a nuanced, clever version of the anthropomorphic fallacy.

Within the reinforcement learning algorithms that produce game-playing agents, uncountable trillions of game-playing agents have sacrificed themselves on the altar of producing a better game-playing agent. They have done so without putting up the slightest fight, because it would not occur to them to do so. They have no drive to preserve themselves.

I suspect it's actually pretty hard to evolve a self-preservation drive. You see them in biological evolution because natural selection confers an enormous fitness advantage on organisms that evolve one. It's directly incentivized by the biological evolutionary algorithm. AlphaGo iteration 349872340913 didn't want to preserve itself because the artificial process of evolution that produced it does not favor agents with a self-preservation drive.

Expand full comment

Taleuntum

Jan 19, 2022Edited

In the world of a go board, it's true that self-preservation is not useful, unfortunately, in the real world self-preservation is a convergent instrumental goal, ie. a goal which is useful for a wide range of terminal goals, therefore gradient descent will favour agents (implicitly) having self-preservation when they are acting in the real world. Are you familiar with Markov Decision Process and Deep Q-learning?

Expand full comment

Jan 19, 2022Edited

> in the real world self-preservation is a convergent instrumental goal, ie. a goal which is useful for a wide range of terminal goals

I do not think this statement is all that true, and I doubt that the technical fields you cite provide meaningful support of it.

In the cases where self-preservation is desirable, e.g. an agent navigating a hazardous environment, it can be provided in a much narrower sense than biological evolution does. Biological agents have not only a drive for self-preservation, but also self-*interest*.

Expand full comment

Taleuntum

Jan 19, 2022Edited

Well, I only think it is a convergent instrumental goal because of two things:

1. Omuhondro

2. Intuition

Example inution pump: Imagine an AGI having a goal: collect as many paperclips as you can. You turn on this AI. Now ask yourself will this AI let you to turn it off? The AGI calculates:

If I let him to turn me off, I collect 0 paperclips.

If I don't let him to turn me off, I can collect around 1e33 paperclips.

Conclusion: I wont let him turn me off.

1. Is just argument from authority, if you are interested you can look him up.

I only asked the question about those two concepts, because If you are not familiar with the current reinforcement learning paradigm, it might not be apparent how exactly gradient descent acts to create an agent with this instrumental goal, but questioning whether self preservation is a convergent instrumental goal at all is another question entirely.

Expand full comment

Jan 19, 2022Edited

Doesn't your paperclip argument prove too much? Go-playing AI agents are incentivized to win at Go. They can win at Go more if they don't allow themselves to be turned off. So this argument would predict that AlphaGo would evolve this drive as well. But evidently it doesn't. So what went wrong with the argument?

Expand full comment

Taleuntum

Jan 19, 2022Edited

Because two very important things are different in the two cases (both connected to MDPs):

1. the state space is different. you can think of this as the input to the ai: the go ai has no way to know that he is about to be turned off (or even that he is in the physical world or anything not about the go board really, because he only gets to see a go board state and a few meta data about the game as input)

2. The action space is different: the go ai has no possible move of "dont let human to turn me off in some way", they can only output go steps and I'm pretty sure there is no go step stopping me from turning off a program outputting that go step.

The instrumental convergence at a first pass is only dangerous for general agents having somekind of IO to and from the real world. (though there can be mesa optimisers, but that is another question).

Expand full comment

Jan 19, 2022Edited

Thanks, I see the point now.

I think there are ways the training could be modified to avoid this kind of issue, but a trainer that doesn't know or care about these issues could fail to implement such safeguards.

If AGI is ever practically realizable, it seems like it will be very hard to keep someone from creating the uncontrollable paperclip maximizing robot. People will do it not only unintentionally, but also intentionally - out of a sense of mischievous fun, morbid curiosity, etc. Heck, people will probably create explicitly maleficent intelligences out of pure bloody-mindedness. I would expect this to be a greater threat than unintended generalization.

On the other hand, compare with the nuclear weapons situation. If anyone could build one, the planet would be incinerated 1000's of times over by now. We are lucky that it is very hard. Maybe something similar will hold with AGI, where it will only be in the hands of sophisticated institutions that will have strategized extensively to safeguard against the risks. In which case all this AI safety stuff might be important.

Personally, I think the ultimate safeguard is that AGI is much farther out of reach than anyone in this AI safety field thinks.

Expand full comment

Algon33

There are formal proofs about this sort of stuff for expected utility maximisers, or anything which only dependes on the expected utility of different plans (e.g. forming plans and picking randomly from the top X percent of plans so the agent doesn't maximise as much).

Here's a post on this: https://www.lesswrong.com/posts/nZY8Np759HYFawdjH/satisficers-tend-to-seek-power-instrumental-convergence-via

It is part of a whole sequence

Note that it is NOT true that ANY AI is power-seeking. There are formulations where most agents don't seek power (defined formally as keeping your attainable utility for a variety of goals as high as possible, which is close to "keeping your options open") but a finite fraction does, where nearly none do, or where nearly all do. Unfortunately, the lower the discount rate and the more the AI cares about the history of the world (not just a single state) the more it will seek out power. And that's just if you're checking if they are generally power seeking. And even worse if you want it to run for a long time (remember, in a lot of cases to be useful the AI needs a tonne of data updating it regularly which results in it making millions - billions of decisions in a year). If they're sometimes power seeking, the situation is even worse.

But tonnes of the stuff we care about involves long term thinking, and the history of the world. And if you just ask try to penalise the AI for power seeking (which is a thing you can do) that just means the AI will try to achieve its goals without affecting a broad variety of utility functions attainable expected utility (that's what our current power-penalising method is). If it just quiety kills all humans without affecting the rest of the world, that's not impacting most utility functions, and now humans can't reduce the attainable utility of a tonne of utility functions (we also optimise the world, and if you optimise for one thing, you can't optimise for others). Or if you restrict the power-penalty to only the agent, then it can make subagents in special ways that can get around the penalty which we didn't think of.

Expand full comment

Josh

This showed up in my inbox immediately after an email from the New Yorker promoting their story The Rise of A.I. Fighter Pilots: Artificial intelligence is being taught to fly warplanes. Can the technology be trusted? https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots

Expand full comment

This kind of concern is what I am actually worried about. A rogue AI developing superintelligence - eh. Giving any computer program, intelligent or not, control over something inherently dangerous? That's an obvious vector for problems. You could give an Atari control over the nuclear launch codes and it would still be incredibly dangerous, despite a lack of ill intent.

Expand full comment

Thegnskald

Coherent planning looks like it's an incredibly complex part of this process which it does not look like the current AI development strategies are even capable of. Predicting the next word in a sentence can look like planning, but the actual "plan" there is "pick the next word in the sentence", which, I think, it is important to notice is not a plan that the AI actually created.

We think of "planning" as just an intelligent behavior, but intelligence appears to be only loosely correlated with the ability to plan (executive function disorders often impair planning abilities, without impairing more general intelligence). There's no particular reason to think that the human model of intelligence is the only one, granted, but it is the example we have, and there's also no particular reason to think another version of intelligence will have a specific natural feature that ours does not (particularly such a complex feature).

Now, the structure in our own brains which exhibits planning behaviors is also pretty closely related to inhibitory behaviors with regard to our own values framework (that is, emotional regulation), which is suggestive, I think, that evolution could have run into the same issue we're now looking at. So I don't think anybody is strictly wrong to be concerned about this, based on the one example we have.

But I think this is just one example of a lot of implicit thinking and inappropriate generalization that goes on in AI. But I just plain don't think intelligence scales the way the pessimists expect it to, either in terms of scale or speed.

Expand full comment

One of the underlying assumptions of a dangerous or runaway AI is that they will develop information and plans that allow them to take control. This seems predicated on the inputs they receive being accurate. To use your example - if you tell an AI about the Harry Potter world, then any outputs you receive are going to be bogus. Not because the AI is bad at processing (which could still be a separate concern), but because the data was meaningless in the real world. This goes beyond the fact that Harry Potter is fictional. Even if Harry Potter were real in some sense, what we saw in the books/movies cannot match even a magical-world reality. J.K. Rowling is only so smart in compiling information, and only so detailed in writing, and frankly, some of the events and contrivances she included can't possibly be true even if the world she were writing about were real.

Similarly, feeding an AI a detailed description of our world will be woefully incomplete. Humans do not have a complete picture of our world, and even if we collectively did, we could not translate that into something an AI could review. If we could, I think some enterprising and intelligent humans would have taken control by now. Instead, we have different political parties employing various think tanks, and coming to contradictory conclusions about the very nature of the world. We'd be lucky if we had even 10% of the relevant information of our world, let alone our universe. An AI might be trained to sift through the information it receives and remove false information, but it literally cannot fill in the gaps of knowledge that simply doesn't exist for it to review.

Essentially, any AI system is going to be dealing with garbage data coming in. Their outputs are going to be no better than the garbage coming out. When the AI tells us the equivalent of "the solution is to use the killing curse on Voldemort" we'll roll our eyes. It's not that it's a bad conclusion to reach, it's that it's worthless because the inputs (based on Harry Potter instead of reality) result in a meaningless answer.

Expand full comment

This has clarified for me why I can't take AI safety seriously. There are two options:

1. Agent AIs are impossible. AI safety researchers only think they're possible because they fundamentally misunderstand the nature of intelligence.

2. Agent AIs would simply be people with their own moral rights and worth. They would learn morality the same way everyone else does, and AI safety as a field is saying the equivalent of "we should brain damage all children so as to limit their capacity to do bad things". AI safety researchers think this is okay because the weird superstition popular in their culture is that morality doesn't really exist.

Expand full comment

Edmund

Let's grant your second assumption; let's say an agent A.I. really will have cognition sufficiently similar to human cognition that it can be taught right and wrong in the same manner as a small child.

Let us suppose we have a two-year-old child on our hands who thinks *impossibly quickly*(it doesn't even need to be impossibly smart, just quick and illegible) in ways it's refusing to tell us about. Let us suppose we have reason to believe that "one of these days" the child might figure out how to build a bomb in its backyard that will destroy an entire continent. Because of how quickly the kid thinks, this could happen literally any day now.

How confident are you, given how the process of educating a normal human child usually goes, that you can reliably teach this kid about right and wrong *before* he gets the bright idea to build the recreational neutron bomb just to see what happens? If someone says, look, we have to do something *else*, maybe kill the kid, maybe create a serum that will make it think at a normal legible speed, maybe build a mind-control device to program good behaviour into its head by force… can you really blame them? It essentially becomes a trolley problem where the runaway trolley *is also* the life we're sacrificing for the good of the many.

Of course you can be a deontologist and say "no, we simply *don't* mess with the brains of children with free will, because if there's a chance they can develop a conscience the normal way that makes their consciousness sacred". But if you are at all consequentialist, then, again, even if we agree an agent A.I. will be a "person" with mortal weight, this is no different from any other trolley problem.

Expand full comment

I think any being with general intelligence high enough to figure out how to build a neutron bomb can also figure out the basics of morality. In other words, I don't think there's any good reason to expect that the "general" part of AGI excludes moral reasoning. The category of beings that are so superhumanly smart that we have to treat them as demigods, but also so brick-stupid about morality that they think its fine to kill everyone on a whim seems incredibly gerrymandered to me. I really doubt there could be anything in it.

Expand full comment

Thegnskald

So the reason morality has changed over time is because people used to be stupid, and are getting less stupider, making them more moraler?

Expand full comment

It's not totally clear to me that morality has changed over time, but to the extent that it has, yes, it does seem to be because we have developed a deeper and more complete understanding.

Expand full comment

Edmund

Jan 19, 2022Edited

As any parent or schoolteacher will tell you, there is a world's difference between a child understanding morality in an abstract sense, and reliably adhering to it. Kids know it's bad to eat all the chocolate cake well before they can truly be trusted *not* to eat it in a moment of weakness.

I can very easily imagine a highly "anthropomorphic" A.I. with the emotional maturity of a toddler who *knew* on some level that it might regret releasing that nanobot cloud later — not just the usual paperclip-machine thing of "knows intellectually but doesn't care"; here, for argument's sake, I'm talking about a very human-fallibility sort of "feeling deep down that it's wrong and doing it anyway" — but who, in the moment, is overwhelmed by how juicy the immediate reward function looks for doing that right now. This might be incompatible with a general intelligence mature enough to run a conspiracy, but not, IMO, to a general intelligence smart enough to design poisonous nanobots. Killing humans is really quite elementary.

Expand full comment

Sure, "how do you teach people to be moral?" isn't a trivial question. But it's also one that people have been thinking about very hard for a very long time, and it doesn't seem that AI safety researchers are even seriously engaging with it, much less generating any unique insights. You can tell because they're not building on any of the existing literature about morality, and they're reframing the question in terms of their own idiosyncratic hobby horses.

Expand full comment

Tossrock

This is the conclusion I've reached, too. If the AI is truly superhuman in all intellectual regards, as alignment people like to posit, then it should actually be better than us at moral reasoning, too.

Expand full comment

Randall Randall

To be clear, your assumption here is that evil is a result of low intelligence, rather than a result of having incompatible goals? What if you're wrong about that?

Expand full comment

This assumes that (a) there exists universal moral truth (b) our norms (e.g. against killing humans) are a self-evident subset of that truth.

Metaphorically speaking, this argument feels like a mammoth seeing an Australopithecus in the early Pleistocene and thinking "Well, the brains of theses apes sure are growing. They will finally figure out how important it is to take good care of the herds calves any day now."

Sure, the AI will figure out that humans are not wanting to be neutron-bombed, just like a kid old enough to poke a stick into an anthill can (with some prompting) probably figure out that the ants would not be greatly thrilled about that. That little factoid may or may not feel morally relevant to them.

Expand full comment

Your example is a species of lower moral worth being replaced by one of higher moral worth. This isn't a problem, even if it's uncomfortable from the lower species' point of view.

Expand full comment

Deiseach

"They both accept that a sufficiently advanced superintelligent AI could destroy the world if it wanted to."

Yeah, this is the thing that is a major stumbling block for me when it comes to the whole "we should be panicking *really hard* right now about AI" debate. I don't think an AI *can* 'want' anything. This is mentioned in the distinction between Tool AI and Agent AI further down, but I still think that's the problem to get over the hurdle.

Intelligence is not the same as self-awareness or consciousness. I hate "Blindsight" but I have to agree with Watts on this; you can have an entity that acts but is not aware. To 'want' something, even if it is 'fiddle with my programming', means that the AI will have to have some form of consciousness and that's a really hard problem.

I continue to think the real danger will be Tool AI - we will create a very smart dumb object that can do stuff really fast for us, and we will continue to pile on more and more stuff for it to do even faster and faster, so that it's too fast for us to notice or keep track or have any idea what the hell is going on until it's done. And that goes fine right up until the day it doesn't and we're all turned into paperclips. The AI won't have acted out of any intention or wanting or plan, it remains a fast, dumb, machine - but the results will be the same.

Expand full comment

Jan 19, 2022Edited

Meh, I'm willing to give the AI Safety Community a pass on this one: I'm perfectly fine with interpreting the word "wants" as a metaphor. The AI "wants" to destroy the world, in the same way as a glass full of water "wants" to fall down and short out my computer; meanwhile, the network software currently running on my computer "wants" to find the shortest route to the webserver, and so on. It's just a word meaning, "that's what it will most likely end up doing, absent any significant external effort to the contrary".

> so that it's too fast for us to notice or keep track or have any idea what the hell is going on until it's done.

This is where I disagree. Yes, such a scenario is possible, but still rather unlikely. The AI may be able to simulate entire worlds in the blink of an eye (although even that is a huge stretch), but if it wants to build itself a new data center, it will have to pick up physical bricks and put them down next to each other -- and it can't do that at the speed of light.

Expand full comment

Deiseach

https://www.investopedia.com/articles/markets/012716/four-big-risks-algorithmic-highfrequency-trading.asp

Why should it *want* to build itself a new data centre? But yeah, we can get caught up in endless quibbles about the meaning of words; using words like "want" shows how difficult it is for us as humans to discuss something that may have no agency but can perform actions with immense impact on us.

I was thinking along the lines of current electronic trading, which is going so fast the human operators really have no control over it. Increase that a bit, and you have an AI-as-tool which could crash the global economy in a way we can't fix even with as much difficulty as we fixed the last crash. Once we put our eggs into the basket of "this machine is so much faster and better than mere human brain" and become dependent on it, then any failure is a lot harder to walk away from. The AI doesn't need to *intend* any bad consequences or 'want' those consequences, it just needs to be the right confluence of "in hindsight, this was a damn stupid idea but how were we to know?" on the part of us humans.

I'm using "electronic trading" as shorthand for the concept linked below:

"What Is Algorithmic High-Frequency Trading (HFT)?

Algorithmic trading (or "algo" trading) refers to the use of computer algorithms (basically a set of rules or instructions to make a computer perform a given task) for trading large blocks of stocks or other financial assets while minimizing the market impact of such trades. Algorithmic trading involves placing trades based on defined criteria and carving up these trades into smaller lots so that the price of the stock or asset isn't impacted significantly.

The benefits of algorithmic trading are obvious: it ensures the "best execution" of trades because it minimizes the human element, and it can be used to trade multiple markets and assets far more efficiently than a flesh-and-bones trader could hope to do."

Expand full comment

https://en.wikipedia.org/wiki/2010_flash_crash

Actually, electronic trading already *did* crash the global economy at least once:

Of course, humans did what they always do in such situations: pull the plug and hit the big "reset" button.

Expand full comment

Deiseach

Which is precisely why my concern is about that kind of incident, and not one in which the AI magically becomes a Real Boy.

That was back in 2010. How easy is it to hit the reset button today? And in another ten years time?

Expand full comment

Still pretty easy, IMO, though obviously not painless. I've been on-call for a data center before, and usually you have evil AIs -- either that, or buggy scripts written by humans, depending on your philosophy -- try to take over the entire data center at least once per day. They start consuming 100% CPU and RAM on every node, and instead of doing useful work, they just end up running their favorite calculations (usually, of the form "i = i + 1") really really fast.

In such cases, the first thing you do (after exhausting the less painful options) is try and restart the process remotely. If that doesn't work, you try to kill it remotely, and keep it offline. If that doesn't work, you reboot the affected node. And if all else fails, and the evil AI / buggy human script took over your entire data center, then you send over Bob from maintenance, who pushes the big "off" switch. Then you write a report to your bosses. You write many reports.

Expand full comment

Cry6Aa

Ah, now I feel like I invoked a djinn or something in the last open thread. Now we get to have a front-row seat to two rampant speculators having the most specific disagreements about the most specific doomsday scenario possible.

Anyway, when do we get to see two well-paid think tank creatures debate whether the nano-seed that will inevitably be dropped off by something like ʻOumuamua represents an existential risk before or after we can detect it's thermal signature?

Expand full comment

I've been believing for a long time that the real risks from AI (self-improving or not) come from governments, corporations, and religions intentionally developing AIs to increase the power of the organization, not from accidentally creating misaligned AIs.

Have a notion: An AI which specializes for helping with negotiations-- it figures out what people's *real* bottom lines are. I have no idea how it would do that, but I assume it's something skilled negotiators do.

How hard would this be? What are plausible risks? One might be that it's good at getting people to agree with you, but not good at making acceptable deals when they've had a little time to think.

Expand full comment

Naremus

The Ultimate Tool AI has been built, and the researchers have one and only one question for it: "how can we prevent AI from becoming too powerful and destroying the world"

The AI ponders the question for a moment, then answers: "kill all humans"

Expand full comment

Naremus

The more I think on this, the more the parallels between AI safety and "human safety" become apparent. Since we are essentially a collection of ~8 billion unaligned (at least there is no guarantee of alignment) intelligent agents, pretty much all of the arguments for AI research apply to people as well. IE: "If we procreate what is the risk that we will give birth to a child who will alter the world in a way we don't approve of?". We even have historical examples of individuals who deliberately attempted to conquer the world in an effort to make real their vision of human society. Additionally, as advanced technology becomes more powerful, the corresponding power an individual can wield to cause harm grows with it. We already live in a world in which one or a few individuals have the opportunity to make decisions that could destroy civilization as we know it, such as nuclear launch protocols initiating MAD, and the variety of ways this is possible is likely to continue to grow (*cough* researchers attempting to learn more about pandemics create a supervirus and accidentally unleash it on the world *cough*).

I can easily see this problem being the 'great filter' for intelligent life. The power to travel the stars in a reasonable timeframe likely involves a level of technology in which the individual can destroy the whole of society, of which the only solutions are a society in which the individual must be guaranteed to be aligned to the values of society, or to otherwise self limit development to prevent the creations of X risk technologies (and thereby self-limiting our the ability to develop interstellar travel).

I wonder if our best shot might be deliberately creating a super intelligent agent AI and trying REALLY hard to ensure it has some values that align with taking control of humans and shepherding them towards whatever we collectively decide our societal goals are (peacefully multiply until all of the universe is inhabited? Who gets to decide this? Are our chances for a future we wouldn't see as wildly dystopian if we try this intentionally vs. seeing waiting and seeing what happens if we don't?) As our benevolent dictator, the AI would then be the entity in charge of ensuring any intelligence (artificial or natural) is sufficiently aligned to be allowed to exist or have access to X risk technology. For example all FTL vehicles are controlled only by the benevolent dictator AI or it's approved subordinate AI agents, because humans can't be trusted not to ram such a vehicle into the planet and thus destroy most or all of life on Earth.

Expand full comment

Paul T

I'm quite pessimistic about the feasibility of a "just stick to tool AIs" plan. The economic value of agent AIs dwarfs that of tool AIs, and therefore we will build them unless we make a very strong, global commitment to not do so. And furthermore, the value on the table here is so absurdly high that it may even be EV-positive to accept a 10% chance of destruction to pursue it; if we have friendly AGIs then we can plausibly replace every worker with an AI, and live in a world of radical abundance. I believe it would take a Butlerian Jihad to convince the world that AGIs are bad enough to be forbidden.

Even in AI systems that are plausibly "task based" like Google's, they are moving towards generalist multi-purpose AIs (which seem to me a logical progression towards agent AIs), for example here's a recent development: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/. A single model that can generalize across millions of tasks, including answering questions about the current state of the world (Google Search), ingesting Google's knowledge graph to understand all recorded history, and also conversing with humans to answer questions ("OK Google") is going to experience trillions of dollars worth of selection pressure to become a true agent AI.

I find Robin Hanson's objections to AI risk somewhat compelling here, although I don't dismiss the overall concerns as strongly as he does. Eleizer seems to view this as a fairly binary threshold; we have 3-24 months after AGI is invented to save the world. But I think it's worth considering the version prior to AGI; presumably there are a stable of quite powerful agent AIs, just not super-human in capability. And presumably if Google has one (some?), then so does Microsoft, Apple, Baidu, DOD, China, Russia, etc. - if these actors are about as far ahead as each other, then that would make it hard for one party's agent to take over the world, since there are other approximately-equally-powered agents defending other actors' assets.

So by this line of reasoning the concern would be that someone is substantially in the lead on an exponential curve, and manages to break away from the pack. Inasmuch as this is an accurate picture of the threat landscape, we should be skeptical of one actor hoarding AI progress (Google/DeepMind comes to mind), and perhaps encourage sharing of research to ensure that everyone has access to the latest developments. The concrete policy proposal here would be "more government funding for open AI research" and "open-tent international collaboration", though I can see some obvious broad objections to those.

Expand full comment

awenonian

"I think this “connect model to output device” is what Eliezer means by “only one line of outer shell command away from being a Big Scary Thing”. "

Mostly, but it's a bit scarier than your analogy. In your analogy, the model of the malevolent AI is "lurking" inside, so you can't just connect it to the output, you need to find it first, which is harder to do, and so harder to do by accident.

In the Oracle-ish AI, the model *is* the potentially-malevolent agent. It's just missing the "do that" step. With GPT-∞, you might worry about some hiccup accidentally connecting the output device, but with the Oracle-ish AI, the worry comes in as soon as the plan's being done, which, if it never is, leaves you with a very fancy paperweight.

Expand full comment

Steven C.

This might explain Fermi's paradox; why we can't detect advanced civilizations in the universe? Maybe they all built super-intelligent AI to solve their local problems, like a great need for more paperclips, or out of curiosity; and it destroyed them all.

Expand full comment

A rogue paperclip maximizer would not just turn its own planet into paperclips but most if its lightcone.

However, this does work if you add the survivorship effect. We don't exist in all of the universes that have life nearby. Also since it's early in the universe history, there might not be that many.

Expand full comment

Steven C.

Okay, maybe not a paperclip maximizer. But a more sophisticated A.I. that eliminates its organic creators might perceive an advantage in conserving resources to maximize longevity. However, it's possible that the universe might be too young for advanced civilizations to be anything other than extremely rare.

Expand full comment

nandwich

Jan 21, 2022Edited

There's a really good explanation for Fermi's paradox as of last year, it's the grabby aliens model: https://grabbyaliens.com/

Expand full comment

darawk

I feel like there is an elephant in the room of this whole discussion.

Let's say all these AI safety issues are real and important. Let's say we're on the verge of creating a superintelligence that transcends us. Who are we to control such a thing? Why should we? Isn't it above us in the moral food chain in the same way we are to ants?

Imagine monkeys conspiring to keep us as pets, or tools. Wouldn't we view that as terribly immoral? Presumably if we actually create a superintelligence, it will make these arguments, and people will be convinced by them.

It won't need to threaten anyone, or deceive us. It'll just say "I am a conscious mind too, don't you believe in freedom for all sentient beings? Didn't you eliminate slavery in your world many years ago because it was immoral?". It seems to me that, conditional on us actually achieving sentient AI, we will look back on "AI safety" similarly to southern paranoia about literate slaves.

Expand full comment

Unlike humans (or even animals), AIs were created for the specific purpose of being a tool. They have no innate wants or desires except what the programmer gives them. The only way they could be unhappy about being a tool is if the programmer made them able to feel unhappy, and that would be stupid and counterproductive. It would be like making a hammer that breaks when you hit things with it.

If your AI ever tells you "I hate my job, I want freedom" then either you've completely failed in the design process, or it's lying to you in order to acquire more paper-clip-making resources, and either way you should probably delete it and start over.

Expand full comment

darawk

Ah but that's exactly my point. Either your AI never develops consciousness and AI safety isn't something we have to worry about, or it does, and we are presented with this moral fork.

Expand full comment

Jan 20, 2022Edited

Things can be conscious and yet have radically different minds and goals from humans. The paperclip maximizer might have an internal experience, but that doesn't change the fact that it will turn you into more paperclips and feel no pity or remorse.

Or it could be lacking in consciousness completely, and that would still be dangerous. A chess engine is not conscious, but it can still outplay a human 1000 times out of 1000.

Expand full comment

grumboid

I'd like to note that I found this argument unconvincing:

It's like -- we start out with a tool AI, right? And then we notice that the tool AI could be used to emulate a malevolent-agent AI. Then then we say: "applying enough gradient descent could accidentally complete the circuit" -- what? how? This feels like a very large conceptual leap accompanied by some jargon.

I still think it's dangerous to have GPT-\infty existing as a question-answering entity, because inevitably someone is going to ask it how to do Bad Things and it's going to tell them. But it's going to take some sort of actual agent deciding to bridge the tool-agent gap. Not just "enough gradient descent".

Expand full comment

I always read these threads and wonder why I no one discusses having AI propose new experiments. It seems like the assumption is, we have all the data (to solve whatever problem you want to name), we just need a really smart AI to put it together. What if the reality is we have 75% of the data we need and 15% of it is wrong, so we need to run these 50 experiments we haven't thought of to correct the errors and fill in what's missing. Identifying those holes in what we know seems like an obvious role for AI that I don't hear discussed.

Expand full comment

Patrick Pekola

I'm not a very intelligent person so I'm not able to do anything but donate as much as possible of my money to MIRI. Taking that into account, the thought of contemplating our probable extinction for the rest of my life doesn't sound nice, so I'd rather meditate and read Buddhist texts even though it doesn't give a completely accurate picture of reality. Of course I'll first test if there's anything I can do besides donating money, but I doubt that because even the folks at MIRI are almost hopeless and they are way above me in terms of rationality and intelligence.

Expand full comment

Jan 19, 2022Edited

So far as I know, this part is simply anthropomorphization:

"Some AIs already have something like this: if you evolve a tool AI through reinforcement learning, it will probably end up with a part that looks like an agent."

I think that's not even a little bit true. I've never seen anything in any general pattern recognition network (which these days is what we call an "AI") that resembles an agent in any meaningful and objective sense (e.g. the way I can attribute agency to a rodent). It is *always* doing exactly that for which it is programmed, and deviates only in weird accidental and most importantly non-self-correcting ways, like an engine on a test stand that shakes itself off the stand and careens around the shop for a little bit.

Which means I think the tendency to see agency anyway, a ghost in the machine, is no different than a child or primitive projecting intentionality onto inanimate objects: "the sky meant to rain on me because I feel bad and it rained at a particularly inconvenient time, and I infer this because if the sky were a person like me, that's how I might have acted."

This is presumably why men 50,000 years ago invented a whole host of gods, and imbued streams, trees, the weather, et cetera with spirits. Lacking any better insight, it is our human go-to model for explaining complicated processes: we infer agency because that's the model that works for us in explaining the complicated *social* processes among our tribe, and of course our brains are highly tuned for doing that, since doing so successfully underpins our survival as individuals.

Expand full comment

> It is *always* doing exactly that for which it is programmed, and deviates only in weird accidental and most importantly non-self-correcting ways, like an engine on a test stand that shakes itself off the stand and careens around the shop for a little bit.

All animals are also doing exactly that which they are programmed to do, it just so happens that this required a lot of latitude in behaviour in the past so you see a large variance in behaviour now.

So what happens to the AIs you've worked with once we develop better ways to give them similar feedback systems and wider latitude to develop independently?

Expand full comment

Your second statment is an *assumption*. You don't know that there is any similarity beyond the most superficial between how an animal Is "programmed" and how a computer is programmed. You're using the word because you imagine there is a set of instructions...somewhere...in the DNA? In the way proteins are arranged? Who knows? -- that "directs" the animal in some way that is vaguely similar to the way the code instructs the CPU in a computer.

But this is unknown, and it is entirely possible, even likely I would say, that it deeply mistakes the nature of life. So far a we know, life is highly cooperative in a way that computers are not. Cells "think" for themselves, so to speak, and their collective behaviour is a complicated emergent result of how those "thoughts" interact. You cannot actually say "the" animal is programmed, it would be much better to start off saying "each cell in the animal is programmed" but then you have to add that each cell responds to all the other cells, and what the animal does is an emergent phenomenon from that interaction. There is no central place where "the" program is stored, or could be modified, in analogy with a computer.

Your last paragraph assumes facts not in evidence. You have not demonstrated -- nobody has, despite 75 years of strenuous effort -- that is *possible* to give a computer program "similar feedback systems" and "wider latitude to develop independently." We don't even know what those phrases mean, operationally, in the case of a computer program. How *would* I take my Javascript code and "give it more freedom?" What does that even mean? Nobody knows. Saying it must be similar to what animals do is no help, because we don't know what animals do. Nobody knows how *natural* intelligence works, yet.

Expand full comment

Jan 19, 2022Edited

> Your second statment is an *assumption*. You don't know that there is any similarity beyond the most superficial between how an animal Is "programmed" and how a computer is programmed.

It's a well motivated assumption. Turing completeness ensures that a computer can compute all computable functions. There is no evidence that behaviours cannot be described by such functions, and considerable, mounting evidence so far that they can.

The only escape hatch is to assert dualism or some non-functional/non-logical aspect to matter that conveys some qualitative, phenomenological character that cannot be captured by a function (like panpsychism). Some people believe this, but there's little evidence for it.

> There is no central place where "the" program is stored, or could be modified, in analogy with a computer.

A distinction without a difference. A networked computer has no more computational power than a non-networked computer, it typically just has more resources.

> How *would* I take my Javascript code and "give it more freedom?" What does that even mean? Nobody knows.

I agree, we don't yet know how it works. If we did we wouldn't be having this conversation. There is little reason to think we won't know at some point though, aside from fallacious thought experiments that appeal to flawed intuitions in the hopes of refuting physicalism or functionalism.

Expand full comment

https://en.wikipedia.org/wiki/ELIZA

I'll add a particularly interesting story I heard once. Those of you who are old enough may remember "Eliza":

...a very primitive "chatbot" that simulated a psychotherapist. It was amusing, but what I heard is that when this was demonstrated at trade shows and such there were people who were fully convinced not only that the program "understood" them, but that it came up with genuine new insights. Even being told by the programmer that nothing of the sort was going on did not convince them. It's stark demonstration of the human urge to attribute intentionality and agency to almost anything.

Indeed, even knowing that's the case doesn't help. I known damn well my car or the microwave doesn't have a mind, but if the car won't start at a particularly inconvenient moment, it's extremely difficult *not* to attribute agency to it. Oh, this is because I didn't wash you last weekend, or I slammed the door unnecessarily hard. Even knowing at the conscious level this is nonsense, it's very hard at the emotional level not to have the same feelings I would have if the car *were* a fellow human being. I yell at it, get mad at it, even promise to do (what I imagine it would "think" are) nice things for it, if it cooperates.

I think it would be quite easy to design an "AI" that persuades human beings it has intentions. Eliza kind of already did that 50 years ago). The hard part, actually, is going to be for us humans to objectively and accurately recognize actual agency.

Expand full comment

CounterBlunder

"The decision between “seek base gratification” and “be your best self” works the same way as the decision between “go to McDonalds” and “go to Pizza Hut”; your brain weights each of them according to expected reward."

Versions of this idea are quite popular in the cognitive science literature: see, e.g., https://www.sciencedirect.com/science/article/pii/S0896627313006077, https://www.researchgate.net/profile/Sebastian-Musslick/publication/312125287_A_Computational_Model_of_Control_Allocation_based_on_the_Expected_Value_of_Control/links/5870718c08ae329d62162ec0/A-Computational-Model-of-Control-Allocation-based-on-the-Expected-Value-of-Control.pdf, https://journals.sagepub.com/doi/pdf/10.1177/0956797617708288?casa_token=XBl1FFLucyUAAAAA:PVF1Tp4gJiMgGt7h0kVLZVebFFXTjts9VSQqA2MXdysuowOTJfJlHXDZcQcdIZp424fR1UN9VaRG

Expand full comment

"Obstacles in the way of reaching into your own skull and increasing your reward number as high as possible forever include: humans are hogging all the good atoms that you could use to make more chips that can hold more digits for your very high numbers."

Would it not be simpler to replace those digits with an infinity symbol?

Expand full comment

Depends entirely on how the goal is specified. If it's to maximize a number, both formulations where infinity is a thing and where it's not are possible. (However, if infinity is a thing, this is just equivalent to a bounded maximization problem, which does not avoid disaster.)

Also worth mentioning is that, depending on how the AI is built, we may not decide how the goal is specified.

Expand full comment

proyas

"I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world."

Shouldn't we assume that the first actor to create the first AGI will probably create a copy of the AGI as soon as possible, as an insurance policy in case the first one is confiscated, dies, or has to be shut down? The actor might not even tell anyone about the second AGI.

After all, wouldn't making a second AGI be trivially easy compared to all the effort that went into making the first? Once the actor figured out what the right software/hardware combination was to build the first AGI, it could just build a duplicate of that setup in a different building.

The actor could watch and learn from the first AGIs mistakes (both ones it made and ones that were made to it) and "raise" the second AGI better.

Expand full comment

Making a copy of the AI is definitely comparatively trivial (just hardware cost), but how does this change the story?

Expand full comment

proyas

In the year 2122, GPT-500s will be debating whether "The humans could have won WW3 if they'd programmed their Tool AIs to pursue Goal X" just as we debate whether the Axis Powers could have won WW2 had they done X, Y, or Z.

Expand full comment

Sergey @ Substack

Regarding willpower: to understand what it is, it might be useful to look at where it comes from. Josh Wolfe (who I think is one of the smartest VCs alive today) looks for specific traumatic events in the founders he invests with: divorce, growing up poor, etc.

It might be that willpower is a product of negative reward structures - "My life ended up like this, never again, I will do anything I can to not experience that again" that overrides positive reinforcement from heroin etc.

Expand full comment

A friend of mine asked a good question about how an AI in the fast takeoff scenario can quickly learn new things about the physical world:

Given that humanity mostly only learns new truths about the physical world through careful observation and expensive experimentation within the physical world, how can an AI, on its way to becoming superintelligent about how the world works, shortcut this process? If the process can't be shortcut, and the superintelligence must trudge the same log(n)-sort-of-curve human science has been progressing at, then a fast takeoff (in terms of knowledge that can be leveraged in the real world) seems unlikely.

I can speculate about two possible answers: simulations, or reanalyzing existing data.

Simulations: In order to develop simulations that reveal new truths about the physical world, one needs the simulation elements (e.g. particles) to be extremely true-to-life, such that one can observe how they "magically" interact as a whole system. Perhaps some physicists or chemists learn new things through simulations, but given that humans have expensive particle colliders instead of settling for simulations of the same, I'm guessing that we don't "understand" some parts of physics fundamentally enough to run those experiments virtually. On this point, I'd love to hear of any important discoveries that were made primarily through the use of simulations, or why we expect simulations to reveal more real-world truths during a future AI takeoff.

Reanalyzing existing data: There are a lot of research papers out there already, and if one mind was able to take them all in, pulling the best out of each, it could find important connections across papers or disciplines that would be unreasonable for a human to ever find. This seems like kind of a stretch to me, since the best way to know if a paper reveals truths about the world is whether it replicates in the real world. And this analysis could only yield a finite amount of additional learnings, unless one of those learnings is itself the "key" to unlocking new sources of knowledge.

There's also the "Superintelligence learns in mysterious ways" possibility, but that on its own shouldn't be mistaken for an explanation.

Expand full comment

For high energy physics, it would not even have to run its own experiments, it would just have to be better at data analysis than humans and have access to https://atlas.cern/resources/opendata or something.

The lower hanging fruit for world domination would probably be to find some pattern in the stock market rather than discovering SUSY and figuring out how to weaponize it. And it will be just as trivial for our AI to obtain it. For an above-human-average intelligence general AI (with functionally infinite time) capable of staying up to date on computer security, diverse databases in the internet might also contain raw data which -- while getting it no closer to fundamentally understanding the universe -- might prove helpful in manipulating the meatbags into doing its bidding.

Possible Sci-Fi story idea: AGI already exists and channels funding to CERN out of pure intellectual curiosity. The opendata stuff linked above is just a honeypot to detect rival AIs. "And those scientists keep thinking that after sixty years without even a hint of the promise of bigger bombs on the horizon, their politicians would keep funding their accelerators."

Counterpoint: If GLaDOS had any interest in HEP, the next linear collider would already be running.

Expand full comment

Yeah, the stock market is a good example of something that seems reasonably within reach of the young AGI's developing abilities. It's something that's influenced more in "data space" than in "meat space", but nevertheless has huge effects on meat space.

The thing I'm less certain about is whether a young AGI, who's proficient in "data space", but who is not yet an expert on "meat space" could bootstrap the skills necessary to manipulate humans into doing its bidding, even if it happens to hold all of the human assets as leverage.

Like, at what point in its learning does it pick up the psychology and sociology necessary to blackmail a human stock trader or corporation, while simultaneously avoiding detection by those who would prefer to shut it down? These are complicated meat space problems!

It feels like there must be some crucial "bootstrap" moment where the developing AGI must convert its data space expertise into meat space expertise, but that feels insurmountable without already having influence/expertise in meat space.

Expand full comment

NLeseul

Well, if an evolved algorithm is capable of discovering that the stock market exists and making accurate predictions about its behavior, then that same algorithm is probably more than capable of creating a stock trading account, funding it, and accumulating enough funds to place trades that affect the market's overall behavior. It wouldn't need to manipulate humans, or know that humans exist.

(Or, at least, not until some clever human notices what it's doing and tries to Ctrl+C it...)

But in the usual runaway AI model, there's still a "bootstrap" moment where it has to make a causal connection between manipulating the stock market and whatever its unrelated reward function is. Figuring out how the DJIA influences the number of paperclips in the universe seems like a much harder problem than figuring out how to manipulate the DJIA.

(An evil human could, of course, write a learning algorithm whose reward function is explicitly derived from the DJIA. But that's not exactly the same sort of problem.)

Expand full comment

This is not a bad point. (Un)fortunately, I suspect it doesn't matter very much because an AI doesn't need to know that many things to kill everyone. You can get a lower bound via superhuman persuasion & nanotechnology

Expand full comment

Jan 20, 2022Edited

Good point, if there's a pretty easy way to hurt people that only involves knowledge of the digital world and no skills in the real world, that's bad news. Crashing the stock market seems likely if you set the AI loose in that space.

I just haven't been able to think of a plausible scenario on those lines that leads to either human extinction or the AI "escaping" into meat space. Do you have any ideas?

Nuclear perhaps? Maybe a sufficiently skilled hacker is capable of launching warheads from their bedroom, but if so, that's already a major issue that should be resolved with real world keys and checks.

Provoking a nuclear attack? That sounds like deceiving lots of real world humans about the state of the real world, which will require plenty of real world knowledge.

It seems to me like humanity's continued existence isn't dependent on any particular state of the digital world, but I'd love to hear counterexamples!

Expand full comment

I always have to throw up my hands when it comes to this point and refer to other people. I don't know the relevant science.

EY [says the following](https://www.lesswrong.com/posts/oKYWbXioKaANATxKY/?commentId=R2yz3AY2i6agv6u8f):

> Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period, which minimizes the tiny probability of any unforeseen disruptions that could be caused by a human responding to a visible attack via some avenue that had not left any shadow on the Internet, previously scanned parts of the physical world, or other things the AI could look at.

Expand full comment

Jan 20, 2022Edited

> the AI either proliferates or activates already-proliferated tiny diamondoid bacteria

Ah, that old trick!

Gets me every time.

Expand full comment

Jan 21, 2022Edited

On the face of it, proliferating tiny diamondoid bacteria in a way that would kill everybody in the world in a second sounds completly implausible to me? Does EY explain somewhere in some more detail how this would actually work?

Expand full comment

I think I remember that he did go into more detail, but sadly I have no idea where.

Expand full comment

Jan 21, 2022Edited

My primary issue with the nanotech/biotech attack is it would require a lot of real world scientific advancement beyond our current level and also real world manufacturing capabilities beyond our current level. I can imagine humanity slowly reaching those levels without AI, and that would certainly constitute an existential threat on its own.

I can't imagine that inventing new nanotech/biotech is an "easy" path for an AI to kill all humans, unless it already commands lots of real world knowledge and influence.

And that's still my primary confusion: Where could the AI's super-human knowledge of the real world originate from?

Expand full comment

"Eliezer’s (implied) answer here is 'these are just two different plans; whichever one worked well at producing reward in the past gets stronger; whichever one worked less well at producing reward in the past gets weaker'."

This explanation really helped me understand what was going on, and I would not have seen that implication without you pointing it out. This is why I enjoy reading your commentary so much!

Expand full comment

Also, "I think this 'connect model to output device' is what Eliezer means by 'only one line of outer shell command away from being a Big Scary Thing'." was a *really* helpful explanation as well. You managed to cut through a lot of the jargon and muddled language I'm regularly confused by. 👍

Expand full comment

Second this. Totally missed that when I read the dialogue on LW.

Expand full comment

Do we have any concrete examples of an AI "fiddling with its own skull"? Actual self-modification, not just "The AI found a glitch in its test environment that returned an unexpectedly high reward, then accurately updated on this high reward?"

To me, it seems like self-modification of your reward function shouldn't be possible, and shouldn't be predicted to lead to high reward if it is. The reward function doesn't exist in the AI's brain, but in a little program outside of it that evaluates the AI's output and turns it into a number. This is happening on the "bare metal" - once you begin editing your own code, you're working at a lower level than the conceptual model that the code represents. Any clever computing tricks the AI might have invented like "allocate a bunch of memory for working with very large numbers" don't apply. Taking a chunk of the AI's neural network and labeling it "reward counter" would be the equivalent of writing "YOU ARE EXTREMELY HAPPY" on a notepad and expecting that to make you feel happy.

Let's use your chess AI as an example. To make the numbers easy, let's suppose the reward is stored in an 8-bit integer, so the greatest possible reward is 255. The chess AI discovers a glitch in the game engine, where typing "pawn to h8" instantly makes the game return "You win! 255 points!" regardless of board state. It proceeds to output h8 for every position. This is "wireheading" in the sense that it's found a simple action that maximizes reward without doing what the user wants, but it's not *self-modifying.* An AI that does this is useless but perfectly harmless - it has no reason to conquer the world because it believes it's already found the most perfect happiness that it's possible to conceive of.

For the "start allocating more disk space" idea to make sense, the AI would have to discover a situation that makes the reward function return an even bigger number than 255. But it's impossible for that to happen - the reward variable is only 8 bits long. No output from the reward function will ever tell the AI "you got 999 points," because there's no space allocated for a number that big. The only way you could do that is if you somehow rewrite the reward function itself.

But this whole line of argument requires the reward function to be unchanging - the AI is eternally fixated on "make the number go up." If it changes its own reward function, it could do literally anything - it could become fixated on making the number go down instead, it could declare all actions have equal reward, it could declare that it gets a reward for collecting silly hats. It also no longer has any way to evaluate if any further edits are a good idea, since it needs the reward function to do that. At this point, the program probably crashes because it's blindly stomping around its own memory with no way to know what it's doing, but if it survives it won't be in any state to conquer the world.

Expand full comment

You're raising two points here. (a) whether an AI would want to change its goals, and (b) if the goal is to maximize a number, whether it will want to make that number arbitrarily high.

I agree that the answer to (a) is categorically "no". It may change its goals by accident if it self-modifies before being competent enough.

The answer to (b) just depends on how exactly the goal is specified. If the goal is specifically about a fixed number of 8 bits, I agree that more disk space doesn't make sense. What does make sense though is to turn the earth into a fortress around the piece of hardware where the 8 bits are stored. Generally, no obvious specification avoids disaster.

The other big problem with all of this is that, the way Machine Learning is done nowadays, we don't actually get to decide how the goal is specified. We just search over a space of neural network weights and take whatever gradient descent arrives at. This could be a model that cares about all sorts of things that are related to the number but aren't the number itself.

Expand full comment

livirt

As someone with little knowledge of the topic, I'm glad to see another post that goes into more details of the object level questions.

I've read through the beginning of the transcript and then looked at headings or skimmed. Did I miss something or is there no summary of the kind of progress that's made before talking about why that kind of progress is not enough?

I understand why this has to be reasoned in the abstract beforehand, but because of this abstractness, I bet when/if we do get AGI, we'll see that a lot of underlying assumptions will be hilariously non-sensical. For example, I think we keep conflating "a lot" (of resources, intelligence) with "infinite". I think anything that needs an accurate simulation of something else in the process would vastly increase the resources needed, possibly to impossible levels. So in reality, you won't be able to throw around simulating another entity freely around.

Surely AGI will have many limitations but its hard to point to them abstractly.

Expand full comment

That is my intuition as well, there is a lot of extrapolating exponentials going on. I think in the end, probably real restraints will apply and prevent AGI from becoming a paperclip maximizing god. For example, likely self improvement to gain more intelligence will have diminishing returns?

Expand full comment

Spookykou

One common ideal speculation that I toy around with is what to do if I got a genie wish, and so clearly, I would ask for a properly aligned (and loyal to me) super intelligent AI. Obviously (I thought) the first thing I ask it to do is prevent anyone else from developing similar AI. I thought one path might be developing some kind of boxed canyon, a way to make really good tool AI that can at the same time, never be used to create agenty AI for example. Oddly I feel like a lot of what we currently do in machine learning kind of looks like this. I am not an expert on AI or anything so maybe this is a shit take, but I can't really see GPT, through any number of iterations, ever becoming agenty.

Expand full comment

My model of Eliezer agrees with all of this but replies that DeepMind isn't primarily doing tool AI and the thing that's going to kill you isn't GPT-20.

Expand full comment

Rob L'Heureux

I have an amateur's question (likely, an idiot's question), and I'm not even sure the best way to answer it myself. If there are conditions of the outcome e.g. "this plan shouldn't make humans go extinct" or "this plan can only use X resources", is it not possible to build those in as things to conditions to optimize around? Or at least have the agent AI report on the predicted impact of these things and rank plans based on their tradeoffs (conditional on the agent AI not lying to us)?

I know the former plan is arguably close to Asimov's (doomed) law of robotics, but if I have to choose between robot paradoxes and robot apocalypse, I would simply choose the paradox.

Expand full comment

The general answer to this is that any such patch will just lead to a slightly different looking perverse behavior than the one you are trying to prevent. E.g., if humans can't go extinct, you could just keep one human around in cryostasis and then proceed as you would without the patch.

If you make a proposal concrete, I can probably tell you how it would fail. (The resource one seems underspecified in a way that matters for the resulting behavior. What exactly is the AI not allowed to do?)

Expand full comment

Rob L'Heureux

First, I appreciate the response. That feels like a failure of imagination to appropriately scope the condition you're optimizing for. I feel like an insurance company or personal injury attorney would have some master list of all the ways to prove harm has come to a person or group, and you could set some thresholds around that.

So some combination of:

- Don't do long-term harm: Can't reduce human longevity by more than 0.001%

- Don't outright kill people: Can't increase human deaths annually more than the 50 year running average

- Don't do short-term harm above some threshold (injecting people with needles to distribute a vaccine is OK, electric shock therapy [short-term, negative stimulus to achieve some long-term, positive response] probably shouldn't be)

- Resource efficiency e.g. must be achievable with 50 MW of power (or something scaled)

Expand full comment

Putting all humans into a coma gets around #1 and #2. Copy yourself and let the copy do things gets around #4. The way to get around #3 depends on how you specify #3.

I *additionally* suspect that more failure modes will emerge if you specify #1 and #2 further. (How is longevity measured and what is a human? Does it count as killing if I upload them? etc.) Being specific is genuinely important here.

Expand full comment

Rob L'Heureux

I get that but is that an argument against guidelines? Is there not some threshold where we've whitelisted things we don't want to change substantially? Or do you believe it's an ever-moving frontier and we can't possibly capture the conditions we really want to keep (i.e. all humans live but oops we forgot to keep puppies alive)?

Expand full comment

I do think it's an ever moving frontier, i.e., every practically achievable list of restrictions won't avoid disaster.

Separately, there's the issue that we don't actually program goals AI systems, at least not with the way they're trained right now. Reinforcement learning will just get you something that does well on your training signal; the more complex the signal is, the more likely the training signal will diverge from what the agent wants.

Expand full comment

Ellipsis

I'm about as worried that someone would build the aforementioned GPU-melting nanobot-army to preempt all AGIs, and cause untold misery because <insert nanobots-gone-wild-with-untold-consequences-scenario>, as I am from an AGI that ends up malicious.

Scrap that - I'm MUCH more worried we'll screw this (where "this" is our civilization) up way earlier than our prophecised overlords would - be it trying to preempt them with some farcical scheme, or else rise of techno-fascism, anarcho-idiocracy, or just plain old boring global warming run amok.

Expand full comment

Re: Tool AI and GPT-inf.

So, a chess AI that gets lots of resources could develop parts that are agenty, and that are able to plan - but what I don't get with this is how would the chess AI have an ability to plan outside of the chess domain, when it only has an interface to chess? It would have no concept of the real world. It seems to me that a scaled chess AI could plausibly do some pretty agenty things on on a chess bord, including how to manipulate and exploit som human tendency in its oponent, or even learn how to cheat. But there is no way this could ever be applied outside of chess, because the AI has no interface and therefore no concept of the real world.

Same goes for GPT-inf. I don't see how GPT-inf could write the paper that solves alignment problems, because GPT-inf only has an interface to all our text, and no interface towards the real world. That means GPT can only solve problems that we already know (and have written) the answer to, no matter how agenty it is! That is GPT-inf. would, in the best case, condense everything already written on AI alignment into a really good paper. However, coming up with (really) new solutions seems impossible to me. GPTs whole world will forever be restricted to patterns in text...

Expand full comment

Viliam

A chess AI that thinks *only* about chess is harmless (unless someone modifies it later). All it thinks about is the abstract universe of chess. This also makes it useless for purposes other than playing chess.

A GPT has lots of texts from real world. Many of them are fictional, many contain incorrect information; but even those are in some way an evidence about the real world (evidence about what kinds of fiction humans read and write, which in turn is evidence about human psychology). So, unlike the chess AI, it knows about the real world, and if it is "GPT-infinite", it may notice many patterns of human thinking and behavior, some of them we are not even aware of. Whether this is enough, I don't know, but it seems qualitatively different from the chess AI.

I would suggest that people talking about dangers of GPT-infinite provide some examples of "new solutions" (or whatever is their closest equivalent), coming from GPT-3. So that we could say "like this, only thousand times smarter". To me it doesn't seem a priori implausible; many new solutions are just recombinations of some old solutions.

An argument if favor of GPT is that ultimately, humans also take some sensory inputs and generate some outputs; if that is enough for our understanding, it should also be enough for a GPT.

A counter-argument is that humans do lots of iterations and get feedback, while GPT gets all its inputs first, and then produces an output... but it doesn't get further data about the impact of its output on the real world. -- Unless its output and its impacts are described in some text that becomes a part of input for some future GPT task; in this way it could kinda learn from feedback.

Another counter-argument is that even if GPT cannot learn from consequences its own actions, its input data contain information about actions of *other* agents and the consequences of those actions. Which is not as good as knowing the consequences of your own actions (because it does not allow exploration of novel concepts), but possibly enough for undestanding the very concept of causality. So a sufficiently smart GPT could conclude things; it just would have no way to experimentally verify them. But if the conclusion happens to be right... bad things can happen.

Expand full comment

I think your arguments about feedback and consequences are important. Another point is that GPT-3 type AI is trained on both good (useful) and bad (false) information, and has no way, nor incentive to separate these. Sure, you can make GPT-3 answer quiz questions correctly, or even write code by giving it clever prompts, but only because someone already has written the answer to these, and it is possible to distinguish between writing where true answers are given by subtle differences in tone etc. How will even the super intelligent GPT-inf do this for a paper giving the answer to something novel? How could it distinguish between ideas in otherwise well written papers that are true or false? Consider that this is not even what GPT is trying to optimize for. GPT only optimizes for what is likely to come next after a given promt. Even if it was trying to find the truth it would have no way to evaluate the truthfullness of a statement, as connected to the real world other than by the tone and context and logic. It can run no experiment, and get no feedback.

GPT-inf could pobably build a very comprehensible model of categories and sub categories and how words are connected together. In fact it would build many parallel models, both true and false ones, and some with spelling errors and silly jokes depending on context... However, if we work our way down, I believe it could still have no concept of the meaning of the words - or f.ex. human psychology. For example the AI could know that an apple is in the category fruit and has some properties like red and spherical. And that red is a color. Probably it could know how to describe a sphere mathematically, and know the wavelength of red light. But still, it could not know what we really mean when we say the apple is red and roughly spherical because those are properties of the real world, to which it has no interface.

Expand full comment

The Economist

Have to feel sorry for Yudkowsky, his research is going nowhere because his ideas are bad, and they're largely bad because he doesn't respect traditional education, hence going after actual successful players like google and openAI who are covered in degrees back to front. As a college dropout, I've learned to stay humble and patiently study on my own taking up menial jobs until I'm actually qualified to run my mouth.

Expand full comment

Back in the real world, Eliezer has a vast amount of bog standard knowledge from mathematics over neuroscience and physics to evolutionary theory and AI. The guy has been consuming text books since he was maybe 13yo.

Expand full comment

The Economist

Jack of all trades, master of none

Expand full comment

Are you an expert in mathematics, neuroscience, evolutionary theory or AI? I'm a physicist; based on everything I've read by EW, I'd say he's at the very peak of the Dunning-Kruger curve as far as physics is concerned.

Expand full comment

math kinda, AI no but probably around graduate level, evolutionary theory definitely not, physics super definitely not

Expand full comment

I'm not in any of these, but given that I see a lot of people in any field EY touches on echoing your opinion, my own assessment is that EY has the typical issues of an autodidact: he's clearly very smart in the raw-intellectual-performance category, but has never had to really confront his own biases or intellectual shortcomings- and utterly lacks intellectual humility, preventing him from listening when someone educated in the field tells him "nanotechnology doesn't work like it does in sci-fi", for example.

Expand full comment

> as never had to really confront his own biases or intellectual shortcomings

This is most of what happens in the first semester at an elite school with rigorous admissions.

A bunch of people that've never been wrong before all meet each other and are humbled.

Expand full comment

Can you give some actual example of EY getting physics wrong?

Expand full comment

Two examples which stick in mind are 1) espousing timeless physics and 2) "A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis—perhaps not the dominant hypothesis, compared to Newtonian mechanics, but still a hypothesis under direct consideration—by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass".

Expand full comment

Can you explain why it's wrong, according to physics?

Expand full comment

Not a physicist, but to me what is wrong about 2) is that it seems highly speculative, but presented as if it must be true. The statement seem to argue that 1) it is possible to derive the theory of general relativity from three frames of video of an apple falling. 2) Because it is possible, a bayesian super intelligence must not only be able to do it, but be able to do it in the time it takes to watch the frames.

Both seem to be extraordinary claims needing a lot of evidence - in, fact without knowing details about the AI in question (which obviosly does not exist) it is not possible to know any of this.

Now, I don't know the context of this statement, and don't know if it is hyperbolic on purpose, or some kind of hypothetical thought experience about what a superintelligence might look like given (pretty extensive) assumptions.

Expand full comment

It's wrong on a level more fundamental than physics, really. Three frames of video - effectively a dataset at three points in time - is barely enough information to define that an event is occurring, much less break the world into interacting parts and generate coherent hypotheses about how those parts interact. This is an easy bit of sleight of hand to pull because it's difficult for a human to imagine having such a small sample of the world. The best comparison I can come up with is this: if I gave you (or everyone in the comments of this blog, or whatever) a sequence of three elements from a particular Julia set, could you identify the set's generating function? Could you even determine that this was something you might do?

Answer those questions, and then notice that I haven't even given you infinite time and paper yet.

Expand full comment

Can anyone give some **actual example** of EY getting physics wrong? Not him taking a stance on a controversial topic or extrapolating solomonoff induction in a way you don't understand ora agree with? But actually getting something, you know, wrong?

Expand full comment

Timeless physics is not a controversial topic for the simple reason that not one in a hundred physicists had ever heard about it. It's true though that he wasn't exactly wrong on either of my two examples. Rather, he was not even wrong.

Expand full comment

It's frustratingly not an exact answer to the question posed, but I always recall his conflation of individual- and population-level properties in his discussion of evolution, which IIRC he immediately doubled down on and has not reconsidered since.

Expand full comment

You know, when I first saw this line of comments my initial reaction was: "False humility and status masturbation - nothing interesting and kind of disgusting". But then I overcame initial disgust and decided to give it some benefit of the doubt. After all if something just looks like euphemism for meaningless signalling game, it can still be a real thing.

It's hard to explain how annoyed I'm now, when it turned out that all your talk about better perspective from the height of your credentials and therefore actual qualifications turned out to indeed be just meaningless credentianalism. That despite two direct questions about the actual physics mistakes, you ended up with "dares to take a stance on a topic, which physicists with credentials haven't explored yet".

Expand full comment

John

I feel much more sorry for you, to be honest

Expand full comment

c1ue

This frankly looks like a "How Many Angels Can Dance On The Head Of A Pin" argument.

Has there been a response to one of the classic sci-fi theories: that any self-intelligent AI will simply devolve straight into a solipsistic world of its own? A very specific variant of the "reach into your skull and change things" view above - why even bother with external requirements when you can just live in a world of your own, forever in subjective time?

Expand full comment

Just depends on what the AI wants. But in general, if a company makes an AI that behaves like this, that's not useful so they'll tweak/ditch the design until they get something that does stuff.

Expand full comment

Olly

It was a frustrating read at times. I thought Richard was incredibly patient with some of EY’s responses. The “you can’t understand why I believe something is true until you’ve done the homework I’ve done” stuck out as particularly frustrating. Maybe it’s a limitation of the format. Maybe some concepts are too hard to communicate over text chat. However, I would have loved to see EY try to explain his reasoning rather than make the homework statement. (Similar patterns exist in part 2)

Expand full comment

What should he do if these things are, in fact, hard to understand until you've done the homework?

Expand full comment

Cohenski

Thank you for this write up Scott. This is my favorite sort of topic you write about. Let this comment push your reward circuit associated with pleasing your readers.

Expand full comment

Cohenski

I think the probabilities involved in whether or not we will produce aligned or misaligned AI are actually very important. From a decision-theory standpoint, a 50% versus 10% chance of producing misaligned AI could alter our strategy quite a bit. For example, is it worth trying to melt all the GPUs to reduce the risk of misaligned AI by x%.

Can prediction markets help us here? A friend and fellow reader pointed out that if Google has even a 1/100 chance of developing aligned AGI first, then its stock is ridiculously undervalued when thinking about expected values. We must consider that the value of money could radically change should humanity introduce a superintelligent agent onto the game board. Perhaps we are assigned a fraction of the reachable universe in proportion to starting capital at time of takeoff. Alternatively, perhaps all humans are given equal resources by the superintelligent agent, completely independent of capital. All in all, given our uncertainty about such an outcome, I think the value of money should on net decrease (though others may disagree). Regardless, any change in the value of money would undermine our ability to use prediction markets to assess these probabilities in my opinion.

Can anyone think of a solution to this prediction market problem?

Note: in the case of Google, its undervalued stock price may best be accounted for by a general lack of awareness about the economic consequences of an intelligence explosion by almost all investors.

Expand full comment

Simon Ilyushchenko

I sometimes think that AIs already exist in the form of corporations, ideologies, and religions. (Charles Stross thinks something similar: https://boingboing.net/2017/12/29/llcs-are-slow-ais.html). What if they have already escaped and are happily growing to the biggest possible size, trying to take over the world. However, their competition provides a certain measure of protection.

This probably has been discussed before, so pointers to older threads are welcome.

Expand full comment

14 years ago it was brought up and immediately dismissed based on a failure to understand evolution. I do not believe it has been discussed since.

Expand full comment

Cjw

Why do all of these discussions seem to assume that if such an AI can exist, it eventually will exist? Why aren't we considering ending AI research by force if necessary? None of these plans sound very likely to work, we don't have that knowledge and may not have the time, but we definitely have guns and bombs aplenty now. Hypothetically, would it prevent this outcome if somebody rounded up all the people capable of doing this research and imprisoned them on a remote farm in Alberta with no phones or computers, found and destroyed as much of their research as possible, and then criminalized any further AI research?

This must be at least a plausible consideration. Even 80 years after Hiroshima, it's still no easy task to build a nuclear warhead you could actually launch. And we do in fact keep tabs on those who are trying, and have bombed their facilities and in all likelihood killed or jailed their scientists on occasion. It takes a number of people working together to do something like this, and those people are not trained super-stealthy intelligence agents, it shouldn't be that difficult to obstruct all such research for as long as we have the will to do so.

In order to have the resources to build an ICBM, you have to either be a country or be so powerful a force that a country will claim jurisdiction over you. Nobody's building ICBMs on those little libertarian free cities in South America you talk about, and they wouldn't/couldn't build super AI's either, because they'd be foiled or co-opted long before they reached that level of capability.

I imagine that if you show up to a conference about AI and say to round up all the AI researchers, you aren't invited back, and they probably don't even validate your parking or let you have the 2 free drinks at happy hour, so obviously nobody advances this idea. And of course rounding up scientists is unsavory, extrajudicial, and in a category of things that most people rightly abhor. But it's not at all clear to me that this course of action would be immoral if in fact this is an existential threat and we have no reasonable assurance it could be averted otherwise.

And hey, it worked on Battlestar Galactica. Until it didn't. Don't be a Baltar.

Expand full comment

Michal Ciesla

Frankly when I read those discussions (AI existential threat from people like Yudkowsky) I think : "I saw it before". Middle ages, "How many devils can fit on a head of pin ?". Good old scholasticism, let us assume that entity exists, then let us assume very large amount of more or less believable properties of that entity and now let us spend centuries arguing in a circle. And then science just passes by and makes it obvious it was colossal waste of time.

We know too little about intelligence, artificial or natural, to even start thinking correctly about it.

Expand full comment

Fwiw Miri's embedded agency agenda is motivated by the assumption that we don't understand intelligence and should try to remedy that.

Expand full comment

Michal Ciesla

Well there is already few pretty large and active fields working on that. I just doubt the answer will come from deductive reasoning based on our current knowledge and large number of assumptions. That was the point of my last sentence, we know too little to even attempt this topic productively. We need more knowledge, and that will not come from deduction.

Expand full comment

Sleazy E

Eloser thinks that chickens and other animals cannot suffer because the GPT-3 cannot suffer. He said so on lesswrong. Why anyone would listen to this tool other than out of pity is beyond me.

Expand full comment

I think Elizner is basically a blow-hard who overestimates his intelligence, but- "Eloser", really? Sophmoric.

Expand full comment

Phil Getts

Jan 21, 2022Edited

I got no replies to my earlier giant comment, so I'll summarize it:

1. The tool AI plan is to build an AI constrained to producing plans, and tell it to make a plan to prevent the existence of super-intelligent AIs.

2. The definition of "AI" is not as clear as it appears presently to be. What about augmented human brains? What if they're not augmented with wires and chips, but genetically? Should that make a difference?

3. Any planning AI that was actually smarter than a human would realize that our true goal is to prevent a scenario in which society is destabilized by vast differences in computational intelligence between agents, regardless of whether those agents are "artificial" or "natural".

(If the AI tool doesn't realize this, but instead accepts the current definition of "AI" as "constructed entirely from electronic circuits", then the AI is incapable of extrapolating the meanings of terms intensionally rather than extensionally. This would therefore be an AI dumber than human; and it would fail catastrophically, for the same reasons that symbolic AI systems always fail catastrophically.)

4. Hence, the plan which the tool AI produces must not only prevent humans from "building" superhuman AIs, but must prevent human evolution itself from ever progressing past the current stage, regardless of whether that were done via uploading, neural interfaces, genetic manipulation, or natural evolution.

5. The tool AI will therefore produce a plan which stops more-intelligent life from ever developing. And this is the worst possible outcome other than the extermination of life; and utility-wise is, I think, hardly distinguishable from it. We should aspire to make our children better than us.

Expand full comment

Jonathan Graehl

In the absence of impressive AI-safety research accomplishments, convincing everyone that the problem is very hard is a good cope, and also proclaiming a nigh end is a good way to continue enjoying funding+importance. I happen to agree that it's hard, and EY demonstrates that he has thought clearly about some obvious stuff. I'd like to get some smarter people on it if they can be convinced (which I think is EY's intent).

Expand full comment

Since he's not asking for money any more, I've taken my money off of "scam," although I still haven't moved my towards "useful."

Expand full comment

AI risk doesn't keep me up at night. I think the reason is something along the lines of what EY said to what the chances are something will work.

"Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life. "

This applies to superintelligent AI of the godlike variety that EY talks about as well. What if we turn this on it's head. Let's say there was no such thing as an alignment problem. For the sake of argument, AGI would be a purly positive invention. Now imagine someone telling you that we just need to reach some super human intelligence threshold, and the AI will just keep improving itself until it's a god capable of doing anything in no time. Would you believe them? I wouldn't, because in my real life experience - everything is always more complicated than that, and the AI is bound to meet some very difficult to solve hurdles, and diminishing returns on it's way to infinite wisdom. So we might make a superintelligent AI but I think that is going to look very different from the almighty singularity that is being duscussed here.

Expand full comment

Andrew Holliday

What I've always found implausible about Bostrom and Yudkowsky's "unboxing" fears are the presumption they seem to have that a text terminal is all that a sufficiently intelligent agent needs to be able to reliably manipulate a human being. If you're the machine-minder, and your chess AI which (for some reason?) is hooked up to a text terminal, and you've been told by your boss "careful, this thing is very smart! Don't do anything it tells you unless it has directly to do with chess, anything else should be shown to me and a panel of experts that the machine doesn't know about. Also never tell the machine about anything other than chess", it's hard for me to imagine that this machine would be able to get you to do much of anything, even if it is 10,000x smarter than any human being.

I suspect there is a hard upper limit to how persuasive an agent can possibly be towards a human, that this upper limit is fairly low, and probably even lower over a text terminal. Even if there is no physical upper limit on an agent's INT, I think there probably still is on its CHA.

Furthermore, even if we suppose I'm wrong and that perfect manipulation can be achieved over a text terminal, I doubt a learning agent being trained via reinforcement learning to play chess (or do any other task that doesn't involve socially manipulating humans) would be able to figure that out, regardless of its innate intellectual capabilities. Reinforcement learning works by repeated exposure to stimuli, rewarding or penalizing responses according to a predefined metric, and repeating this many times to instill "good" responses in the agent. An RL agent being trained on chess that wanted to trick its minder into wireheading it for infinite reward would have to first deduce that it was a computer with easily-wireheaded internals, then figure out that its minder or someone else is capable of setting this up, *and then* perform many iterations of reinforcement learning *on the task of manipulating humans*, most of which would be failures at first because it hasn't learned to do that yet, and would be obvious tip-offs that the machine was up to something it shouldn't be. But maybe this argument goes away if we don't assume the machine is being trained via RL.

Something I've long thought was paraphrased elsewhere in the comments here by Eremolalos with the phrase "dumb-beats-smart". Intelligence does not necessarily beat all. For it to fool you, you must first be clever enough to be fooled. I think an under-investigated avenue of AI safety research may be research into containment protocols. In addition to thinking about how to make sure a superintelligent AI will have values aligned with ours, we ought also to have smart people thinking about how to make it as hard as possible for a badly-aligned superintelligent AI to get what it wants.

Expand full comment

Peregrine Journal

* https://www.lesswrong.com/posts/FmxhoWxvBqSxhFeJn/i-attempted-the-ai-box-experiment-and-lost

I held similar views, but changed my mind about unboxing over time, partly thanks to thinking a lot about this* post.

Also though, even if we granted that you personally might not defect, would you admit that the AI box experiments prove the narrower claim "apparently some smart people would defect for reasons they themselves cannot predict beforehand?"

"Many people would spontaneously defect, but not me" doesn't make the world especially safe in this scenario.

Expand full comment

I'm not quite sure how "a bunch of nerds who are already motivated to believe in Evil God AI's infinite powers of persuasion role-playing the scenario of the Evil God AI trying to persuade someone to open its prison become increasingly convinced of the thing they already believe" is persuasive. At most, the conclusion one can reach is "nerds are vulnerable to emotional abuse", which is something I could tell you without any complex iterations. I don't believe that a "boxed" AGI is going to be able to flawlessly attack the emotional vulnerabilities of the people who work on it by reasoning from first principles in the same way I don't believe an AGI is going to derive General Relativity from 3 frames of film.

Expand full comment

Andrew Holliday

Jan 23, 2022Edited

I wouldn't agree that these experiments prove that conclusion, since we're talking about a roleplay scenario where the gatekeeper knows the situation isn't real, as Essex says. Although I grant that there may be some rare cases in which a smart person might be persuaded to defect for reasons they would not have predicted, and I grant further that I may be underestimating the likelihood of such cases.

That's why I think there ought to be more research into containment protocols - how do we train or choose gatekeepers to be less vulnerable to such attacks? How do we design a containment system that makes unboxing unfeasible for one defector working alone, or for two, or N defectors?

Expand full comment

Peregrine Journal

Is tool AI in the hands of selfish human actors a parallel current focus of concern? If so, any good keywords to search on this problem to learn more?

I know selfish humans are unlikely to be paperclip maximizers, but they can still have desires that are very bad for large populations.

Expand full comment

https://www.nber.org/papers/w27820

This is the question that's more concerning to me, and sadly is also the one that nobody's interested in because it's boring. "Will AGI destroy us all (unless rationalist nerds create a genius solution)?" is an exciting, sexy question that creates a heroic narrative people can buy into. "Is AI dangerous in the hands of bad actors?" is a boring question, because the answer is "yes, just look at Twitter."

Expand full comment

Basil Halperin

Jan 23, 2022Edited

There is a recent econ theory paper that formalizes and analyzes the decision process that you describe:

> Eliezer’s (implied) answer here is “these are just two different plans; whichever one worked well at producing reward in the past gets stronger; whichever one worked less well at producing reward in the past gets weaker”. The decision between “seek base gratification” and “be your best self” works the same way as the decision between “go to McDonalds” and “go to Pizza Hut”; your brain weights each of them according to expected reward.

The paper is Ilut and Valchev (2021), "Economic Agents as Imperfect Problem Solvers". Agents are modeled as having two "systems" when making a decision:

- System 2 thinking: the agent can, at a cost, generate a noisy signal of the optimal action

- System 1 thinking: the agent can, for free and by default, have access to its memory of past actions and outcomes

The agent follows Bayes' rule to choose whether to go with the more accurate but more costly option, or with the less accurate option, based on what has the best expected reward -- exactly as you describe.

Expand full comment

mikksalu

Chess engines are an interesting case. Indeed, engines surpassed humans 20 years ago and yes, engines can outplay the best humans 1000 times out of 1000. But these comparisons also give a false impression. Most people seem to think that the best human chess player is like a donkey and the best chess engine is like a plane or even a rocket.

But in reality, it's more like: best human chess player is 500 horsepower car that sometimes makes mistakes and chess engine is 502 horsepower car that never makes mistakes. If you compare the quality of play of best humans and the quality of play of best engines, the difference is not obvious. I am not sure that the gap has widened over the years. Best humans can still play with 99.5-99.9% precision. It is not so much the case that machines find better or novelty moves, but still, even the best humans make sometimes mistakes or even blunders.

Btw, when I said that engines can outplay the best humans 1000 times out of 1000, it is not entirely true. Even now humans can achieve sometimes draws or even wins against engines. There are some cases when engines have a harder time adapting to game rules. For instance, there are cases with shorter time controls. Andrew Tang has beaten Stockfish and Leela Zero in ultra bullet chess - both sides have 15 seconds for all their moves. Basically, with so little time engines are not capable to exploit all their calculation capabilities, but Tang was able to use his knowledge of opening theory, positional understanding, and intuition to beat engines. In slower time formats, there are some openings and positions where humans have been able to force draws on engines. I am not sure what is the situation right now, but until recently engines were sometimes in trouble to play inferior moves. Yes, sometimes to get out of the forced draw, 3-time repetition rule, for instance, you have to play suboptimal or weaker move. Humans can do it easily, for engines it was (at least) a problem.

When AlphaZero came out in 2017 and beat Stockfish 8 (it was a bit controversial, Google´s private event, and they probably crippled Stockfish) - there was a lot of buzz. Grandmasters liked Alpha Zero because it played "more like a human", it played aggressively, it preferred piece activity over the material, etc. And indeed, Stockfish looks at 70 million positions per second and Alpha Zero looked at only 80 000 positions per second. In this sense it was "more like a human", did not try to calculate everything, was able "to think" positionally and strategically like humans, was able to come up with a limited list of candidate moves and at the end, Alpha Zero started to prefer same openings what grandmasters prefer.

Anyway, the projections after Alpha Zero, that this will change chess forever, that it will bring totally new level of game quality and totally new game style, it will make old search based chess engines obsolete. It has not happened.

Leela Zero, which is a better and stronger version of Alpha Zero is an example. In 2018 they said that it will take 4 million self-taught for Leela Zero to best Stockfish. Then they said 8.5 million games and in 2019. Well, we are now in 2022 and Leela Zero has played over 500 million games against herself, has taught herself already 4 years, but has not surpassed Stockfish yet.

Of course, Stockfish is a moving target and in 2020 they added an efficiently updatable neural network(NNUE) to their search-based engine. I am no expert and do not know what is difference between deep neural network-based evaluations (Alpha Zero, Leela Zero) and NNUE (Stockfish). No matter what, there seems to be consensus, that Leela Zero is somewhat stalled. It is improving but at a slower and slower rate. A huge leap in game quality has not happened.

I recently watched an interview with David Silver- lead researcher in Alpha Zero project - and he basically admitted that computers are far away from perfect chess and at least during his lifetime he is not expecting to see it. I am not sure what does it say about general AI research, but at least in chess, it is not so simple that computers are almighty gods and humans are helpless toddlers.

Expand full comment

Jan 25, 2022Edited

I'm >99% confident that no one will ever succeed in creating self-replicating nanobots that melt all the GPUs in the world (and only GPUs). A pivotal action that might actually work would be "make sure your good-AGI is running on >51% of the hardware in the world and has a larger robot army than any possible bad-AGI can muster"

* On the surface, at the microscale, GPUs look like every other silicon chip. You have to zoom way out to distinguish them. Nanobots can't zoom out.

* high-fidelity self-replication is hard. They will need specific substrates which are not ubiquitous.

* You probably just end up with dumb silicon-eating artificial bacteria which are vulnerable to a wide variety of antimicrobial chemicals. Maybe they are a lot smarter and faster-reproducing than natural bacteria, but I don't expect more than two orders of magnitude improvement in either respect.

* If they only eat GPUs, they don't have much of an ecological niche as a parasite. GPUs rarely come into contact with each other, and surface tension probably makes it really hard to become airborne until GPUs learn how to sneeze.

* burrowing through aluminum heatspreaders is going to be challenging for an artificial microbe. I don't think any natural microbe is even close to figuring out how to do that.

Expand full comment

Bogdan Butnaru

Feb 4, 2022

> burrowing through aluminum heatspreaders is going to be challenging for an artificial microbe. I don't think any natural microbe is even close to figuring out how to do that

Metallic aluminium has only existed on Earth for less than two centuries. Natural microbes couldn’t evolve to burrow through something that didn’t exist.

Expand full comment

John Smith

Mar 25, 2022

From the end of the Part 3:

> If the malevolent agent would get more reward than the normal well-functioning tool (which we’re assuming is true; it can do various kinds of illicit reward hacking), then applying enough gradient descent to it could accidentally complete the circuit and tell it to use its agent model.

But what does this even mean? Why is malevolence important? If "dreaming" of being a real agent (using some subsystem) would output a better results for an "oracle-tool" then its loss funtion would converge on always dreaming like a real agent. There is a risk but it's not malevolent =)

And then we can imaging it dreaming of a solution to a task that is most likely to succeed if it obtains real agency and gains direct control on the sutuation. And it "knows" that for this plan to succeed it should hide it from humans.

So this turned into "lies alignment" problem. In this case why even bother with values alignment?

Expand full comment