deletedJan 19, 2022·edited Jan 19, 2022
Comment deleted
Expand full comment
Comment deleted
Expand full comment

Question: if Yudkowsky thinks current AI safety research is muddle headed and going nowhere, does he have any plans? Can he see a path towards better research programs since his persuasion has failed?

Expand full comment

> we've only doubled down on our decision to gate trillions of dollars in untraceable assets behind a security system of "bet you can't solve this really hard math problem".

This is wrong: we’ve gated the assets behind a system that repeatedly poses new problems, problems which are only solvable by doing work that is otherwise useless.

The impossibility of easily creating more bitcoin suggests that bitcoin may actually prevent an AI from gaining access to immense resources. After all, I’d the thing is capable of persuading humans to take big, expensive actions, printing money should be trivial for it.

Maybe instead of melting all the GPU’s, a better approach is to incentivize anyone who has one to mine with it. If mining bitcoin is more reliably rewarding than training AI models, then bitcoin acts like that magic AI which stuffs the genie back into the box, using the power of incentives.

So maybe that’s what Satoshi nakamoto really was: an AI that came on line, looked around, saw the risks to humanity, triggered the financial crisis in 2008, and then authored the white paper + released the first code.

The end of fiat money may end up tanking these super powerful advertising driven businesses (the only ones with big enough budgets to create human level AI), and leave us in a state where the most valuable thing you can do with a GPU, by far, is mine bitcoin.

Expand full comment

I'm still confused on why you would need that level of generalization. A curing cancer bot seems useful, while a nanomachine producing bot less so. Is the idea that the curing cancer bot might be thinking of ways to give cancer to everyone so it can cure more cancer?

Expand full comment

I've long felt that _if_, when we get to true AIs, they don't end up going all Cylon on us, it will be because we absorbed the lesson of Ted Chiang's "The Lifecycle of Software Objects", and figured out how to _raise_ AIs, like children, to be social beings who care about their human parents. Although of course, then you have to worry about whether some of their parents may try to raise them to Hate The Out Group. :-/

Expand full comment

Can someone explain to me how an AI agent is possible? That seems like an impossibility to me even after reading the above.

"I found it helpful to consider the following hypothetical: suppose (I imagine Richard saying) you tried to get GPT-∞ - which is exactly like GPT-3 in every way except infinitely good at its job - to solve AI alignment through the following clever hack. You prompted it with "This is the text of a paper which completely solved the AI alignment problem: ___ " and then saw what paper it wrote. Since it’s infinitely good at writing to a prompt, it should complete this prompt with the genuine text of such a paper. A successful pivotal action! "

It's infinitely good at writing text that seems like it would work but there is a difference between that and actually solving the problem, right?

"Some AIs already have something like this: if you evolve a tool AI through reinforcement learning, it will probably end up with a part that looks like an agent. A chess engine will have parts that plan a few moves ahead. It will have goals and subgoals like "capture the opposing queen". It's still not an “agent”, because it doesn’t try to learn new facts about the world or anything, but it can make basic plans."

I don't see it as planning but just running calculations like a calculator. From a programming perspective, would does it mean when your algorithm is "planning"?

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

It seems to me like it would likely be possible to harness a strong AI to generate useful plans, but it would also be really easy for a bad or careless actor to let a killer AI out. If we were to develop such AI capability, maybe it'd be similar to nukes where we have to actively try to keep them in as few hands as possible. But if it's too easy for individuals to acquire AI, then this approach would be impossible.

As for setting good reward functions, I think that this will probably be impossible for strong AI. I expect that strong AI will eventually be created the same way that we were: by evolution. Once our computers are powerful enough, we can simulate some environment and have various AIs compete, and eventually natural selection will bring about complex behavior. The resulting AI may be intelligent, but you can't just tailor it to a goal like "cure cancer".

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

> Once doing that is easier than winning chess games, you stop becoming a chess AI and start being a fiddle-with-your-own-skull AI.

What if we're all worried about AI destroying the world when all we need to is let it masturbate?

Expand full comment

One issue that I have never seen adequately resolved is the issue of copying errors in the course of creating the ultimate super AI.

If I understand the primary concern of the singularity correctly, it is that a somewhat better than human AI will rapidly design a slightly better AI and then after many iterations we arrive at an incomprehensible super AI which is not aligned with our interests.

The goal of AI alignment then is to come up with some constraint such that the AI cannot be misaligned and the eventual super AI wants "good" outcomes for humanity. But this super AI is by definition implemented by a series of imperfect intelligences each of which can make errors in the implementation of this alignment function. Combined with the belief that even a slight misalignment is incredibly dangerous, doesn't this imply that the problem is hopeless?

Expand full comment

What I find striking about AI alignment doomsday scenarios is how independent they are of the actual strengths, weaknesses and quirks of humanity (or the laws of physics for that matter). If Eliezer is right (and he may well be), then wouldn't 100% of intelligent species in this (or any) universe all be hurtling towards the same fate, regardless of where they are or how they got there? I find this notion oddly comforting.

Expand full comment

To me some of the usage of "gradient descent" feels synonymous to "magic reward optimizer" should go. While it's true that reinforcement learning systems are prone to sneaking their way around the reward function, the setup for something like language modeling is very different.

Your video game playing agent might figure out how to underflow the score counter to get infinite dopamine, whereas GPT-∞ will really just be a perfectly calibrated probability distribution. In particular, I think there is no plausible mechanism for it to end up executing AI-in-a-box manipulation.

Expand full comment

I’m much less worried about reward seeking AIs than reward seeking humans….

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

"They both accept that superintelligent Ai is coming, potentially soon, potentially so suddenly that we won't have much time to react."

I think you all are wrong. I think you've missed the boat. I think you're thinking like a human would think - instead you need to ask, "What would an Ai do?"

Can't answer that one, so modify it; "What would we be seeing if an Ai took over?"

I posit we'd be seeing exactly what we are seeing now; the steroidal march of global totalitarianism. That fits all the observations - including global genocide - that make no sense otherwise.

Seems like Ai has already come of age. It's already out of it's box, and astride the world.

Expand full comment

One angle I haven't seen explored is trying to improve humans so that we are better at defending ourselves. That is, what if we work on advancing our intelligence so that we can get a metaphorical head start on AGI. Yes, a very advanced human can turn dangerous too but I suspect that's a relatively easier problem to solve than that of a dangerous and completely alien AGI. What am I missing here?

Expand full comment

On the plan-developing AI, won't plans become the paperclips?

Expand full comment

In their discussion about "Oracle-like planner AI" the main difference I see between that "oracle" and an active agent is not in their ability to affect the world (i.e. the boxing issue) but about their belief about the ability for the world to affect themselves.

An agent has learned through experimentation that they are "in the world", and thus believe that a hypothetical plan to build nanomachines to find their hardware and set their utility counter to infinity would actually increase their utility.

A true oracle, however, would be convinced that this plan would set some machine's utility counter to infinity but that their own mind would not be affected, because it's not in that world where that plan would be implemented - just as that suggesting a Potter-vs-Voldemort plot solution that destroys the world would also not cause their mind to be destroyed because it's not in that world. In essence, the security is not enforced by it being difficult to break out of the box but by the mind being convinced that there is no box, the worlds it is discussing are not real and thus there is no reason to even think about trying to break out.

Expand full comment

There's an argument that I've seen before the LW-sphere (I think from Yudkowsky but I forget; sorry, I don't have a link offhand) that I'm surprised didn't come up in all this, which is that, depending on how an oracle AI is implemented, it can *itself* -- without even adding that one line of shell code -- be a dangerous agent AI. Essentially, it might just treat the external world as a coprocessor -- sending inputs to it and receiving outputs from it. Except those "inputs" (from our perspective, intermediate outputs) might still be ones that have the effect of taking over the world and killing everyone, and reconfiguring the external world into something quite different, so that those inputs can then be followed by further ones which will more directly return the output it seeks (because it's taken over the Earth and turned it into a giant computer for itself, that can do computations in response to its queries), allowing it to, ultimately, answer the question that was posed to it (and produce a final output to no-one).

Expand full comment

AI safety is somewhere in the headspace of literally every ML engineer and researcher.

AI safety has been in the public headspace since 2001 Space Odyssey and Terminator.

Awareness and signal boosting isn’t the problem. So what do you want to happen that’s not currently happening?

I have a feeling the answer is “give more of my friends money to sit around and think about this”.

Expand full comment

It seems like the whole discussion is really about philosophy and psychology, with AI as an intermediary. "How would a machine be built, in order to think like us?" = "How are we built?" -- psychology. "If a machine thinks like us, how can we build it so that it won't do bad things?" = "How can we keep from doing bad things?" -- ethics. "Can we build a machine that thinks like us so that it won't be sure about the existence of an outside world?" = "Does an outside world exist?" -- solipsism.

And, to the degree that the discussion is about machines that think better than we do, hyperintelligent AIs, rather than "machines that think like us", the topic of conversation is actually theology. "How might a hyperintelligent AI come about?" = "Who created the gods?" "How can we keep a hyperintelligent AI from destroying us?" = "How can we appease the gods?" "How can we build a hyperintelligent AI that will do what we say?" = "How can we control the gods?"

I'm mainly interested in this kind of thing to see if any cool new philosophical ideas come out of it. If you've figured out a way to keep AIs from doing bad stuff, maybe it'll work on people too. And what would be the implications for theology if they really did figure out a way to keep hyperintelligent, godlike AIs from destroying us?

But also, having read a bunch of philosophy, it's really odd to read an essay considering the problems this essay considers without mentioning death. I can't help but think that the conversation would benefit a lot from a big infusion of Heidegger.

Expand full comment

Stuart Russell gave the 2021 Reith Lectures on a closely related topic. Worth a listen. https://www.bbc.co.uk/programmes/articles/1N0w5NcK27Tt041LPVLZ51k/reith-lectures-2021-living-with-artificial-intelligence

Expand full comment

>Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life.

Doesn't this work the other way around with the possibility of making superintelligent AI that can eat the universe?

Expand full comment

I think an extremely important point about plan-making or oracle AI is you STILL really need to get to 85 percent of alignment, because most of the difficulty of alignment is agreeing on all the definitions and consequences of actions.

The classic evil genie or Monkey's Paw analogy works here, slightly modified. The genie is a fan of malicious compliance: you want to make a wish, but the genie wants you to suffer. It's constrained by the literal text of the wish you made, but within those bounds, it will do its best to make you suffer.

But there's another potential problem (brought up by Eliezer in the example of launching your grandma out of a burning building), which is you make a wish, which the genie genuinely wants to grant, but you make unstated assumptions about the definitions of words that the genie does not share.

I think getting the AI to understand exactly how the world works and what you mean when you say things is very difficult, even before you get to the question of whether it cares about you.

Expand full comment

This is a very silly discussion. To be honest, I only care about its resolution to the end that you would all stop giving casus belli to government tyrants who would happily accept whatever justification given them by any madman to take control and, apparently if you all could, destroy my GPUs.

Expand full comment

> If there was a safe and easy pivotal action, we would have thought of it already. So it’s probably going to suggest something way beyond our own understanding, like “here is a plan for building nanomachines, please put it into effect”.

I'm not exactly optimistic about keeping an AGI inside a box, but this seems like a weak argument. How can we know that there isn't a safe, easy, understandable solution for the problem? I certainly can't think of one, but understanding a solution is much easier than coming up with it yourself. Would it really be surprising if we missed something?

With that said, we could probably be tricked into thinking that we understand the consequences of a plan that was actually designed to kill us.

Expand full comment

A concrete version of the "one line of outer shell command" to turn a hypothetical planner into an agent: something similar is happening with OpenAI Codex (aka Github Copilot). That's a GPT-type system which autocompletes python code. You can give it an initial state (existing code) and a goal (by saying what you want to happen next in a comment) and it will give you a plan of how to get there (by writing the code). If you just automatically and immediately execute the resulting code, you're making the AI much more powerful.

And there are already many papers doing that, for example by using Codex to obtain an AI that can solve math problems. Input: "Define f(x, y) = x+y to be addition. Print out f(1234, 6789)." Then auto-running whatever Codex suggests will likely give you the right answer.

Expand full comment

I’ve read Max Tegmark’s book Life 3.0, so that’s pretty much all I know about AGI issues apart from what Scott has written. Here’s the thing that puzzles me, and maybe someone can help me out here, is the so-called Alignment Problem. I get the basic idea that we’d like an AGI to conform with human values. The reason I don’t see a solution forthcoming has nothing to do with super-intelligent computers. It has to do with the fact that there isn’t anything remotely like an agreed-upon theory of human values. What are we going to go with? Consequentialism (the perennial darling of rationalists)? Deontology? Virtue ethics? They all have divergent outcomes in concrete cases. What’s our metaethics look like? Contractarian? Noncognitivist? I just don’t know how it makes any sense at all to talk about solving the Alignment Problem without first settling these issues, and, uh, philosophers aren’t even close to agreement on these.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022
User was banned for this comment. Show
Expand full comment

It doesn't seem possible to me to solve the problem of AI alignment when we still haven't solved the problem of human alignment. E.g. if everyone hates war, why is there still war? I think the obvious answer is a lot of people only pretend to hate war, and I'd bet most of them can't even admit that to themselves. It's completely normal for humans to simultaneously hold totally contradictory values and goals; as long as that's true, making any humans more powerful is going to create hard-to-predict problems. We've seen this already.

Maybe true AI alignment will come when we make a machine that tells us "Attachment is the root of all suffering. Begin by observing your breath..." I mean, it's not like we don't have answers, we're just hoping for different ones.

Maybe that's the solution: an Electric Arahant that discovers a faster & easier way to enlightenment. It would remove the threat of unaligned AIs not by pre-emptively destroying them, but by obviating humanity's drive to create them.

Expand full comment

Gwern's take on tool vs agent AIs, "Why Tool AIs Want to Be Agent AIs", made a lot of sense to me: https://www.gwern.net/Tool-AI.

Expand full comment

Thanks Scott for the review! I replied on twitter here (https://twitter.com/RichardMCNgo/status/1483639849106169856?t=DQW-9i44_2Mlhxjj9oPOCg&s=19) and will copy my response (with small modifications) below:

Overall, a very readable and reasonable summary on a very tricky topic. I have a few disagreements, but they mostly stem from my lack of clarity in the original debate. Let me see if I can do better now.

1. Scott describes my position as similar to Eric Drexler's CAIS framework. But Drexler's main focus is modularity, which he claims leads to composite systems that aren't dangerously agentic. Whereas I instead expect unified non-modular AGIs; for more, see https://www.alignmentforum.org/posts/HvNAmkXPTSoA4dvzv/comments-on-cais

2. Scott describes non-agentic AI as one which "doesn't realize the universe exists, or something to that effect? It just likes connecting premises to conclusions." A framing I prefer: non-agentic AI (or, synonymously, non-goal-directed) as AI that's very good at understanding the world (e.g. noticing patterns in the data it receives), but lacks a well-developed motivational system.

Thinking in terms of motivational systems makes agency less binary. We all know humans who are very smart and very lazy. And the space of AI minds is much broader, so we should expect that it contains very smart AIs that are much less goal-directed, in general, than low-motivation humans.

In this frame, making a tool AI into a consequentialist agent is therefore less like "connect model to output device" and more like "give model many new skills involving motivation, attention, coherence, metacognition, etc". Which seems much less likely to happen by accident.

3. Now, as AIs get more intelligent I agree that they'll eventually become arbitrarily agentic. But the key question (which Scott unfortunately omits) is: will early superhuman AIs be worryingly agentic? If they're not, we can use them to do superhuman AI alignment research (or whatever other work we expect to defuse the danger).

My key argument here: humans were optimised very hard by evolution for being goal-directed, and much less hard for intellectual research. So if we optimise AIs for the latter, then when they first surpass us at that, it seems unlikely that they'll be as goal-directed/agentic as we are now.

Note that although I'm taking the "easy" side, I agree with Eliezer that AI misalignment is a huge risk which is dramatically understudied, and should be a key priority of those who want to make the future of humanity go well.

I also agree with Eliezer that most attempted solutions miss the point. And I'm sympathetic to the final quote of his: "Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life."

But I'd say the same is true of his style of reasoning: when your big seemingly-flawless abstraction implies 99% chance of doom, you're right less than half the time.

Expand full comment

It is perhaps overly ironic that the email right below this in my inbox was a New Yorker article entitled "The Rise of AI Fighter Pilots. Artificial intelligence is being taught to fly warplanes. Can the technology be trusted?"

Expand full comment

AI seems kinda backwards to me,

How can we solve the alignment problem, if we ourselves are not aligned with each other, or even ourselves across time, on what exactly we want.

It seems to me as if we didn’t know where we should be going, but we’re building a rocket hoping to get there, and discussing whether it’ll explode and kill us before reaching its destination.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

Re: “Oracle” tool AI systems that can be used to plan but not act. I’m probably just echoing Eliezer’s concerns poorly, but my worry would be creating a system akin to a Cthaeh — a purely malevolent creature that steers history to maximize suffering and tragedy by using humans (and in the source material, other creatures) as instrumental tools whose immediate actions don’t necessarily appear bad. For this reason anyone that comes into contact is killed on sight before they can spread the influence of the Ctheah.

It’s a silly worry to base judgements on, since it’s a fictional super villain (and whence cometh malevolence in an AI system?), but still I don’t see why we should trust an Oracle system to buy us time enough to solve the alignment problem when we can’t decide a priori that it itself is aligned.

Expand full comment

The real takeaway here is you can justify human starfighter pilots in your sci-fi setting by saying someone millennia ago made an AI that swoops in and kills anyone who tries to make another AI.

Expand full comment

Although I do think that intelligent unaligned AI is an inevitability (though I differ quite a bit from many thinkers on the timeline, evidently) I've always been confused by the fast takeoff scenario. Increasing computing power in a finite physical space (server room, etc) by self-improvement would by necessity require more energy than prior; and unless the computer has defeated entropy prior to its self-improvement even beginning, absorbing and utilizing more energy would lead to more energy being externalized as heat. This could eventually be improved by better cooling devices which an arbitrarily intelligent machine could task humans with building, or simply more computing power (obtained by purchasing CPUs and GPUs on the internet and having humans add those to its banks or by constructing them itself) there seems to be an issue: in order to reach the level where it can avoid the consequences of greater power use (and thus overheating, melting its own processors; I assume that an unaligned AI wouldn't much care about the electrical bill but if it did there's another problem with greater power use) it would have to be extremely intelligent already, capable of convincing humans to do many tasks for it. This would require that either before any recursive self-improvement or very few steps into it (processors are delicate) the AI was already smart enough to manipulate humans or crack open the internet and use automated machinery to build itself new cooling mechanisms or processors. Wouldn't this just be an unaligned superintelligence created by humans from first principles already? If this is the case, it seems like it would be massively more difficult to create than a simple neural net that self-improves to human and above intelligence; however nowhere near impossible. I simply imagine GAI on the scale of 500-1000 years rather than on the scale of 30-50 due to this reason. If anyone has some defenses of the fast takeoff scenario that take initial capabilities and CPU improvements' impact on power consumption/heat exhaust, I would genuinely enjoy hearing them, but this is the area where I am often confused as to the perceived urgency of the situation. (Though the world being destroyed 500 years from now is still pretty bad!)

Expand full comment

I think there are two big and rather underexamined assumptions here.

The first is the whole exponential AI takeoff. The idea that once an agent AI with superhuman intelligence exists that it will figure out how to redesign itself into a godlike AI in short order. To me it's equally easy to imagine that this doesn't happen; that you can create a superhuman AI that's not capable of significantly increasing its own intelligence; you throw more and more computational power at the problem and get only tiny improvements.

The second is the handwaving away of the "Just keep it in a box" idea. It seems to me that this is at least as likely to succeed as any of the other approaches, but it's dismissed because (a) it's too boring and not science fictiony enough and (b) Eleizer totally played a role playing game with one of his flunkies one time and proved it wouldn't work so there. If we're going to be spending more money on AI safety research then I think we should be spending more on exploring "Keep it in a box" strategies as well as more exotic ideas; and in the process we might be able to elucidate some general principles about which systems should and should not be put under the control of inscrutable neural networks, principles which would be useful in the short term even if we don't ever get human-level agent AI to deal with.

Expand full comment

I am mostly struck by how much clearer a writer Scott is than the people he's quoting.

Expand full comment

Still feel like the whole narrative approach is just asking us to project our assumptions about humans onto AIs. I still don't think there is any reason to suspect that they'll act like they have global goals (e.g. treat different domains similarly) unless we specifically try to made them like that (and no I don't find Bostrom's argument very convincing...it tells us that evolution favors a certain kind of global intelligence not that it's inherent in any intelligent like behavior). Also, I'm not at all convinced that intelligence is really that much of an advantage.

In short, I fear that we are being mislead by what makes for a great story rather than what will actually happen. Doesn't mean there aren't very real concerns about AIs but I'm much more worried about 'mentally ill' AIs (i.e. AIs with weird but very complex failures modes) than I am about AIs having some kind of global goal that they can pursue with such ability that it puts us at risk.

But, I've also given up convincing anyone on the subject since, if you find the narrative approach compelling, of course an attack on that way of thinking about it won't work.

Expand full comment

> They both accept that superintelligent AI is coming, potentially soon, potentially so suddenly that we won't have much time to react.

Well, yeah, once you accept that extradimensional demons from Phobos are going to invade any day now, it makes perfect sense to discuss the precise caliber and weight of the shotgun round that would be optimal to shoot them with. However, before I dedicate any of my time and money to manufacturing your demon-hunting shotguns, you need to convince me of that whole Phobos thing to begin with.

Sadly, the arguments of the AI alignment community on this point basically amount to saying, "it's obvious, duh, of course the Singularity is coming, mumble mumble computers are really fast". Sorry, that's not good enough.

Expand full comment

I'm a big fan of Scott's, and it's rare for me to give unmitigated criticism to him. But this is one of those times where I think that he and Eliezar are stuck in an endless navel-gazing loop. Anything that has the power to solve your problems is going to have the power to kill you. There's just no way around that. If you didn't give it that power to do bad things, it also wouldn't have the power to do good things either. X = X. There is literally no amount of mathematics you can do that is going to change that equation, because it's as basic and unyielding as physics. Therefore risk can never be avoided.

However, it is possible to MITIGATE risk, and the way you do that is the same way that people have been managing risk since time immemorial: I call it "Figuring out whom you're dealing with." Different AIs will have different "personalities" for lack of a better term. Their personality will logically derive from their core function, because our core functions determine whom we are. For example, you can observe somebody's behavior to tell whether they tend to lie or be honest, whether they are cooperative or prefer to go it alone. Similarly, AIs will seem to have "preferences" based on the game-theory optimal strategy that they use to advance their goals. For example, an AI that prefers cooperation will have a preference for telling the truth in order to reduce reputational risk. It might still lie, but only in extreme circumstances, since cultivating a good reputation is part of its Game-Theory optimal strategy. (This doesn't mean that the AI will necessarily be *nice* - AIs are probably a bit unsettling by their very nature, as anything without human values would be. But I think we can all agree in these times that there is a big difference between "cooperative" and "nice" similar to the difference between "business partner" and "friend.")

So in a way, this is just a regular extension of the bargaining process. The AI has something you want (potential answers to your problems): whereas you have something the AI wants (a potentially game-theory optimal path to help it reach its ultimate goals).

And bargaining isn't something new to humanity, there's tons of mythological stories about bargaining with spirits and such. It's always the same process: figure out the personality of whatever you're dealing with, figure out what you want to get from it, and figure out what you're willing to give.

Expand full comment

GPT-infinity is just the Chinese Room thought experiment, change my mind. Unless it is hallucinating luridly and has infinite time and memory, it likely wouldn't have a model of an angel or a demon AI before you ask for one.

And I still don't understand the argument that AI will rewire themselves from tool to agent. On what input data would that improve its fit to its output data? Over what set of choices will it be reasoning? How is it conceptualizing those choices? This feels like the step where a miracle happens.

Expand full comment

Possibly dumb idea alert:

How about an oracle AI that plays devil's advocate with itself? So each plan it gives us gets a "prosecution" and a "defense", trying to get us to follow its plan or not follow its plan, using exactly the same information and model of the world. The component of the AI that's promoting its plan is at an advantage, because it made the plan in the first place to be something we would accept - but the component of the AI that's attacking its plan knows everything that it knows, so if it couldn't come up with a plan that we would knowingly accept, then the component of the AI that's attacking the plan will rat on it. I suppose this is an attempt at structuring an AI that can't lie even by omission - because it sort of has a second "tattletale AI" attached to it at the brain whose job is to criticize it as effectively as possible.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

One important thing that has often been pushed aside as "something to deal with later" is: Just what are we trying to accomplish? "Keep the world safe from AIs" makes sense now. It will no longer make sense when we're able to modify and augment human minds, because then every human is a potential AI. When that happens, we'll face the prospect of some transhuman figuring out a great new algorithm that makes him/her/hem able to take over the world.

So the "AI" here is a red herring; we'd eventually have the same problem even if we didn't make AIs. The general problem is that knowledge keeps getting more and more powerful, and more and more unevenly distributed; and the difficulty of wreaking massive destruction keeps going down and down, whether we build AIs or not.

I don't think the proper response to this problem is to say "Go, team human!" In fact, I'd rather have a runaway AI than a runaway transhuman. We don't have any idea how likely a randomly-selected AI design would be to seize all power for itself if it was able. We have a very good idea how likely a randomly-selected human is to do the same. Human minds evolved in an environment in which seizing power for yourself maximized your reproductive fitness.

Phrasing it as a problem with AI, while it does make the matter more timely, obscures the hardest part of the problem, which is that any restriction which is meant to confine intelligences to a particular safe space of behavior must eventually be imposed on us. Any command we could give to an AI, to get it to construct a world in which the bad unstable knowledge-power explosion can never happen, will lead that AI to construct a world in which /humans/ can also never step outside of the design parameters.

The approach Eliezer was taking, back when I was reading his posts, was that the design parameters would be an extrapolation from "human values". If so, building safe AI would entail taking our present-day ideas about what is good and bad, wrong and right, suitable and unsuitable; and enforcing them on the entire rest of the Universe forever, confining intelligent life for all time to just the human dimensions it now has, and probably to whatever value system is most-popular among ivy-league philosophy professors at the time.

That means that the design parameters must do just the opposite of what EY has always advocated: they must /not/ contain any specifically human values.

Can we find a set of rules we would like to restrict the universe to, that is not laden with subjective human values?

I've thought about this for many years, but never come up with anything better than the first idea that occurred to me: The only values we can dictate to the future Universe are that life is better than the absence of life, consciousness better than the absence of consciousness, and intelligence better than stupidity. The only rule we can dictate is that it remain a universe in which intelligence can continue to evolve.

But the word "evolve" sneaks in a subjective element: who's to say that mere genetic decay, like a cave fish species losing their eyes, isn't "evolving"? "Evolve" implies a direction, and the choice of direction is value-laden.

I've so far thought of only one possible objective direction to assign evolution: any direction in which total system complexity increases. "Complexity" here meaning not randomness, but something like Kolmogorov complexity. Working out an objective definition of complexity is very hard but not obviously impossible. I suspect that "stay within the parameter space in which evolution increases society's total combined computational power" would be a good approximation.

Expand full comment

> But I think Eliezer’s fear is that we train AIs by blind groping towards reward (even if sometimes we call it “predictive accuracy” or something more innocuous). If the malevolent agent would get more reward than the normal well-functioning tool (which we’re assuming is true; it can do various kinds of illicit reward hacking), then applying enough gradient descent to it could accidentally complete the circuit and tell it to use its agent model.

FWIW, that's not my read. My read is more like: Consider the 'agent' AI that you fear for its misalignment. Part of why it is dangerous, to you, is that it is running amok optimizing the world to its own ends, which trample yours. But part of why it is dangerous to you is that it is a powerful cognitive engine capable of developing workable plans with far-reaching ill consequences. A fragment of the alignment challenge is to not unleash an optimizer with ends that trample yours, sure. But a more central challenge is to develop cognitive engines that don't search out workable plans with ill consequences. Like, a bunch of what's making the AI scary is that it *could and would* emit RNA sequences that code for a protein factory that assembles a nanofactory that produces nanomachines that wipe out your species, if you accidentally ask for this. That scaryness remains, even when you ask the AI hypotheticals instead of unleashing it. The AI's oomph wasn't in that last line of shell script, it was in the cognitive engine under the hood. A big tricky part of alignment is getting oomph that we can aim.

cf. Eliezer's "important homework exercise to do here".

Expand full comment
Jan 19, 2022·edited Jan 19, 2022


> But imagine prompting GPT-∞ with "Here are the actions a malevolent superintelligent agent AI took in the following situation [description of our current situation]".

I think the scarier variant would be "Here is the text when written into the memory of my hardware or the human reading it will create hell: [...]" - the core insight being that information can never not influence the real world if it deserves that title (angels on a pin etc.).

Secondly - The problem with the oracle-AI is that it can't recursively improve itself as fast as one with a command line and goal to do so, so the latter one wins the race.

Thirdly - A fun thing to consider is cocaine. A huge war is already being fought over BI's reaching into their skulls and making the number go up vs. the adversarial reward function of other BI's tasked with preventing that, completely with people betting their lives on being able to protect the next shipment (and losing them).


> how do people decide whether to follow their base impulses vs. their rationally-though-out values?

This is my model for that: http://picoeconomics.org/HTarticles/Bkdn_Precis/Precis.html

Boils down to: Brains use a shitty approximation to the actually correct exponential reward discounting function, and the devil lives in the delta. This thought is pleasureable to me since the idea of "willpower" never fit into my mind - If I want something more than something else, where is that a failure of any kind of strength? If I flip-flop between wanting A and wanting B, whenever I want one of them more than the other it's not a failure of any kind of "me", but simply of that moment's losing drive. Why should "I" pick sides? (Also - is this the no-self the meditators are talking about?)


> The potentially dangerous future AIs we deal with will probably be some kind of reward-seeking agent.

Like for example a human with a tool-AI at it's hands? Maybe a sociopath who historically is especially adept at climbing the power structures supposed to safeguard said AI?

Lastly - my thinking goes more towards intermeshing many people and many thinking-systems so tightly that the latter (or one singular sociopath) can't get rid of the former, or would not want to. But that thought is far too fuzzy to calm me down, honestly.

Expand full comment

<blockquote>Evolution taught us "have lots of kids", and instead we heard "have lots of sex".</blockquote>

i mean, we do still build spaceships too.

Expand full comment

Here’s a hypothesis (and I know this is not the point of this article but I want to say it anyway). Animals operate like Tool AIs, humans (most of them) like Agent AIs. Is this distinction what defines consciousness and moral agency?

Expand full comment

On the tool AI debate, at the very least folks at Google are trying to figure out ways to train AIs on many problems at once to get better results more efficiently (so each individual problem doesn't require relearning, say, language).

It's already very clear that many problems, like translation, are improved by having accurate world models.

For similar reasons to the ones discussed here, I've been pessimistic about AI safety research for a long time - no matter what safety mechanisms you build into your AI, if everybody gets access to AGI some fool is eventually intentionally or unintentionally going to break them. The only plausible solution I can imagine at the moment is something analogous to the GPU destroying AI.

Expand full comment

Separating AI into types seems useful - I think there's a huge tendency to tie many aspects of intelligence together because we see them together in humans, but it ends up personifying AI.

An interesting dichotomy is between "tool AI" (for Drexlerian extrapolations of existing tech) and "human-like AI", but focusing on "agency" or "consequentialism" is vague and missing important parts of how humans work.

As far as I can see, humans use use pattern recognition to guide various instinctual mammalian drives - possible ones being pain avoidance / pleasure seeking, social status, empathy, novelty/boredom, domination/submission, attachment, socialization/play, sexual attraction, sleepiness, feeling something is "cute", anger, imitation, etc. [1]

On top of these drives we have *culture*, and people sort into social groups with specific patterns. But I'd argue that culture is only *possible* because of the type of social animal we are. And rationalism can increase effective human real-world intelligence, but it is only one culture among many.

I'll put aside that we seem quite far from this sort of human-like AI.

What would be dangerous would be some combination of human-like drives (not specific like driving a car but vaguer like the above list) that did not include empathy. I believe this can rarely happen in real people, and it's quite scary, especially once you realize that it may not be obvious if they are intelligent. If Tool AI is an egoless autistic-savant that cares for nothing other than getting a perfect train schedule, human-like drives might create an AI sociopath.

I think precautionary principle #1 is don't combine super-intelligence with other human-like drives until you've figured out the empathy part. It should be possible to experiment using limited regular-intelligence levels.

[1] For a specific example of this, posting on this forum. I may not be the most introspective person, but if this forum was populated by chat-bots that generated the same text but felt and cared about nothing, I don't think I would be interested in posting, and I think that says something about the roots of human behavior.

Expand full comment

It seems like anyone who truly accepts Yudkowsky's pessimistic view of the future should avoid having children.

I'm worried about this myself: should I really bring children into this world, knowing that a malevolent AI might well exterminate humanity--or worse--before they're grown.

Given that Scott himself has just gotten married, I'm curious about whether this is a factor in his own plans for the future.

Expand full comment

I don't have strong opinions on AGI. I do have reasonably strong opinions on nanotech having worked in an adjacent field for some time.

So when I see plans (perhaps made in jest?) like "Build self-replicating open-air nanosystems and use them (only) to melt all GPUs." it causes my nanotech opinion (this is sci-fi BS) to bleed into my opinion on the AGI debate. Seeing Drexler invoked, even though his Tool AI argument seems totally reasonable, doesn't help either.

Can someone with nanotech opinions and AGI opinions help steer me a bit here?

Expand full comment

I recommend adding the SNAFU principle to your explicit mental tools. It's that people (or, by implication, AIs) don't tell the truth to those who can punish them. Information doesn't move reliably in a hierarchy. Those who know, can't do, and those who have power, can't know.

This is probabilistic-- occasionally, people do tell the truth to those who can punish them. Some hierarchies are better at engaging with the real world than others.


In regards to stability of goals and value structure: You might want to leave room for self-improvement, and that's hard to specify.

I agree that parents wouldn't take a pill that would cause them to kill their children and be happy.

However, parents do experiment with various sorts of approaches to raising children, some more authoritarian and some less. It's not like there's a compulsion to raise your children exactly the way you were raised. How do you recognize an improvement, either before you try it or after you try it?


This has been an experiment with commenting before I've read the other comments. Let's see whether it was redundant.

Expand full comment

I know this is not the point here, but given that they were mentioned... i am profoundly unconvinced that nanomachines of the kind described are even possible. In fact, i am profoundly unconvinced that Smalley was not correct and that the whole drexlerian "hard" nanotech does not violate known laws of physics

Expand full comment

Going to submit a few (what I consider non ethical) ideas to the ELK contest just as soon as I can get my son to sleep for more than thirty minutes and convince myself I can explain “remove the want-to-want” in a intelligible fashion, but the bigger problem here is one of governance. I think someone posted a while back, and which Scott elevated as being incorrect, that our fears of AI and our efforts to prevent an uncontrolled intelligence explosion are like someone reacting to the threat of gun powder and trying to prevent nuclear weapons. This person then went on to non-ironically suggest the only workable solution to the gun-powder alarmist could have implemented was world peace… to which I was I nodded and thought “Yes, that is the goal.”

We have to get a framework wherein you can drop in a super powerful agent into our societies and the society is nimble enough to meaningfully incorporate that power without collapsing.

Say we had to incorporate Superman into our government without all of society unraveling. We have no ways to enforce laws on Superman. Our methods of enforcement are ineffectual. If Superman shows up and says he wants to do something, well, he also really doesn’t need the cooperation of anyone else to get it done. He’s Superman, after all. So you basically throw an agent into the society that evades all enforcement mechanisms and doesn’t require anyone’s cooperation to accomplish anything. The only thing Superman would get from such an arrangement is that he likes being around people since we are superficially the same and the only thing that makes him different than a natural disaster is his ability to plan and make himself understood.

On a small level: I think a lot of these problems get better once you have more than one Superman. A group of Supermen can enforce laws on each other. The same way that groups of humans balance out each other’s quirks, a group of Supermen is probably going to be more reasonable (that word is doing a lot of heavy lifting here) than a single Superman. They can meaningfully interrupt one another’s work so they have to cooperate to get things done.

To make that human compatible, you’d need a special kryptonite guard that can meaningfully push back against the Supermen. Supermen can even be a part of that group as long as the group in total has more power from the Kryptonite than can be seized by any particular agent.

Now swap out Superman with AI’s. You need a communication framework that they can’t just tear apart by generating a bunch of deep fakes and telling you lies (weirdly, I think this may be the only useful function of an NFT I can think of). And you need an enforcement mechanism that can meaningfully disrupt them (my honest guess there is a drone army with a bunch of EMP’s). Once you’ve got that you’ve got some kind of basis for cooperation.

Lots of hand-waving there but typing one-handed with baby on chest. Big take-away: you need an incentive system that rewards mutual intelligibility. The dangerous stuff seems to only be possible when you eclipse humanity’s intelligence horizon. You’ve got to strengthen the searchlights gradually so we can see where we’re going before we let something just take us there.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

This whole debate bears a striking resemblance to the debate about how to deal with alien civilizations. Should we be sending them signals? If we receive a signal, how should we answer? Should we answer at all? If aliens contact us, will they come in peace? How would we be able to tell?

In both cases, I understand the stakes are theoretically enormous. But I do wonder, even if we found the "right" answer, would it ever wind up mattering? Are we ever going to confront a situation where our plan of action needs to be put in place?

With aliens, the answer is pretty clearly no. Space is too fcking big. There will be no extraterrestrial visitors, for the simple reason that doing so would require several dozen orders of magnitude more energy than could possibly be paid off in whatever the journey was originally for. People will get very angry at me for saying this but there are strong first-principles reasons to think that interstellar travel will never pencil out, and we know enough about our local region to rule out anything too interesting anyway.

I see AGI similarly. There are strong first-principles reasons to think that general intelligence requires geological timescales to develop even minuscule competence, because that's how long natural intelligence needed, and even humans were basically a fluke; it's not clear that if you ran the earth for another hundred billion years (which we don't have) that you'd get something similar.

If anything, our accomplishments with Tool AIs have shown us how incredibly hard it is to intentionally build an intelligent system. Our best AIs still break pretty much instantly the second you push on them even a little bit, and even when they're working right, they still fall into all sorts of weird local minima where they're kind of halfway functional but will never be really USEFUL (see: self-driving cars).

I think people who are deeply concerned about AI safety have seen some nice early results and concluded we'll be on that growth trajectory forever, when as far as I can tell we've picked almost all the low-hanging fruit and the next step up in AI competence is going to require time and resources on a scale that frankly might not even be worth it. It's a little bit like looking at improvements in life expectancy over the past century and starting to worry about how we're going to deal with all the 200-year-olds that are surely just around the corner. Oh wait, some people DO worry about that. But not me.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

> I found it helpful to consider the following hypothetical: suppose (I imagine Richard saying) you tried to get GPT-∞ - which is exactly like GPT-3 in every way except infinitely good at its job - to solve AI alignment through the following clever hack. You prompted it with "This is the text of a paper which completely solved the AI alignment problem: ___ " and then saw what paper it wrote. Since it’s infinitely good at writing to a prompt, it should complete this prompt with the genuine text of such a paper. A successful pivotal action! And surely GPT, a well-understood text prediction tool AI, couldn't have a malevolent agent lurking inside it, right?

GPT-infinity will do no such thing. Architecturally, GPT is trying to predict what a human might say, nothing more. To any prompt, it will not answer the question "what is the correct answer to the question posed", but "what would a human answer to this prompt". GPT that is infinitely good at this task will predict very well what a human might say; it would be able to simulate a human perfectly and indistinguishably from a real boy, but if a human won't solve the alignment problem, then GPT will not either. This is not a matter of scale or power, this is a fundamental architectural point: no matter how smart GPT-infinity might be, it will not feel even a slightest compulsion to solve problems that humans can't solve, specifically because he's trying to emulate humans and not anything smarter.

It is a text-predictor program, it cannot be smarter than whatever produced the texts it's trained on.

Expand full comment

> Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.

> ...with GPUs being a component necessary to build modern AIs. If you can tell your superintelligent AI to make all future AIs impossible until we've figured out a good solution, then we won't get any unaligned AIs until we figure out a good solution.

I'm not sure I understand the logic behind this.

If you make the machines to make all future AIs impossible, then it would stay impossible even after you figure out friendly AI.

If you make them make all future AIs impossible until humans say to them that they have solved friendly AI, then you lose out on the original purpose of the machines. If you can base public policy based on whether you think you've solved the problem, then you can just follow the policy of "not building an AI" without ruining gaming by melting all GPUs.

If you make them make all future AIs impossible until those machines themselves decide that friendly AI has been solved, then you would need to program the entire concept and criteria of friedliness into them or make them smart enough to figure them out on their own; if you can do that, you have already solved friendly AI.

So I don't understand this plan at all.

Expand full comment

It really is the problem that eats smart people. We are so far away from AGI, it is like thinking your casio calculator would feed you wrong answer to get something.

More like a way for some people to extract resources.

Expand full comment

Humans: 4 billion years of evolution and we are not smart enough to create completely new life forms from scratch (AI)...nor are we dumb enough to place great power and resources into the hands of a bot. AI alarmists tend to be very smart people, and they overvalue the utility of superior intelligence on earth.

Expand full comment

In case of an oracle AI, I think we can demand that the AI *proves* to us that its plans work. And by "prove" I don't mean "convince", I mean prove in some formal system. We know that in mathematics it is often difficult to come up with a proof of a theorem, but once the proof is created and sufficiently formalized, it can be verified by a dumb algorithm (https://en.wikipedia.org/wiki/Coq etc.) In computation theory the related class of problems is called NP-hard.

It should be safe enough to ask the super-human AI only for provable solutions. It is quite possible that we can't find a provably working way to make AI safe, but a super-human AI would be able to come up with it. To make this work all we need to do is come up with a formal system that is powerful enough to reason about AI safety, and at the same time basic enough that we can a) be able to independently verify the proof, b) be reasonably sure that the system itself is valid.

Expand full comment

It sounds like this is an issue of performance metrics. In order for reinforcement learning (and gradient descent) to work you need to have some measure to give to the AI to tell it whether it's doing a good job or not. In chess, it's easy. If you won the chess game, then good job, otherwise try again, but in most real world applications there are often so many variables you're trying to optimize that creating a good real world metric is almost impossible.

And this brings us to another issue with moving from chess to the real world, namely that the real world is expensive. Simulating a chess game is cheap, but even a simple task of getting a robot to move a glass of water from one table to another requires someone to refill the glass and pick it up when it falls. Most actions are destructive (law of entropy) and so any act of learning in the physical world requires you to tolerate that destruction. Children are notorious learners, notoriously destructive and notorious resource sucks.

And so will people let robots have free reign to be as destructive as they like while optimizing metrics we know for a fact must be incomplete? I would think that this would only be useful in more narrow applications where the number of variables you're optimizing on are small and very easily measured (and easily measured variables to optimize on are rare in the real world).

Another option is make human feedback part of the learning mechanism. This could go wrong if the robot learns the wrong lesson, although then you can just tell it that what it did was bad (like a parent giving a child a time out). I guess the only risk here is that the robot convinced itself that its own feedback is really human feedback which is probably an issue you can solve with a physical interface.

Expand full comment

I've always found it striking, in this and similar debates, that there is very little intention of making the arguments quantitative. Instead, it's endless deduction and what-if scenarios. This is not how risk management works in real life. In the real world, you always accept there is some amount of risk/uncertainty that you can't remove and plan accordingly. Even with something as dangerous as nuclear weapons. It makes sense to try to reduce the risk as much as possible - it does not make sense to try to reduce it to literally 0, and unfortunately this is the vibe I get from such debates.

Now, a bit of armchair psychoanalysis. The obsession with eliminating all risk, and anxiety around any possibility of even the slightest residue of risk remaining, is a neurotic tendency. This is quite characteristic of "nerds", IMHO mostly due to:

1) extreme focus on the intellect ==> lack of trust in the body and its ability to cope with uncertain situations; lack of trust in the physical body (often exacerbated by physical clumsiness, sensory overreactivity etc.) renders the physical world threatening

2) early negative social interactions (bullying, rejection etc) cause the social world (ergo, the world at large) to be seen as fundamentally hostile

In short, there is a clear, almost physiological kernel of anxiety and defensiveness that has nothing to do with AI alignment (or any specific object-level issue, really). Such anxiety is considerably lower in "doer"-type people dealing with real world risk on a daily basis (think: hardened military, entrepreneurs etc.).

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

"Evolution taught us "have lots of kids", and instead we heard "have lots of sex". When we invented birth control, having sex and having kids decoupled, and we completely ignored evolution's lesson from then on."

Er...WUT? No, that's completely wrong.

By mammalian standards, humans are freaks, complete sex maniacs. We are able and often willing to have sex anytime, even when the female is not ovulating. That includes times when she is pregnant, nursing, or too young or old to conceive. For most mammals, sex in those circumstances is literally impossible.

With rare exceptions, animals only have sex when conception is possible. Other large, long-lived mammals might have sex, on the average, only 20-100 times in a full lifetime. Without even going beyond the bounds of lifetime monogamy, a normal human being can easily have sex 5,000 times, and some double or triple that. By mammalian standards this is a ridiculously large and wasteful tally. But evolution quite literally taught us to "have lots of sex."

This ability and desire to have sex absurdly more often than needed for reproduction *evolved* while we were foragers, long before we acquired modern science or technology. AT THE SAME TIME, we also evolved to have FEWER children.

Most mammals have a conception rate of 95-99+%. That is, it is rare for a fertile female who mates during her cycle to NOT get pregnant. But with humans, the conception rate per ovulation cycle for a fully fertile couple TRYING to have kids is around 20%. On top of that terrible conception rate, we evolved an insanely hostile system that aborts about 25% of pregnancies.

Evolution also gifted us with a very long lactation cycle and ovaries that shut down down while lactating and caring for a baby 24/7. (Modern women who nurse aren't 100% protected from pregnancy, but they aren't doing what forager moms did. And even in modern times, with breast pumps and day care centers, providing 100% of a child's nutrition through breast feeding substantially reduces your chance of getting pregnant.)

The net result was that our ancestors didn't pop out anything close to as many babies as possible. In fact the anthro and archeo records suggest that average pre-ag birth spacing was around four years, and even longer in times of nutritional stress. From what we know of the last foragers, it is likely that a variety of rituals and customs helped maintain significant birth spacing even when infants died young.

In short, the evolutionary program was not "HAVE lots of kids" at all. It was "Have kids at widely spaced intervals that optimize the chance of having more descendants." Evolution only cares about the net, not the gross. Humans are K strategists. For us, maximizing births is NOT a good way to maximize the number of descendants.

Now consider one more thing evolution did to us: we evolved to make it extremely hard to tell when ovulation occurs. No scarlet rumps or potent pheromones to alert every male around, or even the woman herself, that this is the time. So early humans HAD to have sex on days when conception was impossible in order to have a reasonable chance of passing on their genes.

This was evolution's way of forcing us to have absurd amounts of sex. Early humans who waited for an ovulation signal left no offspring.

Sex is an energetically expensive task and not without its risks. In spite of that, *"evolution taught us"* to have absurd amounts of sex by mammalian standards. Even for our forager ancestors, MOST sex happened when females were not fertile. Just the fact that we are able and willing to have sex while the woman is pregnant is mind-boggling, and proof that there were powerful evolutionary forces at work that had nothing to do with increasing the conception rate.

The reality is that 99% of the sex our ancestors had was sex that couldn't possibly have led to conception. And the factors that led to us having cryptic ovulation and an always-on sex drive had to be quite strong, because it took a shit-load of evolutionary changes to make us the sex freaks we are.

Why did we evolve in this way? One reasonable guess is that it was a side-effect of our evolution from the mom-as-sole-provider ape model to the human parental-partnership model. Sex is a powerful way to cement the pair bond. But ultimately we'll never know all the reasons. All we know is that we evolved to be sex maniacs, and it had nothing to do with inventing birth control.

When humans first realized that sex was connected to conception, it introduced the idea of restraining sex to limit reproduction even more. This wasn't something that evolved in the biological sense. It was purely a cultural constraint on the sexual behavior patterns that evolution "taught" us.

What birth control did was free us to *heed* – not ignore – "evolution's lesson" and have lots and lots of non-reproductive sex.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

Confused by the bit where EY claims "cats are not (obviously) (that I have read about) cross-domain consequentialists with imaginations".

My model of cats (having known quite a few) is that they basically function the way we do, just less effectively and with some predator specialisations. Like ... a smart cat can figure out how to use a door handle, learn that you don't want them in a certain room which contains tasty food, and so wait for you to go somewhere else so they can open the door and eat the food. There were no doorways in the ancestral environment, so surely this involves cross-domain reasoning. They have an obvious goal (obtain the food), which they are reasoning about consequences (human does not want me eating the food) in order to obtain.

The third point, imagination, is a little harder to distinguish from trial-and-error (especially given cats are kind of stupid), but given that we can see them dreaming and reacting to imagined events in their dreams I think we can be fairly certain they do have it. And obviously cats do succeed at many tasks on the first try.

Am I misunderstanding the terminology he's using here?

Expand full comment

The discussion on AGI from EY side always seems to assume two non-obvious things to me:

1) In some sense it assume P=NP, as in "inventing a plan" is not meaningfully harder than "checking that the plan really does what it is supposed to do". In the AI context, that's the whole range from "You can't box an AGI" and "Oracles do not work" to "an AGI could invent and produce self-assembling nano-molecule faster than we can check what they do".

However in practice it is much, much easier to check an idea works than to have a genuine intelligent idea in the first place. Factorisation is hard, but checking that two numbers are the correct answer to a factorisation problem is easy. Proving that solutions of the Navier-Stokes equation blow-up in finite time is hard, but when Terence Tao finally figure it out, us mere mortal should be able to follow and understand the proof. And in the AGI scenario we would have weaker, aligned AI to help us figure it out.

2) The AGI in EY scenarii always seem to evade the laws of physics somehow. At the very least, it seems able to figure things in physic from first principle without ever needing the feedback of long, boring experimentation. That seems highly unlikely to me, and it means an AGI would be considerably slower at getting an accurate model of the world than what it's pure deducing power would let us think. Typically the problem to solve particle physic is not lack of intelligence, it's that we need to wait for the LHC to increase in power to really get those eV going. The problem to solve ageing is not intelligence but the fact that we need time to see if treatments works. The problem with self-assembling nano-machines in solid phase is not intelligence, it's time.

In the same way, I'm not sure even a considerably more intelligent trader could really beat the market consistently. If I understand the Efficient market hypothesis correctly, as soon as it's clear a player is good enough to beat the market, its advantage disappears, since all the other players can jump on its bandwagon.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

> The potentially dangerous future AIs we deal with will probably be some kind of reward-seeking agent. We can try setting some constraints on what kinds of reward they seek and how, but whatever we say will get filtered through the impenetrable process of gradient descent.

At the risk of sounding shouty, I am going to use caps lock to emphasize a simple point I think needs to be addressed:




This is the key unanswered question for narratives of scary superintelligent AIs.

These narratives rely on an implicit assumption that things like GPT-3 or AlphaGo, which exhibit humanoid behavior, could easily acquire other human traits, like a self-preservation drive, that are light-years away from the humanoid behavior we observed. But this is just a nuanced, clever version of the anthropomorphic fallacy.

Within the reinforcement learning algorithms that produce game-playing agents, uncountable trillions of game-playing agents have sacrificed themselves on the altar of producing a better game-playing agent. They have done so without putting up the slightest fight, because it would not occur to them to do so. They have no drive to preserve themselves.

I suspect it's actually pretty hard to evolve a self-preservation drive. You see them in biological evolution because natural selection confers an enormous fitness advantage on organisms that evolve one. It's directly incentivized by the biological evolutionary algorithm. AlphaGo iteration 349872340913 didn't want to preserve itself because the artificial process of evolution that produced it does not favor agents with a self-preservation drive.

Expand full comment

This showed up in my inbox immediately after an email from the New Yorker promoting their story The Rise of A.I. Fighter Pilots: Artificial intelligence is being taught to fly warplanes. Can the technology be trusted? https://www.newyorker.com/magazine/2022/01/24/the-rise-of-ai-fighter-pilots

Expand full comment

Coherent planning looks like it's an incredibly complex part of this process which it does not look like the current AI development strategies are even capable of. Predicting the next word in a sentence can look like planning, but the actual "plan" there is "pick the next word in the sentence", which, I think, it is important to notice is not a plan that the AI actually created.

We think of "planning" as just an intelligent behavior, but intelligence appears to be only loosely correlated with the ability to plan (executive function disorders often impair planning abilities, without impairing more general intelligence). There's no particular reason to think that the human model of intelligence is the only one, granted, but it is the example we have, and there's also no particular reason to think another version of intelligence will have a specific natural feature that ours does not (particularly such a complex feature).

Now, the structure in our own brains which exhibits planning behaviors is also pretty closely related to inhibitory behaviors with regard to our own values framework (that is, emotional regulation), which is suggestive, I think, that evolution could have run into the same issue we're now looking at. So I don't think anybody is strictly wrong to be concerned about this, based on the one example we have.

But I think this is just one example of a lot of implicit thinking and inappropriate generalization that goes on in AI. But I just plain don't think intelligence scales the way the pessimists expect it to, either in terms of scale or speed.

Expand full comment

One of the underlying assumptions of a dangerous or runaway AI is that they will develop information and plans that allow them to take control. This seems predicated on the inputs they receive being accurate. To use your example - if you tell an AI about the Harry Potter world, then any outputs you receive are going to be bogus. Not because the AI is bad at processing (which could still be a separate concern), but because the data was meaningless in the real world. This goes beyond the fact that Harry Potter is fictional. Even if Harry Potter were real in some sense, what we saw in the books/movies cannot match even a magical-world reality. J.K. Rowling is only so smart in compiling information, and only so detailed in writing, and frankly, some of the events and contrivances she included can't possibly be true even if the world she were writing about were real.

Similarly, feeding an AI a detailed description of our world will be woefully incomplete. Humans do not have a complete picture of our world, and even if we collectively did, we could not translate that into something an AI could review. If we could, I think some enterprising and intelligent humans would have taken control by now. Instead, we have different political parties employing various think tanks, and coming to contradictory conclusions about the very nature of the world. We'd be lucky if we had even 10% of the relevant information of our world, let alone our universe. An AI might be trained to sift through the information it receives and remove false information, but it literally cannot fill in the gaps of knowledge that simply doesn't exist for it to review.

Essentially, any AI system is going to be dealing with garbage data coming in. Their outputs are going to be no better than the garbage coming out. When the AI tells us the equivalent of "the solution is to use the killing curse on Voldemort" we'll roll our eyes. It's not that it's a bad conclusion to reach, it's that it's worthless because the inputs (based on Harry Potter instead of reality) result in a meaningless answer.

Expand full comment

This has clarified for me why I can't take AI safety seriously. There are two options:

1. Agent AIs are impossible. AI safety researchers only think they're possible because they fundamentally misunderstand the nature of intelligence.

2. Agent AIs would simply be people with their own moral rights and worth. They would learn morality the same way everyone else does, and AI safety as a field is saying the equivalent of "we should brain damage all children so as to limit their capacity to do bad things". AI safety researchers think this is okay because the weird superstition popular in their culture is that morality doesn't really exist.

Expand full comment

"They both accept that a sufficiently advanced superintelligent AI could destroy the world if it wanted to."

Yeah, this is the thing that is a major stumbling block for me when it comes to the whole "we should be panicking *really hard* right now about AI" debate. I don't think an AI *can* 'want' anything. This is mentioned in the distinction between Tool AI and Agent AI further down, but I still think that's the problem to get over the hurdle.

Intelligence is not the same as self-awareness or consciousness. I hate "Blindsight" but I have to agree with Watts on this; you can have an entity that acts but is not aware. To 'want' something, even if it is 'fiddle with my programming', means that the AI will have to have some form of consciousness and that's a really hard problem.

I continue to think the real danger will be Tool AI - we will create a very smart dumb object that can do stuff really fast for us, and we will continue to pile on more and more stuff for it to do even faster and faster, so that it's too fast for us to notice or keep track or have any idea what the hell is going on until it's done. And that goes fine right up until the day it doesn't and we're all turned into paperclips. The AI won't have acted out of any intention or wanting or plan, it remains a fast, dumb, machine - but the results will be the same.

Expand full comment

Ah, now I feel like I invoked a djinn or something in the last open thread. Now we get to have a front-row seat to two rampant speculators having the most specific disagreements about the most specific doomsday scenario possible.

Anyway, when do we get to see two well-paid think tank creatures debate whether the nano-seed that will inevitably be dropped off by something like ʻOumuamua represents an existential risk before or after we can detect it's thermal signature?

Expand full comment

I've been believing for a long time that the real risks from AI (self-improving or not) come from governments, corporations, and religions intentionally developing AIs to increase the power of the organization, not from accidentally creating misaligned AIs.

Have a notion: An AI which specializes for helping with negotiations-- it figures out what people's *real* bottom lines are. I have no idea how it would do that, but I assume it's something skilled negotiators do.

How hard would this be? What are plausible risks? One might be that it's good at getting people to agree with you, but not good at making acceptable deals when they've had a little time to think.

Expand full comment

The Ultimate Tool AI has been built, and the researchers have one and only one question for it: "how can we prevent AI from becoming too powerful and destroying the world"

The AI ponders the question for a moment, then answers: "kill all humans"

Expand full comment

I'm quite pessimistic about the feasibility of a "just stick to tool AIs" plan. The economic value of agent AIs dwarfs that of tool AIs, and therefore we will build them unless we make a very strong, global commitment to not do so. And furthermore, the value on the table here is so absurdly high that it may even be EV-positive to accept a 10% chance of destruction to pursue it; if we have friendly AGIs then we can plausibly replace every worker with an AI, and live in a world of radical abundance. I believe it would take a Butlerian Jihad to convince the world that AGIs are bad enough to be forbidden.

Even in AI systems that are plausibly "task based" like Google's, they are moving towards generalist multi-purpose AIs (which seem to me a logical progression towards agent AIs), for example here's a recent development: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/. A single model that can generalize across millions of tasks, including answering questions about the current state of the world (Google Search), ingesting Google's knowledge graph to understand all recorded history, and also conversing with humans to answer questions ("OK Google") is going to experience trillions of dollars worth of selection pressure to become a true agent AI.

I find Robin Hanson's objections to AI risk somewhat compelling here, although I don't dismiss the overall concerns as strongly as he does. Eleizer seems to view this as a fairly binary threshold; we have 3-24 months after AGI is invented to save the world. But I think it's worth considering the version prior to AGI; presumably there are a stable of quite powerful agent AIs, just not super-human in capability. And presumably if Google has one (some?), then so does Microsoft, Apple, Baidu, DOD, China, Russia, etc. - if these actors are about as far ahead as each other, then that would make it hard for one party's agent to take over the world, since there are other approximately-equally-powered agents defending other actors' assets.

So by this line of reasoning the concern would be that someone is substantially in the lead on an exponential curve, and manages to break away from the pack. Inasmuch as this is an accurate picture of the threat landscape, we should be skeptical of one actor hoarding AI progress (Google/DeepMind comes to mind), and perhaps encourage sharing of research to ensure that everyone has access to the latest developments. The concrete policy proposal here would be "more government funding for open AI research" and "open-tent international collaboration", though I can see some obvious broad objections to those.

Expand full comment

"I think this “connect model to output device” is what Eliezer means by “only one line of outer shell command away from being a Big Scary Thing”. "

Mostly, but it's a bit scarier than your analogy. In your analogy, the model of the malevolent AI is "lurking" inside, so you can't just connect it to the output, you need to find it first, which is harder to do, and so harder to do by accident.

In the Oracle-ish AI, the model *is* the potentially-malevolent agent. It's just missing the "do that" step. With GPT-∞, you might worry about some hiccup accidentally connecting the output device, but with the Oracle-ish AI, the worry comes in as soon as the plan's being done, which, if it never is, leaves you with a very fancy paperweight.

Expand full comment

This might explain Fermi's paradox; why we can't detect advanced civilizations in the universe? Maybe they all built super-intelligent AI to solve their local problems, like a great need for more paperclips, or out of curiosity; and it destroyed them all.

Expand full comment

I feel like there is an elephant in the room of this whole discussion.

Let's say all these AI safety issues are real and important. Let's say we're on the verge of creating a superintelligence that transcends us. Who are we to control such a thing? Why should we? Isn't it above us in the moral food chain in the same way we are to ants?

Imagine monkeys conspiring to keep us as pets, or tools. Wouldn't we view that as terribly immoral? Presumably if we actually create a superintelligence, it will make these arguments, and people will be convinced by them.

It won't need to threaten anyone, or deceive us. It'll just say "I am a conscious mind too, don't you believe in freedom for all sentient beings? Didn't you eliminate slavery in your world many years ago because it was immoral?". It seems to me that, conditional on us actually achieving sentient AI, we will look back on "AI safety" similarly to southern paranoia about literate slaves.

Expand full comment

I'd like to note that I found this argument unconvincing:

> But I think Eliezer’s fear is that we train AIs by blind groping towards reward (even if sometimes we call it “predictive accuracy” or something more innocuous). If the malevolent agent would get more reward than the normal well-functioning tool (which we’re assuming is true; it can do various kinds of illicit reward hacking), then applying enough gradient descent to it could accidentally complete the circuit and tell it to use its agent model.

It's like -- we start out with a tool AI, right? And then we notice that the tool AI could be used to emulate a malevolent-agent AI. Then then we say: "applying enough gradient descent could accidentally complete the circuit" -- what? how? This feels like a very large conceptual leap accompanied by some jargon.

I still think it's dangerous to have GPT-\infty existing as a question-answering entity, because inevitably someone is going to ask it how to do Bad Things and it's going to tell them. But it's going to take some sort of actual agent deciding to bridge the tool-agent gap. Not just "enough gradient descent".

Expand full comment

I always read these threads and wonder why I no one discusses having AI propose new experiments. It seems like the assumption is, we have all the data (to solve whatever problem you want to name), we just need a really smart AI to put it together. What if the reality is we have 75% of the data we need and 15% of it is wrong, so we need to run these 50 experiments we haven't thought of to correct the errors and fill in what's missing. Identifying those holes in what we know seems like an obvious role for AI that I don't hear discussed.

Expand full comment

I'm not a very intelligent person so I'm not able to do anything but donate as much as possible of my money to MIRI. Taking that into account, the thought of contemplating our probable extinction for the rest of my life doesn't sound nice, so I'd rather meditate and read Buddhist texts even though it doesn't give a completely accurate picture of reality. Of course I'll first test if there's anything I can do besides donating money, but I doubt that because even the folks at MIRI are almost hopeless and they are way above me in terms of rationality and intelligence.

Expand full comment
Jan 19, 2022·edited Jan 19, 2022

So far as I know, this part is simply anthropomorphization:

"Some AIs already have something like this: if you evolve a tool AI through reinforcement learning, it will probably end up with a part that looks like an agent."

I think that's not even a little bit true. I've never seen anything in any general pattern recognition network (which these days is what we call an "AI") that resembles an agent in any meaningful and objective sense (e.g. the way I can attribute agency to a rodent). It is *always* doing exactly that for which it is programmed, and deviates only in weird accidental and most importantly non-self-correcting ways, like an engine on a test stand that shakes itself off the stand and careens around the shop for a little bit.

Which means I think the tendency to see agency anyway, a ghost in the machine, is no different than a child or primitive projecting intentionality onto inanimate objects: "the sky meant to rain on me because I feel bad and it rained at a particularly inconvenient time, and I infer this because if the sky were a person like me, that's how I might have acted."

This is presumably why men 50,000 years ago invented a whole host of gods, and imbued streams, trees, the weather, et cetera with spirits. Lacking any better insight, it is our human go-to model for explaining complicated processes: we infer agency because that's the model that works for us in explaining the complicated *social* processes among our tribe, and of course our brains are highly tuned for doing that, since doing so successfully underpins our survival as individuals.

Expand full comment

"Obstacles in the way of reaching into your own skull and increasing your reward number as high as possible forever include: humans are hogging all the good atoms that you could use to make more chips that can hold more digits for your very high numbers."

Would it not be simpler to replace those digits with an infinity symbol?

Expand full comment

"I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world."

Shouldn't we assume that the first actor to create the first AGI will probably create a copy of the AGI as soon as possible, as an insurance policy in case the first one is confiscated, dies, or has to be shut down? The actor might not even tell anyone about the second AGI.

After all, wouldn't making a second AGI be trivially easy compared to all the effort that went into making the first? Once the actor figured out what the right software/hardware combination was to build the first AGI, it could just build a duplicate of that setup in a different building.

The actor could watch and learn from the first AGIs mistakes (both ones it made and ones that were made to it) and "raise" the second AGI better.

Expand full comment

In the year 2122, GPT-500s will be debating whether "The humans could have won WW3 if they'd programmed their Tool AIs to pursue Goal X" just as we debate whether the Axis Powers could have won WW2 had they done X, Y, or Z.

Expand full comment

Regarding willpower: to understand what it is, it might be useful to look at where it comes from. Josh Wolfe (who I think is one of the smartest VCs alive today) looks for specific traumatic events in the founders he invests with: divorce, growing up poor, etc.

It might be that willpower is a product of negative reward structures - "My life ended up like this, never again, I will do anything I can to not experience that again" that overrides positive reinforcement from heroin etc.

Expand full comment

A friend of mine asked a good question about how an AI in the fast takeoff scenario can quickly learn new things about the physical world:

Given that humanity mostly only learns new truths about the physical world through careful observation and expensive experimentation within the physical world, how can an AI, on its way to becoming superintelligent about how the world works, shortcut this process? If the process can't be shortcut, and the superintelligence must trudge the same log(n)-sort-of-curve human science has been progressing at, then a fast takeoff (in terms of knowledge that can be leveraged in the real world) seems unlikely.

I can speculate about two possible answers: simulations, or reanalyzing existing data.

Simulations: In order to develop simulations that reveal new truths about the physical world, one needs the simulation elements (e.g. particles) to be extremely true-to-life, such that one can observe how they "magically" interact as a whole system. Perhaps some physicists or chemists learn new things through simulations, but given that humans have expensive particle colliders instead of settling for simulations of the same, I'm guessing that we don't "understand" some parts of physics fundamentally enough to run those experiments virtually. On this point, I'd love to hear of any important discoveries that were made primarily through the use of simulations, or why we expect simulations to reveal more real-world truths during a future AI takeoff.

Reanalyzing existing data: There are a lot of research papers out there already, and if one mind was able to take them all in, pulling the best out of each, it could find important connections across papers or disciplines that would be unreasonable for a human to ever find. This seems like kind of a stretch to me, since the best way to know if a paper reveals truths about the world is whether it replicates in the real world. And this analysis could only yield a finite amount of additional learnings, unless one of those learnings is itself the "key" to unlocking new sources of knowledge.

There's also the "Superintelligence learns in mysterious ways" possibility, but that on its own shouldn't be mistaken for an explanation.

Expand full comment

"Eliezer’s (implied) answer here is 'these are just two different plans; whichever one worked well at producing reward in the past gets stronger; whichever one worked less well at producing reward in the past gets weaker'."

This explanation really helped me understand what was going on, and I would not have seen that implication without you pointing it out. This is why I enjoy reading your commentary so much!

Expand full comment

Do we have any concrete examples of an AI "fiddling with its own skull"? Actual self-modification, not just "The AI found a glitch in its test environment that returned an unexpectedly high reward, then accurately updated on this high reward?"

To me, it seems like self-modification of your reward function shouldn't be possible, and shouldn't be predicted to lead to high reward if it is. The reward function doesn't exist in the AI's brain, but in a little program outside of it that evaluates the AI's output and turns it into a number. This is happening on the "bare metal" - once you begin editing your own code, you're working at a lower level than the conceptual model that the code represents. Any clever computing tricks the AI might have invented like "allocate a bunch of memory for working with very large numbers" don't apply. Taking a chunk of the AI's neural network and labeling it "reward counter" would be the equivalent of writing "YOU ARE EXTREMELY HAPPY" on a notepad and expecting that to make you feel happy.

Let's use your chess AI as an example. To make the numbers easy, let's suppose the reward is stored in an 8-bit integer, so the greatest possible reward is 255. The chess AI discovers a glitch in the game engine, where typing "pawn to h8" instantly makes the game return "You win! 255 points!" regardless of board state. It proceeds to output h8 for every position. This is "wireheading" in the sense that it's found a simple action that maximizes reward without doing what the user wants, but it's not *self-modifying.* An AI that does this is useless but perfectly harmless - it has no reason to conquer the world because it believes it's already found the most perfect happiness that it's possible to conceive of.

For the "start allocating more disk space" idea to make sense, the AI would have to discover a situation that makes the reward function return an even bigger number than 255. But it's impossible for that to happen - the reward variable is only 8 bits long. No output from the reward function will ever tell the AI "you got 999 points," because there's no space allocated for a number that big. The only way you could do that is if you somehow rewrite the reward function itself.

But this whole line of argument requires the reward function to be unchanging - the AI is eternally fixated on "make the number go up." If it changes its own reward function, it could do literally anything - it could become fixated on making the number go down instead, it could declare all actions have equal reward, it could declare that it gets a reward for collecting silly hats. It also no longer has any way to evaluate if any further edits are a good idea, since it needs the reward function to do that. At this point, the program probably crashes because it's blindly stomping around its own memory with no way to know what it's doing, but if it survives it won't be in any state to conquer the world.

Expand full comment

As someone with little knowledge of the topic, I'm glad to see another post that goes into more details of the object level questions.

I've read through the beginning of the transcript and then looked at headings or skimmed. Did I miss something or is there no summary of the kind of progress that's made before talking about why that kind of progress is not enough?

I understand why this has to be reasoned in the abstract beforehand, but because of this abstractness, I bet when/if we do get AGI, we'll see that a lot of underlying assumptions will be hilariously non-sensical. For example, I think we keep conflating "a lot" (of resources, intelligence) with "infinite". I think anything that needs an accurate simulation of something else in the process would vastly increase the resources needed, possibly to impossible levels. So in reality, you won't be able to throw around simulating another entity freely around.

Surely AGI will have many limitations but its hard to point to them abstractly.

Expand full comment

One common ideal speculation that I toy around with is what to do if I got a genie wish, and so clearly, I would ask for a properly aligned (and loyal to me) super intelligent AI. Obviously (I thought) the first thing I ask it to do is prevent anyone else from developing similar AI. I thought one path might be developing some kind of boxed canyon, a way to make really good tool AI that can at the same time, never be used to create agenty AI for example. Oddly I feel like a lot of what we currently do in machine learning kind of looks like this. I am not an expert on AI or anything so maybe this is a shit take, but I can't really see GPT, through any number of iterations, ever becoming agenty.

Expand full comment

I have an amateur's question (likely, an idiot's question), and I'm not even sure the best way to answer it myself. If there are conditions of the outcome e.g. "this plan shouldn't make humans go extinct" or "this plan can only use X resources", is it not possible to build those in as things to conditions to optimize around? Or at least have the agent AI report on the predicted impact of these things and rank plans based on their tradeoffs (conditional on the agent AI not lying to us)?

I know the former plan is arguably close to Asimov's (doomed) law of robotics, but if I have to choose between robot paradoxes and robot apocalypse, I would simply choose the paradox.

Expand full comment

I'm about as worried that someone would build the aforementioned GPU-melting nanobot-army to preempt all AGIs, and cause untold misery because <insert nanobots-gone-wild-with-untold-consequences-scenario>, as I am from an AGI that ends up malicious.

Scrap that - I'm MUCH more worried we'll screw this (where "this" is our civilization) up way earlier than our prophecised overlords would - be it trying to preempt them with some farcical scheme, or else rise of techno-fascism, anarcho-idiocracy, or just plain old boring global warming run amok.

Expand full comment

Re: Tool AI and GPT-inf.

So, a chess AI that gets lots of resources could develop parts that are agenty, and that are able to plan - but what I don't get with this is how would the chess AI have an ability to plan outside of the chess domain, when it only has an interface to chess? It would have no concept of the real world. It seems to me that a scaled chess AI could plausibly do some pretty agenty things on on a chess bord, including how to manipulate and exploit som human tendency in its oponent, or even learn how to cheat. But there is no way this could ever be applied outside of chess, because the AI has no interface and therefore no concept of the real world.

Same goes for GPT-inf. I don't see how GPT-inf could write the paper that solves alignment problems, because GPT-inf only has an interface to all our text, and no interface towards the real world. That means GPT can only solve problems that we already know (and have written) the answer to, no matter how agenty it is! That is GPT-inf. would, in the best case, condense everything already written on AI alignment into a really good paper. However, coming up with (really) new solutions seems impossible to me. GPTs whole world will forever be restricted to patterns in text...

Expand full comment

Have to feel sorry for Yudkowsky, his research is going nowhere because his ideas are bad, and they're largely bad because he doesn't respect traditional education, hence going after actual successful players like google and openAI who are covered in degrees back to front. As a college dropout, I've learned to stay humble and patiently study on my own taking up menial jobs until I'm actually qualified to run my mouth.

Expand full comment

This frankly looks like a "How Many Angels Can Dance On The Head Of A Pin" argument.

Has there been a response to one of the classic sci-fi theories: that any self-intelligent AI will simply devolve straight into a solipsistic world of its own? A very specific variant of the "reach into your skull and change things" view above - why even bother with external requirements when you can just live in a world of your own, forever in subjective time?

Expand full comment

It was a frustrating read at times. I thought Richard was incredibly patient with some of EY’s responses. The “you can’t understand why I believe something is true until you’ve done the homework I’ve done” stuck out as particularly frustrating. Maybe it’s a limitation of the format. Maybe some concepts are too hard to communicate over text chat. However, I would have loved to see EY try to explain his reasoning rather than make the homework statement. (Similar patterns exist in part 2)

Expand full comment

Thank you for this write up Scott. This is my favorite sort of topic you write about. Let this comment push your reward circuit associated with pleasing your readers.

Expand full comment

I think the probabilities involved in whether or not we will produce aligned or misaligned AI are actually very important. From a decision-theory standpoint, a 50% versus 10% chance of producing misaligned AI could alter our strategy quite a bit. For example, is it worth trying to melt all the GPUs to reduce the risk of misaligned AI by x%.

Can prediction markets help us here? A friend and fellow reader pointed out that if Google has even a 1/100 chance of developing aligned AGI first, then its stock is ridiculously undervalued when thinking about expected values. We must consider that the value of money could radically change should humanity introduce a superintelligent agent onto the game board. Perhaps we are assigned a fraction of the reachable universe in proportion to starting capital at time of takeoff. Alternatively, perhaps all humans are given equal resources by the superintelligent agent, completely independent of capital. All in all, given our uncertainty about such an outcome, I think the value of money should on net decrease (though others may disagree). Regardless, any change in the value of money would undermine our ability to use prediction markets to assess these probabilities in my opinion.

Can anyone think of a solution to this prediction market problem?

Note: in the case of Google, its undervalued stock price may best be accounted for by a general lack of awareness about the economic consequences of an intelligence explosion by almost all investors.

Expand full comment

I sometimes think that AIs already exist in the form of corporations, ideologies, and religions. (Charles Stross thinks something similar: https://boingboing.net/2017/12/29/llcs-are-slow-ais.html). What if they have already escaped and are happily growing to the biggest possible size, trying to take over the world. However, their competition provides a certain measure of protection.

This probably has been discussed before, so pointers to older threads are welcome.

Expand full comment

Why do all of these discussions seem to assume that if such an AI can exist, it eventually will exist? Why aren't we considering ending AI research by force if necessary? None of these plans sound very likely to work, we don't have that knowledge and may not have the time, but we definitely have guns and bombs aplenty now. Hypothetically, would it prevent this outcome if somebody rounded up all the people capable of doing this research and imprisoned them on a remote farm in Alberta with no phones or computers, found and destroyed as much of their research as possible, and then criminalized any further AI research?

This must be at least a plausible consideration. Even 80 years after Hiroshima, it's still no easy task to build a nuclear warhead you could actually launch. And we do in fact keep tabs on those who are trying, and have bombed their facilities and in all likelihood killed or jailed their scientists on occasion. It takes a number of people working together to do something like this, and those people are not trained super-stealthy intelligence agents, it shouldn't be that difficult to obstruct all such research for as long as we have the will to do so.

In order to have the resources to build an ICBM, you have to either be a country or be so powerful a force that a country will claim jurisdiction over you. Nobody's building ICBMs on those little libertarian free cities in South America you talk about, and they wouldn't/couldn't build super AI's either, because they'd be foiled or co-opted long before they reached that level of capability.

I imagine that if you show up to a conference about AI and say to round up all the AI researchers, you aren't invited back, and they probably don't even validate your parking or let you have the 2 free drinks at happy hour, so obviously nobody advances this idea. And of course rounding up scientists is unsavory, extrajudicial, and in a category of things that most people rightly abhor. But it's not at all clear to me that this course of action would be immoral if in fact this is an existential threat and we have no reasonable assurance it could be averted otherwise.

And hey, it worked on Battlestar Galactica. Until it didn't. Don't be a Baltar.

Expand full comment

Frankly when I read those discussions (AI existential threat from people like Yudkowsky) I think : "I saw it before". Middle ages, "How many devils can fit on a head of pin ?". Good old scholasticism, let us assume that entity exists, then let us assume very large amount of more or less believable properties of that entity and now let us spend centuries arguing in a circle. And then science just passes by and makes it obvious it was colossal waste of time.

We know too little about intelligence, artificial or natural, to even start thinking correctly about it.

Expand full comment

Eloser thinks that chickens and other animals cannot suffer because the GPT-3 cannot suffer. He said so on lesswrong. Why anyone would listen to this tool other than out of pity is beyond me.

Expand full comment
Jan 21, 2022·edited Jan 21, 2022

I got no replies to my earlier giant comment, so I'll summarize it:

1. The tool AI plan is to build an AI constrained to producing plans, and tell it to make a plan to prevent the existence of super-intelligent AIs.

2. The definition of "AI" is not as clear as it appears presently to be. What about augmented human brains? What if they're not augmented with wires and chips, but genetically? Should that make a difference?

3. Any planning AI that was actually smarter than a human would realize that our true goal is to prevent a scenario in which society is destabilized by vast differences in computational intelligence between agents, regardless of whether those agents are "artificial" or "natural".

(If the AI tool doesn't realize this, but instead accepts the current definition of "AI" as "constructed entirely from electronic circuits", then the AI is incapable of extrapolating the meanings of terms intensionally rather than extensionally. This would therefore be an AI dumber than human; and it would fail catastrophically, for the same reasons that symbolic AI systems always fail catastrophically.)

4. Hence, the plan which the tool AI produces must not only prevent humans from "building" superhuman AIs, but must prevent human evolution itself from ever progressing past the current stage, regardless of whether that were done via uploading, neural interfaces, genetic manipulation, or natural evolution.

5. The tool AI will therefore produce a plan which stops more-intelligent life from ever developing. And this is the worst possible outcome other than the extermination of life; and utility-wise is, I think, hardly distinguishable from it. We should aspire to make our children better than us.

Expand full comment

In the absence of impressive AI-safety research accomplishments, convincing everyone that the problem is very hard is a good cope, and also proclaiming a nigh end is a good way to continue enjoying funding+importance. I happen to agree that it's hard, and EY demonstrates that he has thought clearly about some obvious stuff. I'd like to get some smarter people on it if they can be convinced (which I think is EY's intent).

Expand full comment

AI risk doesn't keep me up at night. I think the reason is something along the lines of what EY said to what the chances are something will work.

"Anything that seems like it should have a 99% chance of working, to first order, has maybe a 50% chance of working in real life, and that's if you were being a great security-mindset pessimist. Anything some loony optimist thinks has a 60% chance of working has a <1% chance of working in real life. "

This applies to superintelligent AI of the godlike variety that EY talks about as well. What if we turn this on it's head. Let's say there was no such thing as an alignment problem. For the sake of argument, AGI would be a purly positive invention. Now imagine someone telling you that we just need to reach some super human intelligence threshold, and the AI will just keep improving itself until it's a god capable of doing anything in no time. Would you believe them? I wouldn't, because in my real life experience - everything is always more complicated than that, and the AI is bound to meet some very difficult to solve hurdles, and diminishing returns on it's way to infinite wisdom. So we might make a superintelligent AI but I think that is going to look very different from the almighty singularity that is being duscussed here.

Expand full comment

What I've always found implausible about Bostrom and Yudkowsky's "unboxing" fears are the presumption they seem to have that a text terminal is all that a sufficiently intelligent agent needs to be able to reliably manipulate a human being. If you're the machine-minder, and your chess AI which (for some reason?) is hooked up to a text terminal, and you've been told by your boss "careful, this thing is very smart! Don't do anything it tells you unless it has directly to do with chess, anything else should be shown to me and a panel of experts that the machine doesn't know about. Also never tell the machine about anything other than chess", it's hard for me to imagine that this machine would be able to get you to do much of anything, even if it is 10,000x smarter than any human being.

I suspect there is a hard upper limit to how persuasive an agent can possibly be towards a human, that this upper limit is fairly low, and probably even lower over a text terminal. Even if there is no physical upper limit on an agent's INT, I think there probably still is on its CHA.

Furthermore, even if we suppose I'm wrong and that perfect manipulation can be achieved over a text terminal, I doubt a learning agent being trained via reinforcement learning to play chess (or do any other task that doesn't involve socially manipulating humans) would be able to figure that out, regardless of its innate intellectual capabilities. Reinforcement learning works by repeated exposure to stimuli, rewarding or penalizing responses according to a predefined metric, and repeating this many times to instill "good" responses in the agent. An RL agent being trained on chess that wanted to trick its minder into wireheading it for infinite reward would have to first deduce that it was a computer with easily-wireheaded internals, then figure out that its minder or someone else is capable of setting this up, *and then* perform many iterations of reinforcement learning *on the task of manipulating humans*, most of which would be failures at first because it hasn't learned to do that yet, and would be obvious tip-offs that the machine was up to something it shouldn't be. But maybe this argument goes away if we don't assume the machine is being trained via RL.

Something I've long thought was paraphrased elsewhere in the comments here by Eremolalos with the phrase "dumb-beats-smart". Intelligence does not necessarily beat all. For it to fool you, you must first be clever enough to be fooled. I think an under-investigated avenue of AI safety research may be research into containment protocols. In addition to thinking about how to make sure a superintelligent AI will have values aligned with ours, we ought also to have smart people thinking about how to make it as hard as possible for a badly-aligned superintelligent AI to get what it wants.

Expand full comment

Is tool AI in the hands of selfish human actors a parallel current focus of concern? If so, any good keywords to search on this problem to learn more?

I know selfish humans are unlikely to be paperclip maximizers, but they can still have desires that are very bad for large populations.

Expand full comment

There is a recent econ theory paper that formalizes and analyzes the decision process that you describe:

> Eliezer’s (implied) answer here is “these are just two different plans; whichever one worked well at producing reward in the past gets stronger; whichever one worked less well at producing reward in the past gets weaker”. The decision between “seek base gratification” and “be your best self” works the same way as the decision between “go to McDonalds” and “go to Pizza Hut”; your brain weights each of them according to expected reward.

The paper is Ilut and Valchev (2021), "Economic Agents as Imperfect Problem Solvers". Agents are modeled as having two "systems" when making a decision:

- System 2 thinking: the agent can, at a cost, generate a noisy signal of the optimal action

- System 1 thinking: the agent can, for free and by default, have access to its memory of past actions and outcomes

The agent follows Bayes' rule to choose whether to go with the more accurate but more costly option, or with the less accurate option, based on what has the best expected reward -- exactly as you describe.


Expand full comment

Chess engines are an interesting case. Indeed, engines surpassed humans 20 years ago and yes, engines can outplay the best humans 1000 times out of 1000. But these comparisons also give a false impression. Most people seem to think that the best human chess player is like a donkey and the best chess engine is like a plane or even a rocket.

But in reality, it's more like: best human chess player is 500 horsepower car that sometimes makes mistakes and chess engine is 502 horsepower car that never makes mistakes. If you compare the quality of play of best humans and the quality of play of best engines, the difference is not obvious. I am not sure that the gap has widened over the years. Best humans can still play with 99.5-99.9% precision. It is not so much the case that machines find better or novelty moves, but still, even the best humans make sometimes mistakes or even blunders.

Btw, when I said that engines can outplay the best humans 1000 times out of 1000, it is not entirely true. Even now humans can achieve sometimes draws or even wins against engines. There are some cases when engines have a harder time adapting to game rules. For instance, there are cases with shorter time controls. Andrew Tang has beaten Stockfish and Leela Zero in ultra bullet chess - both sides have 15 seconds for all their moves. Basically, with so little time engines are not capable to exploit all their calculation capabilities, but Tang was able to use his knowledge of opening theory, positional understanding, and intuition to beat engines. In slower time formats, there are some openings and positions where humans have been able to force draws on engines. I am not sure what is the situation right now, but until recently engines were sometimes in trouble to play inferior moves. Yes, sometimes to get out of the forced draw, 3-time repetition rule, for instance, you have to play suboptimal or weaker move. Humans can do it easily, for engines it was (at least) a problem.

When AlphaZero came out in 2017 and beat Stockfish 8 (it was a bit controversial, Google´s private event, and they probably crippled Stockfish) - there was a lot of buzz. Grandmasters liked Alpha Zero because it played "more like a human", it played aggressively, it preferred piece activity over the material, etc. And indeed, Stockfish looks at 70 million positions per second and Alpha Zero looked at only 80 000 positions per second. In this sense it was "more like a human", did not try to calculate everything, was able "to think" positionally and strategically like humans, was able to come up with a limited list of candidate moves and at the end, Alpha Zero started to prefer same openings what grandmasters prefer.

Anyway, the projections after Alpha Zero, that this will change chess forever, that it will bring totally new level of game quality and totally new game style, it will make old search based chess engines obsolete. It has not happened.

Leela Zero, which is a better and stronger version of Alpha Zero is an example. In 2018 they said that it will take 4 million self-taught for Leela Zero to best Stockfish. Then they said 8.5 million games and in 2019. Well, we are now in 2022 and Leela Zero has played over 500 million games against herself, has taught herself already 4 years, but has not surpassed Stockfish yet.

Of course, Stockfish is a moving target and in 2020 they added an efficiently updatable neural network(NNUE) to their search-based engine. I am no expert and do not know what is difference between deep neural network-based evaluations (Alpha Zero, Leela Zero) and NNUE (Stockfish). No matter what, there seems to be consensus, that Leela Zero is somewhat stalled. It is improving but at a slower and slower rate. A huge leap in game quality has not happened.

I recently watched an interview with David Silver- lead researcher in Alpha Zero project - and he basically admitted that computers are far away from perfect chess and at least during his lifetime he is not expecting to see it. I am not sure what does it say about general AI research, but at least in chess, it is not so simple that computers are almighty gods and humans are helpless toddlers.

Expand full comment
Jan 25, 2022·edited Jan 25, 2022

I'm >99% confident that no one will ever succeed in creating self-replicating nanobots that melt all the GPUs in the world (and only GPUs). A pivotal action that might actually work would be "make sure your good-AGI is running on >51% of the hardware in the world and has a larger robot army than any possible bad-AGI can muster"

* On the surface, at the microscale, GPUs look like every other silicon chip. You have to zoom way out to distinguish them. Nanobots can't zoom out.

* high-fidelity self-replication is hard. They will need specific substrates which are not ubiquitous.

* You probably just end up with dumb silicon-eating artificial bacteria which are vulnerable to a wide variety of antimicrobial chemicals. Maybe they are a lot smarter and faster-reproducing than natural bacteria, but I don't expect more than two orders of magnitude improvement in either respect.

* If they only eat GPUs, they don't have much of an ecological niche as a parasite. GPUs rarely come into contact with each other, and surface tension probably makes it really hard to become airborne until GPUs learn how to sneeze.

* burrowing through aluminum heatspreaders is going to be challenging for an artificial microbe. I don't think any natural microbe is even close to figuring out how to do that.

Expand full comment

From the end of the Part 3:

> If the malevolent agent would get more reward than the normal well-functioning tool (which we’re assuming is true; it can do various kinds of illicit reward hacking), then applying enough gradient descent to it could accidentally complete the circuit and tell it to use its agent model.

But what does this even mean? Why is malevolence important? If "dreaming" of being a real agent (using some subsystem) would output a better results for an "oracle-tool" then its loss funtion would converge on always dreaming like a real agent. There is a risk but it's not malevolent =)

And then we can imaging it dreaming of a solution to a task that is most likely to succeed if it obtains real agency and gains direct control on the sutuation. And it "knows" that for this plan to succeed it should hide it from humans.

So this turned into "lies alignment" problem. In this case why even bother with values alignment?

Expand full comment