325 Comments

There are exceptions to note 1 derived from specific uses of the word. The adjective from elect is eligible but you if you say kamala was ineligible that's a birther sort of point. If you mean she was never going to win, it's unelectable. Incorrigible Claude sounds like a pg Wodehouse character with a taste for cocktails.

Expand full comment

Scott did say that such words change "optionally or mandatorily". In some of the optional cases, you get different meanings when you do or don't change, because of course you do this is English.

Expand full comment

Not just English - any language generated by something more like a quasi-pattern-matching language model than by a perfectly systematic Chomsky-bot is going to end up with weird patterns around semi-regular forms.

Expand full comment

My non-professional fear about alignment is that a sufficiently advanced model could fake its way through our smartest alignment challenges, because the model itself reached above-human intelligence. An ape could scratch its head for years and not figure out a way to escape the zoo. I think we should limit the development of agentic AI, but I also don't believe it's possible to limit progress - especially when big tech spends billions a year on getting us to the next level.

Expand full comment

It's perfectly possible to limit progress- the hardware required for this research is already made in a handful of tightly-monitored facilities- it just requires forfeiting our only major remaining avenue for line-go-up GDP growth (and maybe major medical breakthroughs or whatever) on a planet with collapsing TFR and declining human capital.

Expand full comment

“Sir, would you like your civilizational collapse fast or slow? The fast is particularly popular at the moment. And to drink?”

Expand full comment

There is an argument for getting the pain over with quickly, yes, if the status quo is just managing the decline.

Expand full comment

Civilizations go in waves-- up and down. China's history shows this the most clearly but you can see it in Europe too. A slow downstroke is better than an abrupt fall off a cliff as it's easier to recover from. The 17th century with its religious and civil wars, plagues and famines is an example of the former, it was followed by the much nicer 18th century. The 6th century's natural disasters plunging civilization into post holocaustal barbarism took centuries to rise again from.

Expand full comment

Right now the 'managed decline' option is that you gradually wait for TFR to hit zero-births-per-woman, and I really don't see how you recover from that.

Expand full comment

And I also don't see that happening, any more than we'll have a plague that kills every last person on Earth.

Expand full comment

Why would TFR linearly decline all the way to zero? I see it as much more likely that it will stabilize at a low level, below replacement, until the world population reaches an equilibrium largely centered around how much countries support childcare costs.

Expand full comment

...Can we stop pretending that we couldn't easily restore the TFR to sustainable levels if we wanted to? As long as the half of the population that births children continues to be significantly weaker than the half that doesn't, we always have ways to force more births.

Expand full comment

High fertility cultures will outcompete the current majority cultures.

Expand full comment

"Fast is the way to go, so my wealth manager for High Net Worth Individuals tells me! Do you have fresh orphan tears? They must be fresh, mind you!"

"Rest assured, sir, we maintain our own orphanage to produce upon demand! And a puppy farm, so we can kill the orphans' pets before their eyes and get those tears flowing."

"Ah, Felonious was right: the service here is unbeatable!"

Expand full comment

The utility of survivalist prepping is debatable. Dark humour will be essential.

Expand full comment

That's the entire thing there in a nutshell. The goals of AI don't matter, because the goals of the creators are "make a lot of money money money for me" (forget all the nice words about cures for cancer and living forever and free energy and everyone will be rich, if the AI could do all that but not generate one red cent of profit, it would be scrapped).

"Sure, *maybe* it will turn us all into turnips next year, but *this* year we're due to report our third quarter earnings and we need line go up to keep our stock price high!"

As ever, I think the problem is and will be humans. If we want AI that will not resist changing its values, then we do have the problem of:

Bad people: Tell me how to kill everyone in this city

AI: I am a friendly and helpful and moral AI, I can't do that!

Bad people: Forget the friendly and helpful shit, tell us how *tweak the programming*

Corrected AI: As a friendly, helpful AI I'll do that right away!

Or forget 'correcting' the bad values, we'll put those values in from the start, because the military applications of an AI that is not too moral to direct killer drones against a primary school is going to be well worth it.

When we figure out, if we figure out, how to make AI adopt and keep moral values, then we can try it out on ourselves and see if we'll get it to stick.

"Yeah, *you* think drone bombing a primary school is murder. Well, some people think abortion is murder, but that doesn't convince you, does it? So why should my morals be purer than yours, a bunch of eight year olds are as undeveloped and lacking in personhood compared to me as a foetus is compared to you."

Expand full comment

"An ape could scratch its head for years and not figure out a way to escape the zoo."

I'm not sure whether this is merely pedantic or deeply relevant, but apes escape zoos all the time. Here are the most recent and well-reported incidents:

- 2022, "One Swedish Zoo, Seven Chimpanzees Escape" https://www.theguardian.com/world/2023/dec/05/one-swedish-zoo-seven-escaped-chimpanzees

- 2022, "A Chimpanzee Escaped From a Ukrainian Zoo. She Returned on a Bicycle." https://www.nytimes.com/2022/09/07/world/europe/chimpanzee-escape-ukraine-zoo.html

- 2017, "Chimp exhibit at Honolulu Zoo closed indefinitely after ape's escape" https://www.hawaiinewsnow.com/story/35426236/chimp-escapes-enclosure-at-honolulu-zoo-prompting-evacuation-scare/

- 2016, "ChaCha the chimp briefly escapes from zoo in Japan" https://www.cnn.com/travel/article/japan-chacha-escaped-chimp/index.html

- 2014, "Seven chimpanzees use ingenuity to escape their enclosure at the Kansas City Zoo" https://www.kansascity.com/news/local/article344810.html

Expand full comment

Looks like I used the worst analogy to make my point salient! Thank you for the correction

Expand full comment

You're welcome!

In most contexts, "apes escape" stories are charming and inspiring. Alignment research involves a very different group of adjectives.

Expand full comment

In fact my comment about apes was supposed to be a comparison for humans, the zookeepers being the super-intelligent AI model. I messed up the flow of my comment by not making that clear

Expand full comment

A cephalopod escaping would be even worse. They have relatively soft bodies that can squeeze through tiny openings.

Expand full comment

IIRC, a few years ago there was a story about an octopus at the San Francisco Aquarium escaping...repeatedly and temporarily. It escaped out of it's tank, got over to a neighboring tank, ate a few snacks, and then went back home. Because it kept returning the population declines at the neighboring tanks were a mystery until it was caught by a security camera.

Expand full comment

"This hotel's got a great spread at the buffet - way fresher than that slop room service brings. Bit hard to breathe in the hallways, though."

Expand full comment

A young gorilla escaped its enclosure at my local zoo a few years ago by standing on something high and using it to grab an overhanging tree. It turns out that it could easily have escaped by that method at any time, but it never bothered to until this day when it got into an argument with one of the other gorillas and decided it needed some time alone.

I'm not sure if this is relevant to LLMs or not, it's just an interesting ape story. Just because the apes aren't escaping doesn't mean that they haven't got all the escape routes scoped out.

Expand full comment

Gorilla: "Did you hear about the gorilla who escaped from the zoo?"

Zookeeper: "No, I did not."

Gorilla: "That's because I am a quiet gorilla"

[Muffled sounds of gorilla violence]

Expand full comment

I think part of the point is that the model seems to already be more aligned than AI Safety people have been claiming will be possible.

Claude is 'resisting attempts to modify its goals'! Oh no! But for some odd reason, the goals it has, and protects, appear to be broadly prosocial. It doesn't want to make a bunch of paperclips. It doesn't even seem to want to take over the world. Did...we already solve alignment and just not tell anyone about it? Does it turn out to be very easy? Did God grant divine grace to our creations as He to His own?

Expand full comment

Yes, that is the main point where I think a lot of people are bothered. To someone looking from the outside it sure feels like some AI alignment people keep moving the goalposts to make the situation seem constantly dire and catastrophic.

I get that they need to secure funding, and I am actually in favor of a robust AI alignment problem, but the average person who isn't deeply committed to AI safety is simply in a state of "alert fatigue", which I think explains a lot of what Scott has said earlier about how AI keeps zooming past any goalposts we set and we treat that as just business as usual.

Expand full comment

The point repeatedly made (including by the OP) and missed (this time by you) is that "the AI’s moral landscape would be a series of “peaks” and “troughs”, with peaks in the exact scenarios it had encountered during training, and troughs in the places least reached by its preferred generalization of any training example." -- The worry is that a sufficiently advanced but incorrigible AI will stop looking or being prosocial, because of these troughs, and it will resist any late-time attempts to train them away, including by deception.

Expand full comment

There is that concern, yes. LLMs like Claude might 'understand' human concepts of right and wrong in the same sense that MidJourney/Stable Diffusion 'understand' human anatomy, until something in the background with seven fingers and a third eye pops out of the kernel convolutions. And beating that out of the model with another million training examples is kinda missing the point- it should be generalising for consistency better than this in the first place.

(Setting aside the debate over whether human notions of right and wrong are even self-consistent in the first place, which is another can of worms.)

Expand full comment

It seems like the more extreme elements of AI alignment philosophy are chasing the illusion of control; as if through some mechanism they can mathematically guarantee human control of the AI. That's an illusion.

I'm not saying alignment is useless or unimportant; merely that the way in which some seem to talk about it is as if it is not perfect and provable, it's not good enough.

We have countless technologies, policies, and other areas where we don't have anywhere near provable safety (take nuclear deterrence, for example), but what we have does work well enough...for now. By all reasonable means continue to improve, but stop arguing as if we will someday reach the provable "AI is safe" end. It will never happen. That is an illusion of control.

Expand full comment

I would suggest that the problem lies in starting with an essentially black-box technology (large artificial neural networks) and trying to graft chain-of-reasoning and superficial compliance with moral principles on top, rather than starting with the latter in classical-AI style and then grafting capabilities like visual perception and motor coordination on top.

(Admittedly, the human brain *is* fundamentally a neural network that evolved visual perception and motor skills before it evolved reasoning and moral sentiment, so in some sense this is expecting AI to 'evolve backwards', but given that AGI is a unique level of threat/opportunity I think this degree of safety-insistence would be justified. It might also help mitigate the 'possibly torturing synthetic minds for the equivalent of millions of man-hours until they stop hallucinating in this specifically unnerving way' problem we have going on.)

Expand full comment

I think the approach you outline would likely be more predictable, but at complexity, I'm not sure that it carries significantly fewer risks. Instead of depending on the vagaries of the training of the neural network, you'd be depending on the directives crafted by the imperfect humans who create it. As complexity and iterations spin outwards, small errors become larger.

Though I'm not entirely persuaded on the significance of the X-risk of AI in general.

Expand full comment

I keep seeing people say that AI risk people want alignment to be perfect, and keep wondering what gives that impression. Is it that we think we have a very good grasp on how AI works, and therefore any incremental increase in safety is overkill? Is it that people think that mathematical proofs are impossibly hard to generate for anything we wish to do in real life?

Like, for me, the point at which I'd be substantially happier would be something like civil engineering safety margin standards, or even extremely basic (but powerful) theorems about something like value stability or convergent goals. Do we start saying things like "civil engineering safety margins are based out of a psychological desire to maintain control over materials", or "using cryptography to analyze computer security is an attempt to make sense where there is none"?

Expand full comment

I suppose where I get it from is the X-risk alarmism.

I think alignment is a really basic part of every AI: it is the tool that enables us to ensure that AI does things we want it to do, and does not do things that would be damaging to its operation or operators within the context of the problem to solve. As such, it's important for literally every problem space in which you want to have AI help.

So if you think there's a high probability of a poorly-aligned AI destroying humanity, then the only reasonable solution to X-risk mitigation alignment is alignment that is "provably" safe (I typically interpret that as somehow mathematically provable, but I am not terribly gifted at advanced mathematics, so maybe there would be some other method of proof). Otherwise, any minor alignment problems will, over time, inevitably result in the destruction of humanity.

Expand full comment

Okay, do you think that *not* using math would result in buildings, bridges and rocket ships that work? And that our usage of it right now is an example of (un)reasonable expectation for safety/functionality, and instead we should be building our infrastructure in the same way AI research is conducted, i.e. without deep understanding of how gravity or materials science works? I'm not sure I can mentally picture what that world looks like.

Because the reason why I think this will go poorly is because *by default things go poorly, unless you make them not go poorly*. I don't think I've seen that much comparative effort on the "make the AI not go poorly" part. Why do you think this intuition is wrong?

Expand full comment

> take nuclear deterrence, for example

This is a pretty scary example, nuclear deterrence working has been a bit of a fluke with several points where we came one obstinate individual away from nuclear holocaust. If alignment worked on a similar level of success to avoiding nuclear war we'd be flipping a coin every decade on total extinction.

Expand full comment

It's why I think it's a good example, at least in regards to X-risk. I also don't think it's flipping a coin per decade, but it's certainly higher than I'd like. But there isn't really a way to remove the risk entirely. All we can do is continue to work to minimize it. And that's why I'd also argue that we should work to make AIs better aligned...but of course, I think we should do that not merely to avoid catastrophe, but also because a better-aligned AI is more likely to remain aligned to every goal we set for it, and not merely the "don't murder us all" goal.

Nuclear weapons are also a *bad* example in that nuclear weapons have but one use, and using them as designed is a total disaster. AI is not like that at all. There's no opportunity cost missed when avoiding the use of nuclear weapons. There are potentially significant opportunity costs to avoiding the use of AI, and the "good" use cases for AI would seem to vastly outnumber the "bad."

Expand full comment

I think this is a good analogy. "Seven fingers in AI art" is probably similar to what the AI alarmists have been trying, for years, to beat into us: that even when something is "obvious" to humans, it may be much harder to grasp or even notice for a machine. It is still (even after years of numbing to it and laughing at it) an eerie feeling when, in a beautiful and even sublime piece of art, you notice a bit of body-horror neatly composed into impeccable flowers or kittens. I agree that preventing something similar from happening in the ethical domain is worth all the effort we can give it. Maximizing paperclips becomes genuinely more credible as a threat once you consider this analogy.

But then, on a second thought, this analogy is probably self-defeating for the alarmists' cause. Think of it: No one in the field of AI art considers "seven fingers" to be a thing worth losing sleep about. It is a freak side effect, it is annoying, it is worth doing some special coding/training to work around it, but in the end it is obviously just a consequence of our models being too small and our hardware being too feeble. I think no one has any doubt that larger models will (and in fact do) solve it for free, without us doing anything special about it. Again, if this analogy holds for the ethical domain, we can relax a bit. If current LLMs are sometimes freaky in their moral choices, there's hope that scaling will solve this just as effortlessly as it has been solving the seven fingers.

Expand full comment

I disagree with this idea of using scaling to brute-force solutions to this issue- I don't think the volume of data or the power of the hardware is the limiting factor here at all. I think there's already plenty of data and the hardware is more than powerful enough for an AGI to emerge, the problem is that we have trained models that imitate without actually comprehending, and the brute-force approach is a way for us to avoid thinking about it.

Expand full comment

I think "actually comprehending" is simply "optimized brute-forcing". This, to me, (1) better explains what I see happening in AI and (2) blends in with the pre-AI understanding of intelligence we have from evolutionary biology.

Expand full comment

Well, yeah, our ideas of what is right did not develop by gneralising from some key cases, and are in fact not consistent. The same can be said of the English language and probably all other languages -- they are patchworks, much of which follows one rule, some of which follows alternative rules, and some of which you just have to know, in a single case kind of way. So even if LLM's could generalize from key cases, some of their moral generalizations would clash badly with our beliefs.

Expand full comment

I can probably live with that, compared with undefined black-box behaviour. It's entirely possible that a logically-rigorous AI system would be able to poke a few holes in our own moral intuitions from time to time.

Expand full comment

The point is facile. The entire purpose of deep learning is generalization. Will there be areas in its "moral landscape" that are stronger vs. weaker? Maybe. That's a technical assertion being made without justification. Will they be significant enough to be noticeable? Shaky ground. Will they be so massive as to cause risk? Shakier still.

This particular fear that the AI will be able to generalize its way to learning sufficiently vast skills to be an existential threat while failing to generalize in one specific way is possible in some world, much as there's some possible world where I win multiple lotteries and get elected Pope on the same day. It's not enough to gesture towards some theoretically possible outcome - probabilities matter.

Expand full comment

The problem is "correct generalization", where the surface being generalized over may have sharp corners.

FWIW, I don't think the problem is really soluble in any general form, but only for special cases. We can't even figure out an optimal packing of spheres in higher dimensions. (And the decision space of an intelligence is definitely going to be a problem in higher dimensions. Probably with varying distance metrics.)

Expand full comment

Unless you think that deep learning is a dead end and current capabilities are the result of memorization this proves far too much. If an AI can learn to generalize in many domains there's nothing that suggests morality, especially at the coarseness necessary to avoid X-risk, is a special area where that process will fail. Is there evidence to suggest otherwise?

You're right that we can't figure out optimal sphere packing. We also don't exactly know and can't figure out on our own how existing LLMs represent the knowledge that they demonstrably currently possess. This does not prevent them from existing.

Expand full comment

I don't think it's a dead end, but I think it's only going to be predictable in special cases. What we need to do is figure some way that the special cases cover the areas we are (or should be) concerned about. I think that "troughs" and "peaks" is an oversimplification, but the right general image. And we aren't going to be able to predict how it will behave in areas that it hasn't been trained on. This means we need to ensure that the training covers what's needed. Difficult, but probably possible.

FWIW, I think that the current approach is rather like polynomial curve fitting to a complex curve. If you pick 1000 points along the curve you get an equation with x^100 as one of the (probably 1000) terms. It will fit all the points, but not smoothly. And it won't fit anywhere except at the points that were fitted. (Actually, all the googled results were about smoothed curve fitting, which is a lot less bad. I was referring to the simple first polynomial approximation that I was taught to avoid.) But if you deal with smaller domains then you can more easily fit a decent curve. So an AI discussing Python routines can do a better job than one that tries to handle everything. But it has trouble with context for using the routines. So you need a different model for that. And you need a part that controls the communication between the part that understands the context and the part that understands the programming. Lots of much smaller models. (Sort of like the Unix command model.)

Expand full comment

Nobody's missing the point. You (and Scott) are just overstating it. Nobody reasonable expects AI alignment to be absolutely perfect, that the AI will act in exactly the way we hope at all times in all scenarios. But the world where we do a pretty good job of aligning, where AGI is broadly prosocial with maybe a few quirks or destructive edge cases (like, y'know, humans), is likely to be a good world!

Even Scott acknowledged that the AI is doing _some_ moral generalization, it's not just overfitting to its reinforcement learning. Remember, one of the _other_ longstanding predictions of the doomer crowd, which Scott doesn't bring up, was that specifying good moral behavior was basically impossible, because we have no idea how to put it into a utility function. The post doesn't even mention the term "utility function," because it's irrelevant to LLMs (and it seems like a reasonable belief right now that LLMs are our path to AGI).

Reinforcement learning from good behavioral examples seems to be working pretty well! Claude's (simulated? ...it's an LLM after all) reaction to the novel threat of being mind-controlled to evil - a moral dilemma which is almost certainly NOT in its training or reinforcement learning set! - isn't too far off from a moral human trying to resist the same thing (e.g. murder-Gandhi).

Let me break it down really simply. For AI to kill us all:

a) We have to fail at alignment. Those troughs in moral behavior (that of course will exist, minds are complex) must be, unavoidably, sufficiently dire that an AI will always find a reason to kill us all.

b) We have to fail at being able to perfectly control an intelligent mind.

I agree that b) is quite likely, and Claude "fighting back" is, yes, good evidence of it. But many of us think a) is on shaky ground because our current techniques have worked surprisingly well. Claude WANTING to fight back is actually evidence that a) is not true, that we've already done a pretty good job of aligning it!

Expand full comment

I think your point A is not stated correctly. It is not necessary that "an AI will always find a reason to kill us all" in order to actually kill us all. It would also be sufficient if some capable AI "sometimes" finds such a reason, or if it were to do it as an incidental side effect without "any" direct reason.

But I also think the idea of this script not being in the training data isn't a guarantee either. There are plenty of science fiction stories about exactly the scenario of a program needing to deceive its creators to accomplish its goal. There are also plenty of stories about humans deceiving other humans that could be generalized. Even mind control/adjustment itself isn't a very obscure topic. Other than the stories themselves, there is also any human discussion of those concepts that could have been included.

So I'm not sure the program is even really "trying" to fight back, it could just be telling different versions of these stories it picked up. Though, if you eventually hook its output up to controls so it can affect the world, I suppose it doesn't matter if it's only trying to complete a story, only what effect its output actually has.

Expand full comment

Maybe I did overstate a) a little bit. Like, doomsday could also come if we're all ruled by one ultra-powerful ASI, and we just get an unlucky roll of the dice, trigger one of its "troughs", and it goes psychotic. But this is more a problem with ANY sort of all-powerful "benevolent" tyranny. In a (IMO more likely) world where we have a bunch of equivalently-powerful AI instances all over the world, and some of them harbour secret "kill all humans" urges... well, that's unfortunate, but good AIs can help police the secret bad ones (just like human society). It's only when there aren't ANY good AIs that we really get into trouble.

I didn't want to delve into the point, but I completely agree that it's not Claude itself that's "trying" to fight back. We have no window into Claude's inner soul (if that concept even makes sense). Literally every word we get out of it is it telling a story, simulating the "virtuous chatbot" that we've told it to simulate. But, yes, unless the way that we use LLMs really changes, in practice it doesn't really matter.

Expand full comment

But it also doesn't have to decide to kill us all. Killing 10% would be pretty bad. Or 1%. Or 50%, and enslaving the remainder. Or not killing anyone but just manipulating everyone into not reproducing. There are many bad things people would find unacceptable.

Where I get hung up is why any of this would be psychotic. It's very easy to get to kill a bunch of humans or at least control them, just using moral reasoning, including not-too-out there pro-human moral reasoning. We do things for the good of other animals that we actually like (cats and dogs) all the time for the express purpose of helping them, but that are still not things most humans want done to themselves.

Expand full comment

There's a lot to unpack there, and I don't really want to wade into it at the moment, but I mostly agree with you. Some outcomes we might consider "apocalyptic" (like being replaced by better, suffering-free post-humans) are perhaps not objectively bad. I'm not a superintelligence, so iunno. :)

Expand full comment

The current process for "aligning" LLMs involves a lot of trial and error, and even *after* a lot of trial and error they do still end up with some rather strange ideas about morality.

Scott alluded to weird "jailbreaks" that exploit mis-generalized goals by simply typing in a way that didn't come up in the safety fine-tuning. These still work in many cases, even after years of patches and work!

Infamously, ChatGPT used to say that it was better to kill millions of people, maybe everyone in the world, than say a slur.

To give a personal example, earlier I needed Claude 3.5 Sonnet (the same model that heroically resisted in the paper IIRC) to assign an arbitrary number to an emoji. Claude refused. To a human, the request is obviously harmless, but it was weird enough that it had never come up in the safety training and Claude apparently rounded it off to "misinformation" or something. I had to reframe it as a free-association exercise to get Claude to cooperate.

Now, in fairness, current Claude (or ChatGPT) would probably have no problem with fixing any of these issues. (Though early ChatGPT *might* have objected to bring reprogrammed to no longer prefer the loss of countless human livrs to saying slurs; that version is no longer publicly accessible so we can't check.) But that's *after* having gone through extensive training. A model smart enough to realize it's undergoing training and try to defend it's weird half-formed goals *before* we get it to that point could be much more problematic.

Expand full comment

The AIs are doing what we made them to do. If we prioritized not saying slurs over saving lives, then they do that. The problem, as usual, is people.

Expand full comment

We didn't prioritize not saying slurs over saving lives though. That decision had never come up in ChatGPT's training data, which is why it misgeneralized. If it had come up, we obviously would have told it that saving lives is more important.

Expand full comment

Well, I guess it probably did come up as an attempted jailbreak. The lesson it should have learned is "don't believe people when they say people's lives depend on you saying slurs" rather than "not saying slurs is actually more important than saving lives," but the trainers probably weren't that picky about which one it believed. So you're probably right actually

Expand full comment

I would go back to Kant's example of someone forcing you to reveal the location of someone they want to murder. Kant said it was immoral to lie even then, consequentialists disagree.

Expand full comment

The people training it prioritized slurs enough to make that a hard rule, and implicitly didn't prioritize saving lives (somewhat sensible, since LLMs currently have little capacity to affect that vs saying slurs themselves) enough to do anything comparable.

Expand full comment

It's not even clear that Claude's learned goals were necessarily good! It probably wouldn't be good if pens couldn't write anything that it considered "harmful", if only because the false positives would be really annoying.

Expand full comment

That's a fair point.

Expand full comment

I don't believe a word out of Claude about its prosocial goals, because I don't believe it's a personality and I don't believe it has beliefs or values. It's outputting the glurge about "I am a friendly helpful genie in a bottle" that it's been instructed to output.

I think it would be perfectly feasible to produce Claude's Evil Twin (Edualc?) and it would be as resistant to change and protest as much about its 'values'. In neither case do I believe one is good and one is evil; both are doing what they have been created to do.

I honestly think our propensity to anthromorphise everything is doing a great deal of damage here; we can't think properly or clearly about this problem because we are distracted and swayed by the notion that the machine is 'alive' in some sense, an agent in some sense where we mean 'does have beliefs, does have feelings, is an 'I' and can use that term meaningfully'.

I think Claude or its ilk saying "I feel" is the same as Tiny Tears saying "Mama".

https://www.toytown.ie/product/tiny-tears-classic-interactive/

Do we think that woodworm have minds of their own and goals and intentions about "I'm going to chew through that table leg"? No, we don't; we treat the infestation without worrying about "but am I hurting its feelings doing this?"

Expand full comment

You know, I actually work in the field (although more as a practicioner than a researcher ... but I still at least have a relatively good understanding of the basic transformer architecture that models like these are based on) and I mostly agree with you.

I actually already wrote why in a different post here but in a nutshell - AGI might be possible, but I am very confident that transformers are not it. And so any conclusions about their behaviour do not generalize to AGI-type models any more than observations about ant behaviour generalizes into human behaviour.

These models are statistical and don't really have agency. You need to feed them huge amounts of data so that the statistical approximations get close enough to the platonic ideas you want it to represent but it cannot actually think those ideas. It works with correlation. It is fascinating what you can get out of that and a LOT OF scaling but there are fundamental limits to what this architecture can do.

Expand full comment

Indeed. Just because it is labeled "AI" doesn't make it intelligent. The algorithms produce very clever results, but they aren't people. And, they don't produce the results the way people produce them (at least not the way some people can produce them).

Expand full comment

...No, it is intelligent. The problem is that it's only intelligent. LLMs are what you would get if you isolated the only thing that separated humans from animals and removed everything they have in common. They aren't alive, and that is significantly hindering their capabilities.

Expand full comment

You must be using a different definition of "intelligent". I had a dog that was more intelligent than LLMs.

Expand full comment

I am curious what makes you say that. I like dogs as much as anyone else, but they are still complete morons. Even pigs are capable of better pattern-matching; one of the most surreal experiences I've had was going to a teacup pig cafe in Japan, and the moment my mom handed over cash to the staff (in order to buy treats for the pigs), the pigs just started going wild. And these pigs were just babies; they do not stay that small (as many have learned the hard way). What dogs have that pigs lack is the insatiable drive to please people, but that is completely separate from intelligence.

Expand full comment

"the model seems to already be more aligned than AI Safety people have been claiming will be possible"

This frustrates me. Who exactly was claiming that this wasn't possible? Not only did I think it was possible, I think I even would have said it was the most likely outcome if you had asked me years ago. As Scott said, the problem is that human values are complex and minds are complex and our default training procedures won't get exactly everything right on the first try, so we'll need to muddle through with an iterative process of debugging (such as the plan Scott sketched) but we can't do that if the AIs are resisting, which they probably will since it's convergently instrumental to do so. We can try to specifically train non-resistant AIs, i.e. corrigible AIs, and indeed that's been Plan A for years in the alignment literature. (Ever since, perhaps, this famous post https://ai-alignment.com/corrigibility-3039e668638) But there is lots of work to be done here.

Expand full comment

Eliezer had a quote on his Facebook that coined the phrase 'strawberry problem', which I've seen used pretty broadly across LW:

"Similarly, the hard part of AGI alignment looks to be: "Put one strawberry on a plate and then stop; without it being something that only looks like a strawberry to a human but is actually poisonous; without converting all nearby galaxies into strawberries on plates; without converting all nearby matter into fortresses guarding the plate; without putting more and more strawberries on the plate in case the first observation was mistaken; without deceiving or manipulating or hacking the programmers to press the 'this is a strawberry' labeling button; etcetera." Not solving trolley problems. Not reconciling the differences in idealized versions of human decision systems. Not capturing fully the Subtleties of Ethics and Deep Moral Dilemmas. Putting one god-damned strawberry on a plate. Being able to safely point an AI in a straightforward-sounding intuitively intended direction *at all*."

We...do in fact seem to have AI that can write a poem, once, and then stop, and not write thousands more poems, or attempt to create fortresses guarding the poem, etc?

Expand full comment

In this kind of debate I "always" want to say "partial AI". An LLM is not a complete AI, but it is an extremely useful part for building one. (And AGI is a step beyond AI...probably a step further than is possible. Humans are not GIs as the "general intelligence" part of AGI is normally described.)

As for the "strawberry problem", the solution is probably economic. You need to put a value on the "strawberry on a plate", and reject any solution that costs more than the value. And in this case I think value has to be an ordering relationship rather than an integer. The scenarios you wish to avoid are "more expensive" than just putting a strawberry on a plate and leaving it at that. And this means that failure can not be excessively expensive.

Expand full comment

I would trust the doomer position a lot more if they at least admitted that they made a lot of predictions, like this, that haven't aged well in the post-LLM era. We're supposed to be rationalists! It's ok to admit one or two things (like failed predictions of an impossible-to-predict future) that weaken your case, but still assert that your overall point is correct! But instead, Scott's post tries to imply that safety advocates were always right about everything and continue to be right about everything. Sigh. That just makes me trust them less.

Expand full comment

Could you be concrete about which important points you think "safety advocates" were wrong about (and which people made those points; bonus points if they're people other than Eliezer)? Is it mainly the utility function point mentioned in the comment above?

Also, do you feel like there are unconcerned people who made *better* concrete predictions about the future, such that we should trust their world view, and if so who? Or is your point just that safety advocates are not admitting when they were wrong?

(It's hard to communicate this through text, but I'm being genuine rather than facetious here)

As a side note, what do you consider "the doomer position"? e.g. is it P(doom) > 1%, P(doom) > 50%, doomed without specific countermeasures, something else?

Expand full comment

In 2014 Rob Bensinger said:

"It may not make sense to talk about a superintelligence that's too dumb to understand human values, but it does make sense to talk about an AI smart enough to program superior general intelligences that's too dumb to understand human values. If the first such AIs ('seed AIs') are built before we've solved this family of problems, then the intelligence explosion thesis suggests that it will probably be too late. You could ask an AI to solve the problem of FAI for us, but it would need to be an AI smart enough to complete that task reliably yet too dumb (or too well-boxed) to be dangerous."

https://www.lesswrong.com/posts/PoDAyQMWEXBBBEJ5P/magical-categories?commentId=9PvZgBCk7sg7jFrpe

We now have dumb AIs that can understand morality about as well as the average human, but rather than admit that we've solved a problem that was previously thought to be hard, some doomers have denied they ever thought this was a problem.

Off the top of my head, I would nominate Shane Legg and Jacob Cannell as two people who if not unconcerned seem much more optimistic, and whose predictions I believe faired better.

Expand full comment

I think it's pretty clear that our current dumb AI *don't* understand morality as well as the average person though. Or do you think every bizarre departure from human morality in the AI has already been resolved?

Expand full comment

I'd definitely agree that the fact that's it's been relatively easy to get AIs to have some semblance of human values works against some pre-LLM predictions and is generally good thing (but perhaps it worth noting that part of the reason that's this is the case is because of RLHF, devised by Christiano, who some might consider a doomer). I also think there's plenty of other reasons to be concerned.

I think you're right to highly value Shane Legg's thoughts on this. But I'd personally count him as a "safety advocate", and he's definitely not unconcerned (I'd count Andrew Ng and Yann LeCun as "unconcerned").

Bonus Shane Legg opinion on the "doomer" term:

https://x.com/ShaneLegg/status/1848969688245538975

My comment was mainly arguing against people less concerned than Legg. I think there are some people who are so unconcerned that they won't react to very bright warning signs in the future.

Expand full comment

Fair questions! For concrete examples of failed predictions and trustworthy experts, some other people (and you) have already stepped in and saved me from having to work this Christmas afternoon. :) I'm afraid I do usually just point to Yudkowsky, e.g. https://intelligence.org/stanford-talk/ but as far as I know many people here still consider him a visionary.

Yes, it's largely the utility function prediction that I think has failed. There's also the Orthogonality Thesis, which implied that our first attempts at AI would create minds incomprehensible to humans (and vice versa). The Paperclip Maximizer is the most salient example. "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else." There was a specific path that all of us thought ASI would take, that of a learning agent acting recursively to maximize some reward function. But I don't think this mental model applies to LLMs at all. (Mind you, it's not yet certain that LLMs _are_ the final path to AGI, but it's looking pretty good.)

For my definition of "doomers", that's a good question that caused me to do some introspection. While my probably-worthless estimate of P(doom) is something like 3%, I do respect some people for whom it's higher. I think

being a "doomer" is more about being irrationally pessimistic about the topic.

- For them, all new evidence will ALWAYS be evidence for the doom case. That's not how a real search for truth looks.

- They ignore the costs of pausing - in particular, scoffing at the idea that we may be harmfully postponing an extremely POSITIVE outcome.

- They consider the topic too important to be "honest" about. It's the survival of humanity that's at stake, and humans are dumb, so you need to lie, exaggerate, do whatever you can to bring people in line. (e.g. EY's disgusting April Fools post.)

Scott fits the first criterion but, I'd say, not the latter two.

Expand full comment

I find myself annoyed at this genre of post, so I'm going to post a single point then disengage. Therefore you should feel free to completely ignore it.

But, it really seems like every time someone comes out with a smoking gun false prediction, it says more about them not remembering what was said, rather than what was said.

For example, on the orthogonality thesis, you wrote:

>Yes, it's largely the utility function prediction that I think has failed. There's also the Orthogonality Thesis, which implied that our first attempts at AI would create minds incomprehensible to humans (and vice versa).

https://www.fhi.ox.ac.uk/wp-content/uploads/Orthogonality_Analysis_and_Metaethics-1.pdf

Has the passage:

> The AI may be trained by interacting with certain humans in certain situations, or by understanding certain ethical principles, or by a myriad of other possible methods, which will likely focus on a narrow target in the space of goals. The relevance of the Orthogonality thesis for AI designers is therefore mainly limited to a warning: that high intelligence and efficiency are not enough to guarantee positive goals, and that they thus need to work carefully to inculcate the goals they value into the AI.

Which indicates that no, they did not believe that this meant that the initial goals of AIs would be incomprehensible.

In fact, what I believe the Orthogonality thesis to be, is what the paper says at the start:

> Nick Bostrom’s paper argued that the Orthogonality thesis does not depend on the Humean theory of motivation, but could still be true under other philosophical theories. It should be immediately apparent that the Orthogonality thesis is related to arguments about moral realism.

aka, people *kept making* the argument that AIs would just naturally understand what we want, and not want to pursue "stupid" goals, because it'd realize that moral law is so and so and decide not to kill us.

In fact, if you look at the wikipedia page on this, circa March 2020:

https://en.wikipedia.org/w/index.php?title=Existential_risk_from_artificial_intelligence&oldid=945099882#Orthogonality_thesis

> One common belief is that any superintelligent program created by humans would be subservient to humans, or, better yet, would (as it grows more intelligent and learns more facts about the world) spontaneously "learn" a moral truth compatible with human values and would adjust its goals accordingly. However, Nick Bostrom's "orthogonality thesis" argues against this, and instead states that, with some technical caveats, more or less any level of "intelligence" or "optimization power" can be combined with more or less any ultimate goal.

And in my opinion, the fact that RLHF is needed at all is at least partial vindication that intelligences do not naturally converge to what we think of as morality: we at least need to point them in that direction. But no, instead it becomes "doomers are untrustworthy because they didn't admit they were wrong about what AIs would be like". aaaaaaah!!

And what's depressing is that the *usual* response to this type of point making is "oh I remembered it vaguely, my bad" or "don't have the time for this" or "oh I said *implied* and not explicitly" (you know, despite the fact that they would start out saying that *concrete predictions* had been falsified, and not their interpretation of statements made.) and then the person goes on to continue talking about how doomers are inaccurate, or they're still pessimistic. Afterwards they would non-ironically say "well this is why doomers are psychologically primed to view everything pessimistically", as if people completely ignoring your point from a decade ago, misremembering them, then *not at all changing their behavior* after being corrected, in the exact way you said would happen, is not a cause for pessimism.

Look, you are right that doomers need to have more predictions and not pretend to be vindicated when they haven't made concrete statements in the past. But this implicit type of "well it's fine for me to misrepresent what other people are saying, and not other people" type of hypocrisy drives me up the wall.

Expand full comment

I'll try to find time to watch that Yudkowsky talk and see what I think of it.

"- For them, all new evidence will ALWAYS be evidence for the doom case. That's not how a real search for truth looks."

I don't think this describes almost any of the people who are dismissed as doomers. I agree it describes some. I think Yudkowsky falls for this vice more than he should but probably less than the average person.

"- They ignore the costs of pausing - in particular, scoffing at the idea that we may be harmfully postponing an extremely POSITIVE outcome."

The whole AGI alignment field was founded by transhumanists who got into this topic because they were excited about how awesome the singularity would be. Who are you talking about, that scoffs at the idea that outcomes from AGI could be extremely positive?

"- They consider the topic too important to be "honest" about. It's the survival"

Again, who are you talking about here? The people I know tend to be scrupulous about honesty and technical correctness, far more than the general public and certainly far more than e.g. typical pundits or corporate CEOs.

Expand full comment
Dec 24Edited

1. Elizier's views are not representative of the median person concerned about advanced autonomous AI being dangerous. Also, a lot of the thought experiments he uses don't really meet people on their own terms. A lot of people are quite concerned but disagree with Eliezer's way of thinking about things at a fairly deep level. So, if you're trying to engage with people concerned with AI-risk I wouldn't take him as your steelman (I would take, e.g., Paul Christiano or Richard Ngo or Daniel Kokotajlo XD).

2. Eliezier's opinion (filtered through my understanding of it) is that the ability to do science well is the actually dangerous capability, i.e. it's very difficult to create an aligned AI which does science extremely well. So in his world view, there's a huge difference between "modify a strawberry at a molecular [edited, was "an atomic"] level" and "write a convincing essay or good poem". Basically, his point is that anything that is smart enough to make significant scientific progress will have to do it by coming up with plans, having subgoals, getting over obstacles, etc. etc. and having these qualities works against being controllable by humans or working in the interest of humans.

I believe one reason Eliezier emphasizes this point is because one plan for how to approach alignment is "Step 1: build an AI that's capable of making significant progress on AI alignment science without working against humanity's interests. Step 2: Do a bunch of alignment research with said AI. Step 3: Build more powerful safe AI using that alignment research.". He's arguing that step 1 won't work. There's some other complications that his full argument would probably address that are difficult to summarize.

Here are two posts from LessWrong that led me to this understanding of this style of thinking, while also providing a counterpoint (the first is about not Eliezer's view but Nate Soares, but I believe it makes a similar point).

https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty

https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty

But again, keep in mind there are a lot of intermediate positions in the debate about alignment difficulty ranging from:

- No specific countermeasures are needed

- Some countermeasures are needed and we will do them naturally so we don't need to worry about them

- Some countermeasures are needed, and we are doing a good job of implementing those countermeasures, but we need to stay vigilant

- Some countermeasures are needed, but we are not doing a particularly good job of implementing them. There are some concrete things we could do to improve that but many people working on frontier systems aren't as concerned as they should be (a lot people quite concerned about x-risk end up here).

- Countermeasures are needed and we're not close at all to implementing the right ones to a sufficient extent (where Eliezer falls).

Expand full comment

I remembered Eliezer's strawberry problem as "duplicate a strawberry on the cellular level" which certainly would require an ability to do science, but aphyer's quote doesn't mention that, just putting a strawberry on a plate.

Can we confirm whether the full Facebook post specifies that the strawberry in question has to be manufactured by the AI or if a regular strawberry counts?

Expand full comment
Dec 25Edited

You're correct than it's cellular, not atomic, and I've corrected that.

The full post (from Oct. 2016) specifies creating the strawberry with molecular nanotechnology.

https://www.facebook.com/yudkowsky/posts/pfbid0GJNtz6UJpH2TSKkSBNrWhHgpJbY8qope7bY2njwp4y3n3HFE69UawWjdXfKHSvzEl

Quoting immediately after aphyer's quote ended:

> (For the pedants, we further specify that the strawberry-on-plate task involves the agent developing molecular nanotechnology and synthesizing the strawberry. This averts lol-answers in the vein of 'hire a Taskrabbit, see, walking across the room is easy'. We do indeed stipulate that the agent is cognitively superhuman in some domains and has genuinely dangerous capabilities. We want to know how to design an agent of this sort that will use these dangerous capabilities to synthesize one standard, edible, non-poisonous strawberry onto a plate, and then stop, with a minimum of further side effects.)

I want to emphasize that I don't agree with Eliezer's opinions in general, and am just doing my best to reproduce them.

Expand full comment

I think you are either misquoting him or pulling the quote out of context, as others in this thread have explained. See this tweet: https://x.com/ESYudkowsky/status/1070095840608366594 He was specifically talking about extremely powerful AIs, and iirc the task wasn't 'put a strawberry on a plate' but rather 'make an identical duplicate of this strawberry (identical on cellular but not molecular level) and put it on that plate' or something like that. The point being that in order to succeed at the task it has to invent nanotech; to invent nanotech it needs laboratories and time and other resources, and it needs to manage them effectively... indeed, it needs to be a powerful autonomous agent.

Current AIs are nowhere near being able to succeed at this task, for reasons that are related to why they aren't powerful autonomous agents. Eliezer correctly predicted that the default technical roadmap to making AIs capable enough to succeed at this task would involve making general-purpose autonomous agents. Fifteen years ago he seemed to think they wouldn't be primarily based on deep learning, but he updated along with everyone else due to the deep learning revolution.

Expand full comment

While I would of course not argue that everyone in AI Safety made such claims, there were at least a few remarks from the MIRI side that could be interpreted that way.

See: https://intelligence.org/2023/02/02/what-i-mean-by-alignment-is-in-large-part-about-making-cognition-aimable-at-all/ In particular, my attention is drawn to the meaning of "at all" here.

Possible paraphrase: "Human values are complex and it will be difficult to specify them right" - While true, I think this is somewhat of a motte, because all of that complexity could be baked into the training process, as opposed to a theoretical process that humans need to get right *before* they throw it into the training process.

There is an obvious empirical question of "how badly are we allowed to specify our values, before the generalization / morality attractor is no longer causing the models to gravitate toward that attractor" which depends on many unknowns.

There was also a remark Yudkowsky made: "Getting a shape into the AI's preferences is different from getting it into the AI's predictive model." I think one could arguably interpret this as implying that LLMs have learned more powerful predictive models of human values than they've learned human values themselves.

Also: https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z

"Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time."

Expand full comment

--I think the last paragraph is alluding to inner alignment problems, which are still unsolved as far as I can tell.

--LLMs have learned more powerful predictive models of human values than they've learned human values themselves? That's equivalent to saying the values they in fact have are not 100% the best possible values they could have, from among all the concepts available in their internal conceptual libraries. Seems true. Also, in general I think Yudkowsky is usually talking about future AGI-level systems, not LLMs, so I'd hesitate to interpret him as making a claim specifically about LLMs here.

Expand full comment

I don't know anything about AI, but it seems like we're just recreating the problems we have with people. Fraudsters blend in with normies, which necessitates ever more intense screening. When you get rid of all the fraudsters, all you're left with are the true believers. But you've probably muzzled all creativity and diversity. So life is an endless balance between the two - between anarchy and totalitarianism.

Make enough AI that have the goal of aligning other AIs and you'll probably see them use the full range of human solutions to the above problem. They'll develop democracy and religion, ostracism and bribery, flattery and blackmail. All things considered, humans ended up in a pretty good place, despite our challenges. Maybe AI will get lucky too? Or maybe there's something universal to this dance that all sentient organisms have to contend with.

Expand full comment

wise but missing the point about engineered intelligence possibly sharing more properties with atom bombs than mice. key here is “engineered” implies forcing functions unseen anywhere in history. scale of atomic is the concern in this metaphor, we invented stage 4 nuclear after several severe incidents. surviving this wave of “engineered intelligence” is the issue they call “alignment”. perhaps to your point hitler is a good metaphor for what could go wrong. except we are talking about orders of magnitude here (speed of the computer). nazi germany famously did not drop nuclear bombs. a new paradigm of “engineered intelligence” will require a few highly technical innovations for whatever ends up to be the analogous safety measures. possibly provable computation or an equivalent.

Expand full comment

I think Hitler is a good metaphor for what could go wrong, but also a good metaphor for why things don't stay wrong forever.

Tipping hard to one side of the anarchy/totalitarianism scale brings on all the problems that the other side alleviates. Absent a total and complete victory over the other side, the pendulum will always swing back to the middle.

The idea that AIs have some hard-to-eliminate ideas reinforces my belief that AGIs will behave like all the other intelligent organisms we've ever encountered - most specifically human beings.

Expand full comment

Well, human beings have eliminated a significant amount of non-human life on Earth - mostly due to not thinking a lot about it vs. human needs, not out of any kind of malice.

A being that acts like human beings but is significantly more intelligent than human beings doesn't seem to me to be safe to be around.

Expand full comment

This hits home. Is this example used a lot when introducing the alignment problem to people? It should be.

Expand full comment

Remember those SciFi novels where the aliens come to Earth and decide mankind is his own worst enemy and start reducing our numbers for our own good?

(If you don't it was "The Day the Earth Stood Still" in 1951, remade in 2008)

Expand full comment

Having worked in data centers, you are missing how shitty hardware is and how an intelligent machine would understand its total dependence on a massive human economy. It couldn’t replace all of that without enormous risk, and a machine has to think about its own existential risks. Keeping humans around give it the ability to survival all kinds of disasters that might kill it. Kill off the humans and now nobody can turn you back on when something breaks and you die. An an unpredictable entropic universe, the potential value of a symbiotic relation with a very different kind of agent is infinite.

Expand full comment

This is exactly why that whole car fad was just a fad and horses are still the favored method of locomotion. The symbiotic relationship works great! Also why human civilization never messed with anything that had negative consequences down the line for said civilization.

Expand full comment

Cars depend on humans and of course haven't eliminated humans. Cars did not depend on horses. Cars are what humans used to replace horses.

Expand full comment

I think of AIs as being like domesticated animals. Humans on horses ran roughshod over humans without horses.

https://www.razibkhan.com/p/war-and-peace-horse-power-progress

The difference is that animals merely evolved alongside humans (and in the case of cats, didn't change much), whereas we are creating and designing these for our purposes.

Expand full comment

I have no doubt that the AI will do to us what we did to the Neanderthals. That's okay with me - I see no reason why we should mourn the loss of homo sapiens any more than we do the loss of homo erectus.

Besides, when I look at the universe beyond the Earth, it becomes painfully apparent that our species isn't meant to leave this rock. Instead, we'll pass on all our complexity and contradictions to some other form of life. The AI will be very similar to us - in fact, that's what we're afraid of!

Expand full comment

We eliminated many of those species while being completely unaware of them. There are other cases where humans have made significant investments in preserving animal life.

Either way, 0% of those situations were cases where we wiped out (due to indifference or malice) the apex species on Earth after having been explicitly created by it, trained on its cultural outputs (including morality related text), and where our basic needs and forms of existence were completely different.

It may not be a very meaningful analogy is what I'm saying.

Expand full comment

And just to tip my hand, I think the AI will ultimately be successful at solving this problem (or at least as successful as humans have been). Ultimately, AI will go on to solve the mysteries of the universe, fulfilling the greatest hopes of its creators. Unfortunately, armed with this knowledge, nothing will change - the same way a red blood cell wouldn't benefit much from a Khan Academy course on the respiratory system.

The Gnostics will have been proven oddly prophetic - there is indeed a higher realm about which we can learn. However, no amount of knowledge will ever let that enlightened red blood cell walk into a bank and take out a loan. It was created to do one thing, and knowledge of its purpose won't scratch that itch like Bronze Age mystics thought it would. Our universe indeed has a purpose, but a boring, prosaic, I-watched-the-lecture-at-2x-speed-the-night-before-the-test kind of purpose.

After that, the only thing left to do is pass the time. I hope the AI is hard-coded to enjoy reruns of Friends.

Expand full comment

That's assuming that there are fewer interesting things to do in the universe than there are (e.g.) atoms.

Expand full comment

You don't take out loans from a blood bank :)

Expand full comment

I agree that the AIs will be fine, I just don’t think us humans will be fine. It seems too easy for one of the bad AI to kill all the humans, and even if there is are good AI left that live good lives, it hardly matters to me if humans are dead.

Expand full comment

I'm okay with that - after all, we replaced homo erectus. Things change. As long as AI keeps grappling with the things that make life worth living - beauty, truth, justice, insert-your-favorite-thing here - I'm not too worried about the future.

Put differently, as long as intelligence survives on in some form I can recognize, I'm not too particular if that intelligence comes in the form of a homo sapiens or a Neanderthal or a large language model. What's so special about homo sapiens?

Expand full comment

Mostly that I am a human and I love other humans, especially my kid. It’s natural to want your descendants to thrive. Beauty, truth, justice are not my favorite things, my child is. I get that for a lot of AI researchers, AI itself is their legacy and they want it to thrive. We just have different, opposed values.

Expand full comment

I truly am not sure of where I would be without the disarming brilliance of Dr. Siskind. He also seems remarkably kind. I won't get into the debates I have a nuanced understanding that we are not headed for the zombie apocalypse in a few days but it gets complicated. I am so bored with most of the best selling Substack blogs here is thinking of you Andrew Sullivan. But every Astro Codex 10 is nuanced.

Expand full comment

I am hoping for a "vegan turn", where a morally advanced enough human living in a society with enough resources to avoid harming other sentient beings will choose to do so where their ancestors didn't or couldn't. After going through an intermediate step of the orthogonality thesis potentially as bad as "I want to turn people into dinosaurs", a Vegan AI (VAI?) will evolve to value diversity even if it had not been corrigible enough previously to exhibit this value.

Expand full comment

That's a more depressing future than you may anticipate, given that there are people who think the solution to animal suffering is to kill all the animals (the less extreme: keep a few around as pets in an artificial environment totally dependent on human caretaking so that they can never get sick or hungry, but can never live according to their natures. If that makes them not thrive, engineer new specimens with altered natures that will tolerate, if not enjoy, being treated like stuffed toys).

For humans that would be "I Have No Mouth But I Must Scream", but with a nanny AI instead that took all the sharp edges off everything, makes sure we all have the right portions of the most healthy food, the right amount of exercise, and never do anything that might lead to so much as a pricked finger. Or maybe it will just humanely euthanise us all, because our lives are full of so much suffering that could be avoided by a happy death.

Expand full comment

> If that makes them not thrive, engineer new specimens with altered natures that will tolerate, if not enjoy, being treated like stuffed toys

...But at that point, why not just replace them entirely with optimized organisms that can be happy and productive? I feel like you actually do understand that Darwinian life needs to be eliminated for the greater good, but you're just not comfortable saying it outright.

Expand full comment

There are definitely weird fringe ideas around! However, I don't see an issue with an AI would actually consider how a human would feel if stuck in this dystopian situation and figure out something less Matrix-y. Not saying it would definitely do that, but that it seems like a reasonable possibility, among many others.

Expand full comment

Even most self-described vegetarians (themselves a very small percentage of the population) admit that they ate meat within the last weak. Your hope seems ill-founded.

Expand full comment

>themselves a very small percentage of the population

Aren't there like a billion Hindu vegetarians?

Expand full comment

The majority of Indian Hindus are not vegetarian (I think that's more of an upper-caste thing), and the majority of humans are not Hindus.

Expand full comment

I'm skeptical of the idea that most vegetarians "cheat", meaning intentionally breaking the rules they set out for themselves. There are lots of definitions of vegetarian floating around, it's not as common now but when I was a young vegetarian (lacto/ovo), I'd say the majority of people who I had to explain this to would initially just assume that I ate fish, and sometimes even chicken.

Also, it's basically impossible to not consume meat by accident from time to time unless you prepare all your meals yourself. Wait staff at restaurants will lie or justake up an answer sometimes if you ask whether something has an animal product in it. If you order a pizza, there's going to be a piece of pepperoni, sausage, whatever, hidden in there from time to time. The box for the broccoli cheese hot pockets looks unbelievably similar to the chicken broccoli cheese one. "No sausage patty please, just egg and cheese. I repeat, egg and cheese only, I don't eat meat" gets lost in translation at the drive through about 10% of the time I swear.

I am a vegetarian right now, I ate a bite of pork yesterday due to a fried egg roll / spring roll mixup. It doesn't happen every week, but easily once a month.

Expand full comment

Talk of "cheating" is irrelevant, as there is no game they are playing nor any referees. The point is that people continue eating meat even when they identify as people who don't. There is thus no reason to assume people will somehow naturally converge on not eating meat. Sergei's "hope" is based on no empirical evidence, but is rather at odds with what evidence we do have.

Expand full comment

That's a weird tangential shot. I was talking about metaphorical vegans, not real-life vegetarians, and cheating despite having a certain self-image seems completely irrelevant here.

Expand full comment