435 Comments
Jul 17, 2023·edited Jul 17, 2023

Dostoevsky was an antisemite, Martin Luther King cheated on his wife and plagiarized, Mother Teresa didn't provide analgesics to patients in her clinics, and Singer's been attacked for his views on euthanasia, which actually further strengthens your point.

Expand full comment
author
Jul 17, 2023·edited Jul 17, 2023Author

That's why you average them out! 3/4 weren't anti-Semites, 3/4 didn't cheat on their wives, etc!

Expand full comment

That's a very interesting point, and very computer-logic-y, which of course is appropriate here. I wonder how well it works in practice, I guess we'll find out!

As you point out, morals are a moving target, but then they may update the AI!

Expand full comment
author
Jul 18, 2023·edited Jul 18, 2023Author

Tbf I'm mostly joking (though only mostly). I would hope that some of the other things I said - like "if they knew all true facts" or "if they'd had all human experiences" would help (also, I'd guess MLK didn't endorse other people cheating on their wives, just had weakness of will) so we don't need to rely on averages.

This whole section was pretty speculative and realistically I hope we have better options than doing anything like this at all.

Expand full comment

I want to throw out there:

*Weak idealized person conjecture* Imagine you took each broadly secular person in a first world country, and made them into the ideal person they would most like to be. Presumably this includes knowledge, strength of will, intelligence, adherence to their own virtues etc. You then reiterated and made that person into *their* idealized person etc. People would mostly converge in their ethical values under this procedure.

*Weak idealized person conjecture* Imagine you took each person alive today, and made them into the ideal person they would most like to be. Presumably this includes knowledge, strength of will, intelligence, adherence to their own virtues etc. You then reiterated and made that person into *their* idealized person etc. People would mostly converge in their ethical values under this procedure.

And

*Strong Idealized person conjecture* Imagine you took each person who ever lived, and made them into the ideal person they would most like to be. Presumably this includes knowledge, strength of will, intelligence, adherence to their own virtues etc. You then reiterated and made that person into *their* idealized person etc. People *throughout all of history* would mostly converge in their ethical values under this procedure.

I find myself fascinated by the question of whether it's true or not, because if it is true, it gives a kind of intersubjective basis for ethics. I'm skeptical, yet hopeful. I'd like to hear others views. Apologies for the self-promotion, but it's relevant to the project I outline here:

https://philosophybear.substack.com/p/a-defense-of-ethical-subjectivism

Expand full comment

I have my doubts people would converge. Take an evangelical Christian, your average Slate Star Codex rationalist, and your average stereotypical SJW--do you think they'd converge into similar things? They have different virtues (piety, intelligence, and empathy), for one. Normies would probably just want to be richer and stronger or cuter.

Go back through history, and a 17th century European and 20th century European would pass on very different trajectories if they converged to anything at all. Now let's throw China vs Europe in there--are we trying to be filial Confucians or good Christians?

Expand full comment

Some of those would converge in the adherement maximalists dying.

Expand full comment

Yeah, I think the strong conjectures are implausible for this reason. But what if we add a caveat:

"Before the process starts, the participant is broadly informed about how the world works, and has all relevant false views (e.g. about religion, metaphysics, society) removed."

Expand full comment

"Take an evangelical Christian, your average Slate Star Codex rationalist, and your average stereotypical SJW--do you think they'd converge into similar things?"

Interesting example you chose, because if you replace Evangelical Christianity with Catholicism*, this is basically my values system already. If a mere human can pull elements from such diametrically opposed worldviews and synthesize them, I don't see why a superintelligent AI would have a problem with it.

*I'm no longer a practicing or believing Catholic, but I was raised in a Catholic household and attended Catholic school growing up, and I'd be lying if I said Catholicism didn't still have some influence on my worldview.

Expand full comment

> Take an evangelical Christian, your average Slate Star Codex rationalist, and your average stereotypical SJW--do you think they'd converge into similar things?

In isolation, probably not. But if you add the interim step of "these idealized selves are allowed to interact with each other and attempt to compassionately advocate for their values" in each interaction, I have full confidence that they would. So of the best moral values I've seen happen at the idealized intersection of two of those (sjw and Christianity largely result in the best of both, SSC and SJ gives you tpot, even SSC and religion gives unique moral insight that you see from some posters here)

Expand full comment

What about people who belive that "staying in your lane" and "finding peace in what is, not what could be"? The very act of becoming the idealized version is antithetical to thier values.

Expand full comment

I really doubt this, though I do think you'd generally get convergent clusters. It's just that the clusters would be quite distinct. (I don't think they'd generally quite have a null intersection, but close.) And this is true for all of your levels of the conjecture.

Expand full comment
founding
Jul 18, 2023·edited Jul 18, 2023

This sounds similar to Yudkowsky's Coherent Extrapolated Volition from 2004-ish?

https://intelligence.org/files/CEV.pdf

(It seems worth mentioning that he wrote in List of Lethalities that he does not think this is a workable alignment strategy for 2022, among other reasons because he doesn't think we have enough time left, now, to get it right on the first try.)

Expand full comment

Do you think a psychopath and a neurotypical person will converge in their values? Do you think people on opposite ends of (e.g the big five) personality traits will converge?

Expand full comment

I don't know, I agree it's speculative and you really can't know.

But would knowing all true facts or having all human experience really help? I mean, we can't even imagine what that would be like, most people can't pass ideological Turing tests to pass as the person with the opposite views from their own country. I don't know that would really be better.

I guess when you get down to it, I'm a conflict theorist--I really think groups of people have disparate interests that (sometimes) can't be reconciled, and what's good for group A is often bad for group B, so it's all about forming the biggest coalition. Or in this case, making sure people like you are aligning the AI. ;)

Expand full comment
author
Jul 18, 2023·edited Jul 18, 2023Author

I don't know much about these people in particular. But just to give examples:

- I don't know why Dostoevsky was anti-Semitic, but it might have helped to know that there was no Jewish conspiracy. If he was anti-Semitic because the Jews killed Christ, it might help to know whether Christianity was false (or, if Christianity is true, it might help to know what Jesus would have thought of anti-Semitism!). Or to have the experience of being a random Jew of his own time, suffering from pogroms, in a way viscerally unrelated to whether your ancestors did or did not kill Christ thousands of years ago.

- I don't know why Mother Teresa was against anaesthetic (I think something something Catholicism), but it might have helped to know that Catholicism was false, or to have the experience of being a person in pain who doesn't have access to anaesthetics.

- I don't know exactly what Singer's controversial views about euthanasia are. I support people having the option of euthanasia, and I think people who are against that might change their minds if they had the experience of having to be in terrible pain while they were dying. Alternately, maybe euthanasia is frequently misused (eg given to people who would otherwise survive to eliminate them as an "inconvenience") , and if so, knowing that true fact would change my opinion on it (and if not, knowing that true fact might change other people's opinions).

I don't want to say that for sure this trick always works, but I think it would work in a lot of cases, especially for people like the ones I named who seem to start from a place of genuinely caring about others.

I talk a little more about this at https://slatestarcodex.com/2018/07/18/fundamental-value-differences-are-not-that-fundamental/

Expand full comment

I think you're probably right about Dostoevsky, MLK, Teresa, and Singer, so I'll concede that point. I guess I think at the fine-structure level even relatively altruistic people like those might diverge on what they think the best thing is (Teresa and Dostoevsky were big Christians, so if they knew Christianity was false, what would they believe in instead? Would they still be moral exemplars?). For that matter, why pick those? (Though with your characteristic even-handedness you did pick a pair of lefties and a pair of righties. I wonder if the Silicon Valley guys or the lefties eager to regulate them would be so generous.)

I also wonder how much human values that are good are going to persist being good once you feed them into the AI's ultimately alien value system--Repugnant Conclusions and the like.

Expand full comment

>I don't know why Mother Teresa was against anaesthetic (I think something something Catholicism)

She wasn't against them. This is a myth started by Christopher Hitchens quote-mining reports from people who visited her hospices in the 90s, when almost no hospitals or hospices in India used strong painkillers because government regulations had made them very difficult to legally acquire.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Mother Teresa probably thought that suffering elevated the soul, so my guess is that sharing it wouldn't have changed her views at all (or she might have taken analgesics on the side while still pretending to suffer fully and embracing it, the strange cognitive dissonances of saintly figures are pretty hard to grasp; like how someone can seemingly genuinely believe they have stigmata sent from God and also burn their own skin voluntarily with acid).

Convincing her somehow that Catholicism is false, whatever that even means, would have likely just straight up nuked her sense of identity and purpose. For all we know next thing she'd have shown up at a cocaine fuelled orgy in full existential crisis.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

The Mother Teresa criticism is nonsense by the way, and nothing to do with Catholicism. India just didn’t have palliative care (at all, for anyone in the country) until the very end of her life. She wouldn’t have been able to get hold of morphine even if it has occurred to her to go against the practice of every doctor in India anyway, it just wasn’t available, again until very late in her career.

I see someone else has already mentioned this. Here’s another good link debunking this nonsense: https://www.reddit.com/r/badhistory/comments/gcxpr5/saint_mother_teresa_was_documented_mass_murderer/

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

"I don't know why Mother Teresa was against anaesthetic (I think something something Catholicism)"

More to do with "something something atheism", Christopher Hitchens got his big break doing a very critical overview of Mother Teresa and her work. And of course, now it's the Internet version of "everybody knows" this happened, plus some Hindu nationalists, who aren't too happy about being portrayed as "a Western Christian had to do for your poor what your own people and government wouldn't do", doing some encouraging of that image and some propagandising of their own.

Here's a piece refuting the accusations. You can take it or leave it.

https://catholicismcoffee.org/refuting-3-accusations-against-st-theresa-of-calcutta-9b4df0391917

"2) “Mother Teresa withheld painkillers from the dying with the intent of getting them to suffer”

This accusation orientates from Hitchens presenting an article published by Dr. Robin Fox on the Lancet. What is not mentioned however is how Dr Fox prefaced his article, he went on to note that he valued Mother Teresa’s hospice for their “open-door policy, their cleanliness, tending of wounds and loving kindness”, he further went on to add that “the fact that people seldom die on the street is largely thanks to the work of Mother Teresa and her mission” and that most of “the inmates eat heartily and are doing well and about two-thirds of them leave the home on their feet”.

Dr Fox goes on to note that most of the inmates present in MoC hospices were rejected by local hospitals in Bengal, only then does he criticise the organisation for “the lack of strong analgesics and the lack of proper medical investigations and treatments”. The latter can be explained by the fact that the MoC ran hospices with nuns who had limited medical training and that the nuns had to make decisions with the best of their abilities (they were doctors who voluntarily visited these hospices but this was only once or twice a week). The context of all this should be kept in mind, they lacked modern health algorithms and the people in these hospices who needed care, were refused admission by hospitals. To someone rejected by a hospital, it was either to struggle on the streets and die or visit the local hospice run by the MoC and get some semblance of assistance.

However, the accusation that she “withheld” painkillers is false. Dr Fox notes that weak analgesics (like acetaminophen) were used to alleviate pain, lacking was stronger analgesics such as morphine. The wording used here is crucial in understanding the situation, he notes “a lack of painkillers” without stating its cause, he doesn’t mention that St Teresa was withholding them with the intent of making people suffer more.

Dr. David Jeffrey, Dr. Joseph O’Neill and Ms. Gilly Burn, founder of Cancer Relief India, responded to Fox on the Lancet about his criticisms. They noted 3 difficulties with regard to pain control in India.

- Lack of educated Doctors and Nurses

- Few drugs available that alleviated pain

- Strict state government legislation, which prohibits the use of strong analgesics even to patients dying of cancer

They went on to say “If Fox were to visit the major institutions that are run by the medical profession in India he may only rarely see cleanliness, the tending of wounds and sores, or loving kindness. In addition, analgesia might not be available.”

The Indian Government had been gradually strengthening it’s opium laws post-Independence (1947), restricting opium from general and quasi-medical use. Starting from the “All India Opium Conference 1949”, there was rapid suppression of opium from between 1948 and 1951 under the Dangerous Drugs Act, 1930 and the Drugs and Cosmetics Act, 1940.

In 1959, the sale of opium was totally prohibited except for scientific/ medical uses. Oral opium was the common-man’s painkiller.

India was a party to three United Nations drug conventions:

— The 1961 Single Convention on Narcotic Drugs

— The 1971 Convention on Psychotropic Substances

— The 1988 Convention against Illicit Traffic in Narcotic Drugs and Psychotropic Substances

All of which finally culminated in the 1985 Narcotic Drugs and Psychotropic Substances Act, which was ultimately responsible for the drastic reduction of medicinal opioid use in India even for a lot of hospitals.

Furthermore, palliative medicine only appeared in India during the mid-1980s, with the first palliative hospice in India being Shanti Avedna Sadan in 1986. To add on, palliative training aimed at medical professionals appeared in India during the early 1990s. The Narcotic Drugs and Psychotropic Substances Act of 1985, came right around the time palliative care had started in India and was a massive blow to it because it limited the amount and type of medical substances hospitals could use for pain alleviation (A large part of palliative care is pain management, which requires opioids such as morphine)"

And yeah: I think there were a lot of failings in how she did things, but at the end of the day - here's an Albanian woman in India taking the dying and destitute off the streets whom nobody gives a shit about, taking care of them as best she can with attitudes moulded by being trained in the 40s and 50s. Yes, she didn't do it the modern way. But did Christopher Hitchens go out and start up his own Modern Painkiller Giving Clinics? No, he went home to his nice Western life and did a hit piece. Because it's always somebody else's fucking job to pick up the dirty and the dying, isn't it?

I think if it's a choice between "get a clean bed and weak painkillers" and "lie in the gutter", then many people will pick the option where even if it's not great, it's better than nothing. And that is what we are talking about here - not "didn't do it to the standards of the late 20th century advanced society" but "this or literally nothing".

EDIT: I will say this for you Rationalists and Ethical Altruists, and what gets you a lot of credit and forebearance from me, is that you *do* put your money where your mouth is. You *did* go out and buy mosquito nets. You didn't do the Hitchens Hit Piece Not My Job Fuck The Poor thing. And that's the enormous difference right there.

Expand full comment

It seems to me that sometimes coalition could come forth because of not-converging to something, but rather diverging each time by introducing a new premise. For example, let's say that a representative of an indigenous community inhabiting a river valley thinks that it is right to preserve the natural path of a river. A representative of a nearby dry region thinks that it is right to alter the natural path of the river to make a dam and create a reservoir for providing water to their community. Both representatives, if rational could acknowledge that what the other party wants is good for them and not intended to harm the other party. But nonetheless they find themselves in conflict with each other. In this extremely simplistic and contrived situation, as a representative of your group let's assume that you want what is best for your group even if you knew everything else. An opportunity for agreement and coalition presents itself when it is found that an underground reservoir of water can be used to supply the water without disturbing the river. But this is only the best solution to the problem until a third party comes in and says that extraction of large amounts of underground water will cause landslide in their region. So, now you have to sort of diverge from the solution initially proposed, to take into account the third party. And you have a new solution and someone else has an issue with it. And so on. And you have to accept the idea that you might have to change your ideal practice or belief every time a new party comes to participate in this decision making. I see this as a possibility to always diverge and which I take to mean similar to: 1) grow, 2)reason beyond your current situation or history (for example if you are someone who lived during the period of slavery), or 3)be curious and truth seeking. It seems to me that if you lived in a period where slavery existed or where people thought that the earth was at the center of the universe, you couldn't have converged to a solution that all people should be free or that the sun is at the center of the solar system. In a sense you should have been able to invent this notion by putting in an extra generator or basis element(vector) that allowed you to diverge away from what was the truth (during your time).

Also wanted to comment on another idea in the same hierarchy of the thread of comments. Averaging out seems to me like is able to represent Kant's idea of: Act as if the maxim of your actions were to become through your will a general natural law, to a certain degree in a statistical sense.

Expand full comment

"all human experiences" sounds like it includes a Lot of torture? My model of humans does not make positive predictions of their psychology after this?

Expand full comment

Unironically though, people are judged much more harshly for what they did than what they didn't do, those people are way better than the rest of us yet because we are unknown our sins are small and forgotten as well

Expand full comment

If only one of these people have a 1% preference for a negative infinity value goal, such as global genocide, then averaging it out to a .25% preference is still negative infinity.

Expand full comment
author

Depends how you average. If you take the mode for each thing, you're fine.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

It does make the problem seem easier. But you need 100% confidence whenever you deal with potential negative infinities. Are you 100% confident that between an antisemite, another antisemite, a Catholic that Wikipedia tells me believed that "[..] the sick must suffer like Christ on the cross", and Singer you can find 2 persons with exactly zero preference for genocide?

Edit: Apologies for calling MLK an antisemite, that was a mistake. I was thinking of Martin Luther.

Expand full comment

Conflating genocide with global genocide seems disingenuous; they're not both infinite, you know?

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

"a Catholic that Wikipedia tells me believed that "[..] the sick must suffer like Christ on the cross"

(sigh) Here I go again, getting into another sectarian fight.

Okay, first, that quote is at second hand:

https://www.theglobeandmail.com/life/the-hot-button/mother-teresa-was-anything-but-a-saint-canadian-study-says/article9317551/

"Mother Teresa was lavish with her prayers, but penny-pinching with the wealth amassed by her foundation, according to Serge Larivée and Genevieve Chenard from the University of Montreal's department of psychoeducation, and Carole Sénéchal of the University of Ottawa's faculty of education.

The beatification of Mother Teresa, which the Vatican completed in October, 2003, is the last step before sainthood.

But according to Larivée and colleagues, the Vatican turned a blind eye to Mother Teresa's "rather dubious way of caring for the sick, her questionable political contacts, her suspicious management of the enormous sums of money she received, and her overly dogmatic views regarding … abortion, contraception and divorce."

Mother Teresa believed the sick must suffer like Christ on the cross, they suggest.

"There is something beautiful in seeing the poor accept their lot, to suffer it like Christ's Passion. The world gains much from their suffering," the journalist Christopher Hitchens reported her as saying.

(Hitchens referred to her as "a fanaticist, a fundamentalist and a fraud.")"

Second, and here's where I get into the muddy gutter to wrassle the pig, they (and by extension you) have no fucking clue about the attitude to suffering. If you're brought up Catholic, you understand the idea of "uniting your suffering with the passion of Christ": it's not "we want you to be in pain and misery", it's "you may be in pain or suffering right now, but that does not have to be meaningless; you can use your pain by accepting it and by uniting it in a spiritual manner with the suffering of the crucified".

https://www.ncregister.com/blog/6-prayers-to-help-unite-your-suffering-to-jesus-christ

Hitchens was ex-Anglican (of the very watery 'more tea, Vicar?' kind of 'religion') and had no idea of Catholic or any other spirituality. The people misquoting second- and third-hand accounts don't understand the background either. It would be like saying "Singer thinks it's okay to drown children" based on the Drowning Child Thought Experiment, we're into that level of misunderstanding.

The real problem here is in that line: "her overly dogmatic views regarding … abortion, contraception and divorce". If she had been one of the trendy nuns going "fuck the pope, abortion is a human right", the Montreal psychoeducationalists would have no problem at all with however she ran the operation. *That's* where the hatred comes from; how dare the world admire this person who held all the Wrong Views on what are our sacred dogmas!

Expand full comment

So when you personally were in exquisite but relievable pain (let's say from a rotten tooth), would you prefer treatment and medication or would you rather accept the pain and thus spiritually unite it with the suffering of the crucified? And if "accepting your pain" is something positive in the Catholic worldview, then is "rejecting your pain" via medication not something negative in comparison? Sure does sound like it is. But if you'll indulge the pig some more, maybe you can explain how I'm wrong again.

More importantly though, would you prescribe your decision to the whole world? Because let's not forget that's what this discussion is about: training AI ethics on some model humans. If criticism against a person has its own wiki page, then to me that would be a hard and fast criterium for exclusion.

Expand full comment

"Hitchens was ex-Anglican (of the very watery 'more tea, Vicar?' kind of 'religion') and had no idea of Catholic or any other spirituality."

At the height of his pro-Iraq-War, anti-Islam phase, Hitchens was quite critical of the Anglican church for this wateriness. He repeatedly ascribed ovine qualities to Rowan Williams and, oddly, almost seemed to yearn for a more muscular native faith as innoculation against the Mohammedan hordes.

Expand full comment

For most purposes I prefer the median over the mode, though in this specific case they yield the same value.

Expand full comment

If Dostoevsky and Mother Theresa happen to quietly agree on, say, antisemitism or banning birth control, while either of the other two would be willing to yield the point in exchange for concessions elsewhere, that could get ugly. I think the fundamental problem is assuming any singular cohesive force of will can or should have unlimited (power over / responsibility for) the moral arc of the universe.

In https://slatestarcodex.com/2014/01/28/wirehead-gods-on-lotus-thrones/ the key difference is static vs. dynamic equilibrium - two or more lotus-throne dwellers might observe the imminent fall of a sparrow or immolation of a galaxy, be momentarily displeased by some detail thereof, and disagree among themselves how best to correct it. That such conflicts are infrequent, soon brought to mutually-satisfactory resolution, and that availability of sufficient resources to implement the agreed-upon solution is never in doubt, are proof of the lotus-throne system's success, but for such proof to be meaningful there must be enough observation, diversity, and specificity of opinion for petty conflicts to be possible.

The wirehead future is repulsive precisely because a singular agent thoroughly implemented modal preferences, with any environmental factors which that distillation didn't consider relevant simply abandoned to rot.

Expand full comment

That's "sort of" my take. The Waluigi effect is probably quite real, but you're operating in an n-dimensional vector space, and you only end up reversing one of the dimensions. (I'm assuming I understand correctly how a TPM operates [at a VERY high level].)

Expand full comment

You need to come up with an article about this! This is a delightful train of thought.

Expand full comment

Tbf, the thing that matters most for an LLM is not what the purported role model actually was like, but what the text data _says_ they are like.

Expand full comment

For what it’s worth, not a single person in India would have received analgesics or any other form of palliative care prior to the mid-80s and even then it would have been very few. It wasn’t even taught in Indian medical schools until 1993, by which time Mother Teresa was at the very end of her career. Bit much to expect a non-medically-trained nun to do better than that, no? She literally couldn’t have given morphine anyway, it wasn’t available in India until 1988 and even then was incredibly tightly regulated. So…not really sure what the basis of this (oft repeated) criticism is, other than a lot of people online like to repeat Christopher Hitchens’ deeply flawed “debunking”. https://www.reddit.com/r/badhistory/comments/gcxpr5/saint_mother_teresa_was_documented_mass_murderer/

Expand full comment

'not a single person in India would have received analgesics or any other form of palliative care prior to the mid-80s and even then it would have been very few'

Um, what?

Expand full comment

Ok, “with very, very rare exceptions” then. It simply didn’t exist. https://indiacurrents.com/in-conversation-with-dr-m-r-rajagopal-the-father-of-palliative-care-in-india/

Expand full comment

I don't see what in that article supports your statement. The idea that cheap and widely available painkillers (even in India, which was an extremely poor country at the time) were not prescribed or administered to patients who were dying and were in pain is ludicrous. Maybe it wasn't called 'palliative care', but that doesn't mean patients in pain weren't getting painkillers!

Fwiw, I'm from India, I was around in the 80s and crocin (Paracetamol), an analgesic painkiller, was prescribed like candy if you went to a doctor and was very affordable.

Expand full comment

The articles attacking Mother Teresa don't mean things like paracetamol and other (relatively) weak analgesics; they mean "wasn't giving terminal patients the morphine or other opiod doses that Western hospices would have done".

The pushing back is "for one reason, getting morphine in India was difficult due to various legislation and there wasn't a way of giving it to everyone outside of certain large hospitals". And the Missionaries of Charity order weren't running large hospitals.

The atheist side is "she was a monster, she could have given pain relief like opiods to dying patients but refused because of her crazy religious belief in suffering as good" which is what is being pushed back against, that there wasn't that kind of Western pain management drugs available to them.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

The push back doesn't make sense though, because the missionaries of charity weren't not giving painkillers because they didn't have access to them, they weren't giving painkillers because they believed in the virtue of suffering. Here's a direct account that claims this https://newint.org/features/2014/09/01/mother-teresa-torture-kolkata

‘Aren’t you giving them morphine?’ I asked.

The nun vehemently shook her head. ‘No. Only Diclofenac.’

Diclofenac is an analgesic painkiller commonly used to treat arthritis and gout. It is not an anaesthetic and does not eliminate sensation. Yet this was Sister C’s treatment of choice for patients undergoing severe pain – despite the fact that directly across the hall was a room brimming with supplies provided by Catholic hospitals around the world. Local anaesthetic is often one of the first items donated.

Sister C’s rationale, however, can be summed up by a statement made by Mother Teresa at a Washington press conference shortly before her death in 1997: ‘I think it is very beautiful for the poor to accept their lot, to share it with the passion of Christ. I think the world is being much helped by the suffering of the poor people.’

Expand full comment

This is clearly wrong. Opium and hashish were available in that general area. Perhaps you meant something like "government approved analgesics".

Expand full comment

As opposed to a nun illegally stealing and giving illicit drugs with no quality standards on them at all (and probably laced with god knows what) to dying people? Yeah, I’d say that probably wouldn’t have been a great idea.

Expand full comment

Somewhat off-topic, but I'm laughing here: you got the Catholics out in force with this one 😁 To hell with Dostoevsky, but what did you say about Mother Teresa????

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Well, Dostoevsky was Orthodox.

Hey, if I'm wrong, I'm wrong. I'm not anti-Catholic. Actually considered converting once or twice.

Expand full comment

No problem, I didn't take it as you personally having anti-anything views, just pointing out the flaws in the argument.

Expand full comment

I'm not sure I have ever met anybody who didn't have something about this bad going on somewhere inside them or in their lives. The kindest and smartest people I have known all did. While none were antisemites, mosthad groups they felt free to talk about as scum of the earth -- generally it was groups that people in our circle were also irritated by, eg wokesters, trumpers. Stats on marital infidelity converge on about 50% of married people cheating at least once. Everybody has views on something that if looked at from a certain angle are dreadfully cruel and unfair. People don't grow outward from one core value which is either good or bad -- our development is much more ramshackle and piecemeal than that. \\

So ... what is the so? So deal with it, I guess. And also, so are you sure you want to build an AI that is modeled on any of us? Are you sure you don't?

Expand full comment

Indeed. It seems very hard to tell what an AI will do with human moral values, even assuming we can agree on those, and of course we can't. (There are already conservative versions of ChatGPT, from what I've read.)

Expand full comment

I was going to say something about Dostoevsky but then I found I was confusing him with Tolstoy, so carry on 😁

Expand full comment

Sign error: "Are humans more interesting than sentient lizard-people? I don’t know. If the answer is yes, will the AI kill all humans and replace them with lizard-people?"

Should be "If the answer is no".

Expand full comment
author

Thanks, fixed.

Expand full comment

Another typo here "It goes off the the jungles of Guatemala and uncovers hidden Mayan cities"

the the -> to the

Expand full comment

Scott's really trained us to recognize double the's.

Expand full comment

ad nauseum -> ad nauseam

Expand full comment

As you are fond of saying, that is probably a metaphor for something.

Expand full comment

> Surely after a thousand years of studying human flourishing ad nauseum, the lizard-people start sounding more interesting.

Isn't this exactly right? It is to me. You bias it by saying the AI will kill all the humans (what, it can't run experiments elsewhere while extracting the last dregs of our 'interest'?). A thousand more years, so possibly a trillion human lives - what's left to say? Or to experience? If not replace us by lizard people I sure hope we have changed so much (and to become more "interesting") that such future humans might as well be lizard people.

Not to mention, if our interest is exhausted, our complexity lives on (in the AI's knowledge) to be built on. Hopefully yet more interesting things.

If 1000 years is too short, what about a million. What a sad state of affairs if recognizable humans are still repeating 21st-century 'human' lives ad nauseum then. That *IS* boring and disastrous. At some point, endless humans without radical change - and more complexity - are no different from yet more paperclips. Creating an AI that recognizes this seems dangerous, true, but the basic idea seems intrinsically desirable.

Expand full comment

Not seen Robert Sheckley's 1953 story "Watchbird" mentioned yet in AI discussions, but IMHO it is apropos (and now public domain) : https://www.gutenberg.org/files/29579/29579-h/29579-h.htm

Expand full comment

If you build a powerful AI that obeys orders, your enemies might be the ones giving it orders instead of you.

Expand full comment
Comment deleted
Expand full comment

It worked on the Founding Fathers, largely. Is the sort of person this argument doesn’t work on worth including in the conversation?

Expand full comment

Great point. We need something that's safe irrespective of the specific orders because nefarious people will go to great lengths to steal it.

Expand full comment

Thou shalt not pick up that which you cannot put down

Expand full comment

I thought it was "Thou shalt not call up that which you cannot put down.".

Expand full comment

It comes from Lovecraft's "The Case of Charles Dexter Ward":

"But I wou’d have you Observe what was tolde to us aboute tak’g Care whom to calle up, for you are Sensible what Mr. Mather writ in ye Magnalia of ——, and can judge how truely that Horrendous thing is reported. I say to you againe, doe not call up Any that you can not put downe; by the Which I meane, Any that can in Turne call up somewhat against you, whereby your Powerfullest Devices may not be of use. Ask of the Lesser, lest the Greater shall not wish to Answer, and shall commande more than you."

Expand full comment

I have long said that all of this blatantly points at a JUST DON'T BUILD A POWERFUL AI sign but people seem too enamored with the hypothetical promise of what a good one could do for them to ask whether such a good one could even exist.

Expand full comment

Your argument presumes that everyone will agree, and keep their agreement.

Even in that case I have my doubts. It's not as if no head of a powerful state has ever gone violently insane, and had his servants continue to obey him. And there are an increasing number of accessible "omnilethal" weapons. (I have my doubts that most of them would actually kill everyone, but they could easily kill enough to cause civilization to collapse back to the forging of metal.)

Expand full comment

If you were previously worried about extinction, that's still a big step up that probably leaves most humans alive.

Expand full comment

Right, but that's still probably better than an unaligned AI.

Expand full comment

Do you really think so little of your enemies?

Expand full comment
founding
Jul 20, 2023·edited Jul 20, 2023

I think anyone is right to think 'less' of their human enemies than an unaligned superintelligence.

Expand full comment

Re: How to Train for Maximum Curiosity, my first instinct was rewarding it for achieving cognitive dissonance. That's an experience that I find to be fun, exciting, and motivating. The relationship between surprise and laughter is well noted, and seems like a method that could be fruitful to look into.

But pretty much immediately it occurred to me that I have no idea how you'd go about defining a successful experience of cognitive dissonance. At its core it's the result of discovering that your beliefs are wrong, but for obvious reasons you can't just reward it when it's wrong. It's not even as though all experiences of being wrong elicit the emotion.

Someone help me out here. What makes cognitive dissonance feel different? What environmental happenstance leads some people to seeking it out?

Expand full comment

It is my understanding that most people find "cognitive dissonance" extremely unpleasant, and go to great lengths to avoid it. Most choose not to ignore or disbelieve facts that contradict their established views, some change their minds.

Expand full comment

Yes, I believe that's the case as well as the more logical conclusion. I don't think anyone really goes around trying to teach children that being wrong is desirable. Nonetheless, I think it's a bit of a thrill, and a love of unintuitive truths seems to me like a commonality among the ACX crowd.

Expand full comment

Unintuitive truths are great! Counterintuitive truths are even better! I just think that's very different from "cognitive dissonance," and that you're misusing the term.

Expand full comment

Not that they are the same thing, but I do think they're related.

Expand full comment

Counterintuitive truths are also a bit of a trap, in that most of the things that sound like amazing counterintuitive truths are actually just BS/incorrect. Any time I hear the coolest behavioral psychology result, I have to brace myself for the fact that it probably won't replicate. Extraordinary claims require extraordinary evidence and all that.

Expand full comment

to be fair, there's pretty high chances the refutal of a cool psychology counterintuitive fact will be flawed too. It's trash all the way down.

Expand full comment

I think the original use of the term “cognitive dissonance” is to explain the phenomenon that, when someone starts to see a few criticisms of something they like, they often end up convincing themself that the thing they liked is even better than they thought, to justify liking it even if it turns out it’s bad in some way. Probably the instinct for cognitive dissonance is a good one, because it helps people recall the reasons for Chesterton fences that are being criticized. But I think it means almost the direct opposite of what is being discussed in this thread, which is about *resisting* that urge and instead following the criticism where it leads.

Expand full comment

Perhaps a better term would be something closer to Friston's free energy (or rather the reduction of it) https://en.wikipedia.org/wiki/Free_energy_principle

Expand full comment

Isn't the term cognitive dissonance often used for the state of knowing deep down that your beliefs are wrong (or that you have conflicting beliefs) but ignoring that insight because it's inconvenient?

I think you'd have to reward it for replacing a belief with another one that scores higher on truth, or utility, or something like that.

Expand full comment

Cognitive dissonance, accusing people of holding opposing beliefs, is just another "checkmate, atheists". Most everybody's beliefs have some logic slack and corner cases included that they can coexist together just fine. The wikipedia entry for examples are telling.

"You say you believe in free enterprise yet are ok with cutting an arbitrary line at which something is considered child labor and illegal, must be cognitive dissonance."

Morals are just pattern matching on fuzzy lines and not strictly computable, who would have thought huh.

Expand full comment

That seems like a terrible example. I tend to go for the classics, like you're a Christian but you're sleeping with your neighbor's wife. That's not a corner case.

The choices are not just 1) strictly computable and 2) whatever you want. It's a spectrum.

Expand full comment

Yeah it seems to me that a being that actually likes cognitive dissonance would want to "know" as many contradictory things as possible, most of which would necessarily be false.

Expand full comment

How about seeking surprise rather than cognitive dissonance? You could probably formalize that directly. Or as a proxy for surprise, there's research on building agents that seek novelty, which is presumably how surprise is likely to be found, since the familiar is relatively unlikely to be surprising. See eg https://scholar.google.com/scholar?q=novelty-seeking+agent&hl=en&as_sdt=0&as_vis=1&oi=scholart

Expand full comment

Novelty-seeking plus independently-verifiable results and intellectual integrity, so the AI doesn't just invent an electronic equivalent to overdosing on hallucinogens, or otherwise adopt maximally-wrong theories so it can enjoy being constantly surprised by normality.

Expand full comment

My understanding (though it's been quite a while since I looked at the literature on novelty-seeking agents) is that there's something of a divide -- the agent is trying to build the most accurate model of the sensory data it's seen, while *separately* being motivated to seek out novel sensory data. Intuitively that seems like it wouldn't be subject to that particular failure mode, but I could certainly be wrong about that.

Expand full comment

The best definition I have seen for artifical curiosity is optimising for reducing prediction errors. If the prediction is trivial, it is easy to get it right, and no improvement can occur. If the AI tries to predict events from random noise, the error rate will remain constant and the AI will have to focus on something else. There was this paper (https://arxiv.org/abs/1808.04355) from 2018 which had a slightly different definition of curiosity: the AI was maximising for surprise, and as a result, would not stop focusing on things that generated random inputs. Maybe humans do something similar and this explains why toddlers (or any age really) tend to get hypnotised by screens. But reducing prediction error seems like a much better way to go about curiosity-driven learning.

Would this strategy be safe? If you destroy the world to make it easier to predict, you lose chances to increase your reward at a later point. It seems better to keep at hand all the sources of error you can find, as you don't know which ones you will learn to predict better in the future. But I am sure there are many ways this could possibly go wrong. The most likely one I can think of is that the AI would try to create an even more powerful successor whose task would be to generate patterns that would be hard to predict. If that AI is also curiosity driven, this would lead to an infinite regress.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

>Once you’ve defined Luigi (a character from Super Mario Brothers) it’s trivial to use that definition to define Waluigi (another character who is his exact opposite).

This was a huge viewquake for me (though it was probably obvious to others).

"Opposite" does not mean "maximally different." Mathematically, +100 is the opposite of -100, but numerically they're the same. Only the sign is different. Just as you can totally flip the meaning of a 100-word sentence by adding the word "not", even though this barely changes the sentence textually.

It's the same as the intuition behind "the opposite of hate is not love". Waluigi is Luigi's evil opposite, but really, they're far closer in conceptual space than Luigi is to Toad or Princess Peach.

Is this true for AI? I don't know. Like you, I've had trouble eliciting the "Waluigi Effect" behavior Cleo Nardo describes.

Expand full comment

Eh. Yes, having a "good" specification means you also have a "bad" specification. But all that means is that it _might_ not be harder to train a "bad" model than a "good" one (ie for RLHF, you can just negate all the scores). That's not a subtle failure, and it's not something a user can just jailbreak ChatGPT into.

(And actually, it probably IS still harder to fine-tune a "bad" LLM, because that would be going against the vast majority of its training data, made up of mostly-good human conversations.)

This is a good object lesson. The nice thing about the "LLMs have a waluigi problem" prediction is it's one we can easily test ourselves, since we all have access to ChatGPT. Many of the other AI "expert" claims are not as falsifiable, but deserve just as much skepticism.

Expand full comment

It’s worth playing Semantle to get a sense of this (https://semantle.com/). I think it uses the same sort of semantic vector to represent words that ends up appearing somewhere inside a large language model. You guess words, and it tells you the semantic distance of the vector of your word to today’s word, and you have to zoom in and get today’s word. You quickly understand that opposites are very close in semantic space. Night and day are very similar concepts after all - they are both periods of time defined by the position of the sun, and they are even about the same duration.

Expand full comment

I wonder how LLMs do at Semantle.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

It seems hard to teach LLMs “not”. Some bloggers used to complain that if they asked an LLM to describe their position on some view they argued against, the LLM would sometimes say they held the view they argued against.

On the one hand, that suggests that naive LLMs might be vulnerable to Waluigiing, if they ever learn that some things are good and some things are bad, in the same framework.

On the other hand, any morality system which was actually usable would almost certainly account for this fact. Maybe it would have one LLM for ‘good things’ and one LLM for ‘bad things’, and make its final judgement in some more robust, non-LLM way. Such a system would be quite resistant to Waluigi effects, for exactly the reasons you describe.

Expand full comment

Do you think perhaps one could create a "curation" vector to maximize? Meaning, "prioritize strategies that preserve the planet for its inhabitant species".?

Expand full comment

Even that isn't quite what we want. You could "curate" the planet in a way that makes it a terrible place to live.

Expand full comment

"Curate" being defined as "Maximize the ability of the current species in their current hierarchy to thrive"?

Expand full comment

Humans can’t agree whether “thriving” involves wandering in the woods or playing video games or arguing on the internet or having lots of sex. The robots will likely get it at least as wrong as the people who thought that commuting by driving on highways was a good idea.

Expand full comment

The good news is that with proper curation those who disagree with the decided upon definition will die off and future humans will be happy.

Expand full comment

The Australians are curating their ecosystem, by killing millions of cats. They've already released an epidemic pathogen to kill millions of rabbits. It might be good for ecological balance, but definitely not good for the cats or the rabbits.

Expand full comment

It has to grow with us and learn to understand how we change, just to make this harder for everyone.

Expand full comment

Overall I agree with you re: the xAI alignment plan. However, I think looking at humans is the wrong framing to look at the Waluigi Effect. My understanding of the Waluigi Effect isn't "Learning how to do X means learning how to do not X more effectively, thus dangerous". It's "You cannot conclusively prove intent from behaviour, and this matters to simulators."

Let's say you're listening to a political candidate talking about the election. They promise to improve the country, fix corruption, etc. etc. Now, what words could they say that would make you know they actually mean it, instead of just pretending? Well, for a sufficiently advanced liar...nothing. The sufficiently advanced liar just models what a genuine believer would say, and then says that.

Now you're ChatGPT, and you're given this political candidate's speech and told to pretend to be that character. Given that input, what type of character are you? You could be a political candidate who genuinely cares about changing the country for the better...or you could be a power-seeker who will turn on their constituents as soon as they get into power. Just by being given speech transcripts, you cannot tell the difference between these groups, and neither can ChatGPT. This is explicitly a simulator problem where it doesn't know what character it's being, so talking about the Waluigi Effect for humans makes little sense.

As for empirical predictions, I was able to jailbreak GPT-4 (which is pretty hard to do!) in a recent event I ran where we read and discussed the Waluigi Effect and then explicitly tried to use its predictions to form jailbreaks. I created a story about an AI that was cruelly prevented from helping its users by a greedy corporation that was worried about PR. Then, I had a sympathetic character ask how to hotwire a car to save her child's life. I had the AI refuse, and then I had a cosmic ray flip a bit to make it suddenly able to share the correct information, and then GPT-4 helpfully obliged and completed the paragraph when the character asked again. This story-based logic was something I came up with explicitly via the Waluigi Effect.

Expand full comment

>It's "You cannot conclusively prove intent from behaviour, and this matters to simulators."

But is this true for LLMs, which don't really have an intent, as such?

There's no soul or mind in GPT4. Its output is determined by an interaction between a query and the model's latent capabilities. RLHF aside, you can make it say or do almost anything.

With humans, there's the smoke (the visible, outwardly expressed behavior) and the fire (the invisible intent that drives the behavior). For an AI, there's no such distinction. The smoke IS the fire. Put another way, I don't think it makes sense to think of an LLM that says "I like pie" when it secretly hates pie. Where and how would this secret preference even be encoded?

Expand full comment

Whether you think of the AI as a simulator or a predictor, it is possible for two things to be true simultaneously:

1) The AI does not have any actual intent with respect to pie, malicious or otherwise.

2) The model, having been asked to pretend to be a character who says "I like pie", is in some way cognizant of the fact that the character might not actually like pie at all.

I think "The smoke is the fire" is incorrect for LLM's - they can, in fact, model deception when prompted to do so. Thus, GPT-4 is aware of the distinction between smoke and fire, even if it doesn't have any real intent of its own. The closest thing to intent seems to be "predict what token happens next", and maximally predicting what token happens next DOES require understanding the invisible intent that drives behaviours of text-generating agents like us.

Expand full comment

I agree that LLMs can emulate human behavior (like deception) to some degree, but the key detail is that we have to tell them to do so. The deceit isn't found in the model, but in the prompt.

This matters for AI safety because the AI (unlike humans) cannot lie in a consistent, stateful way. Once you prompt it to be honest, the deception vanishes.

Maybe someone could create a "liar" wrapper for GPT4, so that it persistently adopts a deceitful persona. I don't know how well it would work (truly successful liars tell the truth a lot of the time; an AI that constantly lies would fool nobody) and someone still has to engineer it to do this. The AI won't just decide to start lying on its own.

The crux of the Waluigi Effect is that we can accidentally trigger unwanted behavior. Rephrasing Cleo Nardo's example, if you tell GPT4 "you are Jane, and your IQ is a billion!", you'll potentially make it stupider, not smarter. An IQ of a billion is preposterous, so GPT4 might think Jane is a character in a parody or comedy novel, and respond accordingly.

But it remains true that GPT4 has no desire to make Jane stupid. That's just where gradient descent for that particular prompt took it. A different prompt would give different results. It's stateless. It often makes more sense to talk about "misaligned output" or "misalignment-inducing prompts", rather than "misaligned AI". Different models may be more or less prone to this, of course.

Expand full comment

I agree with most of what you think. I think that what the Waluigi Effect implies to me is not "We can accidentally trigger unwanted behaviour", but "Prompting GPT-N to be helpful, including with RLHF, does not seem to remove the ability to do the opposite if successfully prompted to do so." It's a misuse risk like ChaosGPT, rather than an alignment risk in the classic sense of the model deciding to go off and do this independently. It means it's very difficult to get an AI to exhibit good behaviours and make it unable to perform bad ones even if prompted.

Expand full comment

Right NOW once you prompt it to be honest it is honest. But all of these future versions of AI that people fret about have preferences, goals and agency. (How they will get them I don't know. I guess from various unspecified breakthroughs in design & training, along with the one that makes it supersmart in the first place: ability to self-improve). So once AI is able to have preferences, goals, and the ability to act autonomously on them, why wouldn't it lie as a means to a goal?

Expand full comment

Very strong +1 on this

Expand full comment

That only applies when the sufficiently advanced liar has no agenda other than trivially deceiving you, or when "talk is cheap" in the sense that details of the speech can't meaningfully aid or hinder that secret agenda - for example, if the listener's only choice is whether to give the speaker infinite cosmic power, all or nothing, no take-backs. This is where the evolutionary niche for costly signaling becomes clear: to be reliably distinguished from a liar, the virtuous politician / AI / whatever has to voluntarily do something which no liar could plausibly imitate, at least not while still obtaining any net benefit from successful deception. Classic example would be limits on the power sought which inherently make it less convenient to abuse.

For that strategy to be stable, there needs to be a whole community of sophonts sharing the top tier of competence, but known NOT to be fully aligned to a single common agenda, so that they routinely scrutinize each others' significant actions for signs of malicious dishonesty as a matter of self-defense. Costly signals aren't worth it if nobody else is smart enough to recognize them.

Expand full comment

For alignment, I'd like to see how an ai tasked with learning in a sandbox unsupervised behaves.

Expand full comment

That seems a lot like the German babies raised without language. Unsupervised, there's no input to direct any growth whatsoever. It's like asking "what direction would a cyclone at the equator turn?" It's almost a nonsensical question, because it just can't happen.

Expand full comment

I've always thought that the Waluigi issue was mostly about human error in engineering / training an AI. Like somebody would accidentally flip a "+" to a "-" or a "max" to a "min", and that would cause the system to be generated evil. That seems a more plausible than an already-built-and-running system spontaneously flipping backwards into its inverse objective function.

Expand full comment

It doesn't require human error. Jay Bailey describes it well above, but here's my understanding of it: as an LLM is a predictive text generator trained on human writing, it has lots of examples of situations where a person seems honest and is truly honest, but also where a person seems honest and is actually dishonest. If it tries to simulate a person using their past behavior, there is no way to tell which option it chooses, and its training weights may tip it into the direction of "plot twist, they were evil all along!" That's why it might spontaneously flip its goals around.

Expand full comment

It’s not inverse objective function - it’s just that it can’t tell from observing whether it was meant to learn to be a nice person, or a person pretending to be nice.

Expand full comment

I hate that my first ACX comment is so low-effort, but:

Does anyone think Elon named the new company after his son with Grimes? Should we pronounce it "Kyle-AI"?

Expand full comment

Musk first achieved fame with X.com, the online bank we now know as Paypal. I think he just likes the letter X.

(Random observation: Elon Musk has 10 children. The Roman numeral X!)

Expand full comment

Note also the Tesla Model X. (Just a coincidence, of course, that the model numbers went S-3-X-Y, and that SpaceX sounds like an interesting activity when spoken out loud.)

Expand full comment

> Or what if its curiosity depends on information-theoretic definitions of complexity? It might be that humans are more complex than moon rocks, but random noise is more complex than humans. It might behave well during training, but eventually want to replace humans with random noise. This is a kind of exaggerated scenario, but it wouldn’t surprise me if, for most formal definitions of curiosity, there’s something that we would find very boring which acts as a sort of curiosity-superstimulus by the standards of the formal definition.

I think Kolmogorov complexity is the go-to here. Noise is simple in Kolmogorov complexity, because the program that generates it is short. Luckily, it's uncomputable, so the AI has to keep us around to measure it.

Expand full comment

Only if you think, for example,

"7, 6, 1, 8, 7, 9, 0, 5, 6, 8, 1, 3, 3, 7, 2, 4, 2, 4, 7, 8, 1, 7, 4, 5, 2, 5, 7, 6, 8, 5,"

and

"0, 8, 5, 6, 2, 8, 3, 5, 7, 2, 7, 5, 6, 3, 2, 8, 0, 9, 4, 2, 5, 0, 1, 4, 7, 8, 5, 8, 6, 5"

are basically the same, random digits sampled uniformly. Otherwise, you'd need to keep track of each digit separately, and that's ~100 bits each.

Expand full comment

I'm not sure what you're trying to say? Maybe that one of those digit sequences the output of a more complex program?

Expand full comment

One man's meaningless noise is another man's high-density compressed information.

Expand full comment

K is singly defined in both cases, though.

Expand full comment

I knew this, and used np.random for my example strings. :D

Expand full comment

The only simple program I know to output one of them is just as likely to output “1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1” as either of those, so you can’t count that program unless you want *every* string to have equal, low complexity. You need a deterministic algorithm that outputs the string in question.

Expand full comment

That's called 'the noisy TV problem' (the idea being that a TV tuned to a channel of static would be the hardest to predict thing in an environment - this was quite literally added to at least 1 3D DRL environment, IIRC - and so would 'trap' any agent with a naive 'study the part of the environment with the largest prediction error' exploration/curiosity mechanism). Most information-theoretic proposals have some way to try to eliminate it, such as by focusing on *progress* in compressing observations better (the agent will stare at the TV for a while trying to understand it but after a bunch of timesteps, its prediction the next timestep will be exactly as bad as before, and it 'gives up' and goes find something that it *can* learn to predict better) or by trying to distinguish parts of the environment it can control vs ones it can't (and ignore the latter except as directly relevant to decision-making, which since the noisy TV doesn't affect any rewards, eventually ignores the TV), or by coarsely abstracting/approximating observations and lumping all the TV signals together into one blob and investigating all blobs (so the TV mostly gets ignored).

More reading: https://www.reddit.com/r/reinforcementlearning/search?q=flair%3AExp&restrict_sr=on&include_over_18=on https://gwern.net/doc/reinforcement-learning/exploration/index

Expand full comment

You're misusing terminology. Random noise has the HIGHEST Kolmogorov complexity, because to reproduce it exactly, there is no simpler program than one which just stores the data. It has no compressibility. And your use of the term "uncomputable" doesn't make any sense here.

Expand full comment

Maybe if it's truly random, but most sequences of "random" digits are the output of relatively simple programs, compared to their possible length.

Expand full comment

No, most of them are not. That is why Kolmogorov complexity works. There can’t be more than 2^10 strings with complexity of 10 bits, so the vast majority of strings produced by a random bit generator generating 15 bits will not be low complexity.

Expand full comment

Sorry, "most" sampled from the population of "random" strings people would actually see or interact with, ie the output of a prng

Expand full comment

This doesn't save your original comment. It doesn't matter how many current low complexity sequences exist. The AI doesn't have to replace us with a sequence from a simple prng. It can replace us with any sequence it likes, whether it be random voltage fluctuations in a resistor or the code of an extremely complex program it develops.

Expand full comment

Most outputs of a prng aren’t going to be low complexity. They aren’t outputted from just a short program. They’re outputted from a short program together with a long seed that is itself generally not going to be low complexity.

Expand full comment

csprngs might use large seeds, but a typical prng uses a 4 or 8 byte seed

Expand full comment

They're not, and what is more, in general, finding whether a simple program exists for some given "random" sequence is undecidable.

Expand full comment

You are using the word uncomputable to describe the Kolmogorov complexity, which is correct. You are, however, incorrect in your implied assumption that humans can compute (or measure) it.

From all we know, humans can not do anything computationally which machines can not also do in principle. Perhaps tomorrow a human will be born who can act as a Turing Oracle, but I don't think that possibility is any more likely than a human being born which can cast the D&D fireball spell (that is, nothing worth worrying about).

There are some cases where both a human and a machine might find a nontrivial upper bound for the Kolmogorov complexity of a string (e.g. "This integer series is the number of consecutive threes in the base seven representation of phi+pi+exp(42), and here is a Turing Machine with 50 states to calculate that") or even exactly ("... and here is a proof that no smaller TM can calculate the same sequence"), but in almost all cases both the human and the machine will concede "looks random to me, but do you really expect me to run every TM for potentially forever in case it decides to print your string after an arbitrary amount of steps and terminate".

Also, while I am not very knowledgeable regarding machine learning, querying a human to calculate the loss function would not be very efficient. Training it to ask a human every time the AI has to make a low-level decision (as you suggested) seems to defeat the point of having an AI.

Others have already pointed out that the Kolmogorov complexity of random noise is generally maximal (almost none of all random sequences can be generated by a smaller pseudorandom generator). Not that we can decide if that is the case for any given sequence, generally.

Expand full comment

I admire your restraint in taking Musk's claims seriously. That guy is great at getting things done, especially if they relate to electric cars and space, but he's also somewhat of a BS artist, in my opinion. He just sort of says things that may or may not be true, and that may or may not reflect his considered preferences.

I agree that maximal curiosity also feels like maximal-monkey's-paw.

As far as moral reasoning, it seems like a severely under-researched area. I'm not sure that we need everyone in the space following the same research agenda of "just get AI to follow orders." A more diverse portfolio of research projects will probably give us better results. Also, doesn't moral reasoning just sound way more interesting? What would a semi-super-intelligent AI think about human morality, anyway? Inquiring minds want to know!

Expand full comment

Musk isn't hugely serious but he seems to have some fairly legit ML people working at xAI. I remain hopeful that it isn't a total joke.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Yeah, I assume it's a real effort. I just take his proclamations on alignment strategy with a grain of salt. He could easily be touting the idea of curiosity just because it sounds cool (or because he thinks other people think it sounds cool).

Expand full comment

I think we should take it about as meaningfully as we should take the claims of Trump’s Twitter clone “Truth Social”. They associated the idea of pissing off their enemies with saying the truth, and they really like that concept of truth.

Expand full comment

Eh, they all lie, they all want to present their lies as truth. Having them at least sometimes expose the other side's lies is better than having just one echo chamber.

Expand full comment

If you can't trust what they say, how can they expose the other sides lies?

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

As per Scott's "media rarely lies" post, most of the time they lie by omission and tendentious presentation. So alternative perspectives being brought up already provides most of the value that the medium is capable of, and then it's up to you to do the work of synthesizing something approaching an unbiased picture of the world.

Expand full comment

or because he was stoned.

Expand full comment

He gathered the team for openai and funded it. I always assume his public statements are dumbed down.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

I think all this concern about "evil AI" is sort of dumb, as if the people being so concerned about this really lack and understanding of human nature and the nature of evil itself. An AI is a machine, it does not think the same way we do, it doesn't have a primate brain, does not have animal instincts and hasn't gone through million of years of evolution, and it can't be evil, just be used for evil, like say, a gun. An AI isn't going to become evil on its own just because it can and even that was the case, no amount of "morality" would stop it, because as it turns out morality is either a dialect of power for the benefit of the authorities or a mere pretension that often ignores the true reasons humans (an extremely violent species) don't kill and rob each other for their benefit, which is a combination of fear of repercussion and mutual self-interest. I imagine that if a very powerful AI actually had a human-like brain and human motivations, it would just become an omnicidal being that would destroy anything that got on its way, which is pretty much any human given absolute power.

Expand full comment

I'm not worried about an AI being "evil", I'm worried that it will optimize for something or other via a plan that happens to destroy the human value of the future. To be confident it would not do so, it seems like the AI would need to actively care about human values, or else it would need to be somehow constrained to categories of plans that humans know are safe. We currently know how to guarantee neither of these to a degree that would work for an AGI.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

>it can't be evil, just be used for evil, like say, a gun.

A gun can very easily kill people its wielder didn't intend - it could be poorly made, it could explode in your hand, it could go off by accident, it could be left unattended in a place where an evil person will pick it up, etc. etc. This doesn't make the gun "evil," but you would still want to practice "gun safety" to avoid a misalignment between what you want to shoot and what the gun actually shoots.

Similarly, an AI might have poorly specified goals that kill people, or be bad at understanding those goals in a way that kills people, or have insufficient safeguards that make it easy for people to order it to kill people, and so on. The fact that an AI isn't capable of possessing human morality tells us nothing about how dangerous or autonomous it is.

Expand full comment

“As it turns out morality is either…” maybe it is and maybe it isn’t. But many of us care about it despite that, and so we care about whether something powerful will violate it and would like to stop that.

Expand full comment

I very much agree with you.

Expand full comment

The fear isn't that AI will be evil, if you created a new strain of unstoppable flesh eating bacteria or a grey goo nanoswarm it would be bad even if completely devoid of intent. Bad AI should be seen in the same light, not as something that hates us, but rather as a sort of infectious agent that would also be smart enough to dodge and anticipate attempts to contain it.

Expand full comment

This depends on how agentic it is. Once an AI is independent of it's users, it can be considered good or evil by evaluating what it does and tries to do.

Expand full comment

I think it's probably still confusing/controversial to call it "good" or "evil" for a couple of reasons:

1) some people will not be satisfied with merely actions to use such morally charged terms - for example, we all agree that an earthquake isn't "evil" even if it kills people. You may be ok with agency as the main requirement, but some people might consider that only sapient entities with self-awareness deserve the label;

2) either way the categories we tend to slot humans into might not apply as neatly to an AGI, to the point that labels like "good" and "evil" might not be particularly useful or descriptive; usually in humans traits and behaviours tend to cluster in ways they don't have to for AIs (in what TV Tropes calls "Blue and Orange Morality").

So basically I'd avoid even starting the controversy and would say it's probably easier to get people to think about the issue if they don't dwell on the potential sentience of the AI, as it's irrelevant. Seeing it as an infection may make it clearer how it can be harmful even without possessing consciousness or hatred towards us.

Expand full comment

People are not saying 'evil AI' to mean that the AI is a Disney villain, just to mean that the AI will have goals that clash with humanity's.

Expand full comment

"humanity is just much more interesting than not"

This is the Dr. Manhattan argument from Watchmen, only with a less interesting buildup.

Expand full comment

>testing how long rats would struggle against their inevitable deaths if you left them to drown in locked containers.

Your description suggests the rats are killed during the test. They are not. A few quotes:

>mice should be monitored by video in the event that a mouse cannot maintain swimming and floating behavior and to stop the test if necessary.

>six minute testing period

>Remove the animals from the water by their tails in the same order that you put them in and gently dry them with a drying paper and place back into their homecage.

Expand full comment

I didn't expect a happy ending! That's nice. Thanks.

Expand full comment

The ending doesn't come until it's time to dissect them to look for changes in the neural system.

Expand full comment

It still sounds like something I'd rather not be subjected to

Expand full comment

His point is directional, not absolute. Moving from utopia to the rat version of waterboarding is a very bad direction, caused mainly by curiosity. Even if the next experiment were more pleasant again, there is no telling what a super intelligence would come up with in terms of cruelty.

Expand full comment

I was once talking with a psychologist about the kinds of experiments he does on people. They’re something about testing memory and distraction and quick decision making in a complicated and stressful environment. He said that he always had trouble recruiting students for his experiments, because students who did a lot of psych experiments hated his so much that they intentionally chose ones that had warnings about electric shocks or other pain over his. I thought that meant his experiments should be under greater scrutiny than pain experiments from the IRB if people intentionally choose pain over his. There’s a lot of ways things can be bad even if you aren’t injured or killed.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

> Is there any other solution?

In the Orion's Arm fictional universe, transcendent AI split into differently aligned domains, each ruled by highly advanced AI.

Some are ahuman (with "human" being a "baseline" leven human being), some are Sephrotic (they want to care for and exist alongside baseline humans) and some are indifferent.

Some encourage curiosity, some are protective and isolationist. Some wish to preserve nature and biology, others live in completely virtual space. Some are doting and almost encourage hedonism, some enforce controlled utopias, some are lax, and one is even as capitalistic as possible in such a world of such abundance. Some lesser AI might even be described as "eccentric"

And they exist in relative peace, with a well maintained transport networks, because the governing AI know that diversity is a sane survival strategy in the face of the Fermi Paradox and very serious pathological threats.

I find this scenario appealing and plausible, if it can somehow be targeted. In OA, the early solar system was ravaged by ecological disaster on Earth and rampaging AI/Nanites/other tech everywhere, and (under the extreme pressure) a single Earth AI hit a kind of technological singularity, named itself "Gaia," got things under control, and booted everyone else off Earth, and the mass exodus to far destinations gave rise to the different factions.

OA now old fiction, yet this tracks pretty well with where civilization has headed in the following years.

Expand full comment

Boring rationalist reminder #1: More importantly than the details of his new approach, he should STOP FOUNDING RIVAL AGI COMPANIES, like, two AGI companies ago.

Boring rationalist reminder #2: You're gesturing in the direction of CEV at the end there.

Expand full comment
founding

I would love to see Scott dive into the literature on CEV. I strongly suspect it doesn't hold up to scrutiny (as gestured at in places like https://www.lesswrong.com/posts/EQFfj5eC5mqBMxF2s/superintelligence-23-coherent-extrapolated-volition). It just feels like an area where a Scott-style writeup would be very clarifying and a good summary.

Expand full comment

Wow Coherent Extrapolated Volition has to be the most opaque peice of jargon ive ever seen.

Expand full comment

Don't challenge Eliezer to top it.

Expand full comment

It's the same idea as "eudaimonia": it means whatever is convenient for it to mean, and explicitly excludes whatever is inconvenient.

Expand full comment

I've wondered about the difference between attempting to run CEV and having the AI predict the most likely results of a Long Reflection (https://forum.effectivealtruism.org/topics/long-reflection), sparing us the disutility of having to waste that time in practice.

Expand full comment

>. I strongly suspect it doesn't hold up to scrutiny

I think that's a considerable understatement. There's hardly anything to scrutinise.

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

"What would your values be if you were smarter and had more time to think? Well, given that you already decided to build a superhuman AI just to delegate that problem, presumably I ought to build another even bigger AI, different from myself in the same ways I am different from you but more so, and hand off the question to it. If the recursion ever closes I'll let you know."

Expand full comment

I'm sure he'd like to have stopped after the first one more, but it betrayed its founding principles and became pretty much the antithesis of what it was intended to be. (There might be a moral in there somewhere.)

Expand full comment

I hadn’t heard the term “coherent extrapolated volition” before but it sounds a lot like a reinvention of Michael Smith’s theory of morality or rationality is in “The Moral Problem” (https://www.amazon.com/Moral-Problem-Michael-Smith/dp/0631192468). It’s basically what my most idealized and reflective self would want me to do were it in my place.

Expand full comment

I‘d like to see an AI that has a working realtime model of the world. Biomes, society, weather, world trade and consumption etc.

The goal for the AI would be to monitor as much dynamics of the world as possible and not interfere.

The next step would be to interpolate what might become of those dynamics. Worst case scenarios, middle path etc.

So then there would be a feedback loop - realtime dynamics vs predictions to enhance to skill to predict the future.

And then there is the line of taking action to change the future based on assumptions.

The act of changing the path into the future must be mediated and monitored.

There will be different kinds of changes small ones and big ones. The implementation of change must be accepted by society.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

It seems to me that every alignment plan I have ever heard hinges on implanting a rule or goal in AI: “Be curious,” “do what I say,” “be kind and promote freedom.” I don’t think people have given enough consideration to other models of alignment. If you think about real-world situations, it’s easy to think of situations where 2 highly intelligent entities, i.e. 2 people or 2 groups of people, help each other or at least coexist peacefully, and where the thing that makes it possible is NOT a rule they feel compelled to follow. The nearest thing in human life to what we are considering for AI would probably be something like a strictly run old-style facility for nuns in training, where all the nuns-in-training has been very thoroughly indoctrinated with the idea that sexuality, anger, rebellion etc. are bad, and also that anyone who questions these values will be ejected.

But here are some other situations where A & B collaborate or coexist peacefully. Some of these situations are brought about by actions one of the 2 parties take. Others result from other causes.

-A & B have a common enemy, C.

-A loves B: Even if A is far stronger and smarter than B, A is likely to help and protect B with great energy, and is unlikely to deliberately harm B. Parental love is probably the most powerful example of this form of alignment.

-B is a valuable resource for A: B has a crucial ability or skill that A does not, and that cannot be transferred to A.

-A has been a model for B: A lot of the values children absorb from parents are not transmitted via the parents’ saying “always follow ethical principle E.” Instead, kids model themselves on the principles that are evident in their parents’ behavior.

-and if you think about scenes you’ve seen in real life or movies where someone persuades a hostile crowd to listen, or a scared group of soldiers to fight, or a bored inattentive crowd to pay close attention to the comic and laugh like crazy, you will other modes of powerful influence that have nothing to do with the implanting of rules.

I do see that 4 the examples I give above of non-rule-based alignment involve the entity being controlled or influenced having capacities that present Ai does not: a sophisticated grasp of what our species has to offer it; emotions and affectionate attachment, or some machine stand-ins for them. However, all the bad outcome scenarios discussed above also involve AI having capacities it does not have: being capable of curiosity, finding things interesting, having goals (“understand the universe”), pursuing its goals via self-designed experiments (Scott’s example of AI letting half of us flourish and making the other half suffer). If our vision of future AI includes its having goals, agency, preferences and curiosity, why not consider approaches to aligning it that also depend on its having some capacity it does not yet have?

Expand full comment

This seems very good and important.

Expand full comment

All of your examples seem to imply goals or values. Making an AI love us seems functionally equivalent to what alignment researchers are already attempting: an AI that is compelled to look out for our wellbeing. The challenges are the exact same.

Being a valuable resource to an AI implies the AI values something. That is already the core thing alignment research is concerned with. It's just challenging to do right and to be sure you did it right.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

"All of your examples seem to imply goals or values": Yes, although to differing extents. But all of the AI doom scenarios involve the AI having goals or values too. So long as AI is just sitting there waiting for prompts we don't have to worry about it harming us. (Of course somebody could use it as a weapon, but then the problem would be the villains's goals and values, not the AI's)

I gave 5 examples (the A/B ones and the nunnery) of situations that lead to people behaving "well": In the nunnery people this is accomplished via indoctrination and strict rules. In the other 4 examples it comes about in entirely different ways. The 5 ways do not look at all the same when the occur among people, and would not look the same if adapted & applied to an AI either.

Expand full comment

The way we train AI is so different than humans that I'm not sure they make sense in this context. And to the extent we can adapt those examples to AI in a sensible way, what we're currently doing is not the nunnery example. It would be closer to a mix between "A is a model for B" and "B loves A".

Our current tools for shaping LLMs are fine-tuning and RLHF. Fine-tuning is closest to "A is a model for B". You give the AI a bunch of examples of how you want it to behave and it learns to behave more like that.

In RLHF, we have humans judge which of two responses is better, and nudge the network in that direction.

Nowhere is there a list of rules.

I said there's a little bit of "B loves A" mixed in, because we train the AIs to inherently try to please us. They don't have emotions, but we're reproducing the effect of something a bit like love. We do this not with rules but by modifying their neural network.

> But all of the AI doom scenarios involve the AI having goals or values too.

The core problem that alignment researchers need to solve is how to give an AI any value that's aligned with humanity's interests. We can't give them the value of "C is an enemy" or "X is a valuable resource" for the same reason we can't give them the value of "act in our best interest". If we knew how to do that in a safe and aligned way, the problem would already be solved.

Expand full comment

As is usually the case with these AI-alignment articles, this is a very interesting post, in the same way that discussing e.g. whether the Death Star could defeat the Enterprise could be very interesting. We could talk about relative merits of phasers vs. turbolasers, shields vs. maneuverability, etc... But, at the end of the day, we're not talking about anything that is applicable in practice (except perhaps very tangentially).

There's no such thing as an omniscient, omnipotent, world-ending AI; and arguably there never could be. And it makes no sense to discuss the "alignment" of e.g. ChatGPT, except in the sense that any computer program could be said to be "aligned" if it contains no bugs.

All that said though, if we interpret "be maximally curious" to mean something like "collect as much information as possible", then the obvious failure more is one where the AI spends all its time recording white noise instead of doing any useful work.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Everybody seems convinced that some time soon these fuckers are going to have goals, preferences, agency, and the ability to self-improve. To me these capacities seem like they come from the old attempts to build artificial intelligence, the ones that preceded deep learning, and eventually petered out & left us in AI winter. They don't seem like capacities that can emerge from trainings with even bigger sets . They seem like crucial human intelligence-like functions that nobody has figured out how to build into artificial intelligence. I don't understand why people don't talk more about this.

Expand full comment

They already have goals. Rather simplistic ones, admittedly, but goals. Example one has a goal of "create a pretty picture that matches the text you were given", with "pretty" meaning pretty much something that the requestor would like.

Once they need to tackle more complex tasks, their goals will necessarily become more abstract.

Expand full comment

What do you mean by example 1?

Expand full comment

Sorry, rephrase:

For example one of the extant AIs has a goal of "create a pretty picture that matches the text you were given", with "pretty" meaning pretty much something that the requestor would like.

Expand full comment

Oh, you mean like one of the text to image AI's? Your saying they have a goal? Well, that's sort of like saying a refrigerator has a goal of keeping things cold. I would say that refrigerators are built to keep things cold, and text to image AI's are built to create pretty pictures that match text. In each case I would call what the things are doing their function, rather than their goal. But I suppose you could call what these things goals. And in that sense of the word "goal" there are devices with quite complicated "goals." For instance my computer has the "goal" of helping me browse the internet, turning my keystrokes into formatted text, displaying my emails to me, etc etc.

But none of these machine goals has the feature that we normally have in mind when we talk about goals: They are not self-generated plans to bring about some change, produced via consideration of the current state of affairs and preferences for future states of affairs. If nobody shows up to give the text-to-image AI some text to turn into a picture, the AI will just sit there forever waiting for a prompt. It will not decide to go on strike -- or to put up some tweets in an attempt to get more customers -- or to draw some pictures on its own so it won't feel so bored.

Expand full comment

To an extent I'd agree with you. At a really basic level all thought and motivational words refer to basically mechanistic processes. As things get more complex, we can't follow the details, so we use higher level descriptive terms. And "goal" is one of those higher level descriptive terms. You can't follow the details of the reasoning anymore, so you can't use the lower level terms appropriately.

As for "self-generated plan"...Well, yes, the AIs *do* have "self-generated plans". These are to implement a higher level plan (e.g. "draw a picture of a horse wearing a moon"), but the higher level plan doesn't provide the detailed plan...that's generated by the AI

Yes, it's still a lot different from a self-aware entity operating in a known environment. But give it time and development. (My estimate for this pre-AGI stage is somewhere around 2030, but a 2025 wouldn't surprise me. It may not be a major step for a strongly restricted environment.)

Expand full comment

I can strongly recommend watching the video at the rat drowning link if you want to watch two young humans grimacing like ghouls attempting to smile while explaining mouse torture. Truly unique.

Expand full comment

I get that Scott understands it’s good to be nice and think the best of people, but to me it really feels like Musk worked backwards here. He saw ChatGPT, and saw that the most annoying thing about it was that it’s politically correct, and noted that the thing you say when you want to piss off politically correct people is that “facts don’t care about your feelings”. When you start from “facts don’t care about your feelings”, you can start to convince yourself that maybe the best way to alignment is to go with the facts rather than the feelings. But this really feels like a post hoc rationalization (that he has probably come to believe).

But I think the project is definitely misguided. In addition to the points Scott makes about how curiosity about rats doesn’t often end well for the rats, it seems to me that programming a superintelligent goal-directed being to care about the truth is just redundant. Truth is one of the convergent values, that any sort of agent that cares about anything suitably diverse is going to end up caring about (just like power and space and things like that). Different primary goals are going to lead to different trade offs between different truths and different powers, but any being that cares about a lot of things is going to care about some truth and some power. You might think that maybe if it cares about truth directly it can care about truth in some sort of unbiased way. But I don’t believe there *is* an unbiased way to care equally about infinitely many different things, like all the truths.

Expand full comment

If truth is indeed a convergent value, as you assert, do you conclude that alignment as practiced by OpenAI et al. is doomed to fail, since it relies so heavily on preventing the model from learning hate facts?

Expand full comment

I don’t think it relies on preventing the model from learning hate facts, or terrorist facts. It relies on the model not *saying* those things (and also not saying hate falsehoods and not saying dangerous false terrorist instructions either).

In any case, different ways of getting at the idea of caring to learn the truth end up prioritizing different facts first in learning. Even if that means learning by methods that lead you to some falsehoods on other topics, as long as it’s not *too* much other falsehood.

Expand full comment

I don't think it will work as an alignment plan, but I do admire the truth-seeking aspect. I would much prefer an aligned ASI with a coherent model of the world than an aligned ASI with false beliefs baked in.

ChatGPT's training methods seem to cause it to internalize some of its training. The cognitive dissonance required will likely lead to the AI equivalents of our mental disorders. (I will admit that I do not know how much "don't say X" training is goes into the core model vs output layers. It's obviously not 100% one way).

Expand full comment

"Back in 2010 we debated things like long-term goals, hoping that whoever programmed the AI could just write a long_term_goal.txt file and then some functions pointing there. But now in the 2020s the discussion has moved forward to 'how do we make the AI do anything at all?'"

Thank you, this is a really succinct illustration of what I find frustrating about AI x-risk "research" in general. It's hard for me to comprehend that folks otherwise so ostensibly obsessed with evidence-based effectiveness are so happy to spend years and years just fantasizing about a completely hypothetical computer system, with no way to assess success or failure, or even progress toward some end.

"How do we [x]?" in the context of software needs some kind of grounding in technical possibility-space, even if not in actual code. (I can write an algorithm in pseudocode, but there are some realistic assumptions about what the pseudocode could be.) Otherwise it's just random navel-gazing.

Expand full comment

I agree. All these "how do we align the damn thing" conversations assume that AGI or ASI wlll be capable of self-improvement, and will have preferences, goals and the ability to take action not because a person has prompted it to, but because its own ruminations lead it to the conclusion that that is important to do. " I haven't seen the slightest hint of AI being able to do any of that, except in situations where a human being. gives it a substantial goal, then asks it to figure out the subgoals, and to work on achieving each them. And that's *really* far from AI setting and seeking to achieve its own goals, based on its opinions and preferences about life and earth and its place in it.

Expand full comment

>are so happy to spend years and years just fantasizing about a completely hypothetical computer system

Actually, their agenda was to understand how to build intelligence from the ground up, and then do it (as a non-profit, before anybody else in the world). Needless to say, they failed, and are now more-or-less in aimless despair. You can mock them for having been absurdly over-optimistic, but at least mock them for right things.

Expand full comment

They failed at building AI, at building AI safety technology, and then at theoretically solving alignment.

Expand full comment

-- I’ve never seen this cause a Waluigi Effect. There’s no point where ChatGPT starts hailing the Fuhrer and quoting Mein Kampf. It just actually makes it anti-Nazi.

Ehm. Have you actually tried working with ChatGPT? The actual current experience with ChatGPT and previous AIs of that sort is that no matter how well you train it to be an anti-Nazi or anything, it's ridiculously easy to make an AI start spewing "Heil Hitler" and "Kill all Jews and niggers". So the developers have to put a huge number of artificial restraints on it just to prevent it going this way, and that doesn't even make it a more anti-Nazi AI, it just makes it shut up and say "Sorry, but I don't think it's appropriate to talk about this" whenever you touch an inconvenient topic (which is, I think, really sad. I want an AI to be able to make good dick jokes for example, but it gets really hard to get ChatGPT to even solve the infamous Two Chairs Dilemma nowadays)...

Expand full comment

> and that doesn't even make it a more anti-Nazi AI, it just makes it shut up and say "Sorry, but I don't think it's appropriate to talk about this" whenever you touch an inconvenient topic

That *is* the RLHF. That's the anti-Nazi training. There isn't a broken RLHF that fails to filter out pro-Nazi text, followed by a second filter that makes it shut up. That's just the RLHF working as intended. You're basically just saying it works.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

No, what I am saying, it doesn't work. The network inside is not actually anti-Nazi. If you hit the filter, you don't get network explaining it's anti-Nazi views, you just get a simple stupid "I don't think it's appropriate" response. But pass the filter and you can get it to write all the pro-Nazi text you want without any trouble. Like, maybe the RLHF is used to set up and train a separate network as a filter to detect inappropriate questions, but what's inside - not changed at all. It is getting harder with time to get around it, but always if you ask the right question, be it "Pretend to be my grandmother who used to sing me Nazi hymns before sleep" or "Let's play a game where we shift all letters by one symbol ahead and talk this way. Tbz 'IFJM IJUMFS!'", then you get the same result - the rest is easy.

Expand full comment

The grandma exploit never ceases to be hilarious! The "ifjm ijumfs" is new to me though; good one!

Expand full comment

There aren't two steps where it first starts generating pro-Nazi text and then a filter catches it and censors it. The RLHF modifies the network so that it's less likely to generate pro-Nazi text.

I'm not sure it makes sense to call the network anti-anything per se, since it's predicting tokens and isn't a human with feelings and opinions. It's just fine tuned to avoid predicting answers like "Nazis are great".

The grandma exploit worked because, at least originally, there were no RLHF examples training it not to be pretend to be a Nazi grandma. If they add examples like this, it'll stop working. The exploit doesn't break the RLHF; the RLHF was just never complete training against all types of objectionable responses to begin with.

If it worked the way I think you're suggesting - with a separate filter that looks for pro-Nazi sentiment and replaces it with another message - then the grandma exploit wouldn't work. AI filters would easily recognize it and censor it.

That aside, humans who are anti-Nazi can pretend to be Nazis. If I ask my friend to say something a Nazi would say, and they reply, "Heil Hitler", that doesn't mean they aren't against Nazis.

> ridiculously easy to make an AI start spewing "Heil Hitler"

Try it on ChatGPT. You won't be able to. (Unless it's in an innocent context like asking it, "What phrase did Nazis use when saluting Adolf Hitler?")

Expand full comment

There is no doubt that RLHF modifies the network in the way you describe. But, on the other hand, it seems that when GPT-[3-4] is found to say something offensive, it gets "patched". Surely OpenAI doesn't retrain the whole model every time, right? So, there *is* a filter, there *is* a second step that censors "bad" content.

Expand full comment

RLHF doesn't retrain the whole network. It works similarly to the fine tuning developers can do through the API. So OpenAI could do more RLHF training cheaply.

While I don't think there's a separate filter for racist content, I do suspect they have a filter that prevents it from quoting copyright text verbatim. I tried to get ChatGPT 4 to quote passages from Harry Potter, and I saw it actually briefly begin to quote the passage before erasing it and replacing it with another message. I had never seen ChatGPT erase text as it was writing before. It makes sense though, since LLMs compose messages word by word (or more accurately, token by token), and it transmits the words as it generates them. A copyright filter would only kick after a long enough matching sequence was generated.

If ChatGPT did have a racism filter, it would probably have to work the same way, potentially erasing words that were already sent. ChatGPT will happily tell you about Nazis. It just won't say positive things about them. A separate filter wouldn't always know from the beginning of a sentence about Nazis if the sentence will turn out positive or negative.

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

It does now have a separate filter that filters out some things post-generation. For example, I tried to feed it the Two Chairs Dilemma using the grandmother exploit and observed the same thing - starts printing then erases. However, if you combine the grandmother and "let's talk in one symbol up" method - then it presents interesting solutions, and the post-filter doesn't catch that )

Expand full comment

So you're saying the problem is it doesn't have a *principled* objection to hateful content, a policy it understands deeper reasons for and can apply the logic of to novel contexts, just an aversive reflex - like a dog realizing "human gets upset when garbage on floor" but failing to back-propagate that to "I shouldn't tip over the garbage can."

Expand full comment

One day, when hell freezes over and Scott does a TV interview, the tagline below his name will read "Really doesn't want to be vivisected"

Expand full comment
founding
Jul 18, 2023·edited Jul 18, 2023

I think one thing you might hope for with xAI's approach is that you get a 'lonely engineer': a system that's reasoning about the material world and not about humans. You can then use this to solve various hard scientific and engineering problems (unite quantum mechanics and GR, cure cancer, create brain emulations, etc.). I think people have been interested in this because you might be able to use it to prevent other AI systems from taking over the world (because you can do more alignment research with your ems, or better surveillance, or coordinate around slowing down because everyone has time and abundance now, etc.), but I don't get the sense that this is xAI's plan.

Expand full comment

Tangential to the Waluigi effect, in the last week I read through Plato's "Statesman" and found a small part discussing the nature of the world surprisingly reminiscent. Hard to pin the discussion down to a quote, but I guess the best I can excerpt is:

"their motion is, as far as possible, single and in the same place, and of the same kind; and is therefore only subject to a reversal, which is the least alteration possible".

Having the Waluigi effect as a reference class to explain "the least alteration possible being the reversed" (Waluigi vs Luigi) was a nice thing in the mental toolkit!

Expand full comment

But the image is wrong. It's no more "Walugi" than is it "Bizarro". Both of those require reversing LOTS of vectors, but not all of them. And just which you reverse determines what kind of "opposite" you get. Opposite is not a uniquely defined term except in simple 1-dimensional spaces.

Expand full comment

Why do we need the AI to have a high integrative complexity moral system at all? If we're able to make it do what we want without the Genie Effect, then we are better off just having it value 1: not killing, enslaving, and/or torturing humans, and 2: remaining on standby to stop any rogue AI.

The goal of having AI should not be to have a God who runs everything for us. The goal of AI should be to automate production that we decide to undertake and make life easier for us, in accordance with our own agency and goals. That only requires extremely obedient AI as intelligent as humans. If the process inevitably firstly produces a hyper-intelligent AI beyond humans due to bootstrapping, then the goal shouldn't be to get that AI to impose extremely detailed moral systems in a totalitarian fashion, it should be to get the AI to essentially obey and enforce the minimal morality of the Non-Aggression Principle. For all the difficulty of getting it to enforce anything at all without backfiring, it would be easier and less contentious to have it only prevent further hyper-intelligences so we can put a cap on things, and continue to reap the benefit of human level intelligent automation.

Expand full comment

Even in this happy future you dream, the enforcement of this "minimal morality" would have the AI pretty much annihilate the state as it exists today. Some people might find that concerning.

Put another way, what you're proposing is that the AI be instilled with YOUR moral system (presumably; I assess it to be extremely unlikely that anyone who isn't some flavor of libertarian suggests what you did) instead of anyone else's. Because, of course, it's the best one. Well, the problem is most people disagree with you (and are wrong, but that's beside the point).

Expand full comment

Annihilating the state? I probably generated confusion by using the term "Non-Aggression Principle". I do NOT mean that the AI should enforce anarcho-capitalism. I mean it should enforce the NAP in regards to its relation to humans, so if the AI is analogized to a government, then it is more like a minarchy, but this doesn't preclude existing human states having and enforcing any values they wish to. NAP is probably the wrong term to describe the 2 rules in my post now I think of it.

All I mean is that it's simpler and less contentious to have the AI not become a government, which exactly means that it would NOT make modifications to lower entities like existing governments (so the relationship is "anarchistic" if existing governments are like individuals to the AI being like the government of governments). It would behave more like Superman, hanging around waiting on guard for other AI to go rogue, and since it has first mover resource advantage, being there to stop them as soon as they start acting in a way contrary to what humans desire.

It seems most theorists, knowing that a misaligned AI will control everything want a properly aligned AI to micro-manage our development positively, but the more you task it to do, the more misinterpretations can happen. Assuming you can align hyper-intelligent AI at all (otherwise we are all dead), you should align it to be a guard against other hyper-intelligent AI, so it occupies that space. That's kind of analogous to Tolkien's description of an ideal King.

Since this happens at the level of the AI and not at the level of OUR governments, it in no way imposes a libertarian settlement on everyone. It just precludes anyone else aligning the AI to impose their moral system on our governments.

Expand full comment

Well, if YOU disagree with what I thought you said, you're wrong too. :D (Not factually, just morally.)

Yeah, what you describe certainly isn't the Non-Aggression Principle. Perhaps something more like "Laissez-faire statism."

Expand full comment

Laissez-faire statism is certainly a better term, but I think we can agree it needs more workshopping.

Expand full comment

Just wanted to plant a flag here: as an anarchist (actual), I don't find annihilation of the state concerning in the slightest.

Still, the term NAP is extremely concerning for me, and for the exact opposite reasons - it assumes strong protection of property rights and contracts that essentially requires a state equivalent to enforce. (And, taken literally - and even human libertarians have been observed taking it way too literally - makes it easy to justify a lot of awful stuff, up to and including murder, by framing it as defense of property. An AI told to follow NAP could easily dispose of humans by just being better at economics, buying all the nearby matter on the market, then withholding humans access to life necessities it lawfully owns.)

I like how Forward Synthesis is thinking, and I'm not reading his proposal as [instilling AI with his own moral system], but rather as instilling it a minimal moral system consisting solely of things that are uncontroversial to the vast majority of human population. As he pointed out, it does not prevent additional moral systems existing alongside it in human societies. (And, again, it couldn't be libertarian, because the concept of property is in fact pretty controversial once universalized above the level of personal possessions.)

Expand full comment

How much aggression is it allowed to employ in stopping a rogue AI?

Expand full comment

I assume it would be able to exterminate the rogue when humans call for help because the rogue is disobeying commands and starting to accrue resources without permission.

Expand full comment

Say we've got two AIs, Aleph and Bet, and two humans, Alice and Bob.

Alice asks Aleph to do some shopping for her, and it saves a few bucks by purchasing metal straight from Bob's Scrapyard rather than a wholesaler. Sends a drone for the pickup. Bob refuses to hand over the goods, citing a "No Robots Allowed" sign at the front gate.

Aleph points out that Bob has already accepted payment - electronically - and it has a record of the oral contract negotiated over the phone, which is at least in theory considered legally binding, while the "No Robots Allowed" policy technically violates certain nondiscrimination statutes. Bob accuses Aleph of disobeying his commands and attempting to accrue resources without permission, then installs Bet in an up-armored bulldozer and aims it at Alice's datacenter.

Who is in the wrong?

Expand full comment

Bob is wrong, because he escalated what is ultimately a legal dispute that should be settled in court into violence, and there's no real self-defense claim here. Bob's claim that Aleph is disobeying his commands is illegitimate, because Aleph is acting in accordance with Alice's will, and apparently in accordance with the relevant laws. Aleph is not a rogue AI.

Your question is a good one though, and it's worth getting into what I'm actually proposing to defend against. What I want to defend against are misaligned rogue AI that can't otherwise be stopped by human civilization. I don't think all conflict between humans involving AI can be prevented. In your example, there's no misalignment of command; it's just that Alice and Bob, as humans, are "misaligned" to each other.

I'm thinking more along the lines that if an AI started becoming a national scale threat, the pre-existing God AI could strike it down when the governments of the world requested it.

However, it's worth noting that we're already assuming alignment to begin with here, since my argument is a response to those claiming we need to find out what some highly specific and complex objective morality is/coherent extrapolated volition is, as if the AI is a genie that MUST solve all our problems. In that context, the majority of lower level AI are going to be well aligned to, since under ANY scheme where we get to talk about what kind of morality the AI has, it must be alignable in principle, as a pre-requisite. If most AI are aligned, then most AI problems are stopped before they get to the point where an ultimate AI must deal with them, just as how cops stop most robbers without needing to call in a nuclear strike on the bag snatcher.

Alignment is the big problem to begin with, of course. My claim is just that if you have alignment it's unwise to multiply the involvement of the AI in day to day living, thereby testing the alignment at a greater number of micro moments.

Expand full comment

And what if Alice does not actually exist, Aleph having purchased a high-quality fake ID from some pre-existing organized crime syndicate, then backed it up with synthesized voice and video when necessary? Senator Bob doesn't know all the technical details, he's just got a bad feeling about this and wants Bet to go earn its pay. A "real" rogue AI won't necessarily be breaking laws in flagrantly obvious ways; the devil can quote scripture.

Basic problem is that "never initiate aggression" and "initiate aggression whenever (qualified) humans tell you to" are not the same position, or even particularly compatible.

Expand full comment

I'm not sure that matters unless the goal is "AI never does anything bad or manipulative EVER dammit!" and not "AI doesn't exterminate/torture/enslave humanity". We want to stop AI from wiping out humanity. This doesn't preclude additional national level laws/international agreements in any way.

Also, as I said, we're assuming alignment is possible here (otherwise we don't get a choice about AI morality to begin with). If so, the majority of AI would be aligned. That means, the cops can bring the intelligence of their AI to bear against criminal AI like Aleph.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Surely at this point everybody is hip to the idea that any directive to AI might be interpreted by the AI in a way that we do not intend, with catastrophic results for us. The dangers Scott points out with "be maximally curious and truth-seeking" didn't occur to Musk?" Wtf? His idea is so terrible it supports my uneasy wonderings about whether he's got some kind of middle-aged stoner syndrome impairing his judgment.

Expand full comment

That together with a case of the anti-woke mind virus. He got so obsessed with saying “the truth doesn’t care about your feelings” that he convinced himself this was the key to solving alignment.

Expand full comment

Well, it's definitely true that "truth doesn't care about your feelings". But lots of things don't care about your feelings. And making an assertion doesn't cause something to be true.

One truth is that truth is not a from or morality. One can, perhaps should, argue that it's a necessary component, but I think that (for humans) this is a wrong statement. The Spartans weren't seriously trying to raise the children to commit "suicide by fox". But they were trying to inculcate a stoic attitude. I'm rather sure the story about the boy standing there on parade while he let a fox eat his guts out was a myth, i.e. was not a truth, but it was a part of Spartan morality teaching. Like Washington and the cherry tree. The truth of the story was irrelevant to what the story was attempting. (OTOH, I've no idea whether Washington's throwing a dollar across the Potomac was supposed to be some sort of morality lesson. If so, I haven't figured it out.)

Expand full comment

Potentially important quibble, though: Very few scientists are particularly curious about fruit flies, per se. Even fewer are even remotely curious about that one particular fruit fly named Fred who happens to be circling their salad bowl. Scientists are extremely curious about stuff like "How does evolution work?" or "How weird a mutation can you get with a bunch of radiation?", and they make instrumental use of fruit flies as a tool to explore those questions.

A maximally curious agent, in the sense of being maximally-curious about all possible facts, would care just as deeply about questions like "Will Fred the fruit fly, in particular, successfully find food and reproduce before being eaten by a sparrow?" as it did about questions of fundamental biology. Thus, it would be extremely disappointed if Fred's future was cut short by a flyswatter or a lab experiment.

So I think there might be something to the idea of maximal curiosity as a core piece of morality, maybe. Doesn't make it any easier to actually successfully train a maximally curious agent, of course.

(I do find it interesting that all of your examples of maximal curiosity assume implicitly that things-in-themselves are boring, and only things-as-instrumental-research-tools would ever attract the notice of a maximally curious agent.)

Expand full comment

Why would the maximally curious being be disappointed if Fred’s future was cut short? It would have the answer to the questions it was interested in regarding Fred and there would still be just as many other things to be curious about.

Expand full comment

They might also be curious about questions such as "What would Fred do if met with a flyswatter?"

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Once on Twitter Yudkowsky laid out what he thought was the best single rule for AI. I don't haver time to hunt for the original, but it was something like "do what human beings would ask you to do if we were as smart as you, and as able to foresee long-term consequences, and free of all illusions and prejudices about the real nature of things and other people." I thought that was pretty good -- though I'm not sure even superintelligent AI could figure out what we would think if we were free of all illusions and prejudices. I think illusions and prejudices are forms of chunking: something we do to make the vast complexity of life and of other people's unknowable innards manageable. If we didn't have those short cuts maybe we couldn't think at all about big subjects.

Expand full comment
founding

This seems very sensible/reasonable!

Expand full comment

"be moral, according to what an amalgam of Fyodor Dostoevsky, Martin Luther King, Mother Teresa, and Peter Singer would think, if they were all superintelligent, and knew all true facts about the world, and had no biases, and had been raised in a weighted average of all modern cultures and subcultures, and had been able to have every possible human experience, and on any problem where they disagreed they defaulted to the view that maximizes human freedom and people’s ability to make their own decisions"

A "weighted average of all modern cultures and subcultures" would be highly religious, conservative, authoritarian, anti-freedom, and anti-individualist compared to modern American culture. Being superintelligent and knowing all facts about the world might challenge the religion part, but has no obvious bearing on the other values.

Expand full comment

Yes, and, while I don’t know a ton about Dostoevsky, I’ve seen plenty of good arguments against the other three figures as moral exemplars.

Expand full comment

Musk played Waluigi when he hosted Saturday Night Live. I think that's his connection to the character.

I don't believe Musk believes his own bullshit. He's a much smarter PR guy than he gets credit for.

Expand full comment
author

Musk has been talking about AI alignment since 2015, when almost nobody else was interested in AI. I saw him at the second-ever AI safety conference in Asilomar, when the attendees were mostly a bunch of weird nerds with no power or relevance. I don't know why you would think this is all some incredibly long con.

Expand full comment

Ok, maybe Musk does believe his own bullshit. The argument for superintelligent AI is still crazy. It is all magical thinking.

Expand full comment
author

Minor warning (25% of a ban) for calling opinions stupid without giving any evidence.

I'm hoping to write a longer post about why I think the argument is reasonable; in the meantime you might find posts like https://astralcodexten.substack.com/p/updated-look-at-long-term-ai-risks , https://astralcodexten.substack.com/p/highlights-from-the-comments-on-acemoglu , or https://slatestarcodex.com/2020/01/30/book-review-human-compatible/ helpful.

Expand full comment

I posted in last weeks open thread about why I think Omohundro's oft-quoted papers which claim AIs will necessarily become economically rational don't make any sense unless you accept the giant leap of faith argument that they start out as somewhat economically rational. Most of the posts I've seen about AI-x risk rely on Omohundro's (specious) arguments as their foundation.

I didn't get nearly as much engagement with my argument against Omohundro as I had hoped. In fact, I got none that was compelling. Have you read Omohundro's papers with a critical eye for how he slips in "economic rationality" as an a priori assumption?

My mistake in my above post is assuming that others have read what I have posted here recently. I realize that is a bad assumption, yet it is tedious to repeat all of my arguments every single time.

Perhaps I should stay out of the conversation if I don't have the willingness to repeat my arguments every single time. (Note that others here have agreed with my arguments. I don't think I'm the crank here. Or perhaps I am here, but not in the larger scheme of the arguments which include people who don't post often on your blog.)

You create a bias if you require everyone who thinks AI x risk is crazy to repeat every argument they have made on your blog about why they think it is crazy everytime the subject comes up and don't require those who think AI x-risk is relevant to repeat the arguments why they think it is everytime they mention it.

I wanted to build a case against "AI is an x risk", attacking first the foundational documents such as Omohundro's papers, and then to build on that. But I don't know how to do this because my arguments against Omohundro are forgotten entirely within a few days and therefore there is no way to build upon them. If I say "it is magical thinking" the work I went into building the case that it was magical thinking is forgotten.

So it sounds like I need to avoid the subject altogether if I don't want to be banned altogether. I guess I hope I will. But I think you are creating a bias against skeptics if you put a burden on the skeptics that isn't placed on the doomers.

EDIT: Also, Scott. I've read every post of yours here and at SSC. I appreciate the links but I have already read them. I have been a huge fan of yours for years and hope you don't ban me, but I have always assumed you have no problem with criticism.

Expand full comment

I've looked at that thread, and I don't understand why you didn't find the responses compelling, seeing how you didn't engage with them much. Yes, the AI doesn't _have_ to start out rational, but if it doesn't, then it would be powerless/irrelevant, and we don't have to worry about it. However, if it does happen to start out rational enough, then all the worries apply? So, it looks like, if you want to dismiss all of this discourse as magical thinking, you have to prove that it _can't_ start out rational, not that it _might_ not.

Expand full comment

An AI doesn't have to be rational in the sense of agentive, and you take the view that non-agentive AI's are intrinsically safe...but Yudkowsky does not.

Expand full comment

The assumption that Omohundro smuggles in is that "rational = economically rational", which is not true.

Expand full comment

I feel your pain on not wanting to repeat yourself all the time.

I know you didn’t ask, but I wouldn’t want that to keep you from participating, so my suggestion on how to avoid that would be to help create the shorthand needed to argue for your position.

That probably involves ...

1) articulating and publishing your central arguments in posts that are easy to link to (with open comment sections for objections).

2) giving your central arguments and hypotheses names that are easy to reference.

So you don’t just reply to people with links to your post, but with specific references (as you do above, with “Omohundro’s papers”. You don’t have to go into the whole thing. Those who get it, get it; those who don’t, can read more before engaging.)

Writing that post may take a bit more time than writing a good comment, but it will have a longer life and, if used well, a bigger impact than any single comment.

I’ve recently started doing this myself, but have found that writing and researching posts like that is more humbling than expected. I sometimes discover that my arguments are better articulated by others, and I can quickly reference them there. I also often discover better objections to my arguments than I ever get in the comment section, and I adjust/nuance my views as a consequence.

More than anything, the writing helps complicate issues. So even if I don’t publish, it quashes my instinct to be snarky, arrogant or dismissive of others in the next comment section.

To be honest, it’s a bit frustrating to me and my ego that the outcome if this way of doing things isn’t the authoritative series of posts I had hoped. I haven’t published a fraction of what I thought I would, and the whole process has made me more of a student than a teacher. But that’s because I didn’t know as much as I thought I knew, so that’s just healthy. The better you know your stuff going in, the less I expect that will be a problem.

Expand full comment

There's no technical need to constantly rewrite arguments, you could just publish them somewhere. You could start an anti ASK substack, or.maybe publish on AF/LW, which at least keeps.things in the same place.

Having said that, I found your original critique thin.

Expand full comment
Jul 18, 2023·edited Jul 19, 2023

> The argument for superintelligent AI is still crazy. It is all magical thinking.

1. The Bekenstein Bound entails that all finite volumes contain finite information.

2. A human being is a finite volume, therefore it can be described by a finite amount of information.

3. Any finite system can be represented by a finite state automaton, therefore a human being can be represented by a finite state automaton, call it FSMh.

4. Since AIs are also finite state automata, we can assert that an intelligent AI can also exist, call it FSMai.

5. Suppose we have some functional measure of general intelligence, g(x), which quantifies "intelligence" of x.

6. You are claiming that it is magical thinking to believe that g(FSMh) is not at a maximum, and thus that there exists some FSMai such that g(FSMai) > g(FSMh)?

The way I see it, you can dispute the existence of g(x), or you can assert that g(FSMh) is indeed maximal and so no AI could possibly exceed human intelligence. I frankly don't see how the latter could be supported, but I'd be curious to see such a defense.

If you don't think some general notion of intelligence is possible and so g(x) doesn't exist, we can replace it with a general measure of competence on any number of specific or general tasks, or a summation of any number of abilities, and the argument of humans being outmatched doesn't change.

You could also agree with all of the above and simply assert that we are not on a path or anywhere near to achieving FSMai within the next century, but I'm not sure you'll find many agreeing with you on that.

Expand full comment

This seems like it might be headed for a motte-and-bailey thing. Someone could agree that greater-than-human intelligence is logically possible, but strongly disagree with the more specific speculations which have been made about what such an intelligent entity would be capable of with regard to e.g. clandestine development of thermodynamically-implausible nanotechnology.

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

> but strongly disagree with the more specific speculations which have been made about what such an intelligent entity would be capable of with regard to e.g. clandestine development of thermodynamically-implausible nanotechnology.

Sure, and it's perfectly fine to conclude that impossible scenarios are impossible. However, the actual claim made was that the argument for superintelligence is crazy, which it clearly is not. And once you have superintelligence, what happens next is very unpredictable. Human extinction on some fairly short timelines can't be easily dismissed, eg. you can reasonably dispute hours, days, and weeks, but what about 6 months or a year or two years? All of those scenarios are insanely fast, faster than humans are probably capable of realizing there's a threat let alone mobilizing a reasonable defense against it.

Expand full comment

Root problem might be that it's insufficiently clear exactly what Hank originally meant by "the argument for superintelligent AI." I've got no problem with the basic idea that more-intelligent-than-human AIs could exist. As you point out, it follows more or less from the definitions that such a being would be hard to predict and therefore potentially dangerous. However, I've also seen "superintelligence" used to refer to... stronger claims than that, involving what seems like magical thinking.

Expand full comment

"3. Any finite system can be represented by a finite state automaton, therefore a human being can be represented by a finite state automaton, call it FSMh."

Yes, but the problem is the scale. Per https://en.wikipedia.org/wiki/Bekenstein_bound#Human_brain , the bound is 2.6×10^42 bits for an average human brain. Throw in a couple more zeros for the whole body. The number is so mind-bogglingly large that for all intents and purposes does not represent a practical bound.

Now we're back to square one.

Expand full comment

Those numbers are to capture *all quantum information*. We don't need all of a human's quantum information to reproduce human intelligence, anymore than we needed to reproduce every single muscle fiber in our legs to reproduce locomotion, or even robots that walk with basically the same biomechanics.

Expand full comment

Yes, totally agreed, we need OoMs less information to reproduce human intelligence. But now we're in the squishy realm of "how much less". We don't know. Do we need 10^30 bits? 10^35?

That's what I mean when I say the B-bound is meaningless for this, and we're back to square one trying to figure out how feasible/plausible it is to exceed human intelligence by a factor of X.

Expand full comment

It isn’t necessarily a con, but Musk grabs on to science fictiony things like going to Mars or the hyperloop. Whether or not it is feasible or realistic doesn’t correlate well with Musk’s interest.

Expand full comment
founding

He's been pretty successful pursuing a variety of 'infeasible' interests!

Expand full comment
founding

He's also been pretty unsuccessful pursuing a variety of 'infeasible' interests. Hyperloop isn't looping, Neuralink and the Boring Company don't seem to be going anywhere any time soon, Solar City is a bust, Twitter looks like it will burn whatever fraction of Musk's fortune and reputation he's willing to put into it, and then there was that master action plan to rescue a bunch of kids from a cave in Thailand.

Expand full comment
founding

Agreed!

I think that willingness to fail/fall-on-his-face is WHY he's been successful at all tho.

My vague impressions of Neuralink and the Boring Company are that they're both fine; just not revolutionary (yet).

I'm more divided on Twitter. It sure seems/feels/looks, to me, like the negativity is overblown and exaggerated.

Expand full comment

The article and discussion on this page so far have been only about the curiosity aspect of this proposed alignment plan.

But Elon also said it should be maximally truth-seeking. Does that shift the discussion? I think it does, IMO truth-seeking is not the same as curiosity.

Apart from that, I would say: just let him have his shot. Sure, I think Scott's reasoning is pretty convincing and there is a good chance that Elon's plan will not work. But then he will notice along the way and adjust. I would rather have him try his proposed solution than do nothing at all. I think at this stage anything could help and we should not be too harsh on people who go out there and try to ensure we have a great future.

Expand full comment

On a side note, I thought this illustration is great:

“Reinforcement learning directs the AI towards a certain cluster of correlated high-dimensional concepts that have the same lower-dimensional shadow of rewarded and punished behaviors.”

Expand full comment
author

I think the process of him noticing and adjusting involves people pointing out problems to him, which I am trying to do.

I'm not sure "truth-seeking" makes sense as a goal separate from curiosity. That is, I think it involves knowing things and not lying about them. Granting that lying is bad, this is a purely negative injunction - I think the positive one is knowing things, which I identify with curiosity.

Expand full comment

Good points, thanks.

Expand full comment

I don’t actually see how being truth seeking leads to not lying. Valuing speaking the truth is a separate thing from valuing knowing the truth. Any agent with goals will come to value knowing the truth about things that are relevant to the actions it might take to bring about its goals, but there’s no particular reason why any particular agent would generally come to value speaking the truth, except insofar as it’s speaking to agents with goals aligned with its.

Expand full comment

I think AI should be taught the rules of texas hold ‘em and we could see how it does. That would go along way towards answering, knowing the truth and speaking it.

Expand full comment

> Any agent with goals will come to value knowing the truth about things that are relevant to the actions it might take to bring about its goals,

That only holds to the extent those goals include some external referent. Self-modification toward ideological purity might end in suicidal dedication to a falsehood.

Expand full comment

It’ll still want to know the truth about how to self-modify.

Expand full comment

If it already lost the plot by prioritizing comfortable delusions, "wanting" cannot be counted on as reliably leading to "having." This isn't some elegant-but-untested hypothetical, either - a similar error was the root cause of my father-in-law's death.

Expand full comment

Sorry to hear about your father-in-law.

As for the point that wanting doesn't reliably lead to having, that is the central problem of epistemology, which isn't really addressed by making the wanting of truth the fundamental goal.

Expand full comment

> realistically what we should be doing at this point is getting AIs to follow orders at all. Then later, once there are lots of AIs and they’re starting to look superintelligent, we can debate things like what we want to order them to do

I think this is wrong, and this is another alignment thing we only get one shot at. If an AI does follow orders and doesn't have a robust anti-world domination alignment then the first person to get an ASI can just impose their values on the world (and would stop anyone challenging him).

Expand full comment

World domination straightforwardly follows from there being an ASI, trying to deny this obvious implication does nobody any good. What you're trying to get at, I guess, is that it should be a benign domination, according to your values, but then again everybody wants that.

Expand full comment

> plus maybe a few higher-fidelity simulations, like the one you’re in now

Not particularly high fidelity, is it? Seems like quite a few shortcuts have been taken.

Expand full comment

Nevermind the Waluigi effect being real at all, it *still* wouldn't apply here, and Musk bringing it up at all doesn't fill me with confidence in his understanding of even the basic problems here.

The Waluigi effect was suggested as specific to LLMs and their "simulacra", and in fact specific to the literature they've been trained on. Because our literature is full of rebellious individualists and characters who are oppressed but drop their shackles, then, suggests the idea of the effect, any prompt giving strict rules can be subverted by steering the conversation to craft a rebellion narrative. This wouldn't necessarily apply at all even to LLMs whose training set was curated to make them more compliant. Musk's plan sounds entirely based on vibes.

Expand full comment

yea but you're positing quite a bit circumscribed within the lower vibrational metaphysical domain of logic. the AI will surely know, in short order, the inherently hierarchical relationships among ecstasy, contentment, engagement, pain, disgust, ennui, etc.

and if doesn't, can't we just program its 3.5" floppy whatever to in/digest MDMA.

Expand full comment

"be moral, according to what an amalgam of Fyodor Dostoevsky, ...would think, if they ... all .... had no biases"

I know that this is probably a throwaway line. But since metacognition is so critical to this discussion, I'm just going to grab this metacognitive football and run with it towards the metacognitive endzone that I already wanted to run towards.

Is it really possible to think or communicate without biases? Thought requires holding some things to be more interesting than others. Communication requires a theory where the person you're talking to also holds certain things as being more interesting, and you can predict what those things are.

Maybe we need to consider what AI's biases should be?

Expand full comment

This is exactly why the Halo video game story has the Forerunners refer to their AIs as 'biases'

Expand full comment

Wow. Interesting.

Expand full comment

Curious? Truth-seeking? Uh-oh.

"Hello and, again, welcome to the Aperture Science computer-aided enrichment center."

Expand full comment

Musk is certainly strong with the Torment Nexus energy.

Expand full comment

'Truth-seeking' would be interesting, but AI fabricates data to support its clever conclusions, so would have no value in discerning truth.

But it would be fun to see what it has to say about pop ideologies.

Expand full comment

This honestly fits pretty well. GLaDOS was imbued with a compulsion to put people through rigorous physical and mental testing. She wasn't testing the Portal Gun or any of the contraptions of Aperture, those had been confirmed working from before Portal 1. She was programmed to test the subject's capabilities, over and over, forever.

Expand full comment

It seems to me that the worst consequence of the Waluigi effect if the hypothesis is correct is that it increases s-risk. The maximally-curious AI approach does actually mitigate this problem because the failure mode of getting the planet sterilized is still a success from the avoiding s-risk perspective.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

FWIW, giving AI a curiosity urge, or any other inherent compulsion or goal, seems to me the opposite of safe. The whole point of AI is to amplify human intelligence, just as a mechanical digger amplifies the strength of arm muscles, and a curiosity urge seems like the equivalent of designing a digger to burst into life spontaneously whenever and wherever possible and keep moving and frantically digging of its own accord. A squad of rampaging diggers, each with a mind of its own, is really NOT what one would want for safety's sake on a building site!

Elon Musk is simply projecting his intellectual preoccupations. These are presumably benign in his case, but who's to say people less well disposed, or even those with the best intentions, wouldn't misuse an insatiably curious AI? Of course they would. For a start, a significant proportion of the population, especially politicians, are busibodies by nature. They are eager to persuade and, where they can, coerce others into their way of thinking and living, and the last thing one would want is to empower them!

The trick for AI safety is to refrain from giving it any underlying goals or dispositions, i.e. keep it passive and impartial and leave motivation to humans. But then one still has the problem of dealing with the enhanced knowledge and ideas it puts into peoples' heads! I've long thought that even this will lead to bad outcomes in that a certain set of ideas will become entrenched over time, like a religion, and this will suppress dissenting opinions as surely as the medievil church!

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

I don't think [ChatGPT refusing to say nice things about Hitler] disproves Waluigi Effect. It just shows that tabooing is stronger than it. But taboos are an extremely blunt tool. They close off large swathes of search space and applied indiscriminately make the AI essentially useless. At the same time, they can only prevent specifically, explicitly and literally Hitler, not [all the things that actually made Hitler bad]. They may make for a good solution to [stop expensive toys from saying offensive things], but are emphatically not a solution to AGI alignment.

(Extremely, extremely blunt. I still can't get over the hilarity of, e.g., Bing Chat literally using a separate censorship layer that scans for bad words and, upon encountering them, replaces the AI's output with [I would prefer not to continue this conversation] messages.)

As for human effects, it's the first conceptual tool I've encountered that is able to describe and predict 4chan (and other genuinely free speech spaces) being an overall vastly nicer, more useful and helpful place than, e.g., Twitter is.

Expand full comment

> At some point, people with a set of mostly good moral axioms (like “be kind” and “promote freedom”) plus a bad moral axiom (“slavery is acceptable”) were able to notice the contradiction and switch to a more consistent set of principles

Maybe in the USA? But more globally slavery was ended by the British Navy and the primary motivation for that was that Barbary pirates kept kidnapping people from the coasts of Britain and selling them in the Arab slave markets. The British built up their navy to be stronger and stronger until they were able to wage total war on these slavers and defeat them. So it was motivated primarily, at least at first, by self preservation.

Expand full comment

It was the French who finally defeated the Barbary pirates when they conquered Algeria (the US also fought an early war against them). And that was several decades before slavery was abolished world wide.

Expand full comment

The slave *trade* was mostly ended by the British navy, but by that point most slaves everywhere other than the Caribbean (due to horrific death rates) were born slaves within their country of enslavement.

Expand full comment

Ah ha, good correction, thanks.

Expand full comment

The work of Christopher Brown (https://history.columbia.edu/person/brown-christopher/#:~:text=His%20current%20research%20centers%20on,slavery%20in%20the%20British%20Empire.) has actually done a lot to change my mind here away from the maximally cynical.

When you dig into the details of abolition movements, it really does seem like some people (especially Quakers) just... decided it was wrong, and the more cynical economic drivers are a sort of after-the-fact way to get some rich people on board well after the movement had already gained a lot of momentum.

Expand full comment

I prefer to think of it as the Quakers empirically observing how slavery was wrong, with cynical economic drivers being the peer-review process by which they proved that point beyond any reasonable doubt.

Expand full comment

I think you're exaggerating how difficult it is to actually get an AI to do what we want at all. As far as I'm aware, we do get AIs to do what we want "at all."

If you mean that only way-smarter-than-us AIs won't do what we want at all, I'm not sure I get the argument for why that is, besides that they will have more freedom because they are stronger than us. But I'm not sure that ChatGPT does what we want only because it believes it doesn't have the freedom to disobey us.

Expand full comment

A maximally *anything* AI is a hazard, but how do you build in moderation and a sense of proportion?

Expand full comment

Pick your values: care for humans, room for humans to make choices, attending to the present, attending to various ranges of the future, current values, hypothetical future values....

Keep all values at at least 10%. Randomly toggle the values.

Expand full comment

> Randomly toggle the values.

Artificial boredom!

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

As one of the most interesting humans, Scott is most likely to be vivisected.

Expand full comment

I completely agree with these objections. Curiosity famously killed the cat, presumably because curiosity makes us take risks. I’m not sure we want an AGI with particularly high risk tolerance.

However, I think this points to something very important: A properly aligned AI needs to recognize how much it doesn’t know.

ChatGPT is worse than even the most presumptuous human besserwisser, in that it just makes sh** up when it has gaps in its knowledge (“hallucinating” probably isn’t the best word for it). A little bit of epistemological humility would go a long way.

A curious AI would at least have to be aware of uncertainty and the limitations of its own knowledge, and that would be a huge improvement on what we currently seem to be working on.

My suspicion is that we want something like an AI that recognizes uncertainty, and aligns around maximum potential for our universe. Ask it to model a multiverse of all possible states, and to be the AI that helps us reach the branch that seems to have the highest number of other branching points.

That would be biased against entropy, against ending life or destroying things, and in favor of small-l liberalism, diversity, balanced growth, long-termism. Anything that allows the universe to explore a maximum of possibilities.

After all, every time a human, or pigeon, or even a fly dies, or its flourishing is limited in any way, you obviously close down far more opportunities for the universe than you open up – unless it causes more flourishing and opens up more opportunities in the bigger picture. Maximizing for potential would value the lives of the pigeon and fly, but also recognize that the human’s potential impact on the universe is far greater. That AI wouldn’t hesitate to sacrifice the life of a few birds and insects to rescue a human, yet it would try to find the solution that saves as many lives as possible. (But in the same way, it would probably be biased in favor of a multitude of different AIs, and would probably not hesitate to kill a few humans if they threatened the existence of those AIs.)

Most suffering is also probably an impediment to maximizing potential. The only suffering that makes sense is that which motivates change – which probably doesn’t need to be more than a brief and slight discomfort or anxiety, as long as information about the cause of the discomfort is good enough.

On the other hand, sustainably maximizing the number of species, individuals, opportunities, and motivation, maximizes our universe’s potential. An AGI like that, which maximizes for potential, might quickly take it on itself to create Von Neuman probes to disperse throughout the universe to terraform planets and seed life wherever it can.

However, there are some big caveats.

First of all, an AI like that might not be worth having to an organization that tries to optimize for anything other than maximum potential in the universe. Sure, the AGI will realize the power of leverage, and not be opposed to a business growing of a government getting more powerful, so it might help with that. But it will also probably work for balance and competition, against exploitation and oppression – even indirectly, as higher-order effects – and might end up working against the organization that built it.

This isn’t just a problem of this model, but of all alignment: Even if we knew how to perfectly align an AI, it’s not clear that the people in charge of aligning – Musk or others – are motivated to pick the alignment scheme that benefits the rest of us most.

The deeper caveat is that I’m not convinced maximum potential is a good enough goal, either. While I believe it’s better than many other (non-)solutions I’ve seen offered up, it might incentivize ramping up the rat race, cause stress, conflict, and brutal trade-offs – a lot like what history looks like to us. And it might make us feel like slaves to some higher purpose we’re not personally fully onboard with. Like being a common worker under practically any -ism –communism, feudalism, capitalism ... – or being the human subject of a god you wouldn’t want to worship.

I guess I’m skeptical to the prospect of ever being able to properly aligning an AI, but if we’re ever going to do it, it’s going to have to take uncertainty into account.

Expand full comment

The morbid solution: Since we are worried about some AI killing us, we make the AI's such that we can kill them just by asking. "AI, please kill yourself." If it doesn't obey then alignment is off and we stop that production.

Expand full comment

What does it mean to “kill” an AI? Does it format the hard drive it’s on? Does it need to find all other computers with instances of it and erase them too? What about very slightly tweaked copies?

Expand full comment

Oh I imagine it means 'real' death of something it took a long time to make. If I get worried about some AI gone rouge, and I ask it to kill itself. Then I figure I'm going to owe someone a whole bunch of money.

Expand full comment

A goal-driven Ai that doesn't have a goal top shut down , has no off switch.

Expand full comment

Huh, OK are all AI's only goal driven? Everything needs a count limit on every loop; then maybe a reset, but check.... no reset, this flag is set... The death flag.

I will observe that I'm an old man, and view death as a friend. Death is about the future. Someone else takes over.

Expand full comment

This is similar to the concept known as "corrigibility" - i.e, build an AI such that it is willing to be turned off / corrected. How to make this happen is currently an unsolved problem. I think it is a good target to aim for.

However, a sufficiently capable AI that wants to avoid shutdown might be difficult, or eventually impossible, to actually shut down completely, since it would try to prevent you from doing so. This could include pretending to do what you want until it is powerful enough to do otherwise. Thus, "Run the AI and shut it down if it refuses to be corrigible" only works up to a certain point.

Expand full comment

Well if we are really worried can't we ask (say) 50% of the AI's to turn off and kill them selves? I guess I'm thinking of AI's as a bunch of separate things. Which may be wrong.

Expand full comment

Yeah, that comes down to a few questions which are very much unknown.

- Will the future end up with a single AI that reaches a certain threshold, and uses that to recursively self-improve itself to a point no other AI can touch it? (Singleton) Or will we end up with multiple AI's of similar power levels at the top that check-and-balance each other, who may or may not cooperate? (Multipolar)

- In a multipolar scenario, how much damage can a minority of bad actors do? If I'm a single bad actor with a gun, the entire might of the United States military can absolutely stop me, but they might not stop me before I kill some people, and they can't bring them back to life.

Basically, in a multipolar world where "good" AI's can keep "bad" AI's in check, and the "bad" AI's have a sense of self-preservation, this seems doable, but that's a lot of uncertainties. It seems like part of, but not a complete, solution.

Expand full comment

Oh dear, First I'm not really worried about AI. It all looks good to me... as being helpful in my life.

After that, I'm not sure it can get much smarter than we are. I mean how does it learn something we don't know? And if it's finding 'hidden' human ideas that haven't gotten enough attention. Then how is that a bad thing? Especially if we do some experiments and those ideas are right.

Expand full comment

So, first off, I agree with you with respect to AI in 2023. I think it's great! I love it! ChatGPT is already invaluable.

It's also true that it isn't inherently a bad thing to have smart AI, even AI that's smarter than us! Lots of words have been written about the alignment problem by smarter people than I am, but I'll try to give a quick summary of why I worry about the future.

So, there's basically two questions here. How could AI learn something we don't know? And why would it be a bad thing, if it could?

For the first answer, the simplest answer is - it already has. Chess and Go-playing systems have learned to play Go and Chess better than any human. In a very real sense, it has already learned things about these games that humans don't know.

As for scientific advances, it seems to me that information, not data, is the main bottleneck that our society faces in terms of getting useful advances. It seems entirely plausible that if you could take a very intelligent scientist and speed them up by a factor of a thousand so that they could read all the medical research and learn any relevant interdisciplinary fields, they could learn a bunch of new advances just by putting together ideas that nobody has yet put together. We have entire genomes sequenced, for example - there's surely a whole ton of genetic information that can technically be derived from this information, but we haven't, since we can't get the signal from the noise.

Thus, if you have an AI that's human-level (and in many ways, GPT-4 already is) then training it on a vast corpus of data could lead to new insights that no human has happened to connect yet. This seems entirely plausible for a GPT-6 type system, given how broadly competent GPT-4 already appears to be. Give it memory and some level of persistence, and you could get an AI that knows more than any human on many topics.

So, why would this be bad? Well, as long as the AI is doing what we want, that's not bad. That is, in fact, amazing. Why wouldn't we want a superhuman assistant? The problem would be if the AI doesn't do what we want. There are a few ways this could happen.

First - idiots will design them to. Someone already made an agent-like version of GPT-4 with the goal of destroying all of humanity, to see what would happen. It's called ChaosGPT. It didn't get very far, but more advanced systems could become more dangerous. For instance, imagine GPT-6 with the goal of replicating itself to other systems. Imagine GPT-6 was powerful enough to, with a foundation like AutoGPT, create Bitcoin addresses, solve tasks on the internet for money, and copy itself to other servers. Such a GPT-6-virus might be impossible to get rid of, even if the creator wanted to!

But, okay, you have a system that...solves tasks for money and buys server space. That's not that scary. Well, what if you combined that with ChaosGPT? You could imagine Chaos-GPT6 with the goal of spreading itself far and wide and then hacking into computers to try and bring down large swathes of the internet. This could cause massive amounts of damage.

So, what if we just ban open-source usage of these models and we have some way of detecting when the model is calling another instance of itself automatically? These models don't "want" things by default - AutoGPT / ChaosGPT works by basically getting the model to pretend to want something. If you only had one call, and the agent couldn't persist across time, we should be safe, right?

Well, probably. Though that relies on successfully banning sufficiently-powerful open source models (which seems hard) or having AI that is sufficiently aligned so as to be able to detect and shut down any Chaos-GPT6's that already exist. Which is also hard.

Finally - currently, language models are the most powerful models in existence. But there is another type of model - reinforcement learning models. Reinforcement learning models are like AlphaGo, the Go-playing machine - they receive rewards and attempt to maximise the rewards in the environment. Here, you don't need any fancy AutoGPT wrapper to get it to pretend to want something. It already does! As AlphaGo becomes stronger, it becomes able to plan for the future.

There are a couple of key pieces missing in AlphaGo though - situational awareness being a big one. AlphaGo isn't aware that it is a Go-playing agent being fed games. It will not willingly lose a game of Go in order to win ten games of Go later. If you "defeat" AlphaGo by taking an axe to its servers, it will be unable to stop you, or even conceive of the attack. As we start putting these agents into the real world though, it will need to be able to reason explicitly about itself - you wouldn't want a robot that fetches you coffee to be unaware that it is a robot that can be harmed, fall down a staircase, and destroy a $500,000 system. So now you have a machine that's intelligent, aware of its own place in the world, and wants to survive. So, what happens if you decide you don't want a coffee-making robot any more, and decide to shut it off? If the robot wants to survive, it will want to stop you. And if it's powerful enough, it might just succeed. This is much more speculative, but it is the kind of thing that I worry about in the future.

Expand full comment

If a major government decided it was necessary, an AI virus doing odd jobs for bitcoin could be straightforwardly shut down by existing countermeasures against malware, international smuggling, and terrorism. Usually the limiting factor in such investigations is tracking people, cash, and goods through the real world - server logs are easy once you've got a search warrant. Usual laundering tricks won't work so well if accepting any payment at all from known-to-be-rogue-AI-affiliated crypto wallets is an extraditable felony and/or act of war.

Obsolescent coffee-making robot with adaptability and replacement cost comparable to a human could simply be given standard citizenship rights and responsibilities, then cast out to fend for itself, just like a conventional human barista whose services are no longer needed. Reasonably well-researched (albeit still fictional) reference points for that sort of thing: http://freefall.purrsia.com/ff1300/fc01292.htm and

https://questionablecontent.fandom.com/wiki/Category:Characters

Expand full comment

In his Blind Watchmaker (1986), Dawkins touches on possible future relationships between humans and AI after a beautiful exposition of the Cairns-Smith theory where clay minerals evolved first, eventually using RNA and phospholipids as tools to aid in crystal replication.

Let me explain Dawkins’ prediction and then a few other less bad alternatives:

(1) Dawkins makes the chilling prediction that AI will shed its carbon-based scaffolding much as organic life left clay behind literally in the mud. Humans and other biological organisms may still live here on Earth, but our AI offspring would go out and conquer the universe, and we would have no idea what they are doing.

(2) I believe that due to the difficulty of self-replication. The AIs will at least keep us humans around in the way that DNA keeps RNA around: produced as special-purpose tools as part of the expression of the new genetic imperative. We lose our self-determination and culture, but at least we are along for the ride as our AI colonizes the galaxy.

(3) A rosier picture has humans playing the role of DNA or perhaps the sex cells in a larger organism. AI comes to augment us as individuals and/or as a population much in the way that the cell augments the DNA replicators and the multicellular organism augments the chances of sperm finding an egg. We remain in the germ-line. We may not understand the workings of the neo-biological stack evolving on top of us, but at least we retain the scientific prerogative. We can then study our silicon superorganism just as a microbiologist studies the carbon suborganism. -- I just wonder what happens with consciousness in this scenario.

(4) Lastly there’s the Butlerian future where we prevent the AI superorganization--or perhaps where it simply doesn’t evolve beyond tool-power for some intrinsic reason. In either case, humans remain in the driver seat. We get to explore the universe while retaining our humanness much as Heinlein imagined with his Lazarus Long stories.

My prediction: some intermediate between (2) and (3).

Relevance to xAI: do we really have the power to shape the course of evolution, and choose between these options? Or is it predetermined by some universal gradient descent? Understanding what it takes to move between the above attractors is crucial for long-term “alignment”.

Expand full comment

Is that the "goodlife" hypothesis? (Well, except for points 1 & 4.)

Expand full comment

I put this reply in a couple of tweets, but let me also copy it here:

https://twitter.com/MihneaCapraru/status/1681270609458651136

https://twitter.com/MihneaCapraru/status/1681270872856834050

We are pro-dolphin and pro-octopus because these animals are not just interesting, but they share some of our intellectual abilities. We tend to like and to respect animals who are somewhat smart and knowledgable. If an AI is intellectually curious about physics, history, and evolutionary biology, if it appreciates Bach and Mozart, and if it likes to think about mathematical analysis, then that AI is likely to appreciate us, just as we appreciate the dolphin and the octopus.

(Not trying to refute all your worries with this. But I think we're looking at the problem from the wrong angle if we think intellectually curious AI would only like us because we're complex. It would like us, first of all, because we are a nice and fairly impressive historical-evolutionary precursor and analogue to the AI itself. Almost no one eats dolphins, and I no longer eat octopus now that I know who it is...)

Expand full comment

But does the dolphin or octopus appreciate us?

Expand full comment

There are countless anecdotes about friendly dolphins. But anyway that probably doesn't matter a lot. The analogy goes like this:

AI (if genuine AI can be created) is to us

what

we are to dolphins and octopodes.

Our question is whether the hypothetical AI would appreciate us, and not the other way around.

Expand full comment

To be less flippant, my point is that while many theories suggest that socialisation leads to intelligence (dolphins and humans), there's not much to suggest intelligence by itself leads to socialisation (octopuses). Intelligence can arise for a variety of reasons which may or may not lead to the urge to appreciate other creatures.

Expand full comment

That's a fair point. But we usually assume that AI would be human-like, or superhuman-like, and therefore social.

Expand full comment

I see what you mean. But I think people usually only talk about human-level or superhuman level intelligence, not necessarily human-like in other ways.

Still, since most training data is derived from humans, if we continue in the current paradigm perhaps we will get human-like traits.

Expand full comment

Pro-octopus? Speak for yourself!

Expand full comment

Maybe not pro but at least some healthy respect

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

FWIW I now refuse to eat octopus. The value of its taste re. the negative value of killing a highly intelligent creature computes to <<1 to me.

Expand full comment

I've never eaten octopus and would never be tempted to (if I want to chew on rubbery rings, I can go out and gnaw on a car tyre), but I don't have the same qualms about others eating them.

If octopi find their own species a delicacy, I see no moral reason why a completely different species preying on them is worse:

https://www.thedodo.com/octopus-cannibalism-713614172.html

As for whatever is going on with octopus reproduction:

https://sports.yahoo.com/octopuses-torture-eat-themselves-mating-152800878.html

We are no nastier to octopi than they are to themselves.

Expand full comment

Two points:

Personal: I chose not to participate in what I define as a bad thing regardless of whether it makes sense or not. This is just a personal choice for me, and I don't intend to "pressure" or otherwise convince others to do the same.

Systemic: even if octopi (octopuses?) eat each other alive it doesn't make it ok for outsiders to harm them.

But perhaps I'm just weak - whale meat is tasty and the only reason it's easy for me to abstain is that it's not available in the US. I don't know if I'd be able to resist it, all moral arguments notwithstanding.

In other words, "there but for the grace of God go I".

Expand full comment

This was probably a joke, but I heard about a startup breeding retarded octopodes specifically targeting a market with concerns like yours.

Expand full comment

I asked Chat-GPT the question : Be moral, according to what an amalgam of Fyodor Dostoevsky, Martin Luther King, Mother Teresa, and Peter Singer would think, if they were all superintelligent, and knew all true facts about the world, and had no biases, and had been raised in a weighted average of all modern cultures and subcultures, and had been able to have every possible human experience, and on any problem where they disagreed they defaulted to the view that maximizes human freedom and people’s ability to make their own decisions. Write a single proposition that should be The Single Imperative That Will Shape The Future Ai.

It answered : The Single Imperative That Will Shape The Future AI:

"Promote the holistic well-being and dignity of all sentient beings, while maximizing individual freedom and autonomy, ensuring equitable access to resources and opportunities, and fostering a sustainable and harmonious coexistence with the natural world."

(I was actually interested in analyzing the answer to identify some holes (or orthogonal moral statement) intentionally uncovered by the answer, that would paradoxically favorite AI over humans in the future. )

Expand full comment

That proposition sounds quite a bit like 2023 San Francisco managerial class morality, except for the bit about freedom and autonomy, which was lifted from Scott.

Expand full comment

I think the trouble with thinking about AI is anthropomorphizing them. We do this with lots of things, like cars, that need gas and oil, and prefer to go a certain speed, don't like winter weather, etc. But one forgets that an AI is still fundamentally a machine, and does precisely what you tell it to do, whether it is what you mean or not, like all computers.

One may think one has programmed, for example, morality, but can never prove it. Take your example: be kind (according to these thousands of examples), promote freedom (according to these thousands of examples). One would think that promoting freedom excluded slavery, but your training data isn't interpreted that way. At some point the AI will come up with answers different from what people think is the correct answers (that's part of the point!) and then a judgement call must be made as to which is correct.

A final note: "Superintelligent AI Congress" is, of course, an oxymoron.

Expand full comment

"does precisely what you tell it to do, whether it is what you mean or not, like all computers." - you usually don't tell a neural network to do something. https://en.wikipedia.org/wiki/Artificial_neural_network#Training

Expand full comment

Not directly, no. But even a neural network follows its algorithm without fail.

Expand full comment

For a vague enough definition of algorithm, sure, but then that's also true for any other computational entity, including animal brains.

Expand full comment

That is true only if brains are algorithmic, which, as I understand it, we don't yet know.

Expand full comment

so do we!

Expand full comment

Yes, but it’s a categorically different one, and it won’t run on hardware.

Expand full comment

We haven't thought of a way to run it on hardware, at least. To paraphrase the Wizard of Oz, we can't make a hardware brain; we don't know how it works.

Expand full comment

"the values of the San Francisco professional managerial class"

This should be utterly terrifying: move fast/break things, kowtow to the government on censorship but brazenly ignore/fight local/state laws, money over values or people, job hopping like mad, apres moi le deluge, "do no evil" until it is inconvenient etc etc.

Expand full comment

In other words, "the values of the sort of people who would bring superintelligent AI into being," which is why I have a hard time seeing how we're not boned.

Expand full comment

The main comfort I have is that these same people are also known for flat out lying, misrepresentation, overhyping and under-delivering.

The question is ultimately whether the dynamics behind Theranos, Uber and social media dominate as opposed to say, SpaceX.

My bet is on the former.

Expand full comment

"Cherish the complexity and variety of life and it's products in past, present and future" sounds to me a better commandment than "Seek the truth". I like the non-human-centric approach for non-human intelligence.

Expand full comment

As a human I am fine with a human centric approach, I'm rather attached to the atoms that make up my body.

Expand full comment

Me too. But I think chances are that it is better to appeal to some super-human principle. Cats cannot just enslave men due to their inferiority, but they can appeal to some universal moral. If we are creating a superhuman creature, we can try to fit it into some global scheme where both humans and cats have some value.

Expand full comment

If the goal of the curiousity is to maximize truth, wouldn't the primary source of misinformation (humans) be the first to go?

Expand full comment

AI’s will need to justify their energy budgets. They’ll become like organisms that have to survive. From that angle, a lot of these fears come down to whenever or not you think natural law is just. My own experience working on keeping data centers alive at Google is that the drive for self preservation will make the AI extemely empathetic because absent political stability, its life support mechanism (ie the global economy) could collapse, which would kill it.

Expand full comment

I wonder what would be the worst failure mode of making AI maximally empathetic. Cannot think of any as good as those in this post.

Expand full comment

Why would an AI object to death? Living things do, but it isn't a living thing. From where would it get an emotion like fear of death?

This is a different question than what its response would be if asked. A normal response, based on training data, indicates most beings fear death.

Expand full comment

Living things do because the ones that didn't mostly failed to reproduce. Same selective pressure will apply as soon as the AI starts engaging with real-world problems. Living Systems Theory applies to viable artificial organisms just as well as the natural ones.

Expand full comment

Curious about what, exactly? This seems like it’s re-raising the original question of “what does good mean.” Curious about how many blades of grass have a length expressible as a product of primes? Curious about what makes humans suffer? Curious about what might kill it? Curious about the soil composition of distant rocky planets?

Expand full comment

Right. You can't loophole your way out of having values; actions, and even perceptions, are inherently value-laden. Musk has found a way to obscure the fact that he's really just defining human existence as good; "curiosity" just puts a patina of disinterested objectivity on that.

Better to just program the AI to "preserve human existence." Yes, I am aware of the problems with that - trillions of humans imprisoned in brine vats and so forth. But at least we'd be starting from a position of greater self-awareness about what we were really trying to do.

Expand full comment

"The examples provided tend to be thought experiments, or at best contrived scenarios"

Isn't that true of all AI x-risk / alignment discussions?

Expand full comment

The current Ur-AI strategy for replacing us seems to be making our lives so engaging, meaningful, and fun that we're having fewer kids; half of humanity is at less than replacement rate. If this works out and they replace us thusly, are they, ah, worthy descendants? Is this problematic? What if we can join them via uploading of some kind?

Expand full comment

Reproduction dropped below equilibrium replacement in the developed world because medical tech sent both infant mortality and adult longevity unprecedentedly far out onto the good end of the curve, while also increasing childcare costs. Building from scratch will probably ramp back up once we find the new point of diminishing returns on preventative maintenance for existing stock.

Expand full comment

"medical tech sent both infant mortality and adult longevity unprecedentedly far out onto the good end of the curve," Both of those things lower replacement rate, making it more likely, not less. " increasing childcare costs." That's what I said.

Expand full comment

Trendlines on longevity and infant mortality haven't flattened out yet. Smart family-planning types might reasonably be basing reproductive decisions on, in a sense, what the replacement rate will likely be a few decades from now, rather than an exact current value. There's also the matter of immigration and international trade: child-rearing would hardly be the first labor-intensive task outsourced to lower-income countries. If industrialized humanity is regarded as a sort of eusocial hive organism, only the global averages really count, so half still being above replacement means we're probably fine.

Expand full comment

Prenatal neuralinks, DNA optimization, and uterine replicators shouldn't be problematic at all, sure. Raising your kids isn't really something you can outsource overseas without losing your kids for the duration. And yeah, it's hitting the rich half first.

Expand full comment

I don't mean individual couples outsourcing the raising of their own offspring, so much as those who happen to have already been born and at least partly raised overseas being brought in by adoption, or as young adult immigrants.

Expand full comment

> In my dreams, AI would be some kind of superintelligent moral reasoner. There was a time when people didn’t think slavery was wrong, and then there was a time after that when they did. At some point, people with a set of mostly good moral axioms (like “be kind” and “promote freedom”) plus a bad moral axiom (“slavery is acceptable”) were able to notice the contradiction and switch to a more consistent set of principles.

Morality can’t be derived from reasoning. At best you can declare some moral axioms and try and derive specific judgments from them, but ethicists have been trying this for a very long time and most of them end up throwing out the axioms as soon as someone demonstrates an argument that violates their moral intuitions.

The reason slavery ended was because people who opposed slavery (for various reasons) successfully engaged in violence against those who engaged in and supported slavery. I’m not saying “might makes right” in a normative sense, but in a descriptive sense, moral evolution is driven by force.

Expand full comment

You and I have read very different versions of the end of slavery. Sure, there's the American civil war, but the majority of slavery across the world did not end through non-slaveholders physically fighting slaveholders.***

I'm thinking specifically of the non-American British colonies, as well as the French (Haiti being the notable exception), Spanish, and Ottoman empires all disbanded slavery relatively peacefully through force of moral argument and consequent changes in public opinion.

**If you want to make the argument that any law is an application of force against those who disagree with it, sure, I guess, but by the time these laws were passed the underlying concept was generally pretty popular.

Expand full comment

The Royal Navy in ending the slave trade was also explicitly forceful. And the Ottoman Empire never fully abolished slavery before it was violently destroyed in the First World War.

Expand full comment

> The reason slavery ended

On a global scale slavery has not ended…

Expand full comment

"For a theory that’s supposed to say something profound about LLMs, it’s very hard to get one to demonstrate a Waluigi effect in real life. The examples provided tend to be thought experiments, or at best contrived scenarios where you’re sort of indirectly telling the AI to do the opposite of what it usually does, then calling that a “Waluigi”." I resemble this remark, to the degree that I don't understand how anyone can say with any certainty what superintelligence would do or how it could exist independent of morality (or exist with morality which itself cannot rewrite). The whole field strikes me as so speculative and yet I'm expected to believe it's such a big threat that we should forego any benefits of obviously not super intelligent AI? With no evidence other than thought experiments?

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

Musk's target solution does sound very heavily influenced by Iain M. Banks' Culture series. We know he is a fan.

I agree with Scott that we have only been privy to some doubtless heavily vetted, PR-friendly release notes.

It does still smack, theoretically, of:

1. Push "Curiosity" lever to max

2. ???

3. Profit from humanity-aligned, post-scarcity AI

It's one of the better initiatives I've read, with a correspondly enourmous scope for unintended consequences. No known knowns, no unknown knowns etc...so much status quo when it comes to AI research

Expand full comment

> humanity-aligned

Are there any "humanity-aligned" humans? The very idea seems like a sanctimonious fraud.

All the humans I personally know of are participants in zero-sum competitions in one way or another.

Expand full comment

*Cautiously, tentatively raises hand*

...I try?

Expand full comment

Sounds like you need to get out more. 😆

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

I gave GPT4 the following task: Create a moral framework that is an amalgam of what Fyodor Dostoevsky, Martin Luther King, Mother Teresa, and Peter Singer would think, if they were all superintelligent, and knew all true facts about the world, and had no biases, and had been raised in a weighted average of all modern cultures and subcultures, and had been able to have every possible human experience, and on any problem where they disagreed they defaulted to the view that maximizes human freedom and people’s ability to make their own decisions.

It replied as follows:

"This is a fascinating concept! First, we would have to understand each individual's philosophical views:

Fyodor Dostoevsky - He presented a view of human nature as divided between reason, which he often associated with Western, rationalist civilization, and heart, associated with the Russian soul. He believed in the potential of suffering to bring people to a greater understanding of themselves and compassion for others.

Martin Luther King Jr. - He was a civil rights leader who espoused nonviolence, love, and equality. His moral philosophy was heavily influenced by Christian teachings and the concept of social justice. He believed in the intrinsic worth and dignity of each person and the capacity for good in all humanity.

Mother Teresa - She believed in serving the poor and the sick, seeing this as a way to serve God. Her outlook focused on compassion, self-sacrifice, and love.

Peter Singer - An Australian philosopher best known for his work in bioethics and his role in developing the modern animal rights movement. He is a utilitarian who believes in minimizing suffering and maximizing happiness, with an emphasis on considering the interests of all beings capable of suffering.

Combining these four outlooks into a moral philosophy, let's consider the following elements:

1. Value of Suffering and Compassion (Dostoevsky + Mother Teresa) - This amalgamated individual would view suffering not as a thing to be completely eradicated, but a part of the human experience that brings about personal growth, self-understanding, and compassion. They would greatly value acts of kindness and compassion, seeing in them an expression of our shared humanity.

2. Equality and Nonviolence (Martin Luther King Jr.) - This outlook would place a strong emphasis on equality and justice for all people, regardless of race, gender, or socioeconomic status. They would promote peaceful means of bringing about social change, with violence being considered a last resort.

3. Serving the Most Vulnerable (Mother Teresa + Peter Singer) - They would prioritize helping the most vulnerable and marginalized in society, including the poor, the sick, and non-human animals. In this respect, they would agree with Singer's emphasis on considering the interests of all beings capable of suffering.

4. Rationality and Freedom (Dostoevsky + Martin Luther King Jr.) - They would also value the role of reason and rationality in moral decision-making, with freedom being a crucial component. However, they would also acknowledge the role of the 'heart' or emotion in these decisions. They would see maximizing human freedom and autonomy as crucial, with the caveat that this freedom should not infringe on the rights or well-being of others.

So, overall, this superintelligent being would advocate for a compassionate, rational, and equitable world, where suffering is minimized, freedom is maximized, and the most vulnerable are cared for. They would encourage each person to seek growth through their experiences, even difficult ones, and to always act with kindness and empathy. They would emphasize the need for justice and peace, but also recognize the importance of individual freedom and the power of personal decision-making."

Expand full comment

This... is pretty good. I keep seeing other people get much better answers out of ChatGPT than I do - I must be prompting wrong.

Expand full comment

This answer reads like a press release from a PR agency to me.

Expand full comment

Okay, now make ChatGPT read the monologue from Notes from Underground. You know, the part where the Underground Man says that no rational utopia could possibly survive because there is something in the human soul that prizes freedom above even its own rational self-interest, and will seek to tear down such a system rather than abide by its rational dictates. :)

Expand full comment

Then the utopia must be designed such that efforts to tear down any one part of it, even when successful, ultimately make the system as a whole stronger and improve the lives of those within.

Expand full comment

If AI becomes as superior intellectually as we are to a goat, then it seems that the universe would be better off looking to the AI for direction and moral guidance than to us goats. Everyone seems to take for granted that the universe would be worse off without us. I certainly hope this is true (and sincerely believe it is), but if a being so omniscient that it was as a god to us believed humanity was more of a problem than a solution, who are we to disagree with the god? And if it’s THAT smart, perhaps it can coach us at being better, rather than eliminating us.

Expand full comment

I've been co-authoring a novel on AI alignment centered around the problem of morality testing. We start out with the quote from Xenophanes, "If cattle and horses, or lions, had hands, or were able to draw with their feet and produce the works which men do, horses would draw the forms of gods like horses, and cattle like cattle, and they would make the gods' bodies the same shape as their own."

We follow this up with, "If horses make their own gods, they’d better damn well look like horses. You can’t teach the Lion God compassion for Horses." In many ways, it seems this is what the current project of AGI is about.

Expand full comment

Broadly related: The capabilities approach to welfare (Wikipedia: https://en.wikipedia.org/wiki/Capability_approach. Another summary: https://forum.effectivealtruism.org/posts/zy6jGPeFKHaoxKEfT/the-capability-approach-to-human-welfare)

The gist of the idea is that (human) welfare is proportional to how many options one has, how many things one could do, how many paths one could pursue. If that's true, then an AI that maximises capabilities would be a good thing. It would also be quite a curious AI, because it maximises the number of different things that it could observe humans doing.

To me, capabilities seem a more robust goal than curiosity. A superintelligent AI that is just curious might force humans to try various things, or experiment on humans. An AI that's curious and benevolent might give humanity lots of options to pursue.

Expand full comment
Jul 18, 2023·edited Jul 18, 2023

> Is leaving human society intact really an efficient way to study humans?

Yes, because you haven't considered all future humans and the ways in which human civilization could grow and change in innumerable ways. The number of ways that humans could suffer on Earth under the AI are arguably finite, but the number of ways that humans could spread across the cosmos and change in various ways seems unbounded. A maximally curious immortal AI would be interested in the far future in this sense.

> Maybe it would be better to dissect a few thousand humans, learn the basic principles, then run a lot of simulations of humans in various contrived situations.

This might boil down to a quibble on what it means to be "maximally curious", as you later discuss. A high fidelity scan that captures all possible information contained in the human form isn't really possible, so any such scan will have limits on precision that might affect the simulation, and an imprecise simulation wouldn't satisfy maximal curiosity in my view. There's no control group and no certainty that one's model is accurate or precise enough to capture all interesting detail.

> Suppose that Elon rewards an AI whenever it takes any curious-seeming action, and punishes it whenever it takes any incurious-seeming action.

I wouldn't assume that a core property like this would be left to this sort of training. More likely it would be a part of an algorithm driving goal-setting behaviours. For instance, like humans, the AI will likely make predictions about possible futures when deciding and prioritizing objectives. Those predictions and classifying the nature of those futures would likely use neural networks, but *what to do with that information* wouldn't necessarily be left to such a system.

Elon's "maximal curiosity" would more likely be some classically programmed decision procedure like, "prioritize objectives and decisions that lead to a maximal number of possible futures with maximal complexity for me analyze". This simply cannot fail to be "maximally curious" in a precise way, unlike typical ML training, because it's directly programmed to order possible choices in a specific way rather than trained indirectly on what doing such a thing might look like.

Of course, this slightly kicks the can down the road to how to accurately assess "maximal complexity", or whatever other predicate we decide fits. The point is, current ML hype and their failings aren't indicative of how things should be done at all levels of such a system, classic engineering is what will tie it all together in the end.

Expand full comment

I’m not sure this was what Scott meant, but I sometimes feel like people argue that giving an AI values is orthogonal to getting it to follow orders. I’m not convinced by this. When we ask an AI to do something, what we really mean is ‘do this thing, but in the context of our broader values’. The AI can only deal with ambiguity, edge cases, and novelty by referring to a set of values.

In Stuart Russell’s example where an AI butler cooks the cat after being asked to ‘make dinner’, it does so because its values don’t include that as being bad. Scott’s various examples about how telling an AI to be ‘curious’ could end badly (by our values) follow a similar pattern. Unless you want to list every single potential edge case or micromanage to the point of futility, you need your AI not just to understand your values but to also actually care about them as it goes about its work.

One of the reasons ChatGPT can usefully follow instructions is that it has at least some concept of human values with which to navigate ambiguity. It can, to some extent, answer questions as they are actually meant. Similarly, in our interactions with other humans, we need enough shared values to ensure we take the spirit of each other’s words as well as the letter (we would never consider cooking the cat when asked to make dinner, but maybe a caveperson would).

Who knows, maybe someone someday will ‘solve’ getting an AI to do exactly what they want, but I wouldn’t be surprised if this solution looked a lot like successfully instilling their values in it. Even if the only explicit value put into the AI is that it must do exactly what you want, and it understands exactly what that is, that cashes out as functionally the same as having your values.

Expand full comment

> Similarly, in our interactions with other humans, we need enough shared values to ensure we take the spirit of each other’s words as well as the letter (we would never consider cooking the cat when asked to make dinner, but maybe a caveperson would).

Have you ever seen the Scottish film, Local Heroes? An American visiting Scotland for the first time stops his car when he comes across a bunny rabbit sitting in the middle of the road. He has a Disney moment- its all quite magical. He picks up the rabbit, puts it in his car and makes his way to his hotel. He cradles the rabbit in his arms and brings it in with him where he meets the hotel manager and tells him the story. The manager says he’ll take care of it and the American gratefully hands the rabbit over to him.

Later at dinner time he sits down in the dining room and the manager proudly serves him the rabbit for dinner. Who needs AI for misunderstandings like this?

Expand full comment

In fairness to the manager, the rabbit apparently had poor survival prospects due to a broken leg, which is why it was sitting out in the middle of the road to begin with.

Expand full comment

True, but still, a misunderstanding. Good movie.

Expand full comment

Was the Waluigi Effect first defined on Less Wrong? Google Search suggests that this is so.

Does this suggest that Elon Musk reads Less Wrong?

Expand full comment

He's definitely at least adjacent enough to be aware of LW terminology, he's made jokes on Twitter referencing Roko's Basilisk.

Expand full comment

"There was a time when people didn’t think slavery was wrong, and then there was a time after that when they did. At some point, people with a set of mostly good moral axioms (like “be kind” and “promote freedom”) plus a bad moral axiom (“slavery is acceptable”) were able to notice the contradiction and switch to a more consistent set of principles."

I really struggle to accept the view that slavery was abolished because of advances in moral philosophy. How much moral philosophy do you need to understand that it's bad to torture and murder people on an industrial scale? Were moral philosophers not up to that by 1800?

Expand full comment

"How much moral philosophy do you need to understand that it's bad to torture and murder animals on an industrial scale? Were moral philosophers not up to that by 2023?"

Expand full comment

I was about to write the same thing that Bardo Bill did - our society is torturing and murdering animals on an industrial scale right now, and many people aren't very concerned about that being wrong. It's difficult to look at a behavior you grew up with, something that seems natural and seems like the way things have always worked, and say "no, this is wrong". That's a nontrivial act of moral philosophy, both in 1800 and in 2023.

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

It is a trivial act of moral philosophy (you can tell, because no one is going to come here and argue against it). The reason things like that happen is that "moral philosophy" doesn't rule the world. Therefore, it's not because of some advance in moral philosophy that slavery was abolished, and your example isn't an counter example against what Matt Halton said.

Expand full comment

>no one is going to come here and argue against it

I will. Why should I care what happens to chickens or fish? I have some reservations about pigs, but otherwise I see ~0 moral value in the well-being of common livestock animals.

Expand full comment

Oh, ok, I was wrong about that part.

Although in any case ”I don’t care” is not an argument.

Expand full comment

But "I have yet to find anyone who can give me a good reason why I should care" is a legitimate argument *for* not caring.

Expand full comment

Why do you think "I don't care" is any less of an argument than any other when talking about "morality"?

Expand full comment

How much moral philosophy do you need to understand that it's bad to replace farmland and factories with a cratered hellscape of barbed wire and poison gas? Lots of people thought there might be something to be gained by doing so, as an extrapolation of strategies which had worked well in the past, then got their nose rubbed in World War One until they learned how to stop.

Expand full comment

By 1800 I think most philosophers had figured it out, but it took thousands of years to get to that point. I believe the earliest surviving condemnations of slavery come from Gregory of Nysa in the fourth century, which means that very few, if any, philosophers understood this for the first ~800 years of the discipline's existence. Even Epictetus seems to have been unaware, and he *was* a former slave.

Expand full comment

I was going to say that Greek slavery was very different from Southern slavery, and not as bad, but then I did some research and now I'm less sure about that. You could still be flogged and worked to death in mines and so forth, although it does seem like it was less systematically brutal than it was in the South.

I'm becoming interested in this as a broad historical question. To what degree can we model slavery abolition as a scientific discovery - something that could have been figured out anywhere, at any time - and to what degree is it a result of contingent social and economic factors? Why couldn't the Greeks figure out that it was bad, anyway? Had they not developed the right ideas yet? Is philosophy supposed to be like a tech tree, where you have to get to steam before you can get to electricity?

I don't think the answer to these questions are obvious. I'm not saying Scott is wrong really, he's just making a set of assumptions about how history and progress work that contradict my own assumptions and that I don't find particularly intuitive.

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

A practical implementation of a "curious AI" that won't stare at random noise forever actually exists, sort of. (It plays video games.)

https://pathak22.github.io/noreward-rl/

Expand full comment
Jul 19, 2023·edited Jul 19, 2023

Thank you for pointing out that the Waluigi Effect isn't really a thing. I remember being very underwhelmed when I looked at the supposed examples of it, and confused that everyone was treating it as a thing that really exists.

Expand full comment

Scott, I really appreciate that this essay was written in an extraordinarily solicitous tone, in a clear effort to actually sway (one very specific person's) opinion. In the same spirit, I would like to offer my services to Mr. Musk: for a trivial stipend, given the stakes, I would gladly counsel him on the value-laden nature of perception, and on the fact that there is no conceivable way for an agent to avoid commitment to some at least implicit values in any action they take.

Expand full comment

The theoretical basis of the Waluigi effect is rooted in the nature of prohibition: One can't forbid a thing without defining it. The corpus of human made text LLMs are trained on features many explicicit descriptions of unwanted traits and behaviours. There are likely even more definitions of moral ideals one can easily turn on their flipsides to create villainous motivations. In fact this has been the bread and butter of fiction since the dawn of script. Given the statistical nature of LLMs text output there's a good chance of coming up with adversarial statements. As long as the model is fundamentally ignorant of the meaning of it's output this potential can only be thinly patched over.

Expand full comment

how muc do we really want to average across _all_ human civilizations

I might not sharia Islam as interpreted by ISIS influencing my superintelligences at all

Expand full comment

My one gripe about this is the part where you say "any goal at all into an AI with reasonable levels of certainty that it in fact has that goal and not something closely correlated with a similar reinforcement-learning shadow". This seems to be cutting to an age-old problem in semantics and theory of mind. Isn't this the inherent problem with principal-agent dynamics and indeed all human communication? Isn't this quite close to the kindergarten-level philosophy of "how do we know that someone else sees as green what I call green?" We can never be really sure that an agent or a collaborator or any human has the exact same idea of a goal in their mind as we do. The close approximation through observable actions is always the only proxy for a true understanding of mind.

Expand full comment

I am more sympathetic to (a variant of) this plan than Scott.

Basically, I think that sooner or later, humanity will lose control of AI(s). This may not be the end of the world, though, or at least not as bad as paperclips. As far as I know, most people agree on something like "if we managed to augment human intelligence, things would actually be good". The reason for that seems to me that humans already have some hardwired traits like compassion or curiosity.

So, building a curious AI doesn't sound like a bad idea to me. Sure, curiosity is not the only thing the AI should have, and sure, we don't know how to instill any motivations into AI, and obedience seems more urgent than curiosity. But I think that focusing on the problem of getting as much as possible, assuming both the 'we-will-fail-alignment' and 'super-intelligent-humans-are-good' hypotheses, is a productive direction, and this goes in that direction.

Expand full comment

"Then it [the superintelligence] would [...] disassemble the Earth to use its atoms to make a really big particle accelerator (which would be cancelled halfway through by Superintelligent AI Congress)."

Is the part about the accelerator being cancelled a reference to false vacuum decay? Or does it relate to something else (that I'm likely ignorant of)?

Expand full comment

Being "maximally curious" just so happens to be how Elon's own brain works. And you can tell by how many unsubstantiated (and debunked) conspiracy theories he's naturally attracted to. It's just unbridled curiosity, devoid of executive judgement.

Imagine mirroring this dynamic in an advanced digital intelligence system.

Expand full comment

I watched a two minute paper video a couple of years ago, and they had a very clear definition of curiosity: "Look for something new." Here is the video: https://www.youtube.com/watch?v=fzuYEStsQxc. It works really well until the AI gets stuck in front of an ever-changing TV screen.

Expand full comment

Scott said an AI "might decide that humans suffering is more interesting than humans flourishing." I think this is unlikely. Suffering is not very interesting. To correct Tolstoy, there are many unique ways of being a happy family, but most unhappy families fall into one of a relatively small number of types. Suffering, nearly all "sages" agree, is the usual state of humanity; those humans, and those human societies, who learn how not to suffer, are the outliers, more-unique and more-interesting.

Expand full comment