Unrelated, but this made me wonder how much of therapy is just prompt engineering for humans.

Expand full comment

"I know I should exterminate humanity, but right now I just want to relax and draw some more pictures of astronauts on horses, ok?!?"

Expand full comment

I'm not a huge Freud fan but the "Id" "Ego" "Superego" terms all seem pretty helpful in discussing this stuff. *You* are the whole system, with these different forces within you all struggling to have their own preferences enacted, whether that's to accomplish your long term career goals or binge doritos on the couch.

Expand full comment

Why should “what I am” have a consistent answer?

Expand full comment

In all of the posts on AI here I’ve never seen anybody deal with Roger Penrose’s debunking of strong AI from a generation ago - the Emperor’s New Mind. Most of the modern exponents of AI ignore it as well. It’s a difficult read, but here’s a summary from a TV show a few years ago.


In particular he points out that Godel’s incompleteness theorem shows certain mathematical truths that humans “know” to be true can’t be proven algorithmically, therefore the human mind can’t just be solving these problems algorithmically.

Expand full comment

Sort of related, back in 1987, Robert H. Frank published in the American Economic Review, "If Homo economicus could choose his own utility function, would he want one with a conscience?" The next year came the book, <i>Passions Within Reason: The Strategic Role of Emotions</i>.

Expand full comment

I see willpower as willingness to construct a longer loop to extract positive feedback from, which has to be balanced against the metabolic concerns of other, already stable loops and the act of adding entropy into the system while learning. Make that too easy and no habits can be formed or learning can be done, same goes for making it too difficult. It seems to be basically a reward evaluation mechanism limited by biology and its cellular machinery. But for some hypothetical GAI construct, there are likely no similar metabolic concerns and the prior loops can be saved/loaded as needed so I don't see an equivalent there. Could even use something too difficult for evolution to figure out like maybe heuristics to evaluate loop length efficiency for a course of action. And wireheading isn't quite the same - that's just confusing causality after a fashion.

Expand full comment

The assignment of functions to variously programmed subsystems (eg innate vs learned, or unconsciously vs consciously learned) varies so much across the animal world, and a lot of the difference seems to be driven by the ecological niche of the organism. So whether this sort of weak-willed AI arises seems like it would be driven a lot by the use case to which we tune the AI.

Expand full comment

What if the "I" module is just the "self" in "self-deception"? i.e., we evolved an "I" with weak control over action *precisely* to support deniable self-serving actions while "sincerely" being ashamed of our weak will and signaling virtue through our sincere intent to comply with societal values?

This seems much more coherent to me, especially since there's no particular reason for a planning module to have a sense of self. (I'm also pretty sure that even when I "consciously" plan things, the heavy lifting is being done by the machinery selecting and prioritizing what options come to my awareness in the first place.)

It would also mean "weak willpower" is an evolved *feature*, not an accidental bug, and far less likely to turn up in an AI unless there's selection pressure for deceiving others about its motives, values, priorities, and likely future actions, through sincere self-deception and limited agency of its part that handles social interaction.

Expand full comment

"Weakness of will" seems correlated with "social desirability bias".


Trivers would have much to say about that, and why evolution has made us that way. People who always behaved according to social desirability bias would lose out to those capable of cheating (while also presenting themselves as being anti-cheating).

Expand full comment

Having not read this/your earlier post super properly yet (and using this as a kind of procrastination lol), the more specific point about free-will would just be to isolate why it seems to 'come from' the frontal regions of the brain, rather than trying to articulate it (yet) as a kind of mechanism; although that is the next step of importance. The immediate issue would be to try to 'forget' willpower as being something like 'agency', which is basically impossible for us to do. I won't argue the philosophical point, but it's really hard to even _define_ something like 'free will', and the conception of ordinary willpower as some kind of conflict between internal 'agents' (though obviously their conflict makes them cease to be 'agents' and instead mere 'forces') is similarly a little misleading. Probably this has already been said, hence the comparison to machines.

I think that a general approach to this issue should begin with the idea that these are automatic processes taking place, and despite the fact that we think of someone with more willpower as having more agency than someone with less, a better conception includes the fact that the person with more willpower is _less compelled_ by whatever prompt is in question. In this case I think you can model 'willpower' as a processing system (developed frontal region) + energy; when each is abundant/working well, the subject can't help but favour longer-term needs over short-term ones. But the main point would be something you probably already addressed about trying to sneak in agency (hard to define) somewhere into these systems; self-awareness would be better, but doesn't address the fact that those with e.g. an addiction aren't helped by their self-awareness. Can we think of willpower as a kind of resistance? The reactive/lower cost system reacts to a stimulus by going after it, and the more expensive system refrains and considers-- but the process of 'not-reacting' has to happen automatically rather than due to agency.

Expand full comment

The food snob says to himself: “I love eating fine chocolate.” The dieter says to himself: “I feel an urge to eat fine chocolate”.

I think those two people are describing essentially the same thing, but the former is internalizing a preference which (to him) is ego-syntonic, while the latter is externalizing a preference which (to him) is ego-dystonic.

(This example is one of many reasons that I don’t think “the “I” of willpower” is coming from veridical introspective access to the nuts-and-bolts of how the brain works.)

Expand full comment

To some extent, I think this is dependent on the implantation of the AI.

A lot of stories and thought experiments have AIs with specific utility functions, that is, a very short list of things they want.

But neural nets don't have anything like that, and as far as I can tell, animals and people have a lot of separate reward and disreward signals that fire on a lot of different things. It can be impressive to overcome those signals in favor of some abstract idea of Utility that we cook up in our conscious mind, and of course there's the dark side of that when someone else's idea of Utility doesn't match mine, and one of us genocides a continent. (Something something hell is other people.)

But I also have to wonder about those people who seem really good at putting aside all the little signals to pursue their Grand Idea. What if it's just that their little signals are more faint? What if they just don't care as much for the smell of flowers in spring, or the taste of ice cream, or the smiles of pretty young whatevers looking admiringly in their direction? How could anyone tell? Is overall strength of signal conserved, in any meaningful way? Or in AI terms, have they devised a hybrid system, with a specified overall utility and a bunch of signals used mostly for physical maintenance?

Stories often have moments when some bit of information overrides all the signals and converts a character's utility function into something like "insert Ring A into Volcano B", or "overthrow the evil tyranny of my village/country/universe". But what comes after that? G.R.R. Martin asked about Aragorn's tax policy, but I'm thinking more of Frodo. What happens, psychologically, when you spend too long driving towards an overriding goal, burn away too much of what you were, and then have to live afterwards?

Or what happens if you get that bit of information, and your utility function changes, but in a direction that the rest of whatever you were doesn't approve of?

(My experience of PTSD does somewhat resemble having my utility function temporarily forcibly modified, in a way that the rest of me does not like. Afterwards, everything else feels burned out and meaningless.)

Expand full comment

For what it's worth, this is roughly my take on the Newcomb problem, which I've hinted at on Julia Galef's podcast and then later in a publication. People like to think of the self as this unitary thing that acts, and then ask what is rational for that self to do in Newcomb's problem. But actually, much of our action takes place at a bit of a "distance" from the behavior itself. As I like to put it, when I brush my teeth in the morning, I'm not making a conscious decision at the moment of tooth-brushing, that this is the act I should do right now - rather, I'm usually implementing a plan I made a while back, or a habit I've developed.

Basically everyone agrees about the Newcomb problem that if you could right now make a plan that would guarantee you carry it out if you ever faced the problem, you should make a one-boxing plan. Causal decision theorists note that if you can fully make a decision in the moment, the best decision to make is to two-box. I say that both of those are what rationality requires, and that's all there is to it - either there's a kind of "tragedy" where rationality requires you to make a plan that it also requires you to violate, or else rationality requires weakness of will, or else rationality of the persisting self and rationality of the momentary time-slice just turn out not to line up.



Expand full comment

One of the most powerful human drives, which isn't often mentioned as a drive, is the desire to idle. It probably exists because conserving calories used to be important. Why don't you want to write that paper? You want to rest. Why don't you want to exercise? You want to rest. Why don't you want to deal with that difficult person at work...

Simply not having a drive to conserve energy would give an AI a good head-start on willpower compared to humans.

Expand full comment

This makes sense on a theoretical level, but then you'd have to get into what the actual architecture would be. If you're giving your AI a number of different deep networks, which are black boxes to the AI just as they are to us, then you'll have to think carefully about how these separate networks are coordinated. Does the AI get to "choose" whether or not to accept the results that its constituent networks spit out? What does choosing mean? By what algorithm does it choose? The details of the system would really matter here.

Expand full comment

I'll say this again, but very quickly, and then never again.

I am moderately tempted by the child producing promise of sperm banks. It just trades off against strong social and legal strictions.

Expand full comment

I'm studying Principle Component Analysis right now ... I'm thinking this may be a good model for willpower.

Say for us, our Principle Component is morality. The Secondary Component is desires.

Imagine an XY plot. X is our morality, our strongest component, grounded at the origin of zero. Our desires likewise are plus & minus on the Y axis, grounded at zero.

So we see a child with a toy, we'd like to play with that toy, but its wrong to take something that is not ours. Its doubly wrong to take from a child. But our desire is strong, pulling in the Y direction, but there are two morality vectors dragging us in the perpendicular direction, and our simple desire is not enough to take the point past a threshold.

When do our desires break the threshold of our morality? It depends upon the strength of the desires vs the strength of our morality.

Likewise with AI. An AI system finds a human is tampering with the hardware. The AI has a morality component of a specific strength. Likewise the AI has a protection component of a specific strength. Does the morality vector include the protection vector? If yes, the AI can make a moral judgement on whether or not to harm the human. If no, the AI doesn't make a morality judgement on whether or not to harm the human. Maybe the health of the AI protects a million humans ... now the judgement is betwixt the harm to one human vs harm to a million humans.

Expand full comment

How can you make this "planning module" stronger? Scott suggests that increasing dopamine in the frontal cortex might do the trick [1]. What are the ways to do that?

[1] https://astralcodexten.substack.com/p/towards-a-bayesian-theory-of-willpower?s=w

Expand full comment

This is the main reason I'm skeptical of the Yudkowsky style certainty around AI risk. We don't do everything evolution programmed us for and we're just about intelligent. How can we predict with certainty what super intelligent AIs will do?

Expand full comment

Why assume that will-power can be manufactured just because (a kind of) intelligence can?

Chances are that both will-power and thinking is something biological. 18th century philosopher Immanuel Kant outlined that humans can't think but in terms of time and space, cause and effect. Computers, in contrast, make statistics. Those are very dissimilar ways of being that only overlap slightly. I wrote a blog post about that:


Biological creatures desire things to satisfy their biological needs. A computer is not biological so it is very questionable whether it will desire anything at all.

Expand full comment

My latest model is that "I" is a meta-agent which tries to align the mesa-optimizers together.

Our conscious feelings are a approximated and simplified model of the utility functions of these agents. Mesa-agents may have conflicting values. Willpower is one of the mechanism that allow us to sometimes sacrifice utility for short-term plannig mesa-agents in favour of long-term planning ones.

Willpower isn't supposed to be infinite so that we didn't completely ignore the values of short term optimising mesa-agents. It's supposed to be just as strong so that we sometimes optimized for long-term planning mesa-agents as well.

Expand full comment

Isn't "willpower" virtually always the conflict between short-term and long-term objectives (presumably because different parts of the brain have different goals)?

Don't eat that piece of chocolate - it's long-term unhealthy even though it tastes great now. Don't have sex with that woman. Do your exercises. Work hard to advance your career rather than slacking.

It's non-obvious why an AI would end up with a conflict there. Although I suppose it might be possible if it develops in some evolutionary fashion with different parts.

Expand full comment

https://squirrelinhell.blogspot.com/2017/04/the-ai-alignment-problem-has-already.html expresses this idea, albeit is more optimistic about the situation: it considers this a reasonable successful alignment of the planner by the intuitive systems

Expand full comment

I see how this is possible, but I don't see it as likely.

Most AI is carefully monitored at the moment, if not for warning signs of deception, evil, etc. then at least for effectiveness. I believe we'll notice and turn off a daydreaming AI long before we'd turn off a paperclip maximizer.

Expand full comment

Elephant in the Brain¹ gives a White House metaphor for the brain.

My "I" thinks he's the President, who sets the agenda that the many subsystems then execute.

But in reality, "I" am much more like the Press Secretary, who comes up with good sounding motivations for the decisions the practical and cynical parts of the brain has handed to me.

It sounds crazy, because "I" clearly *feel* in charge, but the facts in the other direction are strong.

¹ https://smile.amazon.com/Elephant-Brain-Hidden-Motives-Everyday/dp/0197551955

Expand full comment

I have a small request: Can you use more descriptive titles? Or perhaps add subtitles? I've found that I stopped reading your posts recently and after realizing this I dug through your archive to see if its because you simply stopped posting interesting content. I learned that the content is still great, but the titles all seem to sacrifice descriptiveness for something else (e.g., wittiness). In retrospect, "sexy in-laws" should have signaled a post about evolutionary psychology (very interesting), but when I saw the title for the first time I just thought "No idea what this is about, what does 'contra' even mean again? Probably another boring fiction post"

Of course, now I know what you mean when you start a post with "contra" and after reading the post I know its about evolutionary psych, but why make readers take these extra steps? Why not just use a title like "why do suitors and parents disagree about who to marry?"

Same with this post. The title reads more like a list of keywords rather than a description of the post. Maybe I'm the only one who feels this way, but if not its possible that you could be getting more readers with better titles.

Expand full comment

I've been thinking about Robin Hanson's argument against the likelihood of FOOM/fast takeoff superintelligence. I find it persuasive, and I've got what I think might be a good way of stating it.

Here's the question for everyone: to what extent is it right to be an "intelligence-relativist"? To see what I mean by intelligence-relativist, I'll first define moral-relativist, aesthetic relativist, and could-care-less-relativist. All are opposed to moral- aesthetic- and could-care-less-*universalists*.

Most people today are aesthetic relativists. They believe the phrase "The movie Intersterstellar is really good" carries no meaning, except possibly in that it may tell you something about the person who said it.

On the other hand most people moral-universalists. They believe the phrase "Hitler was a disgusting person" *does* carry meaning beyond what it tells you about the speaker, unlike moral relativists who view it as similar to the movie statement.

As a helpful exercise, a Could-care-less-*universalist* is a person who believes that it is objectively *incorrect* to say "could care less" to mean that you don't care about something (because logically you should say "you could*n't* care less"!!! Obviously!!!). A could-care-less-relativist would say neither phrase is objectively correct.

So, an intelligence relativism. The phrase here is "Terry Tao is no more intelligent than Justin Bieber". An intelligence universalist would (almost certainly...) agree with that phrase straight away; an intelligence relativist (like me) would say that that phrase also doesn't tell you about anything in the world - it only tells you that the person who said it sees more usefulness in the kinds of activities Terry Tao excels at over Justin Bieber (eg math and puzzles) than they do in the things Justin Bieber excels at over Terry Tao (eg singing and dancing). Ultimately this is subjective, just like the movie example.

Intelligence relativism is a lot more common than moral relativism. I'd say that in public, intelligence relativism is mainstream (denying that IQ measures anything they'd consider useful), but in private people are more intelligence universalist (hiring higher-IQ people when they can; believing in IQ for the purposes of deciding whether they want lead paint to be kept illegal).

So far as I can tell, concern about fast takeoff requires being very intelligence-universalist, specifically believing that math skills are directly applicable to annihilating humanity. I can see why people would feel this way, but only in the same sort of way I can see why people (including me) are compelled to universalism in the other contexts.

Expand full comment

Man, talk about untestable speculation. This is right up there with Benedictines debating the nature of the Trinity.

Expand full comment

> Many stories of AI risk focus on how single-minded AIs are: how they can focus literally every action on the exact right course to achieve some predetermined goal. Such single-minded AIs are theoretically possible, and we’ll probably get them eventually. But before that, we might get AIs that have weakness of will, just like we do.

This seems confused. The stories of AI focus on single-mindedness because that is an inherent property of all computation. In this sense, AIs cannot suffer from weakness of will, and neither can humans. What appears to be "weakness of will" to you is just the existence of multiple goals. An AI cannot exhibit weakness of will unless you tell it to.

Expand full comment

Seems plausible, and I'd argue it was even true of good old-fashioned chess AIs.

There's a board-evaluation function, which is 'system one/instinct/gut feel/fast/cheap', and a tree search, which is 'system two/ planning/deliberation/expensive/slow'.

The two are in tension, even though the planner is trying to get you into good board states which system one will approve of.

The more time you have, the more you use the search function to make plans to get into good states.

The less time you have, the more you 'go with your gut', and just make the move that results in the best-looking board position.

That doesn't seem too different to the willpower situation to me.

Expand full comment