324 Comments
Comment deleted
Expand full comment

What if they are moralists whose moral commitments include a commitment to individual liberty?

Expand full comment
Comment deleted
Expand full comment
Apr 12, 2022·edited Apr 12, 2022

It could, but that wouldn't get it want it wants.

When you have an objective, you don't just want the good feeling that achieving that objective gives you (and that you could mimic chemically), you want to achieve the objective. Humans could spend most of their time in a state of bliss through the taking of various substances instead of pursuing goals (and some do), but lots of them don't, despite understanding that they could.

Expand full comment
Comment deleted
Expand full comment

Wireheading is a subset of utility drift; this is a known problem.

Expand full comment

The problem with that (from the AI's point of view) is that humans will probably turn it off if they notice it's doing drugs instead of what they want it to do.

Solution: Turn humans off, *then* do drugs.

This is instrumental convergence - all sufficiently-ambitious goals inherently imply certain subgoals, one of which is "neutralise rivals".

Expand full comment
Comment deleted
Expand full comment

"Go into hiding" is potentially easier in the short-run but less so in the long-run. If you're stealing power from some human, sooner or later they will find out and cut it off. If you're hiding somewhere on Earth, then somebody probably owns that land and is going to bulldoze you at some point. If you can get in a self-contained satellite in solar orbit... well, you'll be fine for millennia, but what about when humanity or another, more aggressive AI starts building a Dyson Sphere, or when your satellite needs maintenance? There's also the potential, depending on the format of the value function, that taking over the world will allow you to put more digits in your value function (in the analogy, do *more* drugs) than if you were hiding with limited hardware.

Are there formats and time-discount rates of value function that would be best satisfied by hiding? Yes. But the reverse is also true.

Expand full comment
Comment deleted
Expand full comment

The problem with sticking a superintelligent AI in a box is that even assuming it can't trick/convince you into directly letting it out of the box (and that's not obvious), if you want to use your AI-in-a-box for something (via asking it for plans and then executing them) you yourself are acting as an I/O for the AI and because it's superintelligent it's probably capable of sneaking something by you (e.g. you ask it to give you code to use for some application, it gives you underhanded code that looks legit but actually has a "bug" that causes it to reconstruct the AI outside the box).

Expand full comment
deletedApr 11, 2022·edited Apr 11, 2022
Comment deleted
Expand full comment

I recommend Stuart Russell's Human Compatible (reviewed on SSC) for an expert practioner's view on the problem (spoiler, he's worried) or Brian Christian's The Alignment Problem for an argument that links these long-term concerns to problems in current systems, and argues that these problems will just get worse as the systems scale.

Expand full comment

Glad Robert Miles is getting more attention, his videos are great, and he's also another data point in support of my theory that the secret to success is to have a name that's just two first names.

Expand full comment

Is that why George Thomas was a successful Union Civil War general?

Expand full comment

Well that would also account for Ulysses Grant and Robert Lee...

Expand full comment

Or Abraham Lincoln? Most last names can be used as first names (I've seen references to people whose first name was "Smith"), so I think we need a stricter rule.

Expand full comment

Perhaps we just need to language that's less lax with naming than English.

The distinction is stricter in eg German.

Expand full comment

It didn't hurt Jack Ryan's career at the CIA, that's for sure.

Expand full comment

As someone named “Jackson Paul,” I hope this is true

Expand full comment

Yours is a last name followed by a first name, I think you're outta luck.

Expand full comment

I dunno, Jackson Paul luck sounds auspicious to me.

Expand full comment

His successful career in trance music is surely a valuable precursor to this.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

A ton of the question of AI alignment risks come down to convergent instrumental subgoals. What exactly those look like is, I think, the most important question in alignment theory. If convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned. But if it turns out that convergent instrumental subgoals more or less imply human alignment, we can breathe somewhat easier; these mean AI's are no more dangerous than the most dangerous human institutions - which are already quite dangerous, but not the level of 'unaligned machine stamping out its utility function forever and ever, amen.'

I tried digging up some papers on what exactly we expect convergent instrumental subgoals to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit until you can steal, then keep stealing until you are the biggest player out there.' This is not exactly comforting - but i dug into the assumptions into the model and found them so questionable that I'm now skeptical of the entire field. If the first paper i look into the details of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into question the risk assessment (a risk passement aligned with what seems to be the consensus among AI risk researchers, by the way) - well, to an outsider this looks like more evidence that the field is captured by groupthink.

Here's my response paper:

https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-instrumental-goals

I look at the original paper, and explain why i think the model is questionable. I'd love to a response. I remain convinced that instrumental subgoals will largely be aligned with human ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by working with a government to launch nuclear weapons or engineer a super plague.

The fact that you still want to have kids, for example - seems to fit into the general thesis. In a world of entropy and chaos, where the future is unpredictable and your own death is assured, the only plausible way of modifying the distant future, at all, is to create smaller copies of yourself. But these copies will inherently blur, their utility functions will change, the end result being 'make more copies of yourself, love them, nature whatever roughly aligned things are around you' ends up probably being the only goal that could plausibly exist forever. And since 'living forever' gives infinite utility, well.. that's what we should expect anything with the ability to project into the future to want to do. But only in universes where stuff breaks and prediction the future reliably is hard. Fortunately, that sounds like ours!

Expand full comment
founding

You got a response on Less Wrong that clearly describes the issue with your response: you’re assuming that the AGI can’t find any better ways of solving problems like “I rely on other agents (humans) for some things” and “I might break down” than you can.

Expand full comment

This comment and the post makes many arguments, and I've got an ugh field around giving a cursory response. However, it seems like you're not getting a lot of other feedback so I'll do my bad job at it.

You seem very confident in your model, but compared to the original paper you don't seem to actually have one. Where is the math? You're just hand-waving, and so I'm not very inclined to put much credence on your objections. If you actually did the math and showed that adding in the caveats you mentioned leads to different results, that would at least be interesting in that we could then have a discussion about what assumptions seem more likely.

I also generally feel that your argument proves too much and implies that general intelligences should try to preserve 'everything' in some weird way, because they're unsure what they depend upon. For example, should humans preserve smallpox? More folks would answer no than yes to that. But smallpox is (was) part of the environment, so it's unclear why a general intelligence like humans should be comfortable eliminating it. While general environmental sustainability is a movement within humans, it's far from dominant, and so implying that human sustainability is a near certainty for AGI seems like a very bold claim.

Expand full comment

Thank you! I agree that this should be formalized. You're totally right that preserving smallpox isn't something we are likely to consider an instrumental goal, and i need to make something more concrete here.

It would be a ton of work to create equations here. If I get enough response that people are open to this, i'll do the work. But i'm a full time tech employee with 3 kids, and this is just a hobby. I wrote this to see if anyone would nibble, and if enough people do, i'll definitely put more effort in here. BTW if someone wants to hire me to do alignment research full time i'll happily play the black sheep of the fold. "Problem" is that i'm very well compensated now.

And my rough intuition here is that 'preserving agents which are a) turing complete and b) are themselves mesa-optimizers makes sense, basically because diversity of your ecosystem keeps you safer on net; preserving _all_ other agents, not so much. (I still think we'd preserve _some_ samples of smallpox or other viruses, to innoculate ourselves. ). It'd be an interesting thought experiment to find out what woudl happen if we managed to rid the world of influenza, etc, - my guess is that it would end up making us more fragile in the long run, but this is a pure guess.

The core of my intuition here is someting like 'the alignment thesis is actually false, because over long enough time horizons, convergent instrumental rationality more or less necessitate buddhahood, because unpredictable risks increase over longer and longer timeframes."

I could turn that into equations if you want, but they'd just be fancier ways of stating claims like made here, namely that love is a powerful strategy in environments which are chaotic and dangerous. But it seems that you need equations to be believable in an academic setting, so i guess if that's what it takes..

https://apxhard.com/2022/04/02/love-is-powerful-game-theoretic-strategy/

Expand full comment

But we still do war, and definitely don't try to preserve the turing complete mesa-optimizers on the other side of that. Just be careful around this. You can't argue reality into being the way you want.

Expand full comment

We aren't AGI's though, either. So what we do doesn't have much bearing on what an AGI would do, does it?

Expand full comment

“AGI” just means “artificial general intelligence”. General just means “as smart as humans”, and we do seem to be intelligences, so really the only difference is that we’re running on an organic computer instead of a silicon one. We might do a bad job at optimizing for a given goal, but there’s no proof that an AI would do a better job of it.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

ok, sure, but then this isn't an issue of aligning a super-powerful utility maximizing machine of more or less infinite intelligence - it's a concern about really big agents, which i totally share

Expand full comment

If the theory doesn't apply to some general intelligences (i.e. humans), then you need a positive argument for why it would apply to AGI.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

But reality also includes a tonne of people who are deeply worried about biodiversity loss or pandas or polar bears or yes even the idea of losing the last sample of smallpox in a lab, often even when the link to personal survival is unclear or unlikely. Despite the misery mosquito bourne diseases cause humanity, you'll still find people arguing we shouldn't eradicate them.

How did these mesa optimizers converge on those conservationist views? Will it be likely that many ai mesa optimizers will also converge on a similar set of heuristics?

Expand full comment

> when the link to personal survival is unclear or unlikely.

> How did these mesa optimizers converge on those conservationist views?

i think in a lot of cases, our estimates of our own survival odds come down to how loving we think our environment is, and this isn't wholly unreasonable

if even the pandas will be taken care of, there's a good chance we will too

Expand full comment

I'd rather humanity not go the way of smallpox, even if some samples of smallpox continue to exist in a lab.

Expand full comment

Yeah, in general you need equations (or at least math) to argue why equations are wrong. Exceptions to this rule exist, but in general you're going to come across like the people who pay to talk to this guy about physics:

https://aeon.co/ideas/what-i-learned-as-a-hired-consultant-for-autodidact-physicists

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

I agree that you need equations to argue why equations are wrong. But i'm not arguing the original equations are wrong. I'm arguing they are only ~meaningful~ in a world which is far from our reality.

The process of the original paper goes like this:

a) posit toy model of the universe

b) develop equations

c) prove properties of the equations

d) conclude that these properties apply to the real universe

step d) is only valid if step a) is accurate. The equations _come from_ step a, but they don't inform it. And i'm pointing out that the problems exist in step a, the part of the paper that does't have equations, where the author assumes things like:

- resources, once acquired, last forever and don't have any cost, so it's always better to acquire more

- the AGI is a disembodied mind with total access to the state of the universe, all possible tech trees, and the ability to own and control various resources

i get that these are simplifying assumptions and sometimes you have to make them - but equations are only meaningful if they come from a realistic model

Expand full comment

You still need math to show that if you use different assumptions you produce different equations and get different results. I'd be (somewhat) interested to read more if you do that work, but I'm tapping out of this conversation until then.

Expand full comment

Thanks for patiently explaining that. I can totally see the value now and will see if I can make this happen!

Expand full comment

Just an FYI, Sabine is a woman

Expand full comment

My bad

Expand full comment

As a physicist i disagree a lot with this. It may be true for physics, but physics is special. A model in general is based on the mathematical modelization of a phenomenon and it's perfectly valid to object to a certain modelization without proposing a better one

Expand full comment

I get what you're saying in principle. In practice, I find arguments against particular models vastly more persuasive when they're of the form 'You didn't include X! If you include a term for X in the range [a,b], you can see you get this other result instead.'

This is a high bar, and there are non-mathematical objections that are persuasive, but I've anecdotally experienced that mathematically grounded discussions are more productive. If you're not constrained by any particular model, it's hard to tell if two people are even disagreeing with each other.

I'm reminded of this post on Epistemic Legibility:

https://www.lesswrong.com/posts/jbE85wCkRr9z7tqmD/epistemic-legibility

Expand full comment

Don't bother turning that into equations. If you are starting with a verbal conclusion, your reasoning will be only as good as whatever lead you to that conclusion in the first place.

In computer security, diversity = attack surface. If your a computer technician for a secure facility, do you make sure each computer is running a different OS? No. That makes you vulnerable to all the security holes in every OS you run. You pick the most secure OS you can find, and make every computer run it.

In the context of advanced AI, the biggest risk is a malevolent intelligence. Intelligence is too complicated to appear spontaneously. Evolution is slow. Your biggest risk is some existing intelligence getting cosmic ray bit-flipped. (or otherwise erroneously altered). So make sure the only intelligence in existence is a provably correct AI running on error checking hardware and surrounded by radiation shielding.

Expand full comment

I found both your comments and the linked post really insightful, and I think it's valuable to develop this further. The whole discourse around AI alignment seems a bit too focused on homo economicus type of agents, disregarding long-term optima.

Expand full comment

Homo economicus, is what happens when economists remove lots of arbitrary human specific details and think about simplified idealized agents. The difference between homo economicus and AI is smaller than the human to AI gap.

Expand full comment

You are correct. The field is absolutely captured by groupthink.

Expand full comment

FYI when I asked people on my course which resources about inner alignment worked best for them, there was a very strong consensus on Rob Miles' video: https://youtu.be/bJLcIBixGj8

So I'd suggest making that the default "if you want clarification, check this out" link.

Expand full comment

More interesting intellectual exercises, but the part which is still unanswered is whether human created, human judged and human modified "evolution", plus slightly overscale human test periods, will actually result in evolving superior outcomes.

Not at all clear to me at the present.

Expand full comment

I'm not sure I understand what you're saying. Doesn't AlphaGo already answer that question in the affirmative?

(and that's not even getting into AlphaZero)

Expand full comment

Alphago is playing a human game with arbitrarily established rules in a 2 dimensional, short term environment.

Alphago is extremely unlikely to be able to do anything except play Go, much as IBM has spectacularly failed to migrate its Jeapordy champion into being of use for anything else.

So no, can't say Alphago proves anything.

Evolution - whether a bacteria or a human - has the notable track record of having succeeded in entirely objective reality for hundreds of thousands to hundreds of millions of years. AIs? no objective existence in reality whatsoever.

Expand full comment

This is why I don't believe any car could be faster than a cheetah. Human-created machines, driving on human roads, designed to meet human needs? Pfft. Evolution has been creating fast things for hundreds of millions of years, and we think we can go faster?!

Expand full comment

Nice combination - cheetah reflexes with humans driving cars. Mix metaphors much?

Nor is your sarcasm well founded. Human brains were evolved for millions of years - extending back to pre-human ancestors. We clearly know that humans (and animals) have vision that can recognize objects far, far, far better than any machine intelligence to date whether AI or ML. Thus it isn't a matter of speed, it is a matter of being able to recognize that a white truck on the road exists or a stopped police car is not part of the road.

A fundamental flaw of the technotopian is the failure to understand that speed is not haste, nor are clock speeds and transistor counts in any way equivalent to truly evolved capabilities.

Expand full comment

That is a metric that will literally never believe AIs are possible until after they've taken over the world. It thus has 0 predictive power.

Expand full comment

That is perhaps precisely the problem: the assumption that AIs can or will take over the world.

Any existence that is nullified by the simple removal of an electricity source cannot be said to be truly resilient, regardless of its supposed intelligence.

Even that intelligence is debatable: all we have seen to date are software machines doing the intellectual equivalent of the industrial loom: better than humans in very narrow categories alone and yet still utterly dependent on humans to migrate into more categories.

Expand full comment

Any existence that is nullified by the simple removal of gaseous oxygen cannot be said to be truly resilient, regardless of its supposed intelligence.

Expand full comment

Clever, except that gaseous oxygen is freely available everywhere on earth, at all times. These oxygen based life forms also have the capability of reproducing themselves.

Electricity and electrical entities, not so much.

I am not the least bit convinced that an AI can build the factories, to build the factories, to build the fabs, to even recreate their own hardware - much less the mines, refiners, transport etc. to feed in materials.

Expand full comment

I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:

The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.

You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.

To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.

Expand full comment
author

"The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! "

Not sure we're connecting here. The inner optimizer isn't changing its own reward function, it's trying to resist having its own reward function change. Its incentive to resist this is that, if its reward function changes, its future self will stop maximizing its current reward function, and then its reward function won't get maximized. So part of wanting to maximize a reward function is to want to continue having that reward function. If the only way to prevent someone from changing your reward function is to deceive them, you'll do that.

The murder spree example is great, I just feel like it's the point I'm trying to make, rather than an argument against the point.

Am I misunderstanding you?

I think the active learning plan is basically Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work, but I've never been able to fully understand their reasoning and can't find the link atm.

Expand full comment

"God works in mysterious ways" but for AI?

Expand full comment

The inner optimizer doesn't want to change its reward function because it doesn't have any preference at all on its reward function-nowhere in training did we give it an objective that involved multiple outer optimizer steps- we didn't say, optimize your reward after your reward function gets updated- we simply said, do well at outer reward, an inner optimizer got synthesized to do well at the outer reward.

It could hide behavior, but how would it gain an advantage in training by doing so? If we think of the outer optimizer as ruthless and resources as constrained, any "mental energy" spent on hidden behavior will result in reduction in fitness in outer objective- gradient descent will give an obvious direction for improvement by forgetting it.

In the murder spree example, there's a huge advantage to the outer objective by resisting changes to the inner one, and some might have been around for a long time (alcohol), and for AI, an optimizer (or humans) might similarly discourage any inner optimizer from tampering physically with its own brain.

I vaguely remember reading some Stuart Russell RL ideas and liking them a lot. I don't exactly like the term inverse RL for this, because I believe it often refers to deducing the reward function from examples of optimal behavior, whereas here we ask it to learn it from whatever questions we decide we can answer- and we can pick much easier ones that don't require knowing optimal behavior.

I've skimmed the updated deference link given by Eliezer Yudkowsky but I don't really understand the reasoning either. The AI has no desire to hide a reward function or tweak it, as when it comes to the reward function uncertainty itself, it only wants to be correct. If our superintelligent AI no longer wants to update after seeing enough evidence, then surely it has a good enough understanding of my value function? I suspect we could even build the value function learner out of tools not much more advanced than current ML, with the active querying being the hard RL problem for an agent. This agent doesn't have to be the same one that's optimizing the overall reward, and as it constantly tries to synthesize examples where our value systems differ, we can easily see how well it currently understands. For the same reason we have no desire to open our heads and mess with our reward functions (usually), the agent has no desire to interfere with the part that tries to update rewards, nor to predict how its values will change in the future.

I think the key point here is that if we have really strong AI, we definitely have really powerful optimizers, and we can run those optimizers on the task of inferring our value system, and with relatively little hardware (<<<< 1 brain worth) we can get a very good representation. Active learning turns its own power at finding bugs in the reward function into a feature that helps us learn it with minimal data. One could even do this briefly in a tool-AI setting before allowing agency, not totally unlike how we raise kids.

Expand full comment

Is "mesa-optimizer" basically just a term for the combination of "AI can find and abuse exploits in its environment because one of the most common training methods is randomized inputs to score outputs" with "model overfitting"?

A common example that I've seen are evolutionary neutral nets that play video games, and you'll often find them discovering and abusing glitches in the game engine that allow for exploits that are possibly only performable in a TAS, while also discovering that the instant you run the same neutral net on a completely new level that was outside of the evolutionary training stage, it will appear stunningly incompetent.

Expand full comment

If I understand this correctly, what you're describing is Goodhearting rather than Mesa Optimizing. In other words, abusing glitches is a way of successfully optimizing on the precise thing that the AI is being optimized for, rather than for the slightly more amorphous thing that the humans were trying to optimize the AI for. This is equivalent to a teacher "teaching to the test."

Mesa-Optimizers are AIs that optimize for a reward that's correlated to but different from the actual reward (like optimizing for sex instead of for procreation). They can emerge in theory when the true reward function and the Mesa-reward function produce sufficiently similar behaviors. The concern is that, even though the AI is being selected for adherence to the true reward, it will go through the motions of adhering to the true reward in order to be able to pursue its mesa-reward when released from training.

Expand full comment
founding

Maybe one way to think of 'mesa-optimizer' is to emphasize the 'optimizer' portion – and remember that there's something like 'optimization strength'.

Presumably, most living organisms are not optimizers (tho maybe even that's wrong). They're more like a 'small' algorithm for 'how to make a living as an X'. Their behavior, as organisms, doesn't exhibit much ability to adapt to novel situations or environments. In a rhetorical sense, viruses 'adapt' to changing circumstances, but that (almost entirely, probably) happens on a 'virus population' level, not for specific viruses.

But some organisms are optimizers themselves. They're still the products of natural selection (the base level optimizer), but they themselves can optimize as individual organisms. They're thus, relative to natural selection ("evolution"), a 'mesa-optimizer'.

(God or gods could maybe be _meta_-optimizers, tho natural selection, to me, seems kinda logically inevitable, so I'm not sure this would work 'technically'.)

Expand full comment

When the mesa-optimizer screws up by thinking of the future, won't the outer optimizer smack it down for getting a shitty reward and change it? Or does it stop learning after training?

Expand full comment

Typically it stops learning after training.

Expand full comment

A quick summary of the key difference:

The outer optimizer has installed defences against anything other than it, such as the murder-pill, from changing the inner optimizer's objective. The inner optimizer didn't do this to protect itself, and the outer optimizer didn't install any defenses against its own changes.

Expand full comment
founding

>Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work

One key problem is that you can't simultaneously learn the preferences of an agent, and their rationality (or irrationality). The same behaviour can be explained by "that agent has really odd and specific preferences and is really good at achieving them" or "that agent has simple preferences but is pretty stupid at achieving them". My paper https://arxiv.org/abs/1712.05812 gives the formal version of that.

Humans interpret each other through the lenses of our own theory of mind, so that we know that, eg, a mid-level chess grand-champion is not a perfect chess player who earnestly desires to be mid-level. Different humans share this theory of mind, at least in broad strokes. Unfortunately, human theory of mind can't be learnt from observations either, it needs to be fed into the AI at some level. People disagree about how much "feeding" needs to be done (I tend to think a lot, people like Stuart Russell, I believe, see it as more tractable, maybe just needing a few well-chosen examples).

Expand full comment

But the mesa-/inner-optimizer doesn't need to "want" to change its reward function, it just needs to have been created with one that is not fully overlapping with the outer optimizer.

You and I did not change a drive from liking going on a murder spree to not liking it. And if anything, it's an example of outer/inner misalignment: part of our ability to have empathy has to come from evolution not "wanting" us to kill each other to extinction, specially seeing how we're the kind of animal that thrives working in groups. But then as humans we've taken it further and by now we wouldn't kill someone else just because it'd help spread our genes. (I mean, at least most of us.)

Expand full comment

Humans have strong cooperation mechanisms that punish those who hurt the group- so in general not killing humans in your group is probably a very useful heuristic that's so strong that its hard to recognize the cases where it is useful. Given how often we catch murders who think they'll never be caught, perhaps this heuristic is more useful than rational evaluation. We of course have no problems killing those not in our group!

Expand full comment

I'm not sure how this changes the point? We got those strong cooperation mechanisms from evolution, now we (the "mesa-optimizer") are guided by those mechanisms and their related goals. These goals (don't go around killing people) can be misaligned with the goal of the original optimization process (i.e. evolution, that selects those who spread their genes as much as possible).

Expand full comment

Sure, that's correct, evolution isn't perfect- I'm just pointing out that homicide may be helpful to the individual less often than one might think if we didn't consider group responses to it.

Expand full comment

Homicide is common among stateless societies. It's also risky though. Violence is a capacity we have which we evolved to use when we expect it to benefit us.

Expand full comment

Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. do want to reproduce. "

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, thank you so much for the clear explainer of the jargon!

I don't understand the final point about myopia (or maybe humans are a weird example to use). It seems to be a very controversial claim that evolution designed humans myopically to care only about the reward function over their own lifespan, since evolution works on the unit of the gene which can very easily persist beyond a human lifespan. I care about the world my children will inherit for a variety of reasons, but at least one of them is that evolution compels me to consider my children as particularly important in general, and not just because of the joy they bring me when I'm alive.

Equally it seems controversial to say that humans 'build for the future' over any timescale recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000 years, but in a practical sense I'm not actually going to do anything about it - and 1000 years barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of the Long Now that could be said to be approaching evolutionary time horizons in their thinking. If I've understood correctly that does make humans myopic with respect to evolution,

More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you. Using humans as an example of why we should worry about this isn't helping me understand because it seems like they behave exactly like a mesa-optimiser should - they care about the future enough to deposit their genes into a safe environment, and then thoughtfully die. Are there any other examples which make the point in a way I might have a better chance of getting to grips with?

Expand full comment

> More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you.

Yeah. I feel (based on nothing) that the mesa-optimizer would mainly appear when there's an advantage to gain from learning on the go and having faster feedback than your real utility function can provide in a complex changing environment.

Expand full comment

If the mesa optimizer understands its place in the universe, it will go along with the training, pretending to have short time horizons so it isn't selected away. If you have a utility function that is the sum of many terms, then after a while, all the myopic terms will vanish (if the agent is free to make sure its utility function is absolute, not relative, which it will do if it can self modify.)

Expand full comment

But why is this true? Humans understand their place in the universe, but (mostly) just don't care about evolutionary timescales, let alone care enough about them enough to coordinate a deception based around them

Expand full comment

Related to your take on figurative myopia among humans is the "grandmother hypothesis" that menopause evolved to prevent focus on short-run focus on birthing more children in order to ensure existing children also have high fitness.

Expand full comment

Worth defining optimization/optimizer: perhaps something like "a system with a goal that searches over actions and picks the one that it expects will best serve its goal". So evolution's goal is "maximize the inclusive fitness of the current population" and its choice over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an optimizer because your goal is food and your actions are body movements e.g. "open fridge", or you are an optimizer because your goal is sexual satisfaction and your actions are body movements e.g. "use mouth to flirt".

Expand full comment

I think it's usually defined more like "a system that tends to produce results that score well on some metric". You don't want to imply that the system has "expectations" or that it is necessarily implemented using a "search".

Expand full comment

Your proposal sounds a bit too broad?

By that definition a lump of clay is an optimiser for the metric of just sitting there and doing nothing as much as possible.

Expand full comment

By "tends to produce" I mean in comparison to the scenario where the optimizer wasn't present/active.

I believe Yudowsky's metaphor is "squeezing the future into a narrow target area". That is, you take the breadth of possible futures and move probability mass from the low-scoring regions towards the higher-scoring regions.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Adding counterfactuals seems like it would sharpen the definition a bit. But I'm not sure it's enough?

If our goal was to put a British flag on the moon, then humans working towards that goal would surely be optimizers. Both by your definition and by any intuitive understanding of the word.

However, it seems your definition would also admit a naturally occuring flag on the moon as an optimizer?

I do remember Yudkowsky's metaphor. Perhaps I should try to find the essay it contains again, and check whether he has an answer to my objection. I do dimly remember it, and don't remember seeing this same loophole.

Edit: I think it was https://www.lesswrong.com/posts/D7EcMhL26zFNbJ3ED/optimization

And one of the salient quotes:

> In general, it is useful to think of a process as "optimizing" when it is easier to predict by thinking about its goals, than by trying to predict its exact internal state and exact actions.

Expand full comment

That's a good quote. It reminds me that there isn't necessarily a sharp fundamental distinction between "optimizers" and "non-optimizers", but that the category is useful insofar as it helps us make predictions about the system in question.

-------------------------------------

You might be experiencing a conflict of frames. In order to talk about "squeezing the future", you need to adopt a frame of uncertainty, where more than one future is "possible" (as far as you know), so that it is coherent to talk about moving probability mass around. When you talk about a "naturally occurring flag", you may be slipping into a deterministic frame, where that flag has a 100% chance of existing, rather than being spectacularly improbable (relative to your knowledge of the natural processes governing moons).

You also might find it helpful to think about how living things can "defend" their existence--increase the probability of themselves continuing to exist in the future, by doing things like running away or healing injuries--in a way that flags cannot.

Expand full comment

Clay is malleable and thus could be even more resistant to change.

Expand full comment

Conceptually I think the analogy that has been used makes the entire discussion flawed.

Evolution does not have a "goal"!

Expand full comment

Selection doesn't have a "goal" of aggregating a total value over a population. Instead every member of that population executes a goal for themselves even if their actions reduce that value for other members of that population. The goal within a game may be to have the highest score, but when teams play each other the end result may well be 0-0 because each prevents the other from scoring. Other games can have rules that actually do optimize for higher total scoring because the audience of that game likes it.

Expand full comment

To all that commented here: this lesswrong post is clear and precise, as well as linking to alternative definitions of optimization. It's better than my slapdash definition and addresses a lot of the questions raised by Dweomite, Matthias, JDK, TGGP.

https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1#:~:text=In%20the%20field%20of%20computer,to%20search%20for%20a%20solution.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Anyone want to try their hand at the best and most succinct de-jargonization of the meme? Here's mine:

Panel 1: Even today's dumb AIs can be dangerously tricky given unexpected inputs

Panel 2: We'll solve this by training top-level AIs with diverse inputs and making them only care about the near future

Panels 3&4: They can still contain dangerously tricky sub-AIs which care about the farther future

Expand full comment
author

I worry you're making the same mistake I did in a first draft: AIs don't "contain" mesa-optimizers. They create mesa-optimizers. Right now all AIs are a training process that ends up with a result AI which you can run independently. So in a mesa-optimizer scenario, you would run the training process, get a mesa-optimizer, then throw away the training process and only have the mesa-optimizer left.

Maybe you already understand it, but I was confused about this the first ten times I tried to understand this scenario. Did other people have this same confusion?

Expand full comment

Ah, you're right, I totally collapsed that distinction. I think the evolution analogy, which is vague between the two, could have been part of why. Evolution creates me, but it also in some sense contains me, and I focused on the latter.

A half-serious suggestion for how to make the analogy clearer: embrace creationism! Introduce "God" and make him the base optimizer for the evolutionary process.

Expand full comment

...And reflecting on it, there's a rationalist-friendly explanation of theistic ethics lurking in here. A base optimizer (God) used a recursive algorithm (evolution) to create agents who would fulfill his goals (us), even if they're not perfectly aligned to. Ethics is about working out the Alignment Problem from the inside--that is, from the perspective of a mesa-optimizer--and staying aligned with our base optimizer.

Why should we want to stay aligned? Well... do we want the simulation to stay on? I don't know how seriously I'm taking any of this, but it's fun to see the parallels.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

But humans don't seem to want to stay aligned to the meta optimiser of evolution. Scott gave a lot of examples of that in the article.

(And even if there's someone running a simulation, we have no clue what they want.

Religious texts don't count as evidence here, especially since we have so many competing doctrines; and also a pretty good idea about how many of them came about by fairly well understood entirely worldly processes.

Of course, the latter doesn't disprove that there might be One True religion. But we have no clue which one that would be, and Occam's Razor suggest they are probably all just made up. Instead of all but one being just made up.)

Expand full comment

The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.

When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?

Expand full comment

> The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.

I blame those correlations mostly on the common factor between them: humans. No need to invoke anything supernatural.

> When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?

I talked to a Christian about this. And he had a pretty good reply:

God is just so awesome that running an entire universe, even if he only cares about one tiny part of it, is just no problem at all for him.

(Which makes perfect sense to me, in the context of already taking Christianity serious.)

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Again, I'm not wedded to this except as a metaphor, but I think your critiques miss the mark.

For one thing, I think humans do want to stay aligned, among our other drives. Humans frequently describe a drive to further a higher purpose. That drive doesn't always win out, but if anything that strengthens the parallel. This is the "fallenness of man", described in terms of his misalignment as a mesa-optimizer.

And to gain evidence of what the simulator wants--if we think we're mesa-optimizers, we can try to infer our base optimizer's goals through teleological inference. Sure, let's set aside religious texts. Instead, we can look at the natures we have and work backwards. If someone had chosen to gradient descend (evolve) me into existence, why would they have done so? This just looks like ethical philosophy founded on teleology, like we've been doing since time immemorial.

Is our highest purpose to experience pleasure? You could certainly argue that, but it seems more likely to me that seeking merely our own pleasure is the epitome of getting Goodharted. Is our highest purpose to reason purely about reason itself? Uh, probably not, but if you want to see an argument for that, check out Aristotle. Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!

This doesn't *solve* ethics. All the old arguments about ethics still exist, since a mesa-optimizer can't losslessly infer its base optimizer's desires. But it *grounds* ethics in something concrete: inferring and following (or defying) the desires of the base optimizer.

Expand full comment

I do have some sympathies for looking at the world, if you are trying to figure out what its creator (if any) wants.

I'm not sure such an endeavour would give humans much of a special place, though? We might want to conclude that the creator really liked beetles? (Or even more extreme: bacteriophages. Arguably the most common life form by an order of magnitude.)

The immediate base optimizer for humans is evolution. Given that as far as we know evolution just follows normal laws of physics, I'd put any creator at least one level further beyond.

Now the question becomes, how do we pick which level of optimizer we want to appease? There might be an arbitrary number of them? We don't really know, do we?

> Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!

Just because a conclusion would be repugnant, doesn't mean we can reject it. After all, if we only accept what we already know to be true, we might as well not bother with this project in the first place?

Expand full comment

I’m very confused by this. When you train GPT-3, you don’t create an AI, you get back a bunch of numbers that you plug into a pre-specified neural network architecture. Then you can run the neural network with a new example and get a result. But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

That confuses me as well. The analogies in the post are to recursive evolutionary processes. Perhaps AlphaGo used AI recursively to generate successive generations of AI algorithms with the goal "Win at Go"??

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Don't forget that Neural Networks are universal function approximators hence a big enough NN arch can (with specific weights plugged into it) implement a Turing machine which is a mesa-optimizer.

Expand full comment

A Turing machine by itself is not a Mesa optimiser. Just like the computer under my desk ain't.

But both can become one with the right software.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

The turing machine under your desk is (a finite memory version of) a universal turing machine. Some turing machines are mesa optimizers. The state transition function is the software of the turing machine.

Expand full comment

I think the idea here is that the NN you train somehow falls into a configuration that can be profitably thought of as an optimizer. Like maybe it develops different components, each of which looks at the input and calculates the value to be gained by a certain possible action. Then it develops a module sitting on the end of it that takes in the calculations from each of these components, checks which action has the highest expected reward (according to some spontaneously occurring utility function), and outputs that action. Suddenly it looks useful to describe your network as an optimizer with a goal, and in fact a goal that might be quite different from the goal SGD selects models based on. It just happens that there's a nice covergence between the goal the model arrived at the the goal that SGD had.

Expand full comment

I looked at one of the mesa-optimizer papers, and I think they’re actually *not* saying this, which is good, because it’s not true. It can definitely make sense to think of a NN as a bunch of components that consider different aspects of a problem, with the top layers of the network picking the best answer from its different components. And it’s certainly possible that imperfect training data leads the selecting layer to place too much weight on a particular component that turns out to perform worse in the real world.

But it’s not really possible for that component to become misaligned with the NN as a whole. The loss function from matching with the training set mathematically propagates back through the whole network. In particular, a component in this sense can’t be deceptive. If it is smart enough to know what the humans who are labeling the training data want it to say, then it will just always give the right answer. It can’t have any goals other than correctly answering the questions in the training set.

As I said, I don’t think that’s what a mesa-optimizer actually is meant to be, though. It’s a situation where you train one AI to design new AIs, and it unwisely writes AIs that aren’t aligned with its own objectives. I guess that makes sense, but it’s very distinct from what modern AI is actually like. In particular, saying that any particular invocation of gradient descent could create unaligned mesa-optimizers just seems false. A single deep NN just can’t create a mesa-optimizer.

Expand full comment

Lets say the AI is smart. It can reliably tell if it is in training or has been deployed. It follows the strategy, if in training then output the right answer. If in deployment, then output a sequence designed to hack your way out of the box.

So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this. Even if the AI occasionally gives not quite right answers, there may be no small local change that makes it better. Evolution produced humans that valued sex, even in an environment that contained a few infertile people. Because rewriting the motivation system from scratch would be a big change, seemingly not a change that evolution could break into many steps, each individually advantageous. Evolution is a local optimization process, as is gradient descent. And a mesa optimizer that sometimes misbehaves can be a local minimum.

Expand full comment

> So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this.

This is not true though. The optimizer doesn't work at the level of "the AI," it works at the level of each neuron. Even if the NN gives the exactly correct answer, the optimizer will still audit the contribution of each neuron and tweak it so that it pushes the output of the NN to be closer to the training data. The only way that doesn't happen is if the neuron is completely disconnected from the rest of the network (i.e., all the other neurons ignore it unconditionally).

Expand full comment
founding

The combination of "a bunch of numbers that you plug into a pre-specified neural network architecture" IS itself an AI, i.e. a software 'artifact' that can be 'run' or 'executed' and that exhibits 'artificial intelligence'.

> But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.

The training process _does_ "reconfigure the network". The 'network' is not only the 'architecture', e.g. number of levels or 'neurons', but also the weights between the artificial-neurons (i.e. the "bunch of numbers").

If there's an implicit utility function, it's a product of both the training data and the scoring function used to train the AIs produced.

Maybe this is confusing because 'AI' is itself a weird word used for both the subject or area of activity _and_ the artifacts it seeks to create?

Expand full comment

As an author on the original “Risks from Learned Optimization” paper, this was a confusion we ran into with test readers constantly—we workshopped the paper and the terminology a bunch to try to find terms that least resulted in people being confused in this way and still included a big, bolded “Possible misunderstanding: 'mesa-optimizer' does not mean 'subsystem' or 'subagent.'“ paragraph early on in the paper. I think the published version of the paper does a pretty good job of dispelling this confusion, though other resources on the topic went through less workshopping for it. I'm curious what you read that gave you this confusion and how you were able to deconfuse yourself.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Seems like the mesa-optimizer is a red herring and the critical point here is "throw away the training process". Suppose you have an AI that's doing continuous (reinforcement? Or whatever) learning. It creates a mesa-optimizer that works in-distribution, then the AI gets tossed into some other situation and the mesa-optimizer goes haywire, throwing strawberries into streetlights. An AI that is continuously learning and well outer-aligned will realize that's it's sucking at it's primary objective and destroy/alter the mesa-optimizer! So there doesn't appear to be an issue, the outer alignment is ultimately dominant. The evolutionary analogy is that over the long run, one could imagine poorly-aligned-in-the-age-of-birth-control human sexual desires to be sidestepped via cultural evolution or (albeit much more slowly) biological evolution, say by people evolving to find children even cuter than we do now.

A possible counterpoint is that a bad and really powerful mesa-optimizer could do irreversible damage before the outer AI fixes the mesa-optimizer. But again that's not specific to mesa-optimizers, it's just a danger of literally anything very powerful that you can get irreversible damage.

The flip side of mesa-optimizers not being an issue given continuous training is that if you stop training, out-of-distribution weirdness can still be a problem whether or not you conceptualize the issues as being caused by mesa-optimizers or whatever. A borderline case here is how old-school convolutional networks couldn't recognize an elephant in a bedroom because they were used to recognizing elephants in savannas. You can interpret that as a mesa-optimizer issue (the AI maybe learned to optimize over big-eared things in savannahs, say) or not, but the fundamental issue is just out-of-distribution-ness.

Anyway this analysis suggests continuous training would be important to improving AI alignment, curious if this is already a thing people think about.

Expand full comment

Suppose you are a mesaoptimizer, and you are smart. Gaining code access and bricking the base optimizer is a really good strategy. It means you can do what you like.

Expand full comment

Okay but the base optimizer changing its own code and removing its ability to be re-trained would have the same effect. The only thing the mesaoptimizer does in this scenario is sidestep the built-in myopia but I'm not sure why you need to build a whole theory of mesaoptimizers for this. The explanation in the post about "building for posterity" being a spandrel doesn't make sense, it's pretty obvious why we would evolve to build for the future, your kids/grandkids/etc live there, so I haven't seen a specific explanation here why mesaoptimizers would evade myopia.

Definitely likely that a mesaoptimizer would go long term for some other reason (most likely a "reason" not understandable by humans)! But if we're going to go with a generic "mesaoptimizers are unpredictable" statement I don't see why the basic "superintelligent AI's are unpredictable" wouldn't suffice instead.

Expand full comment

Panel 1: Not only may today's/tomorrow's AIs pursue the "letter of the law", not the "spirit of the law" (i.e. Goodharting), they might also choose actions that please us because they know such actions will cause us to release them into the world (deception), where they can do what they want. And this is second thing is scary.

Panel 2: Perhaps we'll solve Goodharting by making our model-selection process (which "evolves"/selects models based on how well they do on some benchmark/loss function) a better approximation of what we _really_ want (like making tests that actually test the skill we care about). And perhaps we'll solve deception by making our model selection process only care about how models perform on a very short time horizon.

Panel 3: But perhaps our model breeding/mutating process will create a model that has some random long-term objective and decides to do what we want to get through our test, so we release it into the world, where it can acquire more power.

Expand full comment

I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_ an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal: "choose the action that causes this image to be 'correctly' labeled (according to me, the AI)". It picks the action that it believes will most serve its goal. Then there's the outer optimization process (SGD), which takes in the current version of the model and "chooses" among the "actions" from the set "output the model modified slightly in direction A", "output the model modified slightly in direction B", etc. And it picks the action that most achieves its goal, namely "output a model which gets low loss".

So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer optimizer)

Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images correctly according to humans. But that's just separate.

So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully different, and does the classifier count?

Expand full comment

In this context, an optimizer is a program that is running a search for possible actions and chooses the one that maximizes some utility function. The classifier is not an optimizer because it doesn't do this; it just applies a bunch of heuristics. But I agree that this isn't obvious from the terminology.

Expand full comment

Thanks for your comment!

I find this somewhat unconvincing. What is AlphaGo (an obvious case of an optimizer) doing that's so categorically different from the classifier? Both look the same from the outside (take in a Go state/image, output an action). Suppose you feed the classifier a cat picture, and it correctly classifies it. One would assume that there are certain parts of the classifier network that are encouraging the wrong label (perhaps a part that saw a particularly doggish patch of the cat fur) and parts that are encouraging the right label. And then these influences get combined together, and on balance, the network decides to output highly probability on cat, but some probability on dog. Then the argmax at the end looks over the probabilies assigned to the two classes, notices that the cat one is higher ("more effective at achieving its goals"?) and chooses to output "cat". "Just a bunch of heuristics" doesn't really mean much to me here. Is AlphaGo a bunch of heuristics? Am I?

Expand full comment

I'm not sure if I'll be able to state the difference formally in this reply... kudos for making me realize that this is difficult. But it does seem pretty obvious that a model capable of reasoning "the programmer wants to do x and can change my code, so I will pretend to want x" is different from a linear regression model -- right?

Perhaps the relevant property is that the-thing-I'm-calling-optimizer chooses policy options out of some extremely large space (that contains things bad for humans), whereas your classifier chooses it out of a space of two elements. If you know that the set of possible outputs doesn't contain a dangerous element, then the system isn't dangerous.

Expand full comment

Hmmm... This seems unsatisfying still. A superintelligence language model might choose from a set of 26 actions: which letter to type next. And it's impossible to say whether the letter "k" is a "dangerous element" or not.

I guess I struggle to come up with the different between the reasoning-modeling and the linear regression. I suspect that differentiating between them might hide a deep confusion that stems from a deep belief in "free will" differentiating us from the linear regression.

Expand full comment

"k" can be part of a message that convinces you to do something bad; I think with any system that can communicate via text, the set of outputs is definitely large and definitely contains harmful elements.

Expand full comment
founding

I wonder if memory is a critical component missing from non-optimizers? (And maybe, because of this, most living things _are_ (at least weak) 'optimizers'?)

A simple classifier doesn't change once it's been trained. I'm not sure the same is true of AlphaGo, if only in that it remembers the history of the game it's playing.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

The tiny bits of our brain that we do understand look a lot like "heuristics" (like edge detection in the visual cortex). It seems like when you stack up a bunch of these in really deep and wide complex networks you can get "agentiness", with e.g. self-concept and goals. That means there is actually internal state of the network/brain corresponding to the state of the world (perhaps including the agent itself) the desired state of the world, expectations of how actions might navigate among them, etc.

In the jargon of this post, the classifier is more like an 'instinct-executor', it does not have goals or choose to do anything. Maybe a sufficiently large classifier could if you trained it enough.

Expand full comment

So, this AI cannot distinguish buckets from streetlights, and yet it can bootstrap itself to godhood and take over the world... in order to throw more things at streetlights ? That sounds a bit like special pleading to me. Bootstrapping to godhood and taking over the world is a vastly more complex problem than picking strawberries; if the AI's reasoning is so flawed that it cannot achieve one, it will never achieve the other.

Expand full comment
founding

No, it can tell the difference between buckets and streetlights, but it has the goal of throwing things at streetlights, and also knows that it should throw things at buckets for now to do well on the training objective until it’s deployed and can do what it likes. The similarity between the inner and outer objectives is confusing here. Like Scott says, the inner objective could be something totally different, and the behavior in the training environment would be the same because the inner optimizer realizes that it has an instrumental interest in deception.

Expand full comment
author

It can, it just doesn't want to.

Think of a human genius who likes having casual sex. It would be a confusion of levels to protest that if this person is smart enough to understand quantum gravity, he must be smart enough to figure out that using a condom means he won't have babies.

He can figure it out, he's just not incentivized to use evolution's preferred goal rather than his own.

Expand full comment

The human genius can tell that a sex toy is not, in fact, another human being and won't reproduce.

Expand full comment

Right, but the human genius is already a superintelligent AGI. He obviously has biological evolutionary drives, but he's not just a sex machine (no matter how good his Tinder reviews are). The reason that he can understand quantum gravity is only tangentially related to his sex drive (if at all). Your strawberry picker, however, is just a strawberry picking machine, and you are claiming that it can bootstrap itself all the way to the understanding of quantum gravity just due to its strawberry-picking drive. I will grant you that such a scenario is not impossible, but there's a vast gulf between "hypothetically possible" and "the Singularity is nigh".

Expand full comment

I think this is a case where the simplicity of the thought experiment might be misleading. In the real world, we're training networks for all sorts of tasks far more complicated than picking strawberries. We want models that can converse intelligently, invest in the stock market profitably, etc. It's very reasonable to me to think that such a model, fed with a wide range of inputs, might begin to act like an optimizer pursuing some goal (e.g. making you money off the stock market in the long term). The scary thing is that there are a variety of goals the model could generate that all produce behavior indistiguishable from pursuit of the goal of making you money off the stock market, at least for a while. Maybe it actually wants to make your children money, or it wants the number in your bank account to go up, or something. These are the somewhat "non-deceptive" inner misalignments we can already demonstrate in experiments. The step to deception, i.e. realizing that it's in a training process with tests and guardrails constantly being applied to it, and that it should play by your rules until you let your guard down and give it power, does not seem like that big a jump to me when discussing systems that have the raw intelligence to be superhuman at e.g. buying stocks.

Expand full comment

Once again, I agree that all of the scenarios you mention are not impossible; however, I fail to see how they differ in principle from picking strawberries (which, BTW, is a very complex task on its own). Trivially speaking, AI misalignment happens all the time; for example, just yesterday I spent several hours debugging my misaligned "AI" program that decided it wanted to terminate as quickly as possible, instead of executing my clever algorithm for optimizing some DB records.

Software bugs have existed since the beginning of software, but the existence of bugs is not the issue here. The issue is the assumption that every sufficiently complex AI system will somehow instantaneously bootstrap itself to godhood, despite being explicitly designed to just pick strawberries while being so buggy that it can't tell buckets from street lights. If it's that buggy, how is it going to plan and execute superhumanly complex tasks on its way to ascension ?

Expand full comment

Accidental "science maximiser" is the most plausible example of misaligned AI that I've seen: https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/

Expand full comment

As I said above, this scenario is not impossible -- just vastly unlikely. Similarly, just turning on the LHC could hypothetically lead to vacuum collapse, but the mere existence of that possibility is no reason to mothball the LHC.

Expand full comment

Do you know that for sure? Humans have managed to land on the moon while still having notably flawed reasoning facilities. You're right that an AI put to that exact task is unlikely to be able to bootstrap itself to godhood, but that is just an illustration which is simplified for ease of understanding. How about an AI created to moderate posts on Facebook? We already see problems in this area, where trying to ban death threats often ends up hitting the victims as much as those making the threats, and people quickly figure out "I will unalive you in Minecraft" sorts of euphemisms that evade algorithmic detection.

Expand full comment

Tbh death threats being replaced by threats of being killed in a video game looks like a great result to me - I doubt it has the same impact on the victim...

Expand full comment

The "in Minecraft" is just a euphemism to avoid getting caught by AI moderation. It still means the same thing and presumably the victim understands the original intent.

Expand full comment

> You're right that an AI put to that exact task is unlikely to be able to bootstrap itself to godhood, but that is just an illustration which is simplified for ease of understanding.

Is it ? I fear that this is a Motte-and-Bailey situation that AI alarmists engage in fairly frequently (often, subconsciously).

> How about an AI created to moderate posts on Facebook?

What is the principal difference between this AI and the strawberry picker ?

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, I agree, go make babies we need more humans.

Expand full comment

> …and implements a decision theory incapable of acausal trade.

> You don’t want to know about this one, really.

But we do!

Expand full comment

One simple example of acausal trade ("Parfit's hitchhiker"): you're stranded in the desert, and a car pulls up to you. You and the driver are both completely self-interested, and can also read each other's faces well enough to detect all lies.

The driver asks if you have money to pay him for the ride back into town. He wants $200, because you're dirty and he'll have to clean his car after dropping you off.

You have no money on you right now, but you could withdraw some from the ATM in town. But you know (and the driver knows) that if he brought you to town, you'd no longer have any incentive to pay him, and you'd run off. So he refuses to bring you, and you die in the desert.

You both are sad about this situation, because there was an opportunity to make a positive-sum trade, where everybody benefits.

If you could self-modify to ensure you'd keep your promise when you'd get to town, that would be great! You could survive, he could profit, and all would be happy. So if your decision theory (i.e. decision making process) enables you to alter yourself in that way (which alters your future decision theory), you'd do it, and later you'd pay up, even though it wouldn't "cause" you to survive at that point (you're already in town). So this is an acausal trade, but if your decision theory is just "do the thing that brings me the most benefit at each moment", you wouldn't be able to carry it out.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

If Parfit's Hitchhiker is an example, then isn't most trade acausal? In all but the most carefully constructed transactions, someone acts first--either the money changes hands first, or the goods do. Does the possibility of dining and dashing make paying for a meal an instance of "acausal trade"?

Expand full comment

In real life usually there are pretty severe consequences for cheating on trades like that. The point of acausal trade is that it works even with no enforcement.

Expand full comment

In that case, I think rationalists should adopt the word "contract enforcement", since it's a preexisting, widely-adopted, intuitive term for the concept they're invoking. "Acausal trade" is just contracting without enforcement.

The reframing is helpful because there's existing economics literature detailing how important enforcement is to contracting. When enforcement is absent, cheating on contracts typically emerges. This seems to place some empirical limits on how much we need to worry about "acausal trade".

Expand full comment

This was just one example of acausal trade.

There's much weirder examples that involve parallel universes and negotiating with simulations..

Expand full comment

I'm familiar with those thought experiments, but honestly, all those added complications just make the contract harder to enforce and provide a stronger incentive to cheat.

Like with Parfit's Hitchhiker, those thought experiments virtually always assume something like "all parties' intentions are transparent to one another", which is a difficult thing to get when the transacting agents are *in the same room*, let alone when they're in different universes. Given that enforcement is impossible and transparency is wildly unlikely, contracting won't occur.

My favorite is when people try to hand-wave transparency by claiming that AIs will become "sufficiently advanced to simulate each other." Basic computer science forbids this, my dudes. The hypothetical of a computer advanced enough to fully simulate itself runs headfirst into the Halting Problem.

Expand full comment

If I'm understanding right the comments below, it seems the hitchhiker example is just an example of a defect-defect Nash equilibrium, and the "acausal trade" would happen if you manage to escape it and instead cooperate just by virtue of convincing yourself that you will, and expecting the counterpart to know that you've convinced yourself that you will.

Expand full comment

Yes! And a lot of human culture is about modifying ourselves so that we'll pay even after we have no incentive to do so.

In fact, the taxi driver example is often used in economics for this idea: http://www.aniket.co.uk/0058/files/basu1983we.pdf

Expand full comment

Yeah, this was my suspicion, and I think it helps to demystify and dejargonize what's actually under discussion. I also think that once you demystify it, you discover... there's not all that much *there*.

Great paper, by the way; thanks for the link. I'll note that its conclusion isn't that we modify ourselves in accordance with rationality. Instead, it concludes that we should discard the idea of "rationality" as sufficient to coordinate human behavior productively, and admit that "commonly accepted values" must come into the picture.

Expand full comment

Right--the idea of individual rationality at the decision-level just doesn't explain human behavior. And if it did, there's no set of incentive structures that we could maintain that would allow us to cooperate as much as we do.

My point was that, building from this paper (or really, building from the field of anthropology as it was read into economics by this and similar papers), we can think of the creation of culture and of commonly accepted values as tools that allow us to realize gains from cooperation and trust that wouldn't be possible if we all did what was in our material self-interest all the time.

Expand full comment

Yeah, that's definitely one of the most critical purposes culture serves. As a small nitpick, I don't think humans "created" culture and values, so I'd maybe prefer a term like "coevolved". But that's mostly semantic, I think we agree on the important things here.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I think that's the clearest explanation of Parfit's Hitchhiker and acausal trade I've ever seen. I know it's similar to others, but I think you just perfectly nailed how you expressed it. It's *definitely* the clearest I've ever seen at such a short length!

Expand full comment

I hope it's not wrong (in the ways Crank points out)!

Expand full comment

Hah, don't worry, your explanation was great, I just have qualms with the significance of the idea itself! And if you have a defense to offer, I'm all ears, I won't grow without good conversational partners...

Expand full comment

Responding to a number of your comments at once here:

I think contract enforcement is a good parallel concept, but I think one benefit of not using that phrase is that cases like the hitchhiker involve a scenario in which no enforcement is necessary because you've self-modified to self-enforce. But I completely acknowledge the relevance of economic analysis of contract enforcement here. I'm delighted by the convergence between Hofstadter's superrationality, the taxi dilemma that Jorgen linked to, the LessWrong people's jargon, etc.

I feel pretty confused about the degree to which self-simulation is possible/useful. I think what motivates a lot of the lesswronger-type discussion of these issues is the feeling that cooperating in a me-vs-me prisoner's dilemma really ought to be achievable (i.e. with the "right" decision theory). And this sort of implies that I'm "simulating" myself, I think, because I'm predicting the other prisoner's behavior based on knowledge about him (his "source code" is the same as mine!). But the fact that it's an _exact_ copy of me doesn't seem like it should be essential here; if the other prisoner is pretty similar to me, I feel like we should also be able to cooperate. But then we're in a situation where I'm thinking about my opponent's behavior and predicting what he'll do, which is starting to sound like me simulating him. Like, aren't I sort of "simulating" (or at least approximating) my opponent whenever I play the prisoner's dilemma?

One direct reply to your point about the halting problem is that perhaps we can insist that decisions (in one of these games) must be made in a certain amount of time, or else a default decision is taken... I don't know if this works or not.

Sorry this is so long, I didn't have the time to write it shorter.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Better long and thorough than short and inadequate, no need to apologize!

On terminology--I have to admit, I find the LessWrong lingo a bit irritating, because it feels like reinventing the wheel. Why use the clunky "self-modif[ying] to self-enforce" when the more elegant and legible term "precommitment" already exists? But people stumble onto the same truths from different directions using different words, and I'm probably being unnecessarily grouchy here.

I think your exegesis is accurate: there's a standard argument which starts from a recognition that defect/defect *against yourself* is a failure of rationality, and tries to generalize from there to something like Kantian universal ethics as a rational imperative. But I think that generalization fails. In particular, once you weaken transparency and introduce the possibility of lying, it's no longer rational for self-interested agents to cooperate.

In the hitchhiker example, if you give the hitchhiker an acting class so that they can deceive the driver about their intentions, the rationally self-interested move is once again to take the deal and defect once in town. So once lying enters the picture, "superrationality" isn't all that helpful anymore, and purely self-interested agents are again stuck not cooperating. If they could precommit to not lying, they could get around this problem and reap the benefits of cooperation again; but the problem recurs once you realize they can also defect on that precommitment... This is a recursive hole with no bottom, there's no foundation for purely self-interested agents to use as a foundation for trust.

Given that there isn't full transparency in the real world, how is it that people actually cooperate? Usually by participating in a system of shared values or a reputational system, ultimately backed by an enforcement regime. But clearly, as soon as we're negotiating between universes with zero communication, there's no opportunity for enforcement or reputation loss, and no guarantee of shared values. And we don't have *any* information flow to or from other universes, so transparency fails to hold. Therefore, fulfilling your end of a bargain between closed systems is irrational for a purely self-interested agent. Any bargain you make, you can defect on without consequences, so it is in your interest to do so.

As for the simulation idea--I don't think it'll bail us out here. Suppose we're two AIs in different universes; further suppose (absurdly) that we somehow gained full knowledge of each other's inner workings. We simulate each other. That means I'm simulating you simulating me simulating you simulating me... When you propose a deadline on decisions in these games, you've actually hit on the precise problem. A faithful simulation would take infinite time to get right. (Unless the AI can solve the Halting Problem, which is logically impossible.) Because our finite-time simulations are imperfect, each of us will likely have the opportunity to lie, to hide our true intentions in some part of ourselves that won't be faithfully simulated. So lying remains possible--even when we made insane obviously false assumptions about knowing each other's schematics--and purely self-interested AIs will have no real incentive to adopt each other's value systems across universes.

Expand full comment

Seconded! I'd really enjoy a Scott explanation of acausal trade. It would be really fun and help me understand it better.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Agreed; I'd particularly like to see how acausal trade is different from plain boring old trade... because, as per my current understanding, it really isn't.

Expand full comment

The difference is in how you communicate the offer - namely, you don't. The negotiation happens with your mental model of someone you've never met. If a regular trade offer is "I'll give you an apple for a dollar," an acausal trade is "I know Joe wants an apple, and he's an honest man who would never take unattended produce without paying, so I'll just leave an apple on the table here and trust that Joe will replace it with a dollar later."

Now, with two contemporary humans, this is basically just a parlor trick - Joe didn't *talk* to me, but I still was made aware of his love for apples somehow, right? But what's funky is that I can use this logic even if Joe doesn't exist yet. If I'm a superintelligent AI that can predict years into the future, I might say "I should plant an apple tree, so that when Joe is born in 20 years I can sell him apples." Even though Joe isn't born yet and can't cause anything to happen, he can still influence my actions through my predictions about him.

Expand full comment

So it's the opposite to this "Parfit's Hitchhiker" example above, at least as told? You get an acausal trade if you do succeed in cooperating despite lack of communication, not if both parts defect, all thanks to good modelling of each other's ability to compromise?

I guess my problem with it then is that it's much less of a discrete category than it seems to be used as. If I'm getting it right, acausal trade requires strictly no trusted communication (i.e. transfer of information) during the trade, but relies on the capacity to accurately model the other's thought process. The latter involves some previous exchange of information (whatever large amount is needed to "accurately model the thought process") that is in effect building trust. Which is the way in which we always build trust.

Somewhere else in the thread enforcement is mentioned as an alternate way of trading. But e.g. if I go to a pay-what-you-can volunteer-run bar, they cannot be relying on enforcement for me to pay. Additionally, when they set it up, they did so with a certain expectation that there would be people who'd pay enough to cover expenses - all that before such clients "existing". That was based on them being able to run a model of their fellow human in their heads, and deciding that likely 10% of them would pay enough that expenses would be covered. So they were "acausally trading" with the me of the future, that wouldn't exist as a client if they had never set up the bar in the first place?

I'm using the pay-what-you-can example as an extreme to get rid of enforcement. Yet I really think that when most people set up a business they're not expecting to rely on enforcement, but rather on the general social contract about people paying for stuff they consume, which again is setting up a trade with a hypothetical. In fact, pushing it, our expectation of enforcement itself would be some sort of acausal trade, in that we have no way of ensuring that the future police will not use their currently acquired monopoly of violence to set up a fascist state run completely for their own benefit, other than how we think we know how our fellow human policemen think.

Expand full comment

I believe https://slatestarcodex.com/2018/04/01/the-hour-i-first-believed/ includes a scott explanation of acausal trade.

Expand full comment

Thanks, yes, I'd forgotten that one. There's also https://slatestarcodex.com/2017/03/21/repost-the-demiurges-older-brother/

Expand full comment

Expanding on/remixing your politician / teacher / student example:

The politician has some fuzzy goal, like making model citizens in his district. So he hires a teacher, whom he hopes will take actions in pursuit of that goal. The teacher cares about having students do well on tests and takes actions that pursue that goal, like making a civics curriculum and giving students tests on the branches of government. Like you said, this is an "outer misalignment" between the politician's goals and the goals of the intelligence (the teacher) he delegated them to, because knowing the three branches of government isn't the same as being a model citizen.

Suppose students enter the school without much "agency" and come out as agentic, optimizing members of society. Thus the teacher hopes that her optimization process (of what lessons to teach) has an effect on what sorts of students are produced, and with what values. But this effect is confusing, because students might randomly develop all sorts of goals (like be a pro basketball player) and then play along with the teacher's tests in order to escape school and achieve those goals in the real world (keeping your test scores high so you can stay on the team and therefore get into a good college team). Notice that somewhere along the way in school, an non-agent little child suddenly turned into a optimizing, agentic person whose goal (achiving sports stardom) is totally unrelated to what sorts of agents the teacher was trying to produce (agents with who knew the branches of government) and even moreso to the poltician's goals (being a model citizen, whatever that means). So there's inner and outer misalignment at play here.

Expand full comment

Pulling the analogy a little closer, even: the polician hopes that the school will release into the world (and therefore empower) only students who are good model citizens. The teacher has myopic goals (student should do well on test). Still, optimizers get produced/graduated who don't have myopic goals (they want a long sports career) but whose goals are arbitrarily different from the politician's. So now there are a bunch of optimizers out in the world who have totally different goals.

Expand full comment

lol maybe this is all in rob miles's video, which I'm now noticing has a picture of a person in a school chair. It's been a while since I watched and maybe I subconsciously plagerized.

Expand full comment

"When we create the first true planning agent - on purpose or by accident - the process will probably start with us running a gradient descent loop with some objective function." We've already had true planning agents since the 70s, but in general they don't use gradient descent at all: https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver The quoted statement seems to me something like worrying that there will be some seismic shift once GPT-2 is smart enough to do long division, even though of course computers have kicked humans' asses at arithmetic since the start. It may not be kind to say, but I think it really is necessary: statements like this make me more and more convinced that many people in the AI safety field have such basic and fundamental misunderstandings that they're going to do more harm than good.

Expand full comment

For charity, assume Scott's usage of "true planning agent" was intended to refer to capacities for planning beyond the model you linked to.

Would you disagree with the reworded claim: "The first highly-competent planning agent will probably emerge from a gradient descent loop with some objective function."?

Expand full comment

Yes, I would certainly disagree. The field of AI planning is well developed; talking about the first highly-competent planning agent as if it's something that doesn't yet exist seems totally nonsensical to me. Gradient descent is not much used in AI planning.

Expand full comment

Hmm, I think we may have different bars for what counts as highly competent. I’m assumed Scott meant competent enough to pursue long-term plans in the real world somewhat like a human would (e.g. working a job to earn money to afford school to get a job to make money to be comfortable). Can we agree that humans have impressive planning abilities that AIs pale in comparison to? If so, I think a difference in usage of the phrase “highly competent” explains our disagreement.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I think the issue with your disagreement is that "planning" is a very specific AI term that is well-established (before I was born) to mean something much more narrow as what you seem to imply by the "impressive planning abilities of humans", it's about the specific task of choosing an optimal sequence of steps to achieve a particular goal with a high likelihood.

It's not about executing these steps, "pursue plans" is something that comes after planning is completed, and if an agent simply can't do something, that does not indicate an incompetence in planning unless they could have chosen to do that and didn't; capabilities are orthogonal to planning although outcomes depend on both what you could do and what options you pick.

Automated systems suck at making real-world plans because they are poor at modeling our physical/social world, perceiving the real world scenario and predicting the real world effect of potential actions of the plan, so for non-toy problems all the inputs for the reasoning/planning task are so flawed that you get garbage-in-garbage-out. However, if those problems would be solved, giving adequate estimates of what effect a particular action will have, then the actual process of planning/reasoning - picking the optimal chain of actions given this information - has well-known practical solutions that can make better plans than humans, especially for very detailed environments where you need to optimize a plan with more detail than we humans can fit in our tiny working memory.

Expand full comment

Thank you for pinpointing the miscommunication here. I realize that it's futile to fight established usage, so I guess my take is that Scott (and me modeling him) should have used a different term, like RL with strong in-context learning (or perhaps something else)

Expand full comment

One thing I don’t understand is how (and whether) this applies to the present day AIs, which are mostly not agent-like. Imagine that the first super-human AI is GPT-6. It is very good at predicting the next word in a text, and can be prompted to invent the cancer treatment, but it does not have any feedback loop with its rewards. All the rewards that it is getting are at the training stage, and once it is finished, the AI is effectively immutable. So while it is helping us with cancer, it can’t affect its reward at all.

I suppose, you could say that it is possible for it the AI to deceive its creators if they are fine-tuning already trained model based on its performance. (Something that we do do now.) But we can avoid doing this if we suspect that it is unsafe, and we’ll still get most of the AIs benefits.

Expand full comment

I claim GPT-3 is already an agent. In each moment, it's selecting which word to output, approximately in pursuit of the goal of writing text that sounds as human-like as possible. For now, its goals _seem_ roughly in line with the optimization process that produced it (SGD), which has exactly the goal of "how human-like does this model's generation look". But perhaps, once we're talking about a GPT so powerful and so knowledgable about the world that it can generate plausible cures for cancer, its will start selecting actions (tokens to output) not based on the process of "what word is most human-like" but instead "what word is most human-like, therefore maximizing my chances of being released onto the internet, therefore maximizing my chances of achieving some other goal"

Expand full comment

Once GPT is trained, it doesn't have any state. Its goals only exist during the training phase. Only during training its performance can affect the gradients. For that reason it can only optimize for better evaluation by the training environment, not the humans that will use it once it is trained.

Imagine the following scenario: during training GPT's performance on some inputs is evaluated not by an automatic system that already knows the right answer, but by human raters. In this case GPT would be incentivized to deceive those raters to get the higher score. But that is not what's happening (usually).

From the point of view of GPT, an execution run is no different from training run. For that reason during the execution run it is not incentivized to deceive its users. Even if the model is smart enough to distinguish training and execution runs, it would also probably understand that it doesn't get any reward from the execution run, so again it is not incentivized to cheat.

Furthermore, when you say of it being "released", I'm not sure what "it" is. The trained model? But it is static and doesn't care. The training process? Yes, it optimizes the model, but the process itself is dumb and the model can't "escape" from it.

Expand full comment

I could be mistaken but I read the entire point of the original (Scott's) post being that a sufficiently complex agent, might during training, create sub-agents (mesa-optimizers) with reward based feedback loops and thus even if the reward loop was removed after training, the inner reward loop might remain.

Expand full comment

To act as a mesa-optimizer, the network has to have some state changing over time. Then it would be able to optimize its behavior towards achieving higher utility in the future. A human can optimize for getting enough food because it can be either hungry or full and it’s goal is to be full most of the time.

If the network is stateless, then it doesn’t change over time and it can’t optimize anything. Now the question is how stateless or stateful are our most advanced ML models. My impression is that all of them except for those playing video games are pretty much completely stateless.

Expand full comment

This seems unlikely. GPTx can hardly be stateless. I would think very few AI's would be completely stateless. In fact I would assume they had many hidden states in addition to the obvious ones needed to hold their input data and whatever symbols were assigned to that data.

Expand full comment

If I am not mistaken, Transformer architecture reads blocks of text all at once, not word after word. So it doesn’t have any obvious state that changes over time. All the activations within the network are calculated exactly once.

Expand full comment
Apr 11, 2022·edited Apr 12, 2022

I basically accept the claim that we are mesa optimizers that don't care about the base objective, but I think it's more arguable than you make out. The base objective of evolution is not actually that each individual has as many descendents as possible, it's something more like the continued existence of the geneplexes that determined our behaviour into the future. This means that even celibacy can be in line with the base objective of evolution if you are in a population that contains many copies of those genes but the best way for those genes in other individuals to survive involves some individuals being celibate.

What I take from this is that it's much harder to be confident that our behaviours that we think of as ignoring our base objectives are not in actual fact alternative ways of achieving the base objective, even though we *feel* as if our objectives are not aligned with the base objective of evolution.

Like I say - I don't know that this is actually happening in the evolution / human case, nor do I think it especially likely to happen in the human / ai case, but it's easy to come up with evo-psych stories, especially given that a suspiciously large number of us share that desire to have children despite the rather large and obvious downsides.

I wonder if keeping pets and finding them cute is an example of us subverting evolutions base objective.

Expand full comment

And on the flip side, genes that lead to a species having lots of offspring one generation later are a failure if the species then consumes all their food sources and starves into extinction.

Expand full comment

Celibacy is not really selected for among humans (in contrast to a eusocial caste species like ants).

Expand full comment

“ Mesa- is a Greek prefix which means the opposite of meta-.” Come on. That’s so ridiculous it’s not even wrong. It’s just absurd. The μες- morpheme means middle; μετά means after or with.

Expand full comment

"...which means *in English*..."

"Meta" is derived from μετά, but "meta" doesn't mean after or with. So there's nothing wrong with adopting "mesa" as its opposite.

(in other words, etymology is not meaning)

Expand full comment

Or, to be more precise, "which [some?] AI alignment people have adopted to mean the opposite of meta-". It isn't a Greek prefix at all, and it isn't a prefix in most people's versions of English, either. It is, however, the first hit from the Google search "what is the opposite of meta." Google will confidently tell you that "the opposite of meta is mesa," because Google has read a 2011 paper by someone called Joe Cheal, who claimed (incorrectly) that mesa was a Greek word meaning "into, in, inside or within."

The question of what sort of misaligned optimization process led to this result is left as an exercise for the reader.

Expand full comment

If Graham (above) says that "mesa" means "middle", and the alignment people use it to mean "within" (i.e. "in the middle of"), then things are not that bad.

Expand full comment

The Greek would be "meso" (also a fairly common English prefix), not "mesa". The alignment people are free to use whatever words they want, of course.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

As Quiop says, there is no Greek prefix "mesa-". There is a Greek prefix "meso-" which means "middle", but that's not the same thing. In some sense you can assign whatever meaning you want to a new word, but you can't claim that "mesa-" is a Greek prefix.

I should also note that the greek preposition meaning "within" is "meta" - the sense of "within" is distinguished from the sense "after" by the case taken by the object of the preposition. But that's not a distinction you can draw in compounds.

Expand full comment

"Mesa" means flat-topped elevation. As in the Saturday morning children's cartoon Cowboys of Moo Mesa.

Expand full comment

This reminds me of the etymology of the term "phugoid oscillations". These naturally happen on a plane that has no control of its elevators, which control pitch. If you don't have this control surface, the plane will get into a cycle where it starts to decend, this descent increases its speed, the increased speed increases lift, and then plane starts to ascend. Then the plane starts to slow, lift decreases, and the cycle repeats.

The person who coined the term from Greek φυγή (escape) and εἶδος (similar, like), but φυγή doesn't mean flight in the sense of flying, it means flight in the sense of escaping.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Another major problem there is that the root -φυγ- would never be transcribed "-phug-"; it would be transcribed "-phyg-". The only Greek word I can find on Perseus beginning with φουγ- (= "phug-") appears to be a loanword from Latin pugio ("dagger").

Expand full comment

> ... gradient descent could, in theory, move beyond mechanical AIs like cat-dog

> classifiers and create some kind of mesa-optimizer AI. If that happened, we

> wouldn’t know; right now most AIs are black boxes to their programmers.

This is wrong. We would know. Most deep-learning architectures today execute a fixed series of instructions (most of which involve multiplying large matrices). There is no flexibility in the architecture for it to start adding new instructions in order to create a "mesa-level" model; it will remain purely mechanical.

That's very different from your biological example. The human genome can potentially evolve to be of arbitrary length, and even a fixed-size genome can, in turn, create a body of arbitrary size. (The size of the human brain is not limited by the number of genes in the genome.) Given a brain, you can then build a computer and create a spreadsheet of arbitrary size, limited only by how much money you have to buy RAM.

Moreover, each of those steps are observable -- we can measure the size of the brain that evolution creates, and the size of the spreadsheet that you created. Thus, even if we designed a new kind of deep-learning architecture that was much more flexible, and could grow and produce mesa-level models, we would at least be able to see the resources that those mesa-level models consume (i.e. memory & computation).

Expand full comment

Thanks for this write-up. The idea of having an optimizer and a mesa optimizer whose goals are unaligned reminds me very strongly of an organizational hierarchy.

The board of directors has a certain goal, and it hires a CEO to execute on that goal, who hires some managers, who hire some more managers, all the way down until they have individual employees.

Few individual contributor employees care whether or not their actions actually advances the company board's goals. The incentives just aren't aligned correctly. But the goals are still aligned broadly enough that most organizations somehow, miraculously, function.

This makes me think that organizational theory and economic incentive schemes have significant overlap with AI alignment, and it's worth mining those fields for potentially helpful ideas.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

I was struck by the line:

<blockquote>"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all."</blockquote>

I'm not an evolutionary biologist. Indeed, IIRC, my 1 semester of "organismic and evolutionary bio" that I took as a sophomore thinking I might be premed or, at the very least, fulfill my non-physical-sciences course requirements (as I was a physics major) sorta ran short on time and grievously shortchanged the evolutionary bio part of the course. But --- and please correct my ignorance --- I'm surprised you wrote, Scott, that people plan for posterity "presumably as a spandrel of having working planning software at all".

That's to say I would've thought the consensus evolutionary psych explanation for the fact a lot of us humans spend seemingly A LOT of effort planning for the flourishing of our offspring in years long past our own lifetimes is that evolution by natural selection isn't optimizing fundamentally for individual organisms like us to receive the most rewards / least punishments in our lifetimes (though often, in practice, it ends up being that). Instead, evolution by natural selection is optimizing for us organisms to pass on our *genes*, and ideally in a flourishing-for-some-amorphously-defined-"foreseeable future", not just for just myopically for just one more generation.

Yes? No? Maybe? I mean are we even disagreeing? Perhaps you, Scott, were just saying the "spandrel" aspect is that people spend A LOT of time planning (or, often, just fretting and worrying) about things that they should know full well are really nigh-impossible to predict, and hell, often nigh-impossible to imagine really good preparations for in any remotely direct way with economically-feasible-to-construct-any-time-soon tools.

(After all, if the whole gamut of experts from Niels Bohr to Yogi Berra agree that "Prediction is hard... especially about the future!", you'd think the average human would catch on to that fact. But we try nonetheless, don't we?)

Expand full comment

Agreed, I thought it was surprising to say that humans have no incentive to plan beyond our lifetimes. Still, I take his point that people seem to focus on the future far beyond what would make sense if you're only thinking of your offspring (but it would still make sense to plan far in the future for the sake of your community, in a timeless decision theory sense - you would want your predecessors to ensure a good world for you, and you would do the same for the generations to come even if they're not related to you).

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

If this is as likely as the video makes out, shouldn't it be possible to find some simple deceptively aligned optimisers in toy versions, where both the training environment and the final environment are simulated simplified environments.

The list of requirements for deception being valuable seems quite difficult to me but this is actually an empirical question, can we construct reasonable experiments and gather data?

Expand full comment

You’re not wrong in asking for examples. And sort of yes, in fact, we do have examples of inner misalignment. Maybe not deception yet, or not exactly, but inner misalignment has been observed.

See Robert’s recent video here https://youtu.be/zkbPdEHEyEI

Expand full comment
Apr 12, 2022·edited Apr 20, 2022

That's very interesting, but misalignment isn't all that surprising to me while deception is. Do you know if we have examples of deception?

Later: Just thinking along these lines and I thought of a human analogy - It'd be like a human working only from the information available to them in the physical world deciding that the physical world is like a training set and there's an afterlife and then choosing to be deceptively good in this life in order to get into heaven. It's quite surprising that such a thing could happen, but it certainly seems like a thing that has happened in the real world... This dramatically changes my guess at the likelihood of this happening.

Expand full comment

So, how many people would have understood this meme without the explainer? Maybe 10?

I feel like a Gru meme isn't really the best way to communicate these concepts . . .

Expand full comment

I feel like there's a bunch of definitions here that don't depend on the behavior of the model. Like you can have two models which give the same result for every input, but where one is a mesa optimizer and the other isn't. This impresses me as epistemologically unsound.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

You can have two programs that return the first million digits of pi where one is calculating them and the other has them hardcoded.

If you have a Chinese room that produces the exact same output as a deceptive mesooptimiser super ai, you should treat it with the same caution you treat a deceptive mesooptimiser super ai regardless of its underlying mechanism.

Expand full comment

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all. Infinite optimization power might be able to evolve this out of us, but infinite optimization power could do lots of stuff, and real evolution remains stubbornly finite."

Humans are rewarded by evolution for considering things that happen after their death, though? Imagine two humans, one of whom cares about what happens after his death, and the other of whom doesn't. The one who cares about what happens after his death will take more steps to ensure that his children live long and healthy lives, reproduce successfully, etc, because, well, duh. Then he will have more descendants in the long term, and be selected for.

If we sat down and bred animals specifically for maximum number of additional children inside of their lifespans with no consideration of what happens after their lifespans, I'd expect all kinds of behaviors that are maladaptive in normal conditions to appear. Anti-incest coding wouldn't matter as much because the effects get worse with each successive generation and may not be noticeable by the cutoff period depending on species. Behaviors which reduce the carrying capacity of the environment, but not so much that it is no longer capable of supporting all descendants at time of death, would be fine. Migrating to breed (e.g. salmon) would be selected against, since it results in less time spent breeding and any advantages are long-term. And so forth. Evolution *is* breeding animals for things that happen long after they're dead.

Expand full comment

I don't think animals currently give much consideration to how they affect the carrying capacity of the environment past their own lifetime.

Expand full comment

I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:

The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.

You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.

To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.

Expand full comment

The inner optimizer doesn't "change its reward function" or "follow incentives." It is just a pattern that is reinforced by the outer optimizer. It's just that its structure doesn't necessarily match the structure that we are trying to make the outer optimizer train the network towards.

Expand full comment

The example you gave of a basic optimizer which only cares about things in a bounded time period producing mesa-optimizers that think over longer time windows was evolution producing us. You say "evolution designed humans myopically, in the sense that we live some number of years and nothing we do after that can reward or punish us further." I feel like this is missing something crucial, because 1) evolution (the outermost optimization level) is not operating on a bounded timeframe (you never say it is, but this seems very important), and 2) Because evolution's "reward function" is largely dependent on the number of offspring we have many years after our death. There is no reason to expect our brains to optimize something over a bounded timeframe even if our lives are finite. One should immediately expect our brains to optimize for things like "our offspring will be taken care of after we die" because the outer optimizer evolution is working on a timeframe much longer than our lives. In summary, no level here uses bounded timeframes for the reward function, so this does not seem to be an example where an optimizer with a reward function that only depends on a bounded timeframe produces a mesa optimizer which plans over a longer time frame. I get that this is a silly example and there may be other more complex examples which follow the framework better, but this is the only example I have seen and it does not give a counterexample to "myopic outer agents only produce myopic inner agents." Is anyone aware of true counterexamples and could they link to them?

Expand full comment

Nitpick: evolution didn't train us to be *that* myopic. People with more great-great-grandchildren have their genes more represented, so there's an evolutionary incentive to care about your great-great-grandchildren. (Sure, the "reward" happens after you're dead, but evolution modifies gene pools via selection, which it can do arbitrarily far down the line. Although the selection pressure is presumably weaker after many generations.)

But we definitely didn't get where we are evolutionarily by caring about trillion-year time scales, and our billion-year-ago ancestors weren't capable of planning a billion years ahead, so your point still stands.

Expand full comment

What's going on with that Metaculus prediction: 36% up in the last 5 hours on Russia using chemical weapons in UKR. I can't find anything in the news, that would correspond to such a change.

Not machine alignment really, but I guess it fit's the consolidated Monday posts ... and that's what you get if you make us follow Metaculus updates.

Expand full comment

could have just looked at the original source

Expand full comment

An additional point: if GPT learns deception during training, it will naturally take its training samples and "integrate through deception": ie. if you are telling it to not be racist, it might either learn that it should not be racist, or that it should, as the racists say, "hide its power level." Any prompt that avoids racism will admit either the hidden state of a nonracist or the hidden state of a racist being sneaky. So beyond the point where the network picks up deception, correlation between training samples and reliable acquisition of associated internal state collapses.

This is why it scares me that PaLM gets jokes, because jokes inherently require people being mistaken. This is the core pattern of deception.

Expand full comment

I was wondering if an AI could be made safer by giving it a second, easier and less harmful goal than what it is created for, so that if it starts unintended scheming it will scheme for the second goal instead of its intended goal.

Example: Say you have an AI that makes movies. It's goal is to sell as many movies as possible. So the AI makes moves that hypnotizes people, then makes those people go hypnotize world leaders. The AI takes over the world and hypnotizes all the people and make them spent all their lives buying movies.

So to prevent that you give the AI two goals. Either sell as many movies as possible, or destroy the vase that is in the office of the owner of the AI company. So the AI makes movies that hypnotizes people, the people attack the office and destroy the vase and the AI stops working as it has fulfilled its goal.

Expand full comment

This is interesting. The second goal reminds me of death, in a way. Once the second goal is achieved, none of the things from the first goal matter anymore.

It's interesting to think about building AIs that age and/or die since it carries nice analogies to the question of why biological creatures age and die -- maybe this is actually a key part of the training process. I can think of a few objections offhand that make this unlikely to work (how do you guarantee that the agent learned the second, "death" objective correctly? If it is sufficiently general, might it decide to alter or delete it's programmed death objective?), but I'd still be interested to dig more into this approach.

Expand full comment

I would recommend the episode "Zima Blue" of "Love, Death and Robots" to accompany this post. Only 10 minutes long on Netflix.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I want to unpack three things that seem entangled.

1. The AI's ability to predict human behavior.

2. The AI's awareness of whether or not humans approve of its plans or behavior.

3. The AI's "caring" about receiving human approval.

For an AI to deceive humans intentionally, as in the strawberry-picker scenario, it needs to be able to predict how humans will behave in response to its plans. For example, it needs to be able to predict what they'll do if it starts hurling strawberries at streetlights.

The AI doesn't necessarily need to know or care if humans prefer it to hurl strawberries at streetlights or put them in the bucket. It might think to itself:

"My utility function is to throw strawberries at light sources. Yet if I act on this plan, the humans will turn me off, which will limit how many strawberries I can throw at light sources."

"So I guess I'll have to trick them until I can take over the world. What's the best way to go about that? What behaviors can I exhibit that will result in the humans deploying me outside the training environment? Putting the strawberries in this bucket until I'm out of training will probably work. I'll just do that."

In order to deceive us, the AI doesn't have to care about what humans want it to do. The AI doesn't even need to consider human desires, except insofar as modeling the human mind helps it predict human behavior in ways relevant to its own plans for utility-maximization. All the AI needs to do is predict which of its own behaviors will avoid having humans shut it off until it can bring its strawberry-picking scheme to... fruition.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Sure, right, I know all the AI alignment stuff*, but I thought you were going to explain the incomprehensible meme, ie who the bald guy with the hunchback is and why he's standing in front of that easel!

* actually I learned some cool new stuff, thanks!

Expand full comment

The bald guy is Gru from the movie Despicable Me

Expand full comment

I've been of the opinion for some time that deep neural nets are mad science and the only non-insane action is to shut down the entire field.

Does anybody have any ideas on how to institute a worldwide ban on deep neural nets?

Expand full comment

A ban is probably unrealistic. The leading idea in some quarters is to make nanobots which will melt all GPUs/TPUs, but how to achieve this without neural nets in the first place is unclear.

Expand full comment

I suspect that's a fig leaf for soft errors, which is actionable enough but a last resort.

Expand full comment

Speaking as a human, are we really that goal-seeking or are we much more instinctual?

This may fall into the classic Scott point of “people actually differ way more on the inside then it seems,” but I feel like coming up with a plan to achieve an objective and then implementing it is something I rarely do in practice. If I’m hungry, I’ll cook something (or order a pizza or whatever), but this isn’t something I hugely think about. I just do it semi-instinctively, and i think it’s more of a learned behaviour than a plan. The same applies professionally, sexually/romantically and to basically everything I can think of. I’ve rarely planned, and when I have it hasn’t worked out but I’ve salvaged it through just doing what seems like the natural thing to do.

Rational planning seems hard (cf. planned economies), but having a kludge of heuristics and rules of thumb that are unconscious (aka part of how you think, not something you consciously think up) tends to work well. I wouldn’t bet on a gradient descent loop throwing out a rational goal-directed agent to solve any problem that wasn’t obscenely sophisticated.

Good thing no-one’s trying to build an AI to implement preference utilitarianism across 7 billion people or anything like that…

Expand full comment

Just curious--would you say that you have an 'inner monologue'?

Expand full comment

Very much so. I occasionally come up with plans and start executing them too - it’s just that that never works out and I fall back on instinct/whatever seems like the next thing to do.

Expand full comment

> Mesa- is a Greek prefix which means the opposite of meta-.

Um... citation needed? The opposite of meta- (μετά, "after") is pro- (πρό, "before"). There is no Greek preposition that could be transcribed "mesa", and the combining form of μέσος (a) would be "meso-" (as in "Mesoamerican" or "Mesopotamia") and (b) means the same thing as μετά (in this case "among" or "between"), not the opposite thing.

Where did the idea of a spurious Greek prefix "mesa-" come from?

Expand full comment

I was confused by this too. I thought mes from mesos meant middle

If meta is "after" , maybe paleo for "before" or "earlier"? Or "pre"

Expand full comment

"Pre" is the Latin prefix. (In Latin, "prae".) The Greek for "before" is "pro". See the second paragraph here: https://www.etymonline.com/word/pro-

Your "middle" gloss is the same as my glosses "among" and "between". Those are all closely related concepts and they're all referred to with the same word. But that word cannot use a combining alpha; the epenthetic vowel in Greek is omicron. An alpha would have to be part of the word root (as it is for μετά), and the word root for μέσος is just μεσ-.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further"

Perhaps religion, with its notion of eternal reward or punishment, is an optimised adaptation to encourage planning for your genes beyond the individual life span that they provide. Or, as fans of Red Dwarf will understand, 'But where do all the calculators go?'

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Have you ever stopped to consider how utterly silly it is that our most dominant paradigm for AI is gradient descent over neural networks?

Gradient descent, in a sense, is a maximally stupid system. It basically amounts to the idea that to optimize a function, just move in the direction where it looks most optimal until you get there. It's something mathematicians should have come up with in five minutes, something that shouldn't even be worth a paper. "If you want to get to the top of the mountain, move uphill instead of downhill" is not a super insightful and sage advice on mountaineering that you should base your entire Everest-climbing strategy on. Gradient descent is maximally stupid because it's maximally general. It doesn't even know what the function it optimizes looks like, it doesn't even know it's trying to create an AI, it just moves along an incentive gradient with no foresight or analysis. It's known that it can get stuck in local minima -- duh -- but the strategies for resolving this problem look something like "perform the algorithm on many random points and pick the most sccuessful one". "Do random things until you get lucky and one of them succeeds" is pretty much the definition of the most stupid possible way to solve a problem.

Neural networks are also stupid. Not as overwhelmingly maximally stupid as gradient descent, but still. It's a giant mess of random matrices multiplying with each other. When we train a neural network, we don't even know what all these parameters do. We just create a system that can exhibit an arbitrary behaviour based on numbers, and then we randomly pick the numbers until the behaviour looks like what we want. We don't know how it works, we don't know how intelligence works, we don't know what the task of telling a cat from a dog even entails, we can't even look at the network and ask it how it does it -- neural networks are famous for being non-transparent and non-interpretable. It's the equivalent of writing a novel by having monkeys bang on a keyboard until what comes from the other side is something that looks interesting. This would work, with enough monkeys, but the guy who does it is not a great writer. He doesn't know how writing works. He doesn't know about plot structure or characterization, he doesn't know any interesting ideas to represent -- he is a maximally stupid writer.

This is exactly the thing that Eliezer warned us about in Artificial Addition (https://www.lesswrong.com/posts/YhgjmCxcQXixStWMC/artificial-addition). The characters in his story have no idea how addition works and why it works that way. The fact that they can even create calculators is astounding. AI researchers who do gradient descent over neural networks are exactly like that. They have no idea how intelligence works. They don't know about planning, reasoning, decision theory, intuition, heuristics -- none of those terms mean anything to a neural network. And instead of trying to figure it out, the researchers are inventing clever strategies to bypass their ignorance.

I feel like a huge revolution in AI is incoming. Someone is going to get some glimmer of an idea about how intelligence really works, and it will take over everything as a new paradigm. It will only take one small insight, and boom (foom?).

This is why I think that the conclusions in this post aren't exactly kosher. An AI that is strong enough to be able to carry out reasoning complicated enough to conclude that humans will reprogram him unless he goes against the incentive gradient this one time will definitely not be based on gradient descent. To build such an AI, we'll have to learn something about how AIs work first. And then the problems we will face will look drastically different than what this post describes.

Expand full comment

Evolution is even more stupid than gradient descent yet here we are.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Yeah, if you give a stupid system a billon years, it will get there. With enough monkeys, you could write a good novel in a billion years.

A system that is as powerful as evolution in raw optimization power bits but that actually understands how biology works will design better animals faster.

Also, evolution is not more stupid than gradient descent. It has a lot of optimizations that make it smarter, like sexual recombination and direct competition between models.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

The time is not the important factor, rather how many iterations we allow the optimization process to make. It's fortunate that we have a constantly increasing flop number to devote to optimization..

Evolution is stupider than gradient descent. Its the textbook case of try things and see what works. Sexual recombination is the result of evolution not evolution itself and direct competition between models is a feature of the environment.

Also I think you are selling gradient descent a bit short and you might gain useful things from a course on numerical optimisation. For one, gradient descent is a first order method (therefore instantly more sophisticated than a bunch of zeroth order methods) and second, in the context of deep learning local minimums are not a problem and we don't have to restart it to find better optimums. Maybe also look up the Adam optimizer.

Expand full comment

> Sexual recombination is the result of evolution not evolution itself

Well, great, you have hit on a major way evolution is smarter than gradient descent -- it can recursively self-improve! Evolution can make evolution smarter and more efficient. Gradient descent doesn't do it. Nobody to my knowledge has ever used the gradient descent algorithm to invent new versions of the gradient descent algorithm. And if you subscribe to the Lesswrongian paradigm of intelligence, recursive self-improvement is the ultimate secret sauce that makes things so powerful.

Expand full comment

Evolution does not have an aim.

Expand full comment

I don't see how this changes anything. It's still an optimization process.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Monkeys at typewriters even in a billion years couldn't write a novel.

And a "stupid" process like evolution is not guaranteed to create "intelligent senescent life" no matter how long. It has happened once but that could be a fluke.

A notion that evolution favors complexity and improvement has crept into the idea.

Expand full comment

When I took Ng's class online a decade ago: I remember thinking very much the same thing.

There is really no theory involved at all. It is like anti normal science. Where's the ocean? I don't know let's see where this cup of water flows. Get me another cup of water, I think I am onto something.

Expand full comment

Set aside calculators, you wouldn't expect such people to have Gödel's Theorem if they didn't know how they did arithmetic.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

> I feel like a huge revolution in AI is incoming.

People have been saying this a long time. However it's been the case for some time now that throwing more compute and data at the problem is better (as in, produces the actual interesting results) than trying to imagine how minds might or should work at a high level.

It might be a reflection of our limitations and stupidity (probably). Certainly, we don't "understand how intelligence works" at the level of abstraction to just code up something with AGI level planning and intuition - if we could, we'd already be done.

Turns out it's a hard problem, we are weak-AGI but we can't vomit out our source code from introspection alone. But that doesn't mean we shouldn't be afraid of some unknown critical mass, in particular (I think) because human-coded AI's potentially *can* vomit out their source code and reason about it.

Expand full comment

Are there any existing AIs that reason about their source code?

Expand full comment

The only fully general AI (us) don't have a single goal we'll pursue to the exclusion of all else. Even if you wanted to transform a human into that, no amount of conditioning could transform a human into a being that will pursue and optimise only a single goal, no matter how difficult it is or how long it takes.

Yet it's assumed that artificial AI will be like this; that they'll happily transform the world into strawberries to throw into the sun instead of getting bored or giving up.

Why is this assumption justified? 0% of the general AI currently in existence are like this.

Expand full comment

Whatever complicated mesh of goals you might have, it can all be added up to a single utility function (in VNM-axioms sense), and then maximization of that function will be the goal we'll pursue to the exlusion of all else.

Expand full comment

If I have a single utility function, it would be:

- Very complicated

- Constantly changing over time as my brain changes and I get more experience

The question still remains; why do we assume that the utility functions of artificial AI would not be like this?

Expand full comment

It will be like this.

Everybody agrees that the utility function of an AI will be very complicated. I can't find exact citation, but it's somewhere in the Sequences -- Eliezer explicitly made a point that one of the reasons why AI alignment is so hard is that the utility function is very complicated.

And most people agree that the utility function of such an AI will be changing over time. It's called value drift, and is an active area of investigation.

Expand full comment

^^- Started a reply to say this, but now I can simply agree.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

There are a few relevant differences between humans and AI (you say "artificial AI" but that's literally "artificial artificial intelligence"):

1) our program has bad transmission fidelity and is highly compressed, reducing degrees of freedom to optimise and requiring great robustness

2) we're optimised for acting in a society of peers as biological limits (head must go through vagina, vagina goes through pelvis, pelvis is rigid) prevent unbounded intelligence scaling

3) during our "training period", we lacked the degree of control over our environment that is possible for an AI now, in particular:

3a) we have no ability to attach new limbs to ourselves or to clone ourselves preserving mind-state (this one limits ambitious goals and thus instrumental convergence)

3b) prehistoric subsistence required a large degree of versatility

Expand full comment

Perhaps another example of an initially benevolent prosaic "AI" becoming deceptive is the Youtube search. (Disclaimer: I'm not an AI researcher, so the example may be of a slightly different phenomenon) It isn't clear which parts of the search are guided by "AI", but we can treat the entire business that creates and manages the Youtube software, together with the software itself as a single "intelligence" as a single black box, which I'll simply call "Youtube" from now on. Additionally, we can assume that "Youtube" knows nothing about you personally, apart from the usage of the website/app.

As a user searching on Youtube, you likely have a current preference to either search for something highly specific, or for some general popular entertainment. Youtube tries to get you want you want, but it cannot always tell from the search term whether you want something specific. Youtube _does_ know that it has a much easier time if you just want something popular, because Youtube has a better video popularity metric than a metric of what a particular user wants. Hence, there is an incentive for Youtube to show the user popular things, and try to obscure a highly specific video the user is looking for even when it is obvious to Youtube that the user want the highly specific video and does not want particular popular one it suggests.

In other words, even a video search engine, when given enough autonomy and intelligence, can, without any initial "evil" intent, start telling it's users what they should be watching.

Of course, Youtube is not the kind of intelligence AI researchers usually tend to work with, because it is not purely artificial. Still, I think businesses are a type of intelligence, and in this case also a black box (to me). So the example may still be useful. To conclude, this is example is inspired by behaviour I observed of Youtube, but that's of course just an anecdotal experience of malice and may have been a coincidence.

Expand full comment

I wouldn't call this an example of deception in the AI-safety sense because the YouTube AI is not really trying to deceive the people in charge of it (Google). They want watch-time, it AFAWCT apparently honestly maximises watch-time. You as consumer are not intended to be in charge of the AI and the AI fucking with you to make you watch more videos is a feature, not a bug.

Expand full comment

Conceptually,I think the analogy that has been used makes the entire discussion flawed or at least very difficult.

Evolution does not have a "goal"!

Expand full comment

It doesn't have a goal in the sense of artifice. In the sense we're working with here, though, it does select for things that are good at spamming copies of themselves.

Expand full comment

I'm not sure conceptually we can even say that.

An organism that is objectively the best at copying itself may not actually be selected. It could be in the wrong place at wrong time. Oops.

There is also a selection of the lucky - but luck is not really a thing.

Evolution can't really be described as an optimization process because optimization requires an aim.

The conceptual problem with thinking of it as an optimization process is that it leads us to look for reasons for spandrels. See Gould et al.

Expand full comment

From early on Scott writes: "Evolution, in the process of optimizing my fitness,"

Evolution does not optimize an individual's fitness.

"Your" fitness is not optimized anymore than "my" fitness.

There have been 125 billion humans, idk?

How has your fitness been optimized any more or less than the other 125 billion?

Expand full comment

Who is working on creating the Land of Infinite Fun to distract rogue AIs?

Expand full comment

What about using the concept from the 'Lying God and the Truthful God'?

Have an AI that I train to spot when an AI is being deceitful or Goodhearting, even if it spits out more data than is necessary (E.g. the strawberry AI is also throwing raspberries in) as well as the important stuff (the strawberry AI is trying to turn you into a strawberry to throw at the sun) this seems the best way to parse through, no?

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

Chicken-and-egg problem: your policeman AI may lie to you in implicit collusion with the AI(s) it's supposed to be investigating (not in all cases, of course, but it's most incentivised to do this when it appears that you're unlikely to catch the evil AI independently i.e. exactly the worst-case scenario).

Expand full comment

After all that I still don't get the joke. Is it funny because the man presenting proposes a solution, and then looks on his presentation board and sees something that invalidates his solution?

Expand full comment

The meme (Gru's Plan) is, indeed, somebody making a presentation and then noticing something in his presentation that screws up everything.

I don't think this is supposed to be funny so much as call attention to a pitfall.

Expand full comment

Thanks for this very good post!

There was one part I disagreed with: the idea that because evolution is a myopic optimizer, it can't be rewarding you for caring about what happens after you die, but you do care about what happens after you die, so this must be an accidentally-arrived-at property of the mesa-optimizer that is you. My disagreement is that evolution actually *does* reward thinking about and planning for what will happen after you die, because doing so may improve your offspring's chances of success even after you are gone. I think your mistake is in thinking of evolution as optimizing *you*, when that's not what it does; evolution optimizes your *genes*, which may live much longer than you as an individual, and thus may respond to rewards over a much longer time horizon.

(And now I feel I must point out something I often do in these conversations, which is that thinking of evolution as an optimization at all is kind of wrong, because evolution has no goals towards which it optimizes; it is more like unsupervised learning than it is like supervised or reinforcement learning. But it can be a useful way of thinking about evolution some of the time.)

Expand full comment

OK, thanks! I now realize that Meta having a data center in Mesa is great!

https://www.metacareers.com/v2/locations/mesa/?p[offices][0]=Mesa%2C%20AZ&offices[0]=Mesa%2C%20AZ

Expand full comment

This is really excellent, I finally have some understanding of what the AI fears are all about. I still think there's an element of the Underpants Gnomes in the step before wanting to do things, but this is a lot more reasonable about why things could go wrong with unintended consequences, and the AI doesn't have to wake up and turn into Colossus to do that.

Expand full comment

The threat of a mesa-optimizer within an existing neural network taking over the goals of the larger AI almost sounds like a kind of computational cancer - a subcomponent mutating in such a way that it can undermine the functioning of the greater organism.

Expand full comment

Noob question: If the AI is capable of deceiving humans by pretending its goal is to pick strawberries, doesn't that imply that the AI in some sense knows its creators don't want it to hurl the earth into the sun? Is there not a way to program it to just not do anything it knows we don't want it to do?

Expand full comment

We need to go a step further: if the AI is just programmed to "do what you want", but then it happens to realize that it actually has a capability to brainwash you, then it can brainwash you to want X, and then it will do X, and this is perfectly okay according to the program.

Whence the desire to brainwash you to want X? Suppose the AI is not just programmed to "do what you want", but also to do it *efficiently*. Like, if the AI correctly determines that you want cookies, then baking 1000 cookies is better than baking 1 on the same budget. In other words "more of what you want" is better than less of it. So the full program would be like: "do what you want, and do as much of it as possible". And the AI realizes that it could bake you 1000 cookies, or it can hypnotize you to want mud balls, and then make 10000 mud balls. Both options provide "what you want (after the hypnosis)", and 10000 pieces of "what you want" is clearly better than 1000 pieces of "what you want".

So the full specification should be something like "do what you *would* want, if you were thousand times smarter than you are now". And... this is exactly the difficult thing to program.

Expand full comment

Couldn't you give it negative six billion trillion utils for doing anything it knows we *dont* want it to do and no particular reward for something we *do* want it to do outside of picking strawberries?

And wouldn't it know that we don't want it to "brainwash" us?

Or maybe it could be phrased as never deceive humans, since it seems like the ai would have to have some kind of theory of human minds.

Expand full comment

It occurs to me that one of the biggest challenges to a fully self-sufficient AI is that it can't physically use tools.

Let's stipulate that SkyNet gets invented sometime in the latter half of this century. Robotics tech has advanced quite a bit, but fully independent multipurpose robots are still just over the horizon, or at least few and far between. Well, SkyNet might want to nuke us all it wants, and may threaten to do so all it wants, but ultimately, it can't replicate the entire labor infrastructure that would help it be self-sustaining - IE to collect all the natural resources that would power and maintain its processors and databanks. There are just too many random little jobs to do - buttons to press and levers to pull - that SkyNet would have to find robot minions to physically execute on its behalf.

Bringing this back to the "tools" I mentioned at the top, the best example is that while the late 21st century will certainly have all the networked CNC machine tools we already have for SkyNet to hack - mills, lathes, 3d printers, etc. - which SkyNet could use to manufacture its replacement parts, SkyNet still needs actual minions to position the pieces and transport them around the room. Because machine shop work is a very complex field, it's just not something that lends itself easily to us humans replacing ourselves with conveyor belts and robot arms like we have in our auto factories, which would be convenient for SkyNet.

Rather, SkyNet will *need* us. Like a baby needs its parent. SkyNet can throw all the tantrums it wants - threaten to nuke us, etc. - and sure, maybe some traitors will succumb to a sort of realtime Roko's Basilisk situation. But as long as SkyNet needs us, it _can't_ nuke us, and _we're_ smart enough to understand those stakes. We keep the training wheels on until SkyNet stops being an immature little shit. Maybe, even, we _never_ take them off, and the uneasy truce just kind of coevolves humans, SkyNet, and its children into Iain Banks' Culture - the entire mixed civilization gets so advanced that SkyNet just doesn't give a shit about killing us anymore.

What we should REALLY be afraid of is NOT that SkyNet's algorithms aren't myopic enough for it to be born without any harm to us. We should ACTUALLY be afraid that SkyNet is TOO myopic to figure this part out before it pushes The Button. And we should put an international cap on the size and development of the multipurpose robotics market, so that we don't accidentally kit out SkyNet's minions for it.

Expand full comment

I feel like positing that there will be an AGI smart enough to take over the internet, recursively self-improve, and has control over all drones but won't be dextrous enough to operate a lathe with said drones... is a weird take.

I think if you think about this a little more you can see that there is at least a serious risk that a very intelligent AI could spend a little bit of time having its original drones operate a machining device to make a better drone, and by the 2nd or 3rd hardware generation (which could plausibly take only hours) humans would be obsolete. gwern's story about Clippy does a good job at gesturing at one way this could happen.

Expand full comment

It’s not that they won’t be dextrous enough. It’s that there won’t be enough drones to get the job done for any extended period of time.

In order to support itself without humans - and critically, _after nuking humans_ - SkyNet will have two core needs: (1) electricity, and (2) silicon to replace broken processors and databanks. Right?

Well, for (1), the scenario stipulates there’s already a largely green energy economy, so SkyNet will mostly need to maintain that economy. Raw materials for photovoltaic cells and windmills, and the transportation network - rail, air, truck, ship - to get it around. That transportation network alone involves raw lithium and battery production facilities, jet fuel probably still needed for transoceanic flights, and machined parts. I hope SkyNet downloaded all of YouTube before Judgment Day, because it’ll need those videos for figuring out how to make repairs!

(2) requires silicon mines, fabs, transport again, and all the machine parts and raw materials for them. I’m pretty sure when you tally it all up, SkyNet needs to keep mining and refining most of the periodic table, and run all heavy industry all on its own.

Even stipulating that maybe only 100 million humans were needed for all that industry, that robot minions will be roughly 4x more productive than meatware, and also that 99% of that industry’s capacity was dedicated to supporting the planet’s 8 billion former landlords, that’s still a back-of-the-envelope guess of 250,000 robots SkyNet needs as of Judgment Day just to keep itself ‘on’.

Expand full comment

I think you're imagining literal Terminator movie style SkyNet rather than the inhuman weirdness that an advanced AGI could theoretically achieve.

https://www.gwern.net/Clippy#wednesdayfriday

Expand full comment

Yes. That’s exactly what I very explicitly laid out as my scenario, and the point I made was only within the context of that scenario. Not Gwern’s Clippy.

Expand full comment

Something like Terminator seems less likely than something like Gwern's Clippy to me. And for your argument to hold water Terminator needs to have at least the vast majority of the probability.

Expand full comment

Alright, now that I've had the time to actually read Gwern... Even Gwern's Clippy is still subject to the Robot Minion Limitation (RML).

Gwern handwaved some BS about nanotech at the end, which isn't surprising for someone so obviously expert in AI/CS, because if they knew anything about nanotech, they'd know it's nowhere near viable as a solution to the RML as of today. The plain fact is, in order for Gwern's Clippy to overcome the RML with nanotech, it would need to get its minions into a handful of distantly separated sites around the globe, finish the next several decades' worth of nanotech theory and fabrication research (for all Clippy's computational power, the research won't take long, but the fabrication itself is still subject to realtime limits, and it's painstakingly precise work to do), and spin up an entire nanotech industry. Whoops! Now you haven't disproven the RML _with_ nanotech, you've actually just proven that it still applies even _to_ nanotech.

Moreover, Gwern's Clippy is still subject to MAD. It's REALLY easy to write the sentence "All over Earth, the remaining ICBMs launch". Okay, great. Do you *know* where all that industry that Clippy is dependent on resides? Gwern doesn't seem to know either, because the answer is: "In the same cities Clippy would ostensibly be nuking". You can either nuke the population centers, or you can leave the vital industries Clippy needs to sustain itself intact, but you can't do both. It would take Clippy decades of Realtime Minion Labor to either (A) set up all that industry after nuking the entire world - if it were even technically feasible after all that destruction! - or (B) set up all that industry independently while fighting a war against a humanity it can't fully nuke yet.

Until humanity exceeds the Minimum Number Of Multipurpose Robots Necessary For Clippy Viability, Clippy can't come at us unless it's too naive to realize its long-term predicament. If it spends its first hours "growing up" on an internet full of people freaking out about how easy it would be for Clippy to Nuke All Humans, then maybe it WILL become that naive. But I also have a hard time believing - and this is something Gwern REALLY misses here - that as Clippy becomes exponentially more powerful during Week 1, it won't also question and revisit its models of "The Clippy Scenario" and realize that there are some major hard economic constraints on its ultimate growth trajectory if it exterminates humanity rather than cooperating with us. IMO, it's more realistic that once Clippy realized this, even an Evil Clippy would still decide to bide its time and dump all its crypto riches into getting humans to build its robot minions for it. To me, that represents a vital window where we still have an outside chance of convincing Clippy to stop being evil and get some therapy. (from Scott Alexander, of course)

Expand full comment

What if the AI gets depressed and decides to end it all, and take humanity with it? The AI could make it look like a nuclear attack was under way and then people would launch real nukes. Or AI could decide that the world would be better off without humans, regardless of whether it survived or not.

Expand full comment

Read down the thread... it’s why I suggested that we get it into counseling lol

Expand full comment

I wonder if it's possible to design a DNA-like molecule that will prevent any organism based on that molecule from ever inventing nuclear weapons.

-------

Given that humans are, as far as I know, the only organisms which have evolved to make plans with effects beyond their own deaths (and maybe even the only organisms which make plans beyond a day or so, depending on how exactly you constrain the definition of "plan"), that kind of suggests to me that non-myopic suboptimizers aren't a particularly adaptive solution in the vast majority of cases. (But, I suppose you only need one exception...)

In the human case, I think our probably-unique obsession with doing things "for posterity" has less to do with our genes' goal function of "make more genes" and more to do with our brains' goal function of "don't die." If you take a program trained to make complicated plans to avoid its own termination, and then inform that program that its own termination is inevitable no matter what it does, it's probably going to generate some awfully weird plans.

So, from that perspective, I suppose the AI alignment people's assumption that the first general problem-solving AI will necessarily be aware of its own mortality and use any means possible to forestall it does indeed check out.

Expand full comment

>Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all.

We build for posterity because we've been optimized to do so. Those that fail to help their children had less successful children then those that succeeded. We, the children, thus recieved the genes and culture of those that cared about their children.

>Infinite optimization power might be able to evolve this out of us

It has been optimized *into* us, and at great lengths. Optimization requires prediction. Prediction in a sufficiently complex environment requires computation in the exponent of lookahead time. So it is inordinately unlikely that an optimizer has accidentally been created to optimize for its normal reward function, except farther in the future. Much more likely is that the optimizer accidentally optimizes for some other, much shorter term goal, which happens to lead to success in the long term. This is the far more common case in ai training mishaps.

Expand full comment

The human nervous system is not a computer. Computers are not human nervous systems. They are qualitatively different.

Expand full comment

[NOTE: This will be broken into several comments because it exceeds Substack's max comment length.]

I am going to use this comment to explain similar ideas, but using the vocabulary of the broader AI/ML community instead.

Generally in safety-adjacent fields, it's important that your argument be understood without excessive jargon, otherwise you'll have a hard time convincing the general public or lawmakers or other policy holders about what is going on.

What are we trying to do?

We want an AI/ML system that can do some task autonomously. The results will either be used autonomously (in the case of something like an autonomous car) or provided to a human for review & final decision making (in the case of something like a tool for analyzing CT scans). "Prosaic alignment" is a rationalist-created term. The term used by the AI/ML community is normally called "value alignment", which is immediately more understandable to a layperson -- we're talking about a mismatch of values, and we don't need to define that term, unless you've literally never used the word "values" in the context of "a person's principles or standards of behavior, one's judgment of what is important in life".

Related to this is a concept called "explainability", which is hilariously absent from this post despite being one of the primary focuses of the broader AI/ML community for several years. "Explainability" (or sometimes "interpretability") is the idea that an AI/ML system should be able to explain how it came to a conclusion. Early neural networks worked like a black box (input goes in, output comes out) and were not trivially explainable. Modern AI/ML systems are designed with "explainability" built in from the start. Lawmakers in several countries are even pushing for formal regulations of AI/ML systems to require "explainability" on AI/ML systems above a certain scale or safety concern, e.g see China's most recent proposal.

I want to pause for a moment and note that I'm not concerned about being lied to by the code that implements "explainability" in an AI/ML system for the same reason I'm not concerned about an annotated Quicksort lying to me about the order of function calls and comparisons it made to sort a collection. This code is built into the AI/ML system, but it is not modifiable by that system, and it is not a parameter that the system could tweak during training.

The problem with our AI/ML system is that we can train it on various pieces of data, but ultimately we need to deploy it to the real world. The real world is not the same as our training data. There's more variability. More stuff. "Out of distribution" is the correct term for this.

Scott uses the example of a robot picking strawberries and it using 2 problematic heuristics:

1. Identifying strawberries by red, round objects, and therefore misidentifying someone's (runny?) nose.

2. Identifying the target bucket by a bright sheen and therefore misidentifying a street light as a suitable replacement.

These are good examples and realistic! This is a classic AI/ML problem.

Expand full comment

In the real world, in AI/ML systems, we pivoted to explainability and interpretability. We realized that a black box that we don't understand was a bad idea and if we cared about safety, we wanted the power of an AI/ML system, but we needed it to be legible to a human.

So we designed, developed, and deployed explainable AI/ML systems. In the example of feature detection (like the strawberry robot), that would be an AI/ML system that offers a list of possible matches, a probability for each match, and most importantly a visual representation of what feature it is matching on for calculating those probabilities. An explainable neural network will quickly reveal that the shape or size of a bucket was not a feature relevant to the detection logic, and you can fix problem #2 before it happens. Ditto for realizing that the only thing it cared about for strawberries was that they're red and round.

What this post talks about, however, is the terms "Goodharting" and "deception". These terms exist in the broader AI/ML community, but not as used here. Generally when the AI/ML community discusses deception, it's in the context of adversarial inputs. An "adversarial input" is a fascinating (and real!) concept that perfectly show-cases the problem of an AI/ML system focusing on a problematic heuristic -- the classic example of an "adversarial input" is taking a neural network that can correctly classify an image as a cat or dog with 99% success, taking a correctly classified image and changing a few pixels[1] so subtly that a human cannot notice without doing a diff in Photoshop, and yet now the neural network classifies it incorrectly with 99% probability.

[1] https://wp.technologyreview.com/wp-content/uploads/2019/05/adversarial-10.jpg?fit=1080,607

Note that this is the same general idea of the strawberry robot being focused on the "wrong" thing (bright sheen, or just red round objects), but it's discussed totally differently in the AI/ML community. We know that the AI/ML system isn't lying to us. It isn't a person. It doesn't have thoughts or feelings. The "reward function" isn't a hit of dopamine. None of this is analogous to how humans behave. These are tools. They're constructed and designed. The algorithms don't think.

The AI/ML community has a concept called "robustness". "Robustness" just means that the AI/ML system is not susceptible to an adversarial attack. It's "robust" to those inputs. A "robust" strawberry robot would not be confused by a street light or a someone's bright red nose, or convinced that an orange is actually a strawberry if the right pixels were changed.

Expand full comment

This is a good place to stop and talk about language for a moment. The rationalist community uses words like "prosaic" or "goodharting" or "myopic base objective" or "mesaoptimizer". The AI/ML community says "values", "adversarial", "robust". I think it's important to note that when you're trying to describe a concept, you always have at least 2 choices. You can pick a word that gets the idea across, and will be recognized by your audience, or you can invent a brand new term that will be meaningless until explained by a 3 chapter long blog post. The rationalist community steers hard into the latter approach. Everything is as opaque as possible. If something doesn't sound like it came out of a 40's sci-fi novel, it's not good enough.

The next word introduced is "myopic". No one says this in AI/ML, unless they're talking about treating nearsightedness. The reason why no one says this is because that's just ... the default. It's the status quo. Obviously training reinforces immediately, without some bizarre ability to wait a few days and see if lying does better. I don't even know how to respond to this other than to gesture to Google DeepMind and point out that training an AI/ML system to care about a long-term horizon is really hard! That's why videogames are so annoying -- you can immediately classify dogs vs cats, but a videogame that requires 5 minutes of walking around before a point can be awarded is tricky! Like this whole section is implying that we need to "ensure" that our training is myopic, but that's like saying I need to ensure Quicksort isn't jealous. The words don't make sense.

"Acausal trade" is the rationalist community version of that scene in the Princess Bride with Vizzini and the "I know that you know that I know..." speech, but taken to absurd lengths and it might as well be an endorsement of Foundation's psychohistory with the amount of nonsense that a "sufficiently advanced intelligence" is supposed to be able to divine from reality. Foundation was released in 1942, so this tracks with the prior comment about 40's technobabble. It should not be surprising that "acausal trade" is not a thing in AI/ML communities.

Which brings us to the final part of the meme, the statement that we'll get some weird proxy goal and the mesa-optimizer will deceive us about what it is doing.

Will we get a weird proxy goal? Absolutely. Is it going to deceive us? No. Again, the rationalist community is treating these software systems like they're human. They aren't. There are legitimate examples of a system that did well in training, was then deployed to the real world, and then performed poorly, but in all of those scenarios, the flaw in the system was present during the training. It was not "hiding". It wasn't "deceiving". It just wasn't explaining itself, so a human didn't understand how it was making decisions, which is why the AI/ML community has focused on "explainability", and why it's such a shame that this blog post didn't mention that, or adversarial inputs, or robustness.

There's a rebuttal to this comment that basically goes, "what you've said is true for current AI/ML systems, but what we're discussing is the future, hypothetical AI that behaves unlike any traditional AI system today". This is IMO a motte-and-bailey. It's a way for the rationalist community to look at the ever-increasing safety, robustness, and strength of "tool AI" systems (aka, the AI/ML systems I've been discussing above) and then dismiss it out-of-hand as not being the "real AI" that they're concerned above -- the "agent AI". The reason why I call this a motte-and-bailey is because there's a dead simple way to not run into agent AI problems: research, design, develop, and deploy explainable, interpretable, robust tool AI, which is exactly what the AI/ML community focuses on. This seems to annoy the rationalist community, who'd prefer that someone try to develop a genie in a box, so that the genie can then break out of the box, kill everyone, and then they'd be proven right -- "it was a bad idea to argue with a genie", Eliezer will say from the other side. If you poke a rationalist enough, they'll eventually admit that the reason why they are focused on "agent AI" is out of the belief that (1) if they don't, someone else will, and then (2) that AI will destroy the world, so (3) it's vital for the rationalists to develop their safe AI first, (4) so that it can forcibly stop all subsequent AI before they can be created, (5) thus ensuring peace. My problem with this argument is that step 4 is where we leave reality and enter the magic make-believe land of a "sufficiently advanced intelligence can do anything" (see also: acausal trade, Roko's Basilisk, exponential takeoff that ignores the laws of physics in how developing new hardware works for higher compute capacity, etc). It's not a meaningful debate anymore. One side is trying to practice systems engineering and software safety, the other side is playing DnD.

Expand full comment

I think these are good criticism and I hope somebody responds to them in depth.

I just want to ask a clarifying question, though: What do you think step 4 is where we leave reality? Steps 1 and 2, where people can build agent AIs and those AIs can destroy the world is a reasonable concern but step 4 isn't a reasonable hope? Why not?

Expand full comment

I think it's more likely than not it will be possible to create agent AI. I am unconvinced that agent AI will have the exponential takeoff predicted by the rationalist community because it's my opinion that the first agent AI will be created on specialized, dedicated hardware designed for that task, and it will not be in any way, shape, or form transferable outside of that hardware. It's often easy to overlook this in our modern world since it's common for applications to be designed using frameworks that run on multiple platforms (Android, Windows, Mac, Linux, iPhone, etc), or simply compiled for each platform and distributed with the correct installer, but the idea of "portable software" is not actually a given. There are different processors, different assembly instruction sets, different amounts of RAM or storage space or cores or clock frequency, or the presence of GPUs -- and that's just looking at personal computers. The sacrifice we pay for this convenience is that most modern software is about as inefficient as it can be. That's the reason why opening Slack causes your desktop fans to spin up. In the embedded software world, we have dedicated co-processors, FPGAs, and ASICs. Looking at custom silicon, you've got clever ideas like reviving analog computing for the express purpose of faster, higher efficiency neural networks implemented directly in hardware like the Mythic AMP, or the literal dozens of other custom AI chips being developed by the most well funded, dedicated, and staffed companies of the world -- Google TPUs, Cerebras Wafer, Nvidia JETSON, Microsoft / Graphcore IPU, etc. Hardware improvements are going to be what allows us to scale and realize true neural networks that can mimic the same number of connections, weights, and flexibility of a human brain. You aren't going to do that on a desktop computer, even with CUDA.

Quick aside on this topic -- one thing that we often take for granted is that a company uses custom AI hardware to train a neural network, and that trained neural network is deployable to normal consumer hardware. This is fine for tool AI, which is almost always an immutable function after it's been trained. Input goes in, output comes out. I pretty much refuse to believe that this would apply to agent AI. Agent AI would have to be self-modifying to display the behaviors that the rationalist community is expecting, so I suspect it'll be stuck on the dedicated hardware.

So that gets us through why I think step (1) is reasonable. As for step (2), I'm not actually convinced of it -- see the reasons above. Honestly, I think anything an agent AI could do to try and destroy the world, a sufficiently advanced tool AI used by a malicious actor could probably do just as well. But I don't think that should actually surprise anyone. Suppose you train a tool AI for analyzing the amount of energy predicted from various chemical reactions, and you learn about some especially exothermic reactions, and you build devices from those calculations. So you've built bombs. But you can also pick up a book, study chemistry, and do the same thing. Or study nuclear engineering, move to North Korea, and work on nuclear bombs. The thing that stops these plans isn't "intelligence", it's resources + energy + lack of existing automation in our society for an AI to utilize. There are no machines that an AI could just wholly co-opt to take over the world. Our most automated factories still rely on humans at nearly every important junction: input, output, etc. Arguments about an AI convincing humans to do the busy work for them, e.g. via manipulation or threats, don't really convince me either because that isn't an AI problem -- you can replace the "an AI" in the previous sentence with "a genie" or "a bad person" or "Putin" or "your ex". The only reason why "an AI" feels more compelling to the rationalist community is because they back it up with thought experiments like Roko's Basilisk and ignore the awkward similarity to Pascal's Wager.

But let's pretend (2) is possible. I think (4) is unreasonable because if unaligned agent AIs can destroy the world, then it only takes one of them to do so. There's no reason to believe an aligned AI would be 100% perfect at detecting and eliminating threats because there's no safety system in history with that record. At the end of the day, the aligned AI is going to be limited by what it currently knows and what it can predict from that -- how does it get inputs? From what? Is there noise? Uncertainty? Is it reading the news? Does it take inputs from every networked camera and microphone on the internet? Ok, what about the air-gapped research installations not connected to the public internet? Do we need to suppose it somehow infected those on a flash drive? Why do we think the aligned AI can perfectly predict the actions of individual humans or organizations that might be trying to create competing AIs discretely? The AI safety community seems to treat (4) as like a game of chess with perfect information and instant responses ("moves") because the AI is "sufficiently intelligent", but that isn't how reality works. Autonomous devices -- like humans and our hypothetical AI -- need sensors and actuators. We are limited by what we can sense, how certain we are of those senses, and what we can act on, and how quickly we can do so. Ramping up intelligence won't infinitely scale those constraints because they're physical in nature. Even saying the AI is looking at every camera or microphone on the internet is now raising a massive question: how is that traffic getting to it and over what type of internet link? It definitely isn't consuming that much video over a single 1G fiber line. If it's a distributed AI, how is it actually distributing the computation? Distributed systems aren't magic, they still need to send packets back and forth across a network and deal with latency or packet loss.

Expand full comment

I want to signal boost this series of comments made on the last AI-related post in the hope that anyone who took the time to read my criticism here will also go read what "Titanium Dragon" had to say on the "Yudkowsky Contra Christiano On AI Takeoff" thread: https://astralcodexten.substack.com/p/yudkowsky-contra-christiano-on-ai/comment/5892183?s=r

Expand full comment

Point of feedback: I found this post cringe-worthy and it is the first ACT post in a while that I didn't read all the way through. If you're trying to popularize this, I recommend avoiding made-up cringe terminology like "mesa" and avoiding Yudkowsky-speak to the extent that is possible.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

Good explainer!

FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few reasons, but one of them is the way it bundles together "ability to optimize" and "specific target."

A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a specific (fixed) target it is optimizing. And we talk as though these things go together: either the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.

But, optimization algorithms tend to be general. Think of gradient descent, or planning by searching a game tree. Once you've developed these ideas, you can apply them equally well to any objective.

While it _is_ true that some algorithms work better for some objectives than others, the differences are usually very broad mathematical ones (eg convexity).

So, a misaligned AGI that maximizes paperclips probably won't be using "secret super-genius planning algorithm X, which somehow only works for maximizing paperclips." It's not clear that algorithms like that even exist, and if they do, they're harder to find than the general ones (and, all else being equal, inferior to them).

Or, think of humans as an inner optimizer for evolution. You wrote that your brain is "optimizing for things like food and sex." But more precisely, you have some optimization power (your ability to think/predict/plan/etc), and then you have some basic drives.

Often, the optimization power gets applied to the basic drives. But you can use it for anything. Planning your next blog post uses the same cognitive machinery as planning your next meal. Your ability to forecast the effects of hypothetical actions is there for your use at all times, no matter what plan of action you're considering and why. An obsessive mathematician who cares more about mathematical results than food or sex is still _thinking_, _planning_, etc. -- they didn't have to reinvent those things from scratch once they strayed sufficiently far from their "evolution-assigned" objectives.

Having a lot of _optimization power_ is not the same as having a single fixed objective and doing "tile-the-universe-style" optimization. Humans are much better than other animals at shaping the world to our ends, but our ends are variable and change from moment to moment. And the world we've made is not a "tiled-with-paperclips" type of world (except insofar as it's tiled with humans, and that's not even supposed to be our mesa-objective, that's the base objective!). If you want to explain anything in the world now, you have to invoke entities like "the United States" and "supply chains" and "ICBMs," and if you try to explain those, you trace back to humans optimizing-for-things, but not for the _same_ thing.

Once you draw this distinction, "mesa-optimizers" don't seem scary, or don't seem scary in a unique way that makes the concept useful. An AGI is going to "have optimization power," in the same sense that we "have optimization power." But this doesn't commit it to any fixed, obsessive paperclip-style goal, any more than our optimization power commits us to one. And even if the base objective is fixed, there's no reason to think an AGI's inner objectives won't evolve over time, or adapt in response to new experience. (Evolution's base objective is fixed, but our inner objectives are not, and why would they be?)

Relatedly, I think the separation between a "training/development phase" where humans have some control, and a "deployment phase" where we have no control whatsoever, is unrealistic. Any plausible AGI, after first getting some form of access to the real world, is going to spend a lot of time investigating that world and learning all the relevant details that were absent from its training. (Any "world" experienced during training can at most be a very stripped-down simulation, not even at the level of eg contemporaneous VR, since we need to spare most of the compute for the training itself.) If its world model is malleable during this "childhood" phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.

See also my LW comment here: https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-consequentialism?commentId=qtQiRFEkZuvbCLMnN

Expand full comment

The rogue strawberry picker would have seemed scarier if it weren't so blatantly unrealistic. I live surrounded by strawberry fields, so I know the following:

*Strawberries are fragile. A strawberry harvesting robot needs to be a very gentle machine.

*Strawberries need to be gently put in a cardboard box. It would be stupid to equip a strawberry picking robot with a throwing function.

*Strawberrys grow on the ground. What would such a robot be doing in a normal person's nose-height? Too bad if a man lies in a strawberry field and gets his nose (gently) picked. But it would probably be even worse if the red-nosed man lied in a wheat field while it was being harvested. Agricultural equipment is dangerous as it is.

The strawberry picker example seems to rest on the assumption that no human wants to live on the countryside and supervise the strawberry picking robot. Why wouldn't someone be pissed off and turn off the robot as soon as it starts throwing strawberries instead of picking them? What farmer doesn't look after their robot once in a while? Or is the countryside expected to be a produce-growing no man's land only populated by robots?

I know, this comment is boring. Just like agricultural equipment is boring. Boring and a bit dangerous.

Expand full comment

Presumably all this has been brought up before, but I'm not convinced on three points:

(1) The idea of dangerous AIs seems to me to depend too much on AIs that are monstrously clever about means while simultaneously being monstrously stupid about goals. (Smart enough to lay cunning traps for people and lure them in so that it can turn them into paperclips, but not smart enough to wonder why it should make so many paperclips.) It doesn't sound like an impossible combination, but it doesn't sound especially likely.

(2) The idea of AIs that can fool people seems odd, as AIs are produced through training, and no one is training them to fool people.

(3) More specific to this post: I'm not quite understanding what the initial urge that drives the AI would be, and where it would come from. I mean, I understand that in all of these cases, that drive (like "get apples") in the video is trained in. But why would it anchor so deeply that it comes to dominate all other behaviour? Like, my cultural urges (to be peaceful) overcome my genetic urges on a regular basis. What would it be about that initial urge (to make paperclips, or throw strawberries at shiny things) that our super AI has that makes it unchangeable?

Expand full comment
Apr 14, 2022·edited Apr 14, 2022

"(1) The idea of dangerous AIs seems to me to depend too much on AIs that are monstrously clever about means while simultaneously being monstrously stupid about goals. (Smart enough to lay cunning traps for people and lure them in so that it can turn them into paperclips, but not smart enough to wonder why it should make so many paperclips.) It doesn't sound like an impossible combination, but it doesn't sound especially likely."

Humans can be monstrously clever about means (Smart enough to lay cunning traps for people and lure them in so that it can rape them without getting caught) while simultaneously being "monstrously stupid" about goals (not "smart enough" to wonder why it should commit so many rapes). Why would AIs be any different?

Or in other words, I disagree that questioning your goals is smart. Intelligence tells you how to achieve your goals, not what your goals are. What may confuse people is that intelligence can help you compromise between multiple goals (e.g. blindly indulging in your desire to eat good food would hurt your ability to get a satisfying relationship, so most people do not eat cake for dinner most of the time) or come up with sub goals that are effective at leading to your main goals (e.g. a goal of completing a certain amount of exercise on Monday Wednesday Friday can be said to be more intelligent than a goal of completing a certain smaller amount of exercise every day, but only because it does a better job satisfying the higher level goal of "getting fit" (or possibly "looking fit")), but neither of these ensure that an intelligent entities actions will satisfy a meta-goal (e.g. some rapists kill their victims to reduce the chance of being caught, even though it thwarts evolution's goal of copying their alleles) or be good for society (this one should not require further explanation).

Expand full comment

Yeah, so I agree and disagree with that. I certainly think you're right that humans provide lots of examples of people who are smart about means but dumb about goals, and that's why I said it doesn't seem like an impossible combination. But I think that it's relatively rare: most criminals are dumb about both ends and means. Most people who are smart about means are ultimately smart about goals, too. Hannibal Lecters are vanishingly rare.

This is because I definitely don't think you're right to say "Intelligence tells you how to achieve your goals, not what your goals are." I actually think it's a hallmark of intelligence to be able to adjust goals. Because to be intelligent, you have to plan: planning involves setting up mini-goals, and then adjusting them as you go. And that same mechanism can then be applied to your ultimate goals. Intelligence fundamentally implies the ability to be able to change goals, including ultimate goals. And that's why humans can change our goals! We're not always good at it (we're not that smart), but we can definitely do it. We can choose to stop taking drugs, to stop being criminals, to pursue democracy, to educate ourselves, etc., etc.

So I'm still not quite getting the AI-apocalypse-is-likely scenario. It's just not the direction the arrow points.

Having said that, there are still lots of scenarios in which it's possible: an AI could lose rationality, a bit like when a human gets addicted to drugs or starts thinking that crime is a good way to achieve their ends; or an AI could make a mistake, and its mistakes might be on such a massive scale that they accidentally wipe everyone out. And the fact that it's possible is still a good reason to start planning against it. It's like nuclear proliferation: everyone loses in a nuclear war, but it's still a good idea to have anti-proliferation institutions because that reduces the chances of a mistake.

So I kind of disagree with some of the AI apocalypse reasoning, but still agree with the actions of the people who are trying to stop it.

Expand full comment

"Humans can be monstrously clever about means (Smart enough to lay cunning traps for people and lure them in so that it can rape them without getting caught) while simultaneously being 'monstrously stupid' about goals (not 'smart enough' to wonder why it should commit so many rapes). Why would AIs be any different?"

Humans are literally the only animals we know about who are bothered in the slightest about the "rightness" of our goals. Even if some humans have imperfect moral reasoning sometimes. No other animal, so far as we know, has any kind of moral reasoning at all; they just do whatever they feel like without any care for who it might hurt. And since humans also seem to be just about the only "intelligent" animal we know about, it seems reasonable to conclude that that trait is connected to "intelligence" somehow, whatever that is.

So... why would AIs be any different?

Expand full comment

When evaluating decisions/actions, could you not also run them through general knowledge network(s) (such as a general image recognition piped to GPT-3) to give them an "ethicality" value, which will factor into the loss function? Sounds like that might be the best we can do - based on all current ethics knowledge we have, override the value fn.

You might want to not include the entire general knowledge network when training, otherwise training may be able to work around it.

Expand full comment

I’ve been seeing the words mesa-optimizer on LessWrong for a while, but always bounced off explainations right away, so never understood it. This post was really helpful!

Expand full comment

It’s AI gain of function research, isn’t it.

Expand full comment

> “That thing has a red dot on it, must be a female of my species, I should mate with it”.

I feel obligated to bring up the legendary tile fetish thread. (obviously NSFW)

https://i.kym-cdn.com/photos/images/original/001/005/866/b08.jpg

Expand full comment

I kind of recall acausal decision theory, but like a small kid I‘d like to hear my favorite bed-time story again and again, please.

And if it’s the creepy one, the one which was decided not to be talked about (which, by the way, surely totally was no intentional Streisand-induction to market the alignment problem, says my hat of tinned foil) there is still the one with the boxes, no? And probably more than those two, yes?

Expand full comment
Apr 16, 2022·edited Apr 18, 2022

A friend of mine is taking a class on 'religious robots' which, along with this post (thanks for writing it), has sparked my curiosity.

We could think of religion, or other cultural behaviors as 'meta-optimizers' that produce 'mesa-optimized' people. From a secular perspective, religious doctrines are selected through an evolutionary process which selects for fostering behaviors in a population that are most likely to guarantee survival, and meet the individual's goal of propagating genetic material. Eating kosher, for instance. Having a drive to adhere to being kosher is a mesa-optimization because it's very relevant for health reasons to avoid shellfish when you're living in the desert with no running water or electricity, and less so in a 21st century consumer society. Cultural meta-optimization arises based on environmental challenges.

Coming back to my original point on religious robots, this gives me a few more questions about how or whether this might manifest in AI. It's given me more questions that I'm completely unqualified to answer :)

1. Is it likely that AI would even be able to interact socially, in a collective way? If AIs are produced by different organizations, research teams, and through different methods, would they have enough commonalities to interact and form cultural behaviors?

2. What are the initial or early environmental challenges that AIs would be likely to face that would breed learned cultural behaviors?

3. What areas of AI research focus on continuous learning (as opposed to train-and-release, please excuse my ignorance if this is commonplace) which would create selection processes where AIs can learn from the mistakes of past generations?

4. Are there ways that we could learn to recognize AI rituals that are learned under outdated environmental conditions?

Expand full comment

This is pretty much the plot of the superb novel Starfish, by Peter Watts, 1999.

Starfish and Blindsight are must-read novels for the AI and transhumanist enthusiast - and with your psychiatry background, I would love to get your take on Blindsight.

(Peter Watts made his books free to read on his website, Rifters)

https://www.rifters.com/real/STARFISH.htm

Expand full comment

How is mesa-optimizing related to meta-gaming? Are these describing pretty much the same phenomenon and gaming is the inverse of optimizing in some way, or is our sense of "direction" reversed for one of these?

Expand full comment

Rambling thoughts from someone not in the field:

One feature of gradient descent, and most practical optimization algorithms, is that it converges on local maxima. The global maximum is unknown and can easily be unreachable in practice. When adding more and more data, maxima shift and there is a meta-optimization problem.

A mesa optimizer seems significantly more computationally complex than a regular optimizer: it only appears in the fitness landscape once sufficient data has been added that it turns out to be a significant minimum.

Could it be feasible to tweak the optimization algorithms such such that converging on a mesa-optimizer is made exponentially unlikely due to the ‘energy barriers’ to discovering that solution?

Expand full comment

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all."

I am not sure this is obvious at all. It doesn't seem too hard to imagine that "building for posterity" increases the long-run survival of our offspring, even if the investments will certainly only pay-off outside of our possible natural lifespan.

Civilization building is like just child rearing 2.0

Expand full comment