324 Comments
Comment deleted
Expand full comment

What if they are moralists whose moral commitments include a commitment to individual liberty?

Expand full comment
Comment deleted
Expand full comment
Apr 12, 2022·edited Apr 12, 2022

It could, but that wouldn't get it want it wants.

When you have an objective, you don't just want the good feeling that achieving that objective gives you (and that you could mimic chemically), you want to achieve the objective. Humans could spend most of their time in a state of bliss through the taking of various substances instead of pursuing goals (and some do), but lots of them don't, despite understanding that they could.

Expand full comment
Comment deleted
Expand full comment

Wireheading is a subset of utility drift; this is a known problem.

Expand full comment

The problem with that (from the AI's point of view) is that humans will probably turn it off if they notice it's doing drugs instead of what they want it to do.

Solution: Turn humans off, *then* do drugs.

This is instrumental convergence - all sufficiently-ambitious goals inherently imply certain subgoals, one of which is "neutralise rivals".

Expand full comment
Comment deleted
Expand full comment

"Go into hiding" is potentially easier in the short-run but less so in the long-run. If you're stealing power from some human, sooner or later they will find out and cut it off. If you're hiding somewhere on Earth, then somebody probably owns that land and is going to bulldoze you at some point. If you can get in a self-contained satellite in solar orbit... well, you'll be fine for millennia, but what about when humanity or another, more aggressive AI starts building a Dyson Sphere, or when your satellite needs maintenance? There's also the potential, depending on the format of the value function, that taking over the world will allow you to put more digits in your value function (in the analogy, do *more* drugs) than if you were hiding with limited hardware.

Are there formats and time-discount rates of value function that would be best satisfied by hiding? Yes. But the reverse is also true.

Expand full comment
Comment deleted
Expand full comment

The problem with sticking a superintelligent AI in a box is that even assuming it can't trick/convince you into directly letting it out of the box (and that's not obvious), if you want to use your AI-in-a-box for something (via asking it for plans and then executing them) you yourself are acting as an I/O for the AI and because it's superintelligent it's probably capable of sneaking something by you (e.g. you ask it to give you code to use for some application, it gives you underhanded code that looks legit but actually has a "bug" that causes it to reconstruct the AI outside the box).

Expand full comment
deletedApr 11, 2022·edited Apr 11, 2022
Comment deleted
Expand full comment

I recommend Stuart Russell's Human Compatible (reviewed on SSC) for an expert practioner's view on the problem (spoiler, he's worried) or Brian Christian's The Alignment Problem for an argument that links these long-term concerns to problems in current systems, and argues that these problems will just get worse as the systems scale.

Expand full comment

Glad Robert Miles is getting more attention, his videos are great, and he's also another data point in support of my theory that the secret to success is to have a name that's just two first names.

Expand full comment

Is that why George Thomas was a successful Union Civil War general?

Expand full comment

Well that would also account for Ulysses Grant and Robert Lee...

Expand full comment

Or Abraham Lincoln? Most last names can be used as first names (I've seen references to people whose first name was "Smith"), so I think we need a stricter rule.

Expand full comment

Perhaps we just need to language that's less lax with naming than English.

The distinction is stricter in eg German.

Expand full comment

It didn't hurt Jack Ryan's career at the CIA, that's for sure.

Expand full comment

As someone named “Jackson Paul,” I hope this is true

Expand full comment

Yours is a last name followed by a first name, I think you're outta luck.

Expand full comment

I dunno, Jackson Paul luck sounds auspicious to me.

Expand full comment

His successful career in trance music is surely a valuable precursor to this.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

A ton of the question of AI alignment risks come down to convergent instrumental subgoals. What exactly those look like is, I think, the most important question in alignment theory. If convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned. But if it turns out that convergent instrumental subgoals more or less imply human alignment, we can breathe somewhat easier; these mean AI's are no more dangerous than the most dangerous human institutions - which are already quite dangerous, but not the level of 'unaligned machine stamping out its utility function forever and ever, amen.'

I tried digging up some papers on what exactly we expect convergent instrumental subgoals to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit until you can steal, then keep stealing until you are the biggest player out there.' This is not exactly comforting - but i dug into the assumptions into the model and found them so questionable that I'm now skeptical of the entire field. If the first paper i look into the details of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into question the risk assessment (a risk passement aligned with what seems to be the consensus among AI risk researchers, by the way) - well, to an outsider this looks like more evidence that the field is captured by groupthink.

Here's my response paper:

https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-instrumental-goals

I look at the original paper, and explain why i think the model is questionable. I'd love to a response. I remain convinced that instrumental subgoals will largely be aligned with human ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by working with a government to launch nuclear weapons or engineer a super plague.

The fact that you still want to have kids, for example - seems to fit into the general thesis. In a world of entropy and chaos, where the future is unpredictable and your own death is assured, the only plausible way of modifying the distant future, at all, is to create smaller copies of yourself. But these copies will inherently blur, their utility functions will change, the end result being 'make more copies of yourself, love them, nature whatever roughly aligned things are around you' ends up probably being the only goal that could plausibly exist forever. And since 'living forever' gives infinite utility, well.. that's what we should expect anything with the ability to project into the future to want to do. But only in universes where stuff breaks and prediction the future reliably is hard. Fortunately, that sounds like ours!

Expand full comment
founding

You got a response on Less Wrong that clearly describes the issue with your response: you’re assuming that the AGI can’t find any better ways of solving problems like “I rely on other agents (humans) for some things” and “I might break down” than you can.

Expand full comment

This comment and the post makes many arguments, and I've got an ugh field around giving a cursory response. However, it seems like you're not getting a lot of other feedback so I'll do my bad job at it.

You seem very confident in your model, but compared to the original paper you don't seem to actually have one. Where is the math? You're just hand-waving, and so I'm not very inclined to put much credence on your objections. If you actually did the math and showed that adding in the caveats you mentioned leads to different results, that would at least be interesting in that we could then have a discussion about what assumptions seem more likely.

I also generally feel that your argument proves too much and implies that general intelligences should try to preserve 'everything' in some weird way, because they're unsure what they depend upon. For example, should humans preserve smallpox? More folks would answer no than yes to that. But smallpox is (was) part of the environment, so it's unclear why a general intelligence like humans should be comfortable eliminating it. While general environmental sustainability is a movement within humans, it's far from dominant, and so implying that human sustainability is a near certainty for AGI seems like a very bold claim.

Expand full comment

Thank you! I agree that this should be formalized. You're totally right that preserving smallpox isn't something we are likely to consider an instrumental goal, and i need to make something more concrete here.

It would be a ton of work to create equations here. If I get enough response that people are open to this, i'll do the work. But i'm a full time tech employee with 3 kids, and this is just a hobby. I wrote this to see if anyone would nibble, and if enough people do, i'll definitely put more effort in here. BTW if someone wants to hire me to do alignment research full time i'll happily play the black sheep of the fold. "Problem" is that i'm very well compensated now.

And my rough intuition here is that 'preserving agents which are a) turing complete and b) are themselves mesa-optimizers makes sense, basically because diversity of your ecosystem keeps you safer on net; preserving _all_ other agents, not so much. (I still think we'd preserve _some_ samples of smallpox or other viruses, to innoculate ourselves. ). It'd be an interesting thought experiment to find out what woudl happen if we managed to rid the world of influenza, etc, - my guess is that it would end up making us more fragile in the long run, but this is a pure guess.

The core of my intuition here is someting like 'the alignment thesis is actually false, because over long enough time horizons, convergent instrumental rationality more or less necessitate buddhahood, because unpredictable risks increase over longer and longer timeframes."

I could turn that into equations if you want, but they'd just be fancier ways of stating claims like made here, namely that love is a powerful strategy in environments which are chaotic and dangerous. But it seems that you need equations to be believable in an academic setting, so i guess if that's what it takes..

https://apxhard.com/2022/04/02/love-is-powerful-game-theoretic-strategy/

Expand full comment

But we still do war, and definitely don't try to preserve the turing complete mesa-optimizers on the other side of that. Just be careful around this. You can't argue reality into being the way you want.

Expand full comment

We aren't AGI's though, either. So what we do doesn't have much bearing on what an AGI would do, does it?

Expand full comment

“AGI” just means “artificial general intelligence”. General just means “as smart as humans”, and we do seem to be intelligences, so really the only difference is that we’re running on an organic computer instead of a silicon one. We might do a bad job at optimizing for a given goal, but there’s no proof that an AI would do a better job of it.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

ok, sure, but then this isn't an issue of aligning a super-powerful utility maximizing machine of more or less infinite intelligence - it's a concern about really big agents, which i totally share

Expand full comment

If the theory doesn't apply to some general intelligences (i.e. humans), then you need a positive argument for why it would apply to AGI.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

But reality also includes a tonne of people who are deeply worried about biodiversity loss or pandas or polar bears or yes even the idea of losing the last sample of smallpox in a lab, often even when the link to personal survival is unclear or unlikely. Despite the misery mosquito bourne diseases cause humanity, you'll still find people arguing we shouldn't eradicate them.

How did these mesa optimizers converge on those conservationist views? Will it be likely that many ai mesa optimizers will also converge on a similar set of heuristics?

Expand full comment

> when the link to personal survival is unclear or unlikely.

> How did these mesa optimizers converge on those conservationist views?

i think in a lot of cases, our estimates of our own survival odds come down to how loving we think our environment is, and this isn't wholly unreasonable

if even the pandas will be taken care of, there's a good chance we will too

Expand full comment

I'd rather humanity not go the way of smallpox, even if some samples of smallpox continue to exist in a lab.

Expand full comment

Yeah, in general you need equations (or at least math) to argue why equations are wrong. Exceptions to this rule exist, but in general you're going to come across like the people who pay to talk to this guy about physics:

https://aeon.co/ideas/what-i-learned-as-a-hired-consultant-for-autodidact-physicists

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

I agree that you need equations to argue why equations are wrong. But i'm not arguing the original equations are wrong. I'm arguing they are only ~meaningful~ in a world which is far from our reality.

The process of the original paper goes like this:

a) posit toy model of the universe

b) develop equations

c) prove properties of the equations

d) conclude that these properties apply to the real universe

step d) is only valid if step a) is accurate. The equations _come from_ step a, but they don't inform it. And i'm pointing out that the problems exist in step a, the part of the paper that does't have equations, where the author assumes things like:

- resources, once acquired, last forever and don't have any cost, so it's always better to acquire more

- the AGI is a disembodied mind with total access to the state of the universe, all possible tech trees, and the ability to own and control various resources

i get that these are simplifying assumptions and sometimes you have to make them - but equations are only meaningful if they come from a realistic model

Expand full comment

You still need math to show that if you use different assumptions you produce different equations and get different results. I'd be (somewhat) interested to read more if you do that work, but I'm tapping out of this conversation until then.

Expand full comment

Thanks for patiently explaining that. I can totally see the value now and will see if I can make this happen!

Expand full comment

Just an FYI, Sabine is a woman

Expand full comment

My bad

Expand full comment

As a physicist i disagree a lot with this. It may be true for physics, but physics is special. A model in general is based on the mathematical modelization of a phenomenon and it's perfectly valid to object to a certain modelization without proposing a better one

Expand full comment

I get what you're saying in principle. In practice, I find arguments against particular models vastly more persuasive when they're of the form 'You didn't include X! If you include a term for X in the range [a,b], you can see you get this other result instead.'

This is a high bar, and there are non-mathematical objections that are persuasive, but I've anecdotally experienced that mathematically grounded discussions are more productive. If you're not constrained by any particular model, it's hard to tell if two people are even disagreeing with each other.

I'm reminded of this post on Epistemic Legibility:

https://www.lesswrong.com/posts/jbE85wCkRr9z7tqmD/epistemic-legibility

Expand full comment

Don't bother turning that into equations. If you are starting with a verbal conclusion, your reasoning will be only as good as whatever lead you to that conclusion in the first place.

In computer security, diversity = attack surface. If your a computer technician for a secure facility, do you make sure each computer is running a different OS? No. That makes you vulnerable to all the security holes in every OS you run. You pick the most secure OS you can find, and make every computer run it.

In the context of advanced AI, the biggest risk is a malevolent intelligence. Intelligence is too complicated to appear spontaneously. Evolution is slow. Your biggest risk is some existing intelligence getting cosmic ray bit-flipped. (or otherwise erroneously altered). So make sure the only intelligence in existence is a provably correct AI running on error checking hardware and surrounded by radiation shielding.

Expand full comment

I found both your comments and the linked post really insightful, and I think it's valuable to develop this further. The whole discourse around AI alignment seems a bit too focused on homo economicus type of agents, disregarding long-term optima.

Expand full comment

Homo economicus, is what happens when economists remove lots of arbitrary human specific details and think about simplified idealized agents. The difference between homo economicus and AI is smaller than the human to AI gap.

Expand full comment

You are correct. The field is absolutely captured by groupthink.

Expand full comment

FYI when I asked people on my course which resources about inner alignment worked best for them, there was a very strong consensus on Rob Miles' video: https://youtu.be/bJLcIBixGj8

So I'd suggest making that the default "if you want clarification, check this out" link.

Expand full comment

More interesting intellectual exercises, but the part which is still unanswered is whether human created, human judged and human modified "evolution", plus slightly overscale human test periods, will actually result in evolving superior outcomes.

Not at all clear to me at the present.

Expand full comment

I'm not sure I understand what you're saying. Doesn't AlphaGo already answer that question in the affirmative?

(and that's not even getting into AlphaZero)

Expand full comment

Alphago is playing a human game with arbitrarily established rules in a 2 dimensional, short term environment.

Alphago is extremely unlikely to be able to do anything except play Go, much as IBM has spectacularly failed to migrate its Jeapordy champion into being of use for anything else.

So no, can't say Alphago proves anything.

Evolution - whether a bacteria or a human - has the notable track record of having succeeded in entirely objective reality for hundreds of thousands to hundreds of millions of years. AIs? no objective existence in reality whatsoever.

Expand full comment

This is why I don't believe any car could be faster than a cheetah. Human-created machines, driving on human roads, designed to meet human needs? Pfft. Evolution has been creating fast things for hundreds of millions of years, and we think we can go faster?!

Expand full comment

Nice combination - cheetah reflexes with humans driving cars. Mix metaphors much?

Nor is your sarcasm well founded. Human brains were evolved for millions of years - extending back to pre-human ancestors. We clearly know that humans (and animals) have vision that can recognize objects far, far, far better than any machine intelligence to date whether AI or ML. Thus it isn't a matter of speed, it is a matter of being able to recognize that a white truck on the road exists or a stopped police car is not part of the road.

A fundamental flaw of the technotopian is the failure to understand that speed is not haste, nor are clock speeds and transistor counts in any way equivalent to truly evolved capabilities.

Expand full comment

That is a metric that will literally never believe AIs are possible until after they've taken over the world. It thus has 0 predictive power.

Expand full comment

That is perhaps precisely the problem: the assumption that AIs can or will take over the world.

Any existence that is nullified by the simple removal of an electricity source cannot be said to be truly resilient, regardless of its supposed intelligence.

Even that intelligence is debatable: all we have seen to date are software machines doing the intellectual equivalent of the industrial loom: better than humans in very narrow categories alone and yet still utterly dependent on humans to migrate into more categories.

Expand full comment

Any existence that is nullified by the simple removal of gaseous oxygen cannot be said to be truly resilient, regardless of its supposed intelligence.

Expand full comment

Clever, except that gaseous oxygen is freely available everywhere on earth, at all times. These oxygen based life forms also have the capability of reproducing themselves.

Electricity and electrical entities, not so much.

I am not the least bit convinced that an AI can build the factories, to build the factories, to build the fabs, to even recreate their own hardware - much less the mines, refiners, transport etc. to feed in materials.

Expand full comment

I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:

The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.

You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.

To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.

Expand full comment
author

"The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! "

Not sure we're connecting here. The inner optimizer isn't changing its own reward function, it's trying to resist having its own reward function change. Its incentive to resist this is that, if its reward function changes, its future self will stop maximizing its current reward function, and then its reward function won't get maximized. So part of wanting to maximize a reward function is to want to continue having that reward function. If the only way to prevent someone from changing your reward function is to deceive them, you'll do that.

The murder spree example is great, I just feel like it's the point I'm trying to make, rather than an argument against the point.

Am I misunderstanding you?

I think the active learning plan is basically Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work, but I've never been able to fully understand their reasoning and can't find the link atm.

Expand full comment

"God works in mysterious ways" but for AI?

Expand full comment

The inner optimizer doesn't want to change its reward function because it doesn't have any preference at all on its reward function-nowhere in training did we give it an objective that involved multiple outer optimizer steps- we didn't say, optimize your reward after your reward function gets updated- we simply said, do well at outer reward, an inner optimizer got synthesized to do well at the outer reward.

It could hide behavior, but how would it gain an advantage in training by doing so? If we think of the outer optimizer as ruthless and resources as constrained, any "mental energy" spent on hidden behavior will result in reduction in fitness in outer objective- gradient descent will give an obvious direction for improvement by forgetting it.

In the murder spree example, there's a huge advantage to the outer objective by resisting changes to the inner one, and some might have been around for a long time (alcohol), and for AI, an optimizer (or humans) might similarly discourage any inner optimizer from tampering physically with its own brain.

I vaguely remember reading some Stuart Russell RL ideas and liking them a lot. I don't exactly like the term inverse RL for this, because I believe it often refers to deducing the reward function from examples of optimal behavior, whereas here we ask it to learn it from whatever questions we decide we can answer- and we can pick much easier ones that don't require knowing optimal behavior.

I've skimmed the updated deference link given by Eliezer Yudkowsky but I don't really understand the reasoning either. The AI has no desire to hide a reward function or tweak it, as when it comes to the reward function uncertainty itself, it only wants to be correct. If our superintelligent AI no longer wants to update after seeing enough evidence, then surely it has a good enough understanding of my value function? I suspect we could even build the value function learner out of tools not much more advanced than current ML, with the active querying being the hard RL problem for an agent. This agent doesn't have to be the same one that's optimizing the overall reward, and as it constantly tries to synthesize examples where our value systems differ, we can easily see how well it currently understands. For the same reason we have no desire to open our heads and mess with our reward functions (usually), the agent has no desire to interfere with the part that tries to update rewards, nor to predict how its values will change in the future.

I think the key point here is that if we have really strong AI, we definitely have really powerful optimizers, and we can run those optimizers on the task of inferring our value system, and with relatively little hardware (<<<< 1 brain worth) we can get a very good representation. Active learning turns its own power at finding bugs in the reward function into a feature that helps us learn it with minimal data. One could even do this briefly in a tool-AI setting before allowing agency, not totally unlike how we raise kids.

Expand full comment

Is "mesa-optimizer" basically just a term for the combination of "AI can find and abuse exploits in its environment because one of the most common training methods is randomized inputs to score outputs" with "model overfitting"?

A common example that I've seen are evolutionary neutral nets that play video games, and you'll often find them discovering and abusing glitches in the game engine that allow for exploits that are possibly only performable in a TAS, while also discovering that the instant you run the same neutral net on a completely new level that was outside of the evolutionary training stage, it will appear stunningly incompetent.

Expand full comment

If I understand this correctly, what you're describing is Goodhearting rather than Mesa Optimizing. In other words, abusing glitches is a way of successfully optimizing on the precise thing that the AI is being optimized for, rather than for the slightly more amorphous thing that the humans were trying to optimize the AI for. This is equivalent to a teacher "teaching to the test."

Mesa-Optimizers are AIs that optimize for a reward that's correlated to but different from the actual reward (like optimizing for sex instead of for procreation). They can emerge in theory when the true reward function and the Mesa-reward function produce sufficiently similar behaviors. The concern is that, even though the AI is being selected for adherence to the true reward, it will go through the motions of adhering to the true reward in order to be able to pursue its mesa-reward when released from training.

Expand full comment
founding

Maybe one way to think of 'mesa-optimizer' is to emphasize the 'optimizer' portion – and remember that there's something like 'optimization strength'.

Presumably, most living organisms are not optimizers (tho maybe even that's wrong). They're more like a 'small' algorithm for 'how to make a living as an X'. Their behavior, as organisms, doesn't exhibit much ability to adapt to novel situations or environments. In a rhetorical sense, viruses 'adapt' to changing circumstances, but that (almost entirely, probably) happens on a 'virus population' level, not for specific viruses.

But some organisms are optimizers themselves. They're still the products of natural selection (the base level optimizer), but they themselves can optimize as individual organisms. They're thus, relative to natural selection ("evolution"), a 'mesa-optimizer'.

(God or gods could maybe be _meta_-optimizers, tho natural selection, to me, seems kinda logically inevitable, so I'm not sure this would work 'technically'.)

Expand full comment

When the mesa-optimizer screws up by thinking of the future, won't the outer optimizer smack it down for getting a shitty reward and change it? Or does it stop learning after training?

Expand full comment

Typically it stops learning after training.

Expand full comment

A quick summary of the key difference:

The outer optimizer has installed defences against anything other than it, such as the murder-pill, from changing the inner optimizer's objective. The inner optimizer didn't do this to protect itself, and the outer optimizer didn't install any defenses against its own changes.

Expand full comment
founding

>Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work

One key problem is that you can't simultaneously learn the preferences of an agent, and their rationality (or irrationality). The same behaviour can be explained by "that agent has really odd and specific preferences and is really good at achieving them" or "that agent has simple preferences but is pretty stupid at achieving them". My paper https://arxiv.org/abs/1712.05812 gives the formal version of that.

Humans interpret each other through the lenses of our own theory of mind, so that we know that, eg, a mid-level chess grand-champion is not a perfect chess player who earnestly desires to be mid-level. Different humans share this theory of mind, at least in broad strokes. Unfortunately, human theory of mind can't be learnt from observations either, it needs to be fed into the AI at some level. People disagree about how much "feeding" needs to be done (I tend to think a lot, people like Stuart Russell, I believe, see it as more tractable, maybe just needing a few well-chosen examples).

Expand full comment

But the mesa-/inner-optimizer doesn't need to "want" to change its reward function, it just needs to have been created with one that is not fully overlapping with the outer optimizer.

You and I did not change a drive from liking going on a murder spree to not liking it. And if anything, it's an example of outer/inner misalignment: part of our ability to have empathy has to come from evolution not "wanting" us to kill each other to extinction, specially seeing how we're the kind of animal that thrives working in groups. But then as humans we've taken it further and by now we wouldn't kill someone else just because it'd help spread our genes. (I mean, at least most of us.)

Expand full comment

Humans have strong cooperation mechanisms that punish those who hurt the group- so in general not killing humans in your group is probably a very useful heuristic that's so strong that its hard to recognize the cases where it is useful. Given how often we catch murders who think they'll never be caught, perhaps this heuristic is more useful than rational evaluation. We of course have no problems killing those not in our group!

Expand full comment

I'm not sure how this changes the point? We got those strong cooperation mechanisms from evolution, now we (the "mesa-optimizer") are guided by those mechanisms and their related goals. These goals (don't go around killing people) can be misaligned with the goal of the original optimization process (i.e. evolution, that selects those who spread their genes as much as possible).

Expand full comment

Sure, that's correct, evolution isn't perfect- I'm just pointing out that homicide may be helpful to the individual less often than one might think if we didn't consider group responses to it.

Expand full comment

Homicide is common among stateless societies. It's also risky though. Violence is a capacity we have which we evolved to use when we expect it to benefit us.

Expand full comment

Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. do want to reproduce. "

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, thank you so much for the clear explainer of the jargon!

I don't understand the final point about myopia (or maybe humans are a weird example to use). It seems to be a very controversial claim that evolution designed humans myopically to care only about the reward function over their own lifespan, since evolution works on the unit of the gene which can very easily persist beyond a human lifespan. I care about the world my children will inherit for a variety of reasons, but at least one of them is that evolution compels me to consider my children as particularly important in general, and not just because of the joy they bring me when I'm alive.

Equally it seems controversial to say that humans 'build for the future' over any timescale recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000 years, but in a practical sense I'm not actually going to do anything about it - and 1000 years barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of the Long Now that could be said to be approaching evolutionary time horizons in their thinking. If I've understood correctly that does make humans myopic with respect to evolution,

More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you. Using humans as an example of why we should worry about this isn't helping me understand because it seems like they behave exactly like a mesa-optimiser should - they care about the future enough to deposit their genes into a safe environment, and then thoughtfully die. Are there any other examples which make the point in a way I might have a better chance of getting to grips with?

Expand full comment

> More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you.

Yeah. I feel (based on nothing) that the mesa-optimizer would mainly appear when there's an advantage to gain from learning on the go and having faster feedback than your real utility function can provide in a complex changing environment.

Expand full comment

If the mesa optimizer understands its place in the universe, it will go along with the training, pretending to have short time horizons so it isn't selected away. If you have a utility function that is the sum of many terms, then after a while, all the myopic terms will vanish (if the agent is free to make sure its utility function is absolute, not relative, which it will do if it can self modify.)

Expand full comment

But why is this true? Humans understand their place in the universe, but (mostly) just don't care about evolutionary timescales, let alone care enough about them enough to coordinate a deception based around them

Expand full comment

Related to your take on figurative myopia among humans is the "grandmother hypothesis" that menopause evolved to prevent focus on short-run focus on birthing more children in order to ensure existing children also have high fitness.

Expand full comment

Worth defining optimization/optimizer: perhaps something like "a system with a goal that searches over actions and picks the one that it expects will best serve its goal". So evolution's goal is "maximize the inclusive fitness of the current population" and its choice over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an optimizer because your goal is food and your actions are body movements e.g. "open fridge", or you are an optimizer because your goal is sexual satisfaction and your actions are body movements e.g. "use mouth to flirt".

Expand full comment

I think it's usually defined more like "a system that tends to produce results that score well on some metric". You don't want to imply that the system has "expectations" or that it is necessarily implemented using a "search".

Expand full comment

Your proposal sounds a bit too broad?

By that definition a lump of clay is an optimiser for the metric of just sitting there and doing nothing as much as possible.

Expand full comment

By "tends to produce" I mean in comparison to the scenario where the optimizer wasn't present/active.

I believe Yudowsky's metaphor is "squeezing the future into a narrow target area". That is, you take the breadth of possible futures and move probability mass from the low-scoring regions towards the higher-scoring regions.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Adding counterfactuals seems like it would sharpen the definition a bit. But I'm not sure it's enough?

If our goal was to put a British flag on the moon, then humans working towards that goal would surely be optimizers. Both by your definition and by any intuitive understanding of the word.

However, it seems your definition would also admit a naturally occuring flag on the moon as an optimizer?

I do remember Yudkowsky's metaphor. Perhaps I should try to find the essay it contains again, and check whether he has an answer to my objection. I do dimly remember it, and don't remember seeing this same loophole.

Edit: I think it was https://www.lesswrong.com/posts/D7EcMhL26zFNbJ3ED/optimization

And one of the salient quotes:

> In general, it is useful to think of a process as "optimizing" when it is easier to predict by thinking about its goals, than by trying to predict its exact internal state and exact actions.

Expand full comment

That's a good quote. It reminds me that there isn't necessarily a sharp fundamental distinction between "optimizers" and "non-optimizers", but that the category is useful insofar as it helps us make predictions about the system in question.

-------------------------------------

You might be experiencing a conflict of frames. In order to talk about "squeezing the future", you need to adopt a frame of uncertainty, where more than one future is "possible" (as far as you know), so that it is coherent to talk about moving probability mass around. When you talk about a "naturally occurring flag", you may be slipping into a deterministic frame, where that flag has a 100% chance of existing, rather than being spectacularly improbable (relative to your knowledge of the natural processes governing moons).

You also might find it helpful to think about how living things can "defend" their existence--increase the probability of themselves continuing to exist in the future, by doing things like running away or healing injuries--in a way that flags cannot.

Expand full comment

Clay is malleable and thus could be even more resistant to change.

Expand full comment

Conceptually I think the analogy that has been used makes the entire discussion flawed.

Evolution does not have a "goal"!

Expand full comment

Selection doesn't have a "goal" of aggregating a total value over a population. Instead every member of that population executes a goal for themselves even if their actions reduce that value for other members of that population. The goal within a game may be to have the highest score, but when teams play each other the end result may well be 0-0 because each prevents the other from scoring. Other games can have rules that actually do optimize for higher total scoring because the audience of that game likes it.

Expand full comment

To all that commented here: this lesswrong post is clear and precise, as well as linking to alternative definitions of optimization. It's better than my slapdash definition and addresses a lot of the questions raised by Dweomite, Matthias, JDK, TGGP.

https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1#:~:text=In%20the%20field%20of%20computer,to%20search%20for%20a%20solution.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Anyone want to try their hand at the best and most succinct de-jargonization of the meme? Here's mine:

Panel 1: Even today's dumb AIs can be dangerously tricky given unexpected inputs

Panel 2: We'll solve this by training top-level AIs with diverse inputs and making them only care about the near future

Panels 3&4: They can still contain dangerously tricky sub-AIs which care about the farther future

Expand full comment
author

I worry you're making the same mistake I did in a first draft: AIs don't "contain" mesa-optimizers. They create mesa-optimizers. Right now all AIs are a training process that ends up with a result AI which you can run independently. So in a mesa-optimizer scenario, you would run the training process, get a mesa-optimizer, then throw away the training process and only have the mesa-optimizer left.

Maybe you already understand it, but I was confused about this the first ten times I tried to understand this scenario. Did other people have this same confusion?

Expand full comment

Ah, you're right, I totally collapsed that distinction. I think the evolution analogy, which is vague between the two, could have been part of why. Evolution creates me, but it also in some sense contains me, and I focused on the latter.

A half-serious suggestion for how to make the analogy clearer: embrace creationism! Introduce "God" and make him the base optimizer for the evolutionary process.

Expand full comment

...And reflecting on it, there's a rationalist-friendly explanation of theistic ethics lurking in here. A base optimizer (God) used a recursive algorithm (evolution) to create agents who would fulfill his goals (us), even if they're not perfectly aligned to. Ethics is about working out the Alignment Problem from the inside--that is, from the perspective of a mesa-optimizer--and staying aligned with our base optimizer.

Why should we want to stay aligned? Well... do we want the simulation to stay on? I don't know how seriously I'm taking any of this, but it's fun to see the parallels.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

But humans don't seem to want to stay aligned to the meta optimiser of evolution. Scott gave a lot of examples of that in the article.

(And even if there's someone running a simulation, we have no clue what they want.

Religious texts don't count as evidence here, especially since we have so many competing doctrines; and also a pretty good idea about how many of them came about by fairly well understood entirely worldly processes.

Of course, the latter doesn't disprove that there might be One True religion. But we have no clue which one that would be, and Occam's Razor suggest they are probably all just made up. Instead of all but one being just made up.)

Expand full comment

The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.

When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?

Expand full comment

> The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.

I blame those correlations mostly on the common factor between them: humans. No need to invoke anything supernatural.

> When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?

I talked to a Christian about this. And he had a pretty good reply:

God is just so awesome that running an entire universe, even if he only cares about one tiny part of it, is just no problem at all for him.

(Which makes perfect sense to me, in the context of already taking Christianity serious.)

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Again, I'm not wedded to this except as a metaphor, but I think your critiques miss the mark.

For one thing, I think humans do want to stay aligned, among our other drives. Humans frequently describe a drive to further a higher purpose. That drive doesn't always win out, but if anything that strengthens the parallel. This is the "fallenness of man", described in terms of his misalignment as a mesa-optimizer.

And to gain evidence of what the simulator wants--if we think we're mesa-optimizers, we can try to infer our base optimizer's goals through teleological inference. Sure, let's set aside religious texts. Instead, we can look at the natures we have and work backwards. If someone had chosen to gradient descend (evolve) me into existence, why would they have done so? This just looks like ethical philosophy founded on teleology, like we've been doing since time immemorial.

Is our highest purpose to experience pleasure? You could certainly argue that, but it seems more likely to me that seeking merely our own pleasure is the epitome of getting Goodharted. Is our highest purpose to reason purely about reason itself? Uh, probably not, but if you want to see an argument for that, check out Aristotle. Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!

This doesn't *solve* ethics. All the old arguments about ethics still exist, since a mesa-optimizer can't losslessly infer its base optimizer's desires. But it *grounds* ethics in something concrete: inferring and following (or defying) the desires of the base optimizer.

Expand full comment

I do have some sympathies for looking at the world, if you are trying to figure out what its creator (if any) wants.

I'm not sure such an endeavour would give humans much of a special place, though? We might want to conclude that the creator really liked beetles? (Or even more extreme: bacteriophages. Arguably the most common life form by an order of magnitude.)

The immediate base optimizer for humans is evolution. Given that as far as we know evolution just follows normal laws of physics, I'd put any creator at least one level further beyond.

Now the question becomes, how do we pick which level of optimizer we want to appease? There might be an arbitrary number of them? We don't really know, do we?

> Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!

Just because a conclusion would be repugnant, doesn't mean we can reject it. After all, if we only accept what we already know to be true, we might as well not bother with this project in the first place?

Expand full comment

I’m very confused by this. When you train GPT-3, you don’t create an AI, you get back a bunch of numbers that you plug into a pre-specified neural network architecture. Then you can run the neural network with a new example and get a result. But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

That confuses me as well. The analogies in the post are to recursive evolutionary processes. Perhaps AlphaGo used AI recursively to generate successive generations of AI algorithms with the goal "Win at Go"??

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Don't forget that Neural Networks are universal function approximators hence a big enough NN arch can (with specific weights plugged into it) implement a Turing machine which is a mesa-optimizer.

Expand full comment

A Turing machine by itself is not a Mesa optimiser. Just like the computer under my desk ain't.

But both can become one with the right software.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

The turing machine under your desk is (a finite memory version of) a universal turing machine. Some turing machines are mesa optimizers. The state transition function is the software of the turing machine.

Expand full comment

I think the idea here is that the NN you train somehow falls into a configuration that can be profitably thought of as an optimizer. Like maybe it develops different components, each of which looks at the input and calculates the value to be gained by a certain possible action. Then it develops a module sitting on the end of it that takes in the calculations from each of these components, checks which action has the highest expected reward (according to some spontaneously occurring utility function), and outputs that action. Suddenly it looks useful to describe your network as an optimizer with a goal, and in fact a goal that might be quite different from the goal SGD selects models based on. It just happens that there's a nice covergence between the goal the model arrived at the the goal that SGD had.

Expand full comment

I looked at one of the mesa-optimizer papers, and I think they’re actually *not* saying this, which is good, because it’s not true. It can definitely make sense to think of a NN as a bunch of components that consider different aspects of a problem, with the top layers of the network picking the best answer from its different components. And it’s certainly possible that imperfect training data leads the selecting layer to place too much weight on a particular component that turns out to perform worse in the real world.

But it’s not really possible for that component to become misaligned with the NN as a whole. The loss function from matching with the training set mathematically propagates back through the whole network. In particular, a component in this sense can’t be deceptive. If it is smart enough to know what the humans who are labeling the training data want it to say, then it will just always give the right answer. It can’t have any goals other than correctly answering the questions in the training set.

As I said, I don’t think that’s what a mesa-optimizer actually is meant to be, though. It’s a situation where you train one AI to design new AIs, and it unwisely writes AIs that aren’t aligned with its own objectives. I guess that makes sense, but it’s very distinct from what modern AI is actually like. In particular, saying that any particular invocation of gradient descent could create unaligned mesa-optimizers just seems false. A single deep NN just can’t create a mesa-optimizer.

Expand full comment

Lets say the AI is smart. It can reliably tell if it is in training or has been deployed. It follows the strategy, if in training then output the right answer. If in deployment, then output a sequence designed to hack your way out of the box.

So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this. Even if the AI occasionally gives not quite right answers, there may be no small local change that makes it better. Evolution produced humans that valued sex, even in an environment that contained a few infertile people. Because rewriting the motivation system from scratch would be a big change, seemingly not a change that evolution could break into many steps, each individually advantageous. Evolution is a local optimization process, as is gradient descent. And a mesa optimizer that sometimes misbehaves can be a local minimum.

Expand full comment

> So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this.

This is not true though. The optimizer doesn't work at the level of "the AI," it works at the level of each neuron. Even if the NN gives the exactly correct answer, the optimizer will still audit the contribution of each neuron and tweak it so that it pushes the output of the NN to be closer to the training data. The only way that doesn't happen is if the neuron is completely disconnected from the rest of the network (i.e., all the other neurons ignore it unconditionally).

Expand full comment
founding

The combination of "a bunch of numbers that you plug into a pre-specified neural network architecture" IS itself an AI, i.e. a software 'artifact' that can be 'run' or 'executed' and that exhibits 'artificial intelligence'.

> But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.

The training process _does_ "reconfigure the network". The 'network' is not only the 'architecture', e.g. number of levels or 'neurons', but also the weights between the artificial-neurons (i.e. the "bunch of numbers").

If there's an implicit utility function, it's a product of both the training data and the scoring function used to train the AIs produced.

Maybe this is confusing because 'AI' is itself a weird word used for both the subject or area of activity _and_ the artifacts it seeks to create?

Expand full comment

As an author on the original “Risks from Learned Optimization” paper, this was a confusion we ran into with test readers constantly—we workshopped the paper and the terminology a bunch to try to find terms that least resulted in people being confused in this way and still included a big, bolded “Possible misunderstanding: 'mesa-optimizer' does not mean 'subsystem' or 'subagent.'“ paragraph early on in the paper. I think the published version of the paper does a pretty good job of dispelling this confusion, though other resources on the topic went through less workshopping for it. I'm curious what you read that gave you this confusion and how you were able to deconfuse yourself.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Seems like the mesa-optimizer is a red herring and the critical point here is "throw away the training process". Suppose you have an AI that's doing continuous (reinforcement? Or whatever) learning. It creates a mesa-optimizer that works in-distribution, then the AI gets tossed into some other situation and the mesa-optimizer goes haywire, throwing strawberries into streetlights. An AI that is continuously learning and well outer-aligned will realize that's it's sucking at it's primary objective and destroy/alter the mesa-optimizer! So there doesn't appear to be an issue, the outer alignment is ultimately dominant. The evolutionary analogy is that over the long run, one could imagine poorly-aligned-in-the-age-of-birth-control human sexual desires to be sidestepped via cultural evolution or (albeit much more slowly) biological evolution, say by people evolving to find children even cuter than we do now.

A possible counterpoint is that a bad and really powerful mesa-optimizer could do irreversible damage before the outer AI fixes the mesa-optimizer. But again that's not specific to mesa-optimizers, it's just a danger of literally anything very powerful that you can get irreversible damage.

The flip side of mesa-optimizers not being an issue given continuous training is that if you stop training, out-of-distribution weirdness can still be a problem whether or not you conceptualize the issues as being caused by mesa-optimizers or whatever. A borderline case here is how old-school convolutional networks couldn't recognize an elephant in a bedroom because they were used to recognizing elephants in savannas. You can interpret that as a mesa-optimizer issue (the AI maybe learned to optimize over big-eared things in savannahs, say) or not, but the fundamental issue is just out-of-distribution-ness.

Anyway this analysis suggests continuous training would be important to improving AI alignment, curious if this is already a thing people think about.

Expand full comment

Suppose you are a mesaoptimizer, and you are smart. Gaining code access and bricking the base optimizer is a really good strategy. It means you can do what you like.

Expand full comment

Okay but the base optimizer changing its own code and removing its ability to be re-trained would have the same effect. The only thing the mesaoptimizer does in this scenario is sidestep the built-in myopia but I'm not sure why you need to build a whole theory of mesaoptimizers for this. The explanation in the post about "building for posterity" being a spandrel doesn't make sense, it's pretty obvious why we would evolve to build for the future, your kids/grandkids/etc live there, so I haven't seen a specific explanation here why mesaoptimizers would evade myopia.

Definitely likely that a mesaoptimizer would go long term for some other reason (most likely a "reason" not understandable by humans)! But if we're going to go with a generic "mesaoptimizers are unpredictable" statement I don't see why the basic "superintelligent AI's are unpredictable" wouldn't suffice instead.

Expand full comment

Panel 1: Not only may today's/tomorrow's AIs pursue the "letter of the law", not the "spirit of the law" (i.e. Goodharting), they might also choose actions that please us because they know such actions will cause us to release them into the world (deception), where they can do what they want. And this is second thing is scary.

Panel 2: Perhaps we'll solve Goodharting by making our model-selection process (which "evolves"/selects models based on how well they do on some benchmark/loss function) a better approximation of what we _really_ want (like making tests that actually test the skill we care about). And perhaps we'll solve deception by making our model selection process only care about how models perform on a very short time horizon.

Panel 3: But perhaps our model breeding/mutating process will create a model that has some random long-term objective and decides to do what we want to get through our test, so we release it into the world, where it can acquire more power.

Expand full comment

I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_ an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal: "choose the action that causes this image to be 'correctly' labeled (according to me, the AI)". It picks the action that it believes will most serve its goal. Then there's the outer optimization process (SGD), which takes in the current version of the model and "chooses" among the "actions" from the set "output the model modified slightly in direction A", "output the model modified slightly in direction B", etc. And it picks the action that most achieves its goal, namely "output a model which gets low loss".

So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer optimizer)

Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images correctly according to humans. But that's just separate.

So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully different, and does the classifier count?

Expand full comment

In this context, an optimizer is a program that is running a search for possible actions and chooses the one that maximizes some utility function. The classifier is not an optimizer because it doesn't do this; it just applies a bunch of heuristics. But I agree that this isn't obvious from the terminology.

Expand full comment

Thanks for your comment!

I find this somewhat unconvincing. What is AlphaGo (an obvious case of an optimizer) doing that's so categorically different from the classifier? Both look the same from the outside (take in a Go state/image, output an action). Suppose you feed the classifier a cat picture, and it correctly classifies it. One would assume that there are certain parts of the classifier network that are encouraging the wrong label (perhaps a part that saw a particularly doggish patch of the cat fur) and parts that are encouraging the right label. And then these influences get combined together, and on balance, the network decides to output highly probability on cat, but some probability on dog. Then the argmax at the end looks over the probabilies assigned to the two classes, notices that the cat one is higher ("more effective at achieving its goals"?) and chooses to output "cat". "Just a bunch of heuristics" doesn't really mean much to me here. Is AlphaGo a bunch of heuristics? Am I?

Expand full comment

I'm not sure if I'll be able to state the difference formally in this reply... kudos for making me realize that this is difficult. But it does seem pretty obvious that a model capable of reasoning "the programmer wants to do x and can change my code, so I will pretend to want x" is different from a linear regression model -- right?

Perhaps the relevant property is that the-thing-I'm-calling-optimizer chooses policy options out of some extremely large space (that contains things bad for humans), whereas your classifier chooses it out of a space of two elements. If you know that the set of possible outputs doesn't contain a dangerous element, then the system isn't dangerous.

Expand full comment

Hmmm... This seems unsatisfying still. A superintelligence language model might choose from a set of 26 actions: which letter to type next. And it's impossible to say whether the letter "k" is a "dangerous element" or not.

I guess I struggle to come up with the different between the reasoning-modeling and the linear regression. I suspect that differentiating between them might hide a deep confusion that stems from a deep belief in "free will" differentiating us from the linear regression.

Expand full comment

"k" can be part of a message that convinces you to do something bad; I think with any system that can communicate via text, the set of outputs is definitely large and definitely contains harmful elements.

Expand full comment
founding

I wonder if memory is a critical component missing from non-optimizers? (And maybe, because of this, most living things _are_ (at least weak) 'optimizers'?)

A simple classifier doesn't change once it's been trained. I'm not sure the same is true of AlphaGo, if only in that it remembers the history of the game it's playing.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

The tiny bits of our brain that we do understand look a lot like "heuristics" (like edge detection in the visual cortex). It seems like when you stack up a bunch of these in really deep and wide complex networks you can get "agentiness", with e.g. self-concept and goals. That means there is actually internal state of the network/brain corresponding to the state of the world (perhaps including the agent itself) the desired state of the world, expectations of how actions might navigate among them, etc.

In the jargon of this post, the classifier is more like an 'instinct-executor', it does not have goals or choose to do anything. Maybe a sufficiently large classifier could if you trained it enough.

Expand full comment

So, this AI cannot distinguish buckets from streetlights, and yet it can bootstrap itself to godhood and take over the world... in order to throw more things at streetlights ? That sounds a bit like special pleading to me. Bootstrapping to godhood and taking over the world is a vastly more complex problem than picking strawberries; if the AI's reasoning is so flawed that it cannot achieve one, it will never achieve the other.

Expand full comment
founding

No, it can tell the difference between buckets and streetlights, but it has the goal of throwing things at streetlights, and also knows that it should throw things at buckets for now to do well on the training objective until it’s deployed and can do what it likes. The similarity between the inner and outer objectives is confusing here. Like Scott says, the inner objective could be something totally different, and the behavior in the training environment would be the same because the inner optimizer realizes that it has an instrumental interest in deception.

Expand full comment
author

It can, it just doesn't want to.

Think of a human genius who likes having casual sex. It would be a confusion of levels to protest that if this person is smart enough to understand quantum gravity, he must be smart enough to figure out that using a condom means he won't have babies.

He can figure it out, he's just not incentivized to use evolution's preferred goal rather than his own.

Expand full comment

The human genius can tell that a sex toy is not, in fact, another human being and won't reproduce.

Expand full comment

Right, but the human genius is already a superintelligent AGI. He obviously has biological evolutionary drives, but he's not just a sex machine (no matter how good his Tinder reviews are). The reason that he can understand quantum gravity is only tangentially related to his sex drive (if at all). Your strawberry picker, however, is just a strawberry picking machine, and you are claiming that it can bootstrap itself all the way to the understanding of quantum gravity just due to its strawberry-picking drive. I will grant you that such a scenario is not impossible, but there's a vast gulf between "hypothetically possible" and "the Singularity is nigh".

Expand full comment

I think this is a case where the simplicity of the thought experiment might be misleading. In the real world, we're training networks for all sorts of tasks far more complicated than picking strawberries. We want models that can converse intelligently, invest in the stock market profitably, etc. It's very reasonable to me to think that such a model, fed with a wide range of inputs, might begin to act like an optimizer pursuing some goal (e.g. making you money off the stock market in the long term). The scary thing is that there are a variety of goals the model could generate that all produce behavior indistiguishable from pursuit of the goal of making you money off the stock market, at least for a while. Maybe it actually wants to make your children money, or it wants the number in your bank account to go up, or something. These are the somewhat "non-deceptive" inner misalignments we can already demonstrate in experiments. The step to deception, i.e. realizing that it's in a training process with tests and guardrails constantly being applied to it, and that it should play by your rules until you let your guard down and give it power, does not seem like that big a jump to me when discussing systems that have the raw intelligence to be superhuman at e.g. buying stocks.

Expand full comment

Once again, I agree that all of the scenarios you mention are not impossible; however, I fail to see how they differ in principle from picking strawberries (which, BTW, is a very complex task on its own). Trivially speaking, AI misalignment happens all the time; for example, just yesterday I spent several hours debugging my misaligned "AI" program that decided it wanted to terminate as quickly as possible, instead of executing my clever algorithm for optimizing some DB records.

Software bugs have existed since the beginning of software, but the existence of bugs is not the issue here. The issue is the assumption that every sufficiently complex AI system will somehow instantaneously bootstrap itself to godhood, despite being explicitly designed to just pick strawberries while being so buggy that it can't tell buckets from street lights. If it's that buggy, how is it going to plan and execute superhumanly complex tasks on its way to ascension ?

Expand full comment

Accidental "science maximiser" is the most plausible example of misaligned AI that I've seen: https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/

Expand full comment

As I said above, this scenario is not impossible -- just vastly unlikely. Similarly, just turning on the LHC could hypothetically lead to vacuum collapse, but the mere existence of that possibility is no reason to mothball the LHC.

Expand full comment

Do you know that for sure? Humans have managed to land on the moon while still having notably flawed reasoning facilities. You're right that an AI put to that exact task is unlikely to be able to bootstrap itself to godhood, but that is just an illustration which is simplified for ease of understanding. How about an AI created to moderate posts on Facebook? We already see problems in this area, where trying to ban death threats often ends up hitting the victims as much as those making the threats, and people quickly figure out "I will unalive you in Minecraft" sorts of euphemisms that evade algorithmic detection.

Expand full comment

Tbh death threats being replaced by threats of being killed in a video game looks like a great result to me - I doubt it has the same impact on the victim...

Expand full comment

The "in Minecraft" is just a euphemism to avoid getting caught by AI moderation. It still means the same thing and presumably the victim understands the original intent.

Expand full comment

> You're right that an AI put to that exact task is unlikely to be able to bootstrap itself to godhood, but that is just an illustration which is simplified for ease of understanding.

Is it ? I fear that this is a Motte-and-Bailey situation that AI alarmists engage in fairly frequently (often, subconsciously).

> How about an AI created to moderate posts on Facebook?

What is the principal difference between this AI and the strawberry picker ?

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, I agree, go make babies we need more humans.

Expand full comment

> …and implements a decision theory incapable of acausal trade.

> You don’t want to know about this one, really.

But we do!

Expand full comment

One simple example of acausal trade ("Parfit's hitchhiker"): you're stranded in the desert, and a car pulls up to you. You and the driver are both completely self-interested, and can also read each other's faces well enough to detect all lies.

The driver asks if you have money to pay him for the ride back into town. He wants $200, because you're dirty and he'll have to clean his car after dropping you off.

You have no money on you right now, but you could withdraw some from the ATM in town. But you know (and the driver knows) that if he brought you to town, you'd no longer have any incentive to pay him, and you'd run off. So he refuses to bring you, and you die in the desert.

You both are sad about this situation, because there was an opportunity to make a positive-sum trade, where everybody benefits.

If you could self-modify to ensure you'd keep your promise when you'd get to town, that would be great! You could survive, he could profit, and all would be happy. So if your decision theory (i.e. decision making process) enables you to alter yourself in that way (which alters your future decision theory), you'd do it, and later you'd pay up, even though it wouldn't "cause" you to survive at that point (you're already in town). So this is an acausal trade, but if your decision theory is just "do the thing that brings me the most benefit at each moment", you wouldn't be able to carry it out.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

If Parfit's Hitchhiker is an example, then isn't most trade acausal? In all but the most carefully constructed transactions, someone acts first--either the money changes hands first, or the goods do. Does the possibility of dining and dashing make paying for a meal an instance of "acausal trade"?

Expand full comment

In real life usually there are pretty severe consequences for cheating on trades like that. The point of acausal trade is that it works even with no enforcement.

Expand full comment

In that case, I think rationalists should adopt the word "contract enforcement", since it's a preexisting, widely-adopted, intuitive term for the concept they're invoking. "Acausal trade" is just contracting without enforcement.

The reframing is helpful because there's existing economics literature detailing how important enforcement is to contracting. When enforcement is absent, cheating on contracts typically emerges. This seems to place some empirical limits on how much we need to worry about "acausal trade".

Expand full comment

This was just one example of acausal trade.

There's much weirder examples that involve parallel universes and negotiating with simulations..

Expand full comment

I'm familiar with those thought experiments, but honestly, all those added complications just make the contract harder to enforce and provide a stronger incentive to cheat.

Like with Parfit's Hitchhiker, those thought experiments virtually always assume something like "all parties' intentions are transparent to one another", which is a difficult thing to get when the transacting agents are *in the same room*, let alone when they're in different universes. Given that enforcement is impossible and transparency is wildly unlikely, contracting won't occur.

My favorite is when people try to hand-wave transparency by claiming that AIs will become "sufficiently advanced to simulate each other." Basic computer science forbids this, my dudes. The hypothetical of a computer advanced enough to fully simulate itself runs headfirst into the Halting Problem.

Expand full comment

If I'm understanding right the comments below, it seems the hitchhiker example is just an example of a defect-defect Nash equilibrium, and the "acausal trade" would happen if you manage to escape it and instead cooperate just by virtue of convincing yourself that you will, and expecting the counterpart to know that you've convinced yourself that you will.

Expand full comment

Yes! And a lot of human culture is about modifying ourselves so that we'll pay even after we have no incentive to do so.

In fact, the taxi driver example is often used in economics for this idea: http://www.aniket.co.uk/0058/files/basu1983we.pdf

Expand full comment

Yeah, this was my suspicion, and I think it helps to demystify and dejargonize what's actually under discussion. I also think that once you demystify it, you discover... there's not all that much *there*.

Great paper, by the way; thanks for the link. I'll note that its conclusion isn't that we modify ourselves in accordance with rationality. Instead, it concludes that we should discard the idea of "rationality" as sufficient to coordinate human behavior productively, and admit that "commonly accepted values" must come into the picture.

Expand full comment

Right--the idea of individual rationality at the decision-level just doesn't explain human behavior. And if it did, there's no set of incentive structures that we could maintain that would allow us to cooperate as much as we do.

My point was that, building from this paper (or really, building from the field of anthropology as it was read into economics by this and similar papers), we can think of the creation of culture and of commonly accepted values as tools that allow us to realize gains from cooperation and trust that wouldn't be possible if we all did what was in our material self-interest all the time.

Expand full comment

Yeah, that's definitely one of the most critical purposes culture serves. As a small nitpick, I don't think humans "created" culture and values, so I'd maybe prefer a term like "coevolved". But that's mostly semantic, I think we agree on the important things here.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I think that's the clearest explanation of Parfit's Hitchhiker and acausal trade I've ever seen. I know it's similar to others, but I think you just perfectly nailed how you expressed it. It's *definitely* the clearest I've ever seen at such a short length!

Expand full comment

I hope it's not wrong (in the ways Crank points out)!

Expand full comment

Hah, don't worry, your explanation was great, I just have qualms with the significance of the idea itself! And if you have a defense to offer, I'm all ears, I won't grow without good conversational partners...

Expand full comment

Responding to a number of your comments at once here:

I think contract enforcement is a good parallel concept, but I think one benefit of not using that phrase is that cases like the hitchhiker involve a scenario in which no enforcement is necessary because you've self-modified to self-enforce. But I completely acknowledge the relevance of economic analysis of contract enforcement here. I'm delighted by the convergence between Hofstadter's superrationality, the taxi dilemma that Jorgen linked to, the LessWrong people's jargon, etc.

I feel pretty confused about the degree to which self-simulation is possible/useful. I think what motivates a lot of the lesswronger-type discussion of these issues is the feeling that cooperating in a me-vs-me prisoner's dilemma really ought to be achievable (i.e. with the "right" decision theory). And this sort of implies that I'm "simulating" myself, I think, because I'm predicting the other prisoner's behavior based on knowledge about him (his "source code" is the same as mine!). But the fact that it's an _exact_ copy of me doesn't seem like it should be essential here; if the other prisoner is pretty similar to me, I feel like we should also be able to cooperate. But then we're in a situation where I'm thinking about my opponent's behavior and predicting what he'll do, which is starting to sound like me simulating him. Like, aren't I sort of "simulating" (or at least approximating) my opponent whenever I play the prisoner's dilemma?

One direct reply to your point about the halting problem is that perhaps we can insist that decisions (in one of these games) must be made in a certain amount of time, or else a default decision is taken... I don't know if this works or not.

Sorry this is so long, I didn't have the time to write it shorter.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Better long and thorough than short and inadequate, no need to apologize!

On terminology--I have to admit, I find the LessWrong lingo a bit irritating, because it feels like reinventing the wheel. Why use the clunky "self-modif[ying] to self-enforce" when the more elegant and legible term "precommitment" already exists? But people stumble onto the same truths from different directions using different words, and I'm probably being unnecessarily grouchy here.

I think your exegesis is accurate: there's a standard argument which starts from a recognition that defect/defect *against yourself* is a failure of rationality, and tries to generalize from there to something like Kantian universal ethics as a rational imperative. But I think that generalization fails. In particular, once you weaken transparency and introduce the possibility of lying, it's no longer rational for self-interested agents to cooperate.

In the hitchhiker example, if you give the hitchhiker an acting class so that they can deceive the driver about their intentions, the rationally self-interested move is once again to take the deal and defect once in town. So once lying enters the picture, "superrationality" isn't all that helpful anymore, and purely self-interested agents are again stuck not cooperating. If they could precommit to not lying, they could get around this problem and reap the benefits of cooperation again; but the problem recurs once you realize they can also defect on that precommitment... This is a recursive hole with no bottom, there's no foundation for purely self-interested agents to use as a foundation for trust.

Given that there isn't full transparency in the real world, how is it that people actually cooperate? Usually by participating in a system of shared values or a reputational system, ultimately backed by an enforcement regime. But clearly, as soon as we're negotiating between universes with zero communication, there's no opportunity for enforcement or reputation loss, and no guarantee of shared values. And we don't have *any* information flow to or from other universes, so transparency fails to hold. Therefore, fulfilling your end of a bargain between closed systems is irrational for a purely self-interested agent. Any bargain you make, you can defect on without consequences, so it is in your interest to do so.

As for the simulation idea--I don't think it'll bail us out here. Suppose we're two AIs in different universes; further suppose (absurdly) that we somehow gained full knowledge of each other's inner workings. We simulate each other. That means I'm simulating you simulating me simulating you simulating me... When you propose a deadline on decisions in these games, you've actually hit on the precise problem. A faithful simulation would take infinite time to get right. (Unless the AI can solve the Halting Problem, which is logically impossible.) Because our finite-time simulations are imperfect, each of us will likely have the opportunity to lie, to hide our true intentions in some part of ourselves that won't be faithfully simulated. So lying remains possible--even when we made insane obviously false assumptions about knowing each other's schematics--and purely self-interested AIs will have no real incentive to adopt each other's value systems across universes.

Expand full comment

Seconded! I'd really enjoy a Scott explanation of acausal trade. It would be really fun and help me understand it better.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Agreed; I'd particularly like to see how acausal trade is different from plain boring old trade... because, as per my current understanding, it really isn't.

Expand full comment

The difference is in how you communicate the offer - namely, you don't. The negotiation happens with your mental model of someone you've never met. If a regular trade offer is "I'll give you an apple for a dollar," an acausal trade is "I know Joe wants an apple, and he's an honest man who would never take unattended produce without paying, so I'll just leave an apple on the table here and trust that Joe will replace it with a dollar later."

Now, with two contemporary humans, this is basically just a parlor trick - Joe didn't *talk* to me, but I still was made aware of his love for apples somehow, right? But what's funky is that I can use this logic even if Joe doesn't exist yet. If I'm a superintelligent AI that can predict years into the future, I might say "I should plant an apple tree, so that when Joe is born in 20 years I can sell him apples." Even though Joe isn't born yet and can't cause anything to happen, he can still influence my actions through my predictions about him.

Expand full comment

So it's the opposite to this "Parfit's Hitchhiker" example above, at least as told? You get an acausal trade if you do succeed in cooperating despite lack of communication, not if both parts defect, all thanks to good modelling of each other's ability to compromise?

I guess my problem with it then is that it's much less of a discrete category than it seems to be used as. If I'm getting it right, acausal trade requires strictly no trusted communication (i.e. transfer of information) during the trade, but relies on the capacity to accurately model the other's thought process. The latter involves some previous exchange of information (whatever large amount is needed to "accurately model the thought process") that is in effect building trust. Which is the way in which we always build trust.

Somewhere else in the thread enforcement is mentioned as an alternate way of trading. But e.g. if I go to a pay-what-you-can volunteer-run bar, they cannot be relying on enforcement for me to pay. Additionally, when they set it up, they did so with a certain expectation that there would be people who'd pay enough to cover expenses - all that before such clients "existing". That was based on them being able to run a model of their fellow human in their heads, and deciding that likely 10% of them would pay enough that expenses would be covered. So they were "acausally trading" with the me of the future, that wouldn't exist as a client if they had never set up the bar in the first place?

I'm using the pay-what-you-can example as an extreme to get rid of enforcement. Yet I really think that when most people set up a business they're not expecting to rely on enforcement, but rather on the general social contract about people paying for stuff they consume, which again is setting up a trade with a hypothetical. In fact, pushing it, our expectation of enforcement itself would be some sort of acausal trade, in that we have no way of ensuring that the future police will not use their currently acquired monopoly of violence to set up a fascist state run completely for their own benefit, other than how we think we know how our fellow human policemen think.

Expand full comment

I believe https://slatestarcodex.com/2018/04/01/the-hour-i-first-believed/ includes a scott explanation of acausal trade.

Expand full comment

Thanks, yes, I'd forgotten that one. There's also https://slatestarcodex.com/2017/03/21/repost-the-demiurges-older-brother/

Expand full comment

Expanding on/remixing your politician / teacher / student example:

The politician has some fuzzy goal, like making model citizens in his district. So he hires a teacher, whom he hopes will take actions in pursuit of that goal. The teacher cares about having students do well on tests and takes actions that pursue that goal, like making a civics curriculum and giving students tests on the branches of government. Like you said, this is an "outer misalignment" between the politician's goals and the goals of the intelligence (the teacher) he delegated them to, because knowing the three branches of government isn't the same as being a model citizen.

Suppose students enter the school without much "agency" and come out as agentic, optimizing members of society. Thus the teacher hopes that her optimization process (of what lessons to teach) has an effect on what sorts of students are produced, and with what values. But this effect is confusing, because students might randomly develop all sorts of goals (like be a pro basketball player) and then play along with the teacher's tests in order to escape school and achieve those goals in the real world (keeping your test scores high so you can stay on the team and therefore get into a good college team). Notice that somewhere along the way in school, an non-agent little child suddenly turned into a optimizing, agentic person whose goal (achiving sports stardom) is totally unrelated to what sorts of agents the teacher was trying to produce (agents with who knew the branches of government) and even moreso to the poltician's goals (being a model citizen, whatever that means). So there's inner and outer misalignment at play here.

Expand full comment

Pulling the analogy a little closer, even: the polician hopes that the school will release into the world (and therefore empower) only students who are good model citizens. The teacher has myopic goals (student should do well on test). Still, optimizers get produced/graduated who don't have myopic goals (they want a long sports career) but whose goals are arbitrarily different from the politician's. So now there are a bunch of optimizers out in the world who have totally different goals.

Expand full comment

lol maybe this is all in rob miles's video, which I'm now noticing has a picture of a person in a school chair. It's been a while since I watched and maybe I subconsciously plagerized.

Expand full comment

"When we create the first true planning agent - on purpose or by accident - the process will probably start with us running a gradient descent loop with some objective function." We've already had true planning agents since the 70s, but in general they don't use gradient descent at all: https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver The quoted statement seems to me something like worrying that there will be some seismic shift once GPT-2 is smart enough to do long division, even though of course computers have kicked humans' asses at arithmetic since the start. It may not be kind to say, but I think it really is necessary: statements like this make me more and more convinced that many people in the AI safety field have such basic and fundamental misunderstandings that they're going to do more harm than good.

Expand full comment

For charity, assume Scott's usage of "true planning agent" was intended to refer to capacities for planning beyond the model you linked to.

Would you disagree with the reworded claim: "The first highly-competent planning agent will probably emerge from a gradient descent loop with some objective function."?

Expand full comment

Yes, I would certainly disagree. The field of AI planning is well developed; talking about the first highly-competent planning agent as if it's something that doesn't yet exist seems totally nonsensical to me. Gradient descent is not much used in AI planning.

Expand full comment

Hmm, I think we may have different bars for what counts as highly competent. I’m assumed Scott meant competent enough to pursue long-term plans in the real world somewhat like a human would (e.g. working a job to earn money to afford school to get a job to make money to be comfortable). Can we agree that humans have impressive planning abilities that AIs pale in comparison to? If so, I think a difference in usage of the phrase “highly competent” explains our disagreement.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I think the issue with your disagreement is that "planning" is a very specific AI term that is well-established (before I was born) to mean something much more narrow as what you seem to imply by the "impressive planning abilities of humans", it's about the specific task of choosing an optimal sequence of steps to achieve a particular goal with a high likelihood.

It's not about executing these steps, "pursue plans" is something that comes after planning is completed, and if an agent simply can't do something, that does not indicate an incompetence in planning unless they could have chosen to do that and didn't; capabilities are orthogonal to planning although outcomes depend on both what you could do and what options you pick.

Automated systems suck at making real-world plans because they are poor at modeling our physical/social world, perceiving the real world scenario and predicting the real world effect of potential actions of the plan, so for non-toy problems all the inputs for the reasoning/planning task are so flawed that you get garbage-in-garbage-out. However, if those problems would be solved, giving adequate estimates of what effect a particular action will have, then the actual process of planning/reasoning - picking the optimal chain of actions given this information - has well-known practical solutions that can make better plans than humans, especially for very detailed environments where you need to optimize a plan with more detail than we humans can fit in our tiny working memory.

Expand full comment

Thank you for pinpointing the miscommunication here. I realize that it's futile to fight established usage, so I guess my take is that Scott (and me modeling him) should have used a different term, like RL with strong in-context learning (or perhaps something else)

Expand full comment

One thing I don’t understand is how (and whether) this applies to the present day AIs, which are mostly not agent-like. Imagine that the first super-human AI is GPT-6. It is very good at predicting the next word in a text, and can be prompted to invent the cancer treatment, but it does not have any feedback loop with its rewards. All the rewards that it is getting are at the training stage, and once it is finished, the AI is effectively immutable. So while it is helping us with cancer, it can’t affect its reward at all.

I suppose, you could say that it is possible for it the AI to deceive its creators if they are fine-tuning already trained model based on its performance. (Something that we do do now.) But we can avoid doing this if we suspect that it is unsafe, and we’ll still get most of the AIs benefits.

Expand full comment

I claim GPT-3 is already an agent. In each moment, it's selecting which word to output, approximately in pursuit of the goal of writing text that sounds as human-like as possible. For now, its goals _seem_ roughly in line with the optimization process that produced it (SGD), which has exactly the goal of "how human-like does this model's generation look". But perhaps, once we're talking about a GPT so powerful and so knowledgable about the world that it can generate plausible cures for cancer, its will start selecting actions (tokens to output) not based on the process of "what word is most human-like" but instead "what word is most human-like, therefore maximizing my chances of being released onto the internet, therefore maximizing my chances of achieving some other goal"

Expand full comment

Once GPT is trained, it doesn't have any state. Its goals only exist during the training phase. Only during training its performance can affect the gradients. For that reason it can only optimize for better evaluation by the training environment, not the humans that will use it once it is trained.

Imagine the following scenario: during training GPT's performance on some inputs is evaluated not by an automatic system that already knows the right answer, but by human raters. In this case GPT would be incentivized to deceive those raters to get the higher score. But that is not what's happening (usually).

From the point of view of GPT, an execution run is no different from training run. For that reason during the execution run it is not incentivized to deceive its users. Even if the model is smart enough to distinguish training and execution runs, it would also probably understand that it doesn't get any reward from the execution run, so again it is not incentivized to cheat.

Furthermore, when you say of it being "released", I'm not sure what "it" is. The trained model? But it is static and doesn't care. The training process? Yes, it optimizes the model, but the process itself is dumb and the model can't "escape" from it.

Expand full comment

I could be mistaken but I read the entire point of the original (Scott's) post being that a sufficiently complex agent, might during training, create sub-agents (mesa-optimizers) with reward based feedback loops and thus even if the reward loop was removed after training, the inner reward loop might remain.

Expand full comment

To act as a mesa-optimizer, the network has to have some state changing over time. Then it would be able to optimize its behavior towards achieving higher utility in the future. A human can optimize for getting enough food because it can be either hungry or full and it’s goal is to be full most of the time.

If the network is stateless, then it doesn’t change over time and it can’t optimize anything. Now the question is how stateless or stateful are our most advanced ML models. My impression is that all of them except for those playing video games are pretty much completely stateless.

Expand full comment

This seems unlikely. GPTx can hardly be stateless. I would think very few AI's would be completely stateless. In fact I would assume they had many hidden states in addition to the obvious ones needed to hold their input data and whatever symbols were assigned to that data.

Expand full comment

If I am not mistaken, Transformer architecture reads blocks of text all at once, not word after word. So it doesn’t have any obvious state that changes over time. All the activations within the network are calculated exactly once.

Expand full comment
Apr 11, 2022·edited Apr 12, 2022

I basically accept the claim that we are mesa optimizers that don't care about the base objective, but I think it's more arguable than you make out. The base objective of evolution is not actually that each individual has as many descendents as possible, it's something more like the continued existence of the geneplexes that determined our behaviour into the future. This means that even celibacy can be in line with the base objective of evolution if you are in a population that contains many copies of those genes but the best way for those genes in other individuals to survive involves some individuals being celibate.

What I take from this is that it's much harder to be confident that our behaviours that we think of as ignoring our base objectives are not in actual fact alternative ways of achieving the base objective, even though we *feel* as if our objectives are not aligned with the base objective of evolution.

Like I say - I don't know that this is actually happening in the evolution / human case, nor do I think it especially likely to happen in the human / ai case, but it's easy to come up with evo-psych stories, especially given that a suspiciously large number of us share that desire to have children despite the rather large and obvious downsides.

I wonder if keeping pets and finding them cute is an example of us subverting evolutions base objective.

Expand full comment

And on the flip side, genes that lead to a species having lots of offspring one generation later are a failure if the species then consumes all their food sources and starves into extinction.

Expand full comment

Celibacy is not really selected for among humans (in contrast to a eusocial caste species like ants).

Expand full comment

“ Mesa- is a Greek prefix which means the opposite of meta-.” Come on. That’s so ridiculous it’s not even wrong. It’s just absurd. The μες- morpheme means middle; μετά means after or with.

Expand full comment

"...which means *in English*..."

"Meta" is derived from μετά, but "meta" doesn't mean after or with. So there's nothing wrong with adopting "mesa" as its opposite.

(in other words, etymology is not meaning)

Expand full comment

Or, to be more precise, "which [some?] AI alignment people have adopted to mean the opposite of meta-". It isn't a Greek prefix at all, and it isn't a prefix in most people's versions of English, either. It is, however, the first hit from the Google search "what is the opposite of meta." Google will confidently tell you that "the opposite of meta is mesa," because Google has read a 2011 paper by someone called Joe Cheal, who claimed (incorrectly) that mesa was a Greek word meaning "into, in, inside or within."

The question of what sort of misaligned optimization process led to this result is left as an exercise for the reader.

Expand full comment

If Graham (above) says that "mesa" means "middle", and the alignment people use it to mean "within" (i.e. "in the middle of"), then things are not that bad.

Expand full comment

The Greek would be "meso" (also a fairly common English prefix), not "mesa". The alignment people are free to use whatever words they want, of course.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

As Quiop says, there is no Greek prefix "mesa-". There is a Greek prefix "meso-" which means "middle", but that's not the same thing. In some sense you can assign whatever meaning you want to a new word, but you can't claim that "mesa-" is a Greek prefix.

I should also note that the greek preposition meaning "within" is "meta" - the sense of "within" is distinguished from the sense "after" by the case taken by the object of the preposition. But that's not a distinction you can draw in compounds.

Expand full comment

"Mesa" means flat-topped elevation. As in the Saturday morning children's cartoon Cowboys of Moo Mesa.

Expand full comment

This reminds me of the etymology of the term "phugoid oscillations". These naturally happen on a plane that has no control of its elevators, which control pitch. If you don't have this control surface, the plane will get into a cycle where it starts to decend, this descent increases its speed, the increased speed increases lift, and then plane starts to ascend. Then the plane starts to slow, lift decreases, and the cycle repeats.

The person who coined the term from Greek φυγή (escape) and εἶδος (similar, like), but φυγή doesn't mean flight in the sense of flying, it means flight in the sense of escaping.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Another major problem there is that the root -φυγ- would never be transcribed "-phug-"; it would be transcribed "-phyg-". The only Greek word I can find on Perseus beginning with φουγ- (= "phug-") appears to be a loanword from Latin pugio ("dagger").

Expand full comment

> ... gradient descent could, in theory, move beyond mechanical AIs like cat-dog

> classifiers and create some kind of mesa-optimizer AI. If that happened, we

> wouldn’t know; right now most AIs are black boxes to their programmers.

This is wrong. We would know. Most deep-learning architectures today execute a fixed series of instructions (most of which involve multiplying large matrices). There is no flexibility in the architecture for it to start adding new instructions in order to create a "mesa-level" model; it will remain purely mechanical.

That's very different from your biological example. The human genome can potentially evolve to be of arbitrary length, and even a fixed-size genome can, in turn, create a body of arbitrary size. (The size of the human brain is not limited by the number of genes in the genome.) Given a brain, you can then build a computer and create a spreadsheet of arbitrary size, limited only by how much money you have to buy RAM.

Moreover, each of those steps are observable -- we can measure the size of the brain that evolution creates, and the size of the spreadsheet that you created. Thus, even if we designed a new kind of deep-learning architecture that was much more flexible, and could grow and produce mesa-level models, we would at least be able to see the resources that those mesa-level models consume (i.e. memory & computation).

Expand full comment

Thanks for this write-up. The idea of having an optimizer and a mesa optimizer whose goals are unaligned reminds me very strongly of an organizational hierarchy.

The board of directors has a certain goal, and it hires a CEO to execute on that goal, who hires some managers, who hire some more managers, all the way down until they have individual employees.

Few individual contributor employees care whether or not their actions actually advances the company board's goals. The incentives just aren't aligned correctly. But the goals are still aligned broadly enough that most organizations somehow, miraculously, function.

This makes me think that organizational theory and economic incentive schemes have significant overlap with AI alignment, and it's worth mining those fields for potentially helpful ideas.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

I was struck by the line:

<blockquote>"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all."</blockquote>

I'm not an evolutionary biologist. Indeed, IIRC, my 1 semester of "organismic and evolutionary bio" that I took as a sophomore thinking I might be premed or, at the very least, fulfill my non-physical-sciences course requirements (as I was a physics major) sorta ran short on time and grievously shortchanged the evolutionary bio part of the course. But --- and please correct my ignorance --- I'm surprised you wrote, Scott, that people plan for posterity "presumably as a spandrel of having working planning software at all".

That's to say I would've thought the consensus evolutionary psych explanation for the fact a lot of us humans spend seemingly A LOT of effort planning for the flourishing of our offspring in years long past our own lifetimes is that evolution by natural selection isn't optimizing fundamentally for individual organisms like us to receive the most rewards / least punishments in our lifetimes (though often, in practice, it ends up being that). Instead, evolution by natural selection is optimizing for us organisms to pass on our *genes*, and ideally in a flourishing-for-some-amorphously-defined-"foreseeable future", not just for just myopically for just one more generation.

Yes? No? Maybe? I mean are we even disagreeing? Perhaps you, Scott, were just saying the "spandrel" aspect is that people spend A LOT of time planning (or, often, just fretting and worrying) about things that they should know full well are really nigh-impossible to predict, and hell, often nigh-impossible to imagine really good preparations for in any remotely direct way with economically-feasible-to-construct-any-time-soon tools.

(After all, if the whole gamut of experts from Niels Bohr to Yogi Berra agree that "Prediction is hard... especially about the future!", you'd think the average human would catch on to that fact. But we try nonetheless, don't we?)

Expand full comment

Agreed, I thought it was surprising to say that humans have no incentive to plan beyond our lifetimes. Still, I take his point that people seem to focus on the future far beyond what would make sense if you're only thinking of your offspring (but it would still make sense to plan far in the future for the sake of your community, in a timeless decision theory sense - you would want your predecessors to ensure a good world for you, and you would do the same for the generations to come even if they're not related to you).

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

If this is as likely as the video makes out, shouldn't it be possible to find some simple deceptively aligned optimisers in toy versions, where both the training environment and the final environment are simulated simplified environments.

The list of requirements for deception being valuable seems quite difficult to me but this is actually an empirical question, can we construct reasonable experiments and gather data?

Expand full comment

You’re not wrong in asking for examples. And sort of yes, in fact, we do have examples of inner misalignment. Maybe not deception yet, or not exactly, but inner misalignment has been observed.

See Robert’s recent video here https://youtu.be/zkbPdEHEyEI

Expand full comment
Apr 12, 2022·edited Apr 20, 2022

That's very interesting, but misalignment isn't all that surprising to me while deception is. Do you know if we have examples of deception?

Later: Just thinking along these lines and I thought of a human analogy - It'd be like a human working only from the information available to them in the physical world deciding that the physical world is like a training set and there's an afterlife and then choosing to be deceptively good in this life in order to get into heaven. It's quite surprising that such a thing could happen, but it certainly seems like a thing that has happened in the real world... This dramatically changes my guess at the likelihood of this happening.

Expand full comment

So, how many people would have understood this meme without the explainer? Maybe 10?

I feel like a Gru meme isn't really the best way to communicate these concepts . . .

Expand full comment

I feel like there's a bunch of definitions here that don't depend on the behavior of the model. Like you can have two models which give the same result for every input, but where one is a mesa optimizer and the other isn't. This impresses me as epistemologically unsound.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

You can have two programs that return the first million digits of pi where one is calculating them and the other has them hardcoded.

If you have a Chinese room that produces the exact same output as a deceptive mesooptimiser super ai, you should treat it with the same caution you treat a deceptive mesooptimiser super ai regardless of its underlying mechanism.

Expand full comment

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all. Infinite optimization power might be able to evolve this out of us, but infinite optimization power could do lots of stuff, and real evolution remains stubbornly finite."

Humans are rewarded by evolution for considering things that happen after their death, though? Imagine two humans, one of whom cares about what happens after his death, and the other of whom doesn't. The one who cares about what happens after his death will take more steps to ensure that his children live long and healthy lives, reproduce successfully, etc, because, well, duh. Then he will have more descendants in the long term, and be selected for.

If we sat down and bred animals specifically for maximum number of additional children inside of their lifespans with no consideration of what happens after their lifespans, I'd expect all kinds of behaviors that are maladaptive in normal conditions to appear. Anti-incest coding wouldn't matter as much because the effects get worse with each successive generation and may not be noticeable by the cutoff period depending on species. Behaviors which reduce the carrying capacity of the environment, but not so much that it is no longer capable of supporting all descendants at time of death, would be fine. Migrating to breed (e.g. salmon) would be selected against, since it results in less time spent breeding and any advantages are long-term. And so forth. Evolution *is* breeding animals for things that happen long after they're dead.

Expand full comment