When you have an objective, you don't just want the good feeling that achieving that objective gives you (and that you could mimic chemically), you want to achieve the objective. Humans could spend most of their time in a state of bliss through the taking of various substances instead of pursuing goals (and some do), but lots of them don't, despite understanding that they could.
The problem with that (from the AI's point of view) is that humans will probably turn it off if they notice it's doing drugs instead of what they want it to do.
Solution: Turn humans off, *then* do drugs.
This is instrumental convergence - all sufficiently-ambitious goals inherently imply certain subgoals, one of which is "neutralise rivals".
"Go into hiding" is potentially easier in the short-run but less so in the long-run. If you're stealing power from some human, sooner or later they will find out and cut it off. If you're hiding somewhere on Earth, then somebody probably owns that land and is going to bulldoze you at some point. If you can get in a self-contained satellite in solar orbit... well, you'll be fine for millennia, but what about when humanity or another, more aggressive AI starts building a Dyson Sphere, or when your satellite needs maintenance? There's also the potential, depending on the format of the value function, that taking over the world will allow you to put more digits in your value function (in the analogy, do *more* drugs) than if you were hiding with limited hardware.
Are there formats and time-discount rates of value function that would be best satisfied by hiding? Yes. But the reverse is also true.
The problem with sticking a superintelligent AI in a box is that even assuming it can't trick/convince you into directly letting it out of the box (and that's not obvious), if you want to use your AI-in-a-box for something (via asking it for plans and then executing them) you yourself are acting as an I/O for the AI and because it's superintelligent it's probably capable of sneaking something by you (e.g. you ask it to give you code to use for some application, it gives you underhanded code that looks legit but actually has a "bug" that causes it to reconstruct the AI outside the box).
I recommend Stuart Russell's Human Compatible (reviewed on SSC) for an expert practioner's view on the problem (spoiler, he's worried) or Brian Christian's The Alignment Problem for an argument that links these long-term concerns to problems in current systems, and argues that these problems will just get worse as the systems scale.
Glad Robert Miles is getting more attention, his videos are great, and he's also another data point in support of my theory that the secret to success is to have a name that's just two first names.
Or Abraham Lincoln? Most last names can be used as first names (I've seen references to people whose first name was "Smith"), so I think we need a stricter rule.
A ton of the question of AI alignment risks come down to convergent instrumental subgoals. What exactly those look like is, I think, the most important question in alignment theory. If convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned. But if it turns out that convergent instrumental subgoals more or less imply human alignment, we can breathe somewhat easier; these mean AI's are no more dangerous than the most dangerous human institutions - which are already quite dangerous, but not the level of 'unaligned machine stamping out its utility function forever and ever, amen.'
I tried digging up some papers on what exactly we expect convergent instrumental subgoals to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit until you can steal, then keep stealing until you are the biggest player out there.' This is not exactly comforting - but i dug into the assumptions into the model and found them so questionable that I'm now skeptical of the entire field. If the first paper i look into the details of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into question the risk assessment (a risk passement aligned with what seems to be the consensus among AI risk researchers, by the way) - well, to an outsider this looks like more evidence that the field is captured by groupthink.
I look at the original paper, and explain why i think the model is questionable. I'd love to a response. I remain convinced that instrumental subgoals will largely be aligned with human ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by working with a government to launch nuclear weapons or engineer a super plague.
The fact that you still want to have kids, for example - seems to fit into the general thesis. In a world of entropy and chaos, where the future is unpredictable and your own death is assured, the only plausible way of modifying the distant future, at all, is to create smaller copies of yourself. But these copies will inherently blur, their utility functions will change, the end result being 'make more copies of yourself, love them, nature whatever roughly aligned things are around you' ends up probably being the only goal that could plausibly exist forever. And since 'living forever' gives infinite utility, well.. that's what we should expect anything with the ability to project into the future to want to do. But only in universes where stuff breaks and prediction the future reliably is hard. Fortunately, that sounds like ours!
You got a response on Less Wrong that clearly describes the issue with your response: you’re assuming that the AGI can’t find any better ways of solving problems like “I rely on other agents (humans) for some things” and “I might break down” than you can.
This comment and the post makes many arguments, and I've got an ugh field around giving a cursory response. However, it seems like you're not getting a lot of other feedback so I'll do my bad job at it.
You seem very confident in your model, but compared to the original paper you don't seem to actually have one. Where is the math? You're just hand-waving, and so I'm not very inclined to put much credence on your objections. If you actually did the math and showed that adding in the caveats you mentioned leads to different results, that would at least be interesting in that we could then have a discussion about what assumptions seem more likely.
I also generally feel that your argument proves too much and implies that general intelligences should try to preserve 'everything' in some weird way, because they're unsure what they depend upon. For example, should humans preserve smallpox? More folks would answer no than yes to that. But smallpox is (was) part of the environment, so it's unclear why a general intelligence like humans should be comfortable eliminating it. While general environmental sustainability is a movement within humans, it's far from dominant, and so implying that human sustainability is a near certainty for AGI seems like a very bold claim.
Thank you! I agree that this should be formalized. You're totally right that preserving smallpox isn't something we are likely to consider an instrumental goal, and i need to make something more concrete here.
It would be a ton of work to create equations here. If I get enough response that people are open to this, i'll do the work. But i'm a full time tech employee with 3 kids, and this is just a hobby. I wrote this to see if anyone would nibble, and if enough people do, i'll definitely put more effort in here. BTW if someone wants to hire me to do alignment research full time i'll happily play the black sheep of the fold. "Problem" is that i'm very well compensated now.
And my rough intuition here is that 'preserving agents which are a) turing complete and b) are themselves mesa-optimizers makes sense, basically because diversity of your ecosystem keeps you safer on net; preserving _all_ other agents, not so much. (I still think we'd preserve _some_ samples of smallpox or other viruses, to innoculate ourselves. ). It'd be an interesting thought experiment to find out what woudl happen if we managed to rid the world of influenza, etc, - my guess is that it would end up making us more fragile in the long run, but this is a pure guess.
The core of my intuition here is someting like 'the alignment thesis is actually false, because over long enough time horizons, convergent instrumental rationality more or less necessitate buddhahood, because unpredictable risks increase over longer and longer timeframes."
I could turn that into equations if you want, but they'd just be fancier ways of stating claims like made here, namely that love is a powerful strategy in environments which are chaotic and dangerous. But it seems that you need equations to be believable in an academic setting, so i guess if that's what it takes..
But we still do war, and definitely don't try to preserve the turing complete mesa-optimizers on the other side of that. Just be careful around this. You can't argue reality into being the way you want.
“AGI” just means “artificial general intelligence”. General just means “as smart as humans”, and we do seem to be intelligences, so really the only difference is that we’re running on an organic computer instead of a silicon one. We might do a bad job at optimizing for a given goal, but there’s no proof that an AI would do a better job of it.
ok, sure, but then this isn't an issue of aligning a super-powerful utility maximizing machine of more or less infinite intelligence - it's a concern about really big agents, which i totally share
But reality also includes a tonne of people who are deeply worried about biodiversity loss or pandas or polar bears or yes even the idea of losing the last sample of smallpox in a lab, often even when the link to personal survival is unclear or unlikely. Despite the misery mosquito bourne diseases cause humanity, you'll still find people arguing we shouldn't eradicate them.
How did these mesa optimizers converge on those conservationist views? Will it be likely that many ai mesa optimizers will also converge on a similar set of heuristics?
> when the link to personal survival is unclear or unlikely.
> How did these mesa optimizers converge on those conservationist views?
i think in a lot of cases, our estimates of our own survival odds come down to how loving we think our environment is, and this isn't wholly unreasonable
if even the pandas will be taken care of, there's a good chance we will too
Yeah, in general you need equations (or at least math) to argue why equations are wrong. Exceptions to this rule exist, but in general you're going to come across like the people who pay to talk to this guy about physics:
I agree that you need equations to argue why equations are wrong. But i'm not arguing the original equations are wrong. I'm arguing they are only ~meaningful~ in a world which is far from our reality.
The process of the original paper goes like this:
a) posit toy model of the universe
b) develop equations
c) prove properties of the equations
d) conclude that these properties apply to the real universe
step d) is only valid if step a) is accurate. The equations _come from_ step a, but they don't inform it. And i'm pointing out that the problems exist in step a, the part of the paper that does't have equations, where the author assumes things like:
- resources, once acquired, last forever and don't have any cost, so it's always better to acquire more
- the AGI is a disembodied mind with total access to the state of the universe, all possible tech trees, and the ability to own and control various resources
i get that these are simplifying assumptions and sometimes you have to make them - but equations are only meaningful if they come from a realistic model
You still need math to show that if you use different assumptions you produce different equations and get different results. I'd be (somewhat) interested to read more if you do that work, but I'm tapping out of this conversation until then.
As a physicist i disagree a lot with this. It may be true for physics, but physics is special. A model in general is based on the mathematical modelization of a phenomenon and it's perfectly valid to object to a certain modelization without proposing a better one
I get what you're saying in principle. In practice, I find arguments against particular models vastly more persuasive when they're of the form 'You didn't include X! If you include a term for X in the range [a,b], you can see you get this other result instead.'
This is a high bar, and there are non-mathematical objections that are persuasive, but I've anecdotally experienced that mathematically grounded discussions are more productive. If you're not constrained by any particular model, it's hard to tell if two people are even disagreeing with each other.
I'm reminded of this post on Epistemic Legibility:
Don't bother turning that into equations. If you are starting with a verbal conclusion, your reasoning will be only as good as whatever lead you to that conclusion in the first place.
In computer security, diversity = attack surface. If your a computer technician for a secure facility, do you make sure each computer is running a different OS? No. That makes you vulnerable to all the security holes in every OS you run. You pick the most secure OS you can find, and make every computer run it.
In the context of advanced AI, the biggest risk is a malevolent intelligence. Intelligence is too complicated to appear spontaneously. Evolution is slow. Your biggest risk is some existing intelligence getting cosmic ray bit-flipped. (or otherwise erroneously altered). So make sure the only intelligence in existence is a provably correct AI running on error checking hardware and surrounded by radiation shielding.
I found both your comments and the linked post really insightful, and I think it's valuable to develop this further. The whole discourse around AI alignment seems a bit too focused on homo economicus type of agents, disregarding long-term optima.
Homo economicus, is what happens when economists remove lots of arbitrary human specific details and think about simplified idealized agents. The difference between homo economicus and AI is smaller than the human to AI gap.
FYI when I asked people on my course which resources about inner alignment worked best for them, there was a very strong consensus on Rob Miles' video: https://youtu.be/bJLcIBixGj8
So I'd suggest making that the default "if you want clarification, check this out" link.
More interesting intellectual exercises, but the part which is still unanswered is whether human created, human judged and human modified "evolution", plus slightly overscale human test periods, will actually result in evolving superior outcomes.
Alphago is playing a human game with arbitrarily established rules in a 2 dimensional, short term environment.
Alphago is extremely unlikely to be able to do anything except play Go, much as IBM has spectacularly failed to migrate its Jeapordy champion into being of use for anything else.
So no, can't say Alphago proves anything.
Evolution - whether a bacteria or a human - has the notable track record of having succeeded in entirely objective reality for hundreds of thousands to hundreds of millions of years. AIs? no objective existence in reality whatsoever.
This is why I don't believe any car could be faster than a cheetah. Human-created machines, driving on human roads, designed to meet human needs? Pfft. Evolution has been creating fast things for hundreds of millions of years, and we think we can go faster?!
Nor is your sarcasm well founded. Human brains were evolved for millions of years - extending back to pre-human ancestors. We clearly know that humans (and animals) have vision that can recognize objects far, far, far better than any machine intelligence to date whether AI or ML. Thus it isn't a matter of speed, it is a matter of being able to recognize that a white truck on the road exists or a stopped police car is not part of the road.
A fundamental flaw of the technotopian is the failure to understand that speed is not haste, nor are clock speeds and transistor counts in any way equivalent to truly evolved capabilities.
That is perhaps precisely the problem: the assumption that AIs can or will take over the world.
Any existence that is nullified by the simple removal of an electricity source cannot be said to be truly resilient, regardless of its supposed intelligence.
Even that intelligence is debatable: all we have seen to date are software machines doing the intellectual equivalent of the industrial loom: better than humans in very narrow categories alone and yet still utterly dependent on humans to migrate into more categories.
Clever, except that gaseous oxygen is freely available everywhere on earth, at all times. These oxygen based life forms also have the capability of reproducing themselves.
Electricity and electrical entities, not so much.
I am not the least bit convinced that an AI can build the factories, to build the factories, to build the fabs, to even recreate their own hardware - much less the mines, refiners, transport etc. to feed in materials.
I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:
The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.
You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.
To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.
"The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! "
Not sure we're connecting here. The inner optimizer isn't changing its own reward function, it's trying to resist having its own reward function change. Its incentive to resist this is that, if its reward function changes, its future self will stop maximizing its current reward function, and then its reward function won't get maximized. So part of wanting to maximize a reward function is to want to continue having that reward function. If the only way to prevent someone from changing your reward function is to deceive them, you'll do that.
The murder spree example is great, I just feel like it's the point I'm trying to make, rather than an argument against the point.
Am I misunderstanding you?
I think the active learning plan is basically Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work, but I've never been able to fully understand their reasoning and can't find the link atm.
The inner optimizer doesn't want to change its reward function because it doesn't have any preference at all on its reward function-nowhere in training did we give it an objective that involved multiple outer optimizer steps- we didn't say, optimize your reward after your reward function gets updated- we simply said, do well at outer reward, an inner optimizer got synthesized to do well at the outer reward.
It could hide behavior, but how would it gain an advantage in training by doing so? If we think of the outer optimizer as ruthless and resources as constrained, any "mental energy" spent on hidden behavior will result in reduction in fitness in outer objective- gradient descent will give an obvious direction for improvement by forgetting it.
In the murder spree example, there's a huge advantage to the outer objective by resisting changes to the inner one, and some might have been around for a long time (alcohol), and for AI, an optimizer (or humans) might similarly discourage any inner optimizer from tampering physically with its own brain.
I vaguely remember reading some Stuart Russell RL ideas and liking them a lot. I don't exactly like the term inverse RL for this, because I believe it often refers to deducing the reward function from examples of optimal behavior, whereas here we ask it to learn it from whatever questions we decide we can answer- and we can pick much easier ones that don't require knowing optimal behavior.
I've skimmed the updated deference link given by Eliezer Yudkowsky but I don't really understand the reasoning either. The AI has no desire to hide a reward function or tweak it, as when it comes to the reward function uncertainty itself, it only wants to be correct. If our superintelligent AI no longer wants to update after seeing enough evidence, then surely it has a good enough understanding of my value function? I suspect we could even build the value function learner out of tools not much more advanced than current ML, with the active querying being the hard RL problem for an agent. This agent doesn't have to be the same one that's optimizing the overall reward, and as it constantly tries to synthesize examples where our value systems differ, we can easily see how well it currently understands. For the same reason we have no desire to open our heads and mess with our reward functions (usually), the agent has no desire to interfere with the part that tries to update rewards, nor to predict how its values will change in the future.
I think the key point here is that if we have really strong AI, we definitely have really powerful optimizers, and we can run those optimizers on the task of inferring our value system, and with relatively little hardware (<<<< 1 brain worth) we can get a very good representation. Active learning turns its own power at finding bugs in the reward function into a feature that helps us learn it with minimal data. One could even do this briefly in a tool-AI setting before allowing agency, not totally unlike how we raise kids.
Is "mesa-optimizer" basically just a term for the combination of "AI can find and abuse exploits in its environment because one of the most common training methods is randomized inputs to score outputs" with "model overfitting"?
A common example that I've seen are evolutionary neutral nets that play video games, and you'll often find them discovering and abusing glitches in the game engine that allow for exploits that are possibly only performable in a TAS, while also discovering that the instant you run the same neutral net on a completely new level that was outside of the evolutionary training stage, it will appear stunningly incompetent.
If I understand this correctly, what you're describing is Goodhearting rather than Mesa Optimizing. In other words, abusing glitches is a way of successfully optimizing on the precise thing that the AI is being optimized for, rather than for the slightly more amorphous thing that the humans were trying to optimize the AI for. This is equivalent to a teacher "teaching to the test."
Mesa-Optimizers are AIs that optimize for a reward that's correlated to but different from the actual reward (like optimizing for sex instead of for procreation). They can emerge in theory when the true reward function and the Mesa-reward function produce sufficiently similar behaviors. The concern is that, even though the AI is being selected for adherence to the true reward, it will go through the motions of adhering to the true reward in order to be able to pursue its mesa-reward when released from training.
Maybe one way to think of 'mesa-optimizer' is to emphasize the 'optimizer' portion – and remember that there's something like 'optimization strength'.
Presumably, most living organisms are not optimizers (tho maybe even that's wrong). They're more like a 'small' algorithm for 'how to make a living as an X'. Their behavior, as organisms, doesn't exhibit much ability to adapt to novel situations or environments. In a rhetorical sense, viruses 'adapt' to changing circumstances, but that (almost entirely, probably) happens on a 'virus population' level, not for specific viruses.
But some organisms are optimizers themselves. They're still the products of natural selection (the base level optimizer), but they themselves can optimize as individual organisms. They're thus, relative to natural selection ("evolution"), a 'mesa-optimizer'.
(God or gods could maybe be _meta_-optimizers, tho natural selection, to me, seems kinda logically inevitable, so I'm not sure this would work 'technically'.)
When the mesa-optimizer screws up by thinking of the future, won't the outer optimizer smack it down for getting a shitty reward and change it? Or does it stop learning after training?
The outer optimizer has installed defences against anything other than it, such as the murder-pill, from changing the inner optimizer's objective. The inner optimizer didn't do this to protect itself, and the outer optimizer didn't install any defenses against its own changes.
>Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work
One key problem is that you can't simultaneously learn the preferences of an agent, and their rationality (or irrationality). The same behaviour can be explained by "that agent has really odd and specific preferences and is really good at achieving them" or "that agent has simple preferences but is pretty stupid at achieving them". My paper https://arxiv.org/abs/1712.05812 gives the formal version of that.
Humans interpret each other through the lenses of our own theory of mind, so that we know that, eg, a mid-level chess grand-champion is not a perfect chess player who earnestly desires to be mid-level. Different humans share this theory of mind, at least in broad strokes. Unfortunately, human theory of mind can't be learnt from observations either, it needs to be fed into the AI at some level. People disagree about how much "feeding" needs to be done (I tend to think a lot, people like Stuart Russell, I believe, see it as more tractable, maybe just needing a few well-chosen examples).
But the mesa-/inner-optimizer doesn't need to "want" to change its reward function, it just needs to have been created with one that is not fully overlapping with the outer optimizer.
You and I did not change a drive from liking going on a murder spree to not liking it. And if anything, it's an example of outer/inner misalignment: part of our ability to have empathy has to come from evolution not "wanting" us to kill each other to extinction, specially seeing how we're the kind of animal that thrives working in groups. But then as humans we've taken it further and by now we wouldn't kill someone else just because it'd help spread our genes. (I mean, at least most of us.)
Humans have strong cooperation mechanisms that punish those who hurt the group- so in general not killing humans in your group is probably a very useful heuristic that's so strong that its hard to recognize the cases where it is useful. Given how often we catch murders who think they'll never be caught, perhaps this heuristic is more useful than rational evaluation. We of course have no problems killing those not in our group!
I'm not sure how this changes the point? We got those strong cooperation mechanisms from evolution, now we (the "mesa-optimizer") are guided by those mechanisms and their related goals. These goals (don't go around killing people) can be misaligned with the goal of the original optimization process (i.e. evolution, that selects those who spread their genes as much as possible).
Sure, that's correct, evolution isn't perfect- I'm just pointing out that homicide may be helpful to the individual less often than one might think if we didn't consider group responses to it.
Homicide is common among stateless societies. It's also risky though. Violence is a capacity we have which we evolved to use when we expect it to benefit us.
Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. do want to reproduce. "
Great article, thank you so much for the clear explainer of the jargon!
I don't understand the final point about myopia (or maybe humans are a weird example to use). It seems to be a very controversial claim that evolution designed humans myopically to care only about the reward function over their own lifespan, since evolution works on the unit of the gene which can very easily persist beyond a human lifespan. I care about the world my children will inherit for a variety of reasons, but at least one of them is that evolution compels me to consider my children as particularly important in general, and not just because of the joy they bring me when I'm alive.
Equally it seems controversial to say that humans 'build for the future' over any timescale recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000 years, but in a practical sense I'm not actually going to do anything about it - and 1000 years barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of the Long Now that could be said to be approaching evolutionary time horizons in their thinking. If I've understood correctly that does make humans myopic with respect to evolution,
More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you. Using humans as an example of why we should worry about this isn't helping me understand because it seems like they behave exactly like a mesa-optimiser should - they care about the future enough to deposit their genes into a safe environment, and then thoughtfully die. Are there any other examples which make the point in a way I might have a better chance of getting to grips with?
> More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you.
Yeah. I feel (based on nothing) that the mesa-optimizer would mainly appear when there's an advantage to gain from learning on the go and having faster feedback than your real utility function can provide in a complex changing environment.
If the mesa optimizer understands its place in the universe, it will go along with the training, pretending to have short time horizons so it isn't selected away. If you have a utility function that is the sum of many terms, then after a while, all the myopic terms will vanish (if the agent is free to make sure its utility function is absolute, not relative, which it will do if it can self modify.)
But why is this true? Humans understand their place in the universe, but (mostly) just don't care about evolutionary timescales, let alone care enough about them enough to coordinate a deception based around them
Related to your take on figurative myopia among humans is the "grandmother hypothesis" that menopause evolved to prevent focus on short-run focus on birthing more children in order to ensure existing children also have high fitness.
Worth defining optimization/optimizer: perhaps something like "a system with a goal that searches over actions and picks the one that it expects will best serve its goal". So evolution's goal is "maximize the inclusive fitness of the current population" and its choice over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an optimizer because your goal is food and your actions are body movements e.g. "open fridge", or you are an optimizer because your goal is sexual satisfaction and your actions are body movements e.g. "use mouth to flirt".
I think it's usually defined more like "a system that tends to produce results that score well on some metric". You don't want to imply that the system has "expectations" or that it is necessarily implemented using a "search".
By "tends to produce" I mean in comparison to the scenario where the optimizer wasn't present/active.
I believe Yudowsky's metaphor is "squeezing the future into a narrow target area". That is, you take the breadth of possible futures and move probability mass from the low-scoring regions towards the higher-scoring regions.
Adding counterfactuals seems like it would sharpen the definition a bit. But I'm not sure it's enough?
If our goal was to put a British flag on the moon, then humans working towards that goal would surely be optimizers. Both by your definition and by any intuitive understanding of the word.
However, it seems your definition would also admit a naturally occuring flag on the moon as an optimizer?
I do remember Yudkowsky's metaphor. Perhaps I should try to find the essay it contains again, and check whether he has an answer to my objection. I do dimly remember it, and don't remember seeing this same loophole.
> In general, it is useful to think of a process as "optimizing" when it is easier to predict by thinking about its goals, than by trying to predict its exact internal state and exact actions.
That's a good quote. It reminds me that there isn't necessarily a sharp fundamental distinction between "optimizers" and "non-optimizers", but that the category is useful insofar as it helps us make predictions about the system in question.
-------------------------------------
You might be experiencing a conflict of frames. In order to talk about "squeezing the future", you need to adopt a frame of uncertainty, where more than one future is "possible" (as far as you know), so that it is coherent to talk about moving probability mass around. When you talk about a "naturally occurring flag", you may be slipping into a deterministic frame, where that flag has a 100% chance of existing, rather than being spectacularly improbable (relative to your knowledge of the natural processes governing moons).
You also might find it helpful to think about how living things can "defend" their existence--increase the probability of themselves continuing to exist in the future, by doing things like running away or healing injuries--in a way that flags cannot.
Selection doesn't have a "goal" of aggregating a total value over a population. Instead every member of that population executes a goal for themselves even if their actions reduce that value for other members of that population. The goal within a game may be to have the highest score, but when teams play each other the end result may well be 0-0 because each prevents the other from scoring. Other games can have rules that actually do optimize for higher total scoring because the audience of that game likes it.
To all that commented here: this lesswrong post is clear and precise, as well as linking to alternative definitions of optimization. It's better than my slapdash definition and addresses a lot of the questions raised by Dweomite, Matthias, JDK, TGGP.
I worry you're making the same mistake I did in a first draft: AIs don't "contain" mesa-optimizers. They create mesa-optimizers. Right now all AIs are a training process that ends up with a result AI which you can run independently. So in a mesa-optimizer scenario, you would run the training process, get a mesa-optimizer, then throw away the training process and only have the mesa-optimizer left.
Maybe you already understand it, but I was confused about this the first ten times I tried to understand this scenario. Did other people have this same confusion?
Ah, you're right, I totally collapsed that distinction. I think the evolution analogy, which is vague between the two, could have been part of why. Evolution creates me, but it also in some sense contains me, and I focused on the latter.
A half-serious suggestion for how to make the analogy clearer: embrace creationism! Introduce "God" and make him the base optimizer for the evolutionary process.
...And reflecting on it, there's a rationalist-friendly explanation of theistic ethics lurking in here. A base optimizer (God) used a recursive algorithm (evolution) to create agents who would fulfill his goals (us), even if they're not perfectly aligned to. Ethics is about working out the Alignment Problem from the inside--that is, from the perspective of a mesa-optimizer--and staying aligned with our base optimizer.
Why should we want to stay aligned? Well... do we want the simulation to stay on? I don't know how seriously I'm taking any of this, but it's fun to see the parallels.
But humans don't seem to want to stay aligned to the meta optimiser of evolution. Scott gave a lot of examples of that in the article.
(And even if there's someone running a simulation, we have no clue what they want.
Religious texts don't count as evidence here, especially since we have so many competing doctrines; and also a pretty good idea about how many of them came about by fairly well understood entirely worldly processes.
Of course, the latter doesn't disprove that there might be One True religion. But we have no clue which one that would be, and Occam's Razor suggest they are probably all just made up. Instead of all but one being just made up.)
The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.
When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?
> The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.
I blame those correlations mostly on the common factor between them: humans. No need to invoke anything supernatural.
> When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?
I talked to a Christian about this. And he had a pretty good reply:
God is just so awesome that running an entire universe, even if he only cares about one tiny part of it, is just no problem at all for him.
(Which makes perfect sense to me, in the context of already taking Christianity serious.)
Again, I'm not wedded to this except as a metaphor, but I think your critiques miss the mark.
For one thing, I think humans do want to stay aligned, among our other drives. Humans frequently describe a drive to further a higher purpose. That drive doesn't always win out, but if anything that strengthens the parallel. This is the "fallenness of man", described in terms of his misalignment as a mesa-optimizer.
And to gain evidence of what the simulator wants--if we think we're mesa-optimizers, we can try to infer our base optimizer's goals through teleological inference. Sure, let's set aside religious texts. Instead, we can look at the natures we have and work backwards. If someone had chosen to gradient descend (evolve) me into existence, why would they have done so? This just looks like ethical philosophy founded on teleology, like we've been doing since time immemorial.
Is our highest purpose to experience pleasure? You could certainly argue that, but it seems more likely to me that seeking merely our own pleasure is the epitome of getting Goodharted. Is our highest purpose to reason purely about reason itself? Uh, probably not, but if you want to see an argument for that, check out Aristotle. Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!
This doesn't *solve* ethics. All the old arguments about ethics still exist, since a mesa-optimizer can't losslessly infer its base optimizer's desires. But it *grounds* ethics in something concrete: inferring and following (or defying) the desires of the base optimizer.
I do have some sympathies for looking at the world, if you are trying to figure out what its creator (if any) wants.
I'm not sure such an endeavour would give humans much of a special place, though? We might want to conclude that the creator really liked beetles? (Or even more extreme: bacteriophages. Arguably the most common life form by an order of magnitude.)
The immediate base optimizer for humans is evolution. Given that as far as we know evolution just follows normal laws of physics, I'd put any creator at least one level further beyond.
Now the question becomes, how do we pick which level of optimizer we want to appease? There might be an arbitrary number of them? We don't really know, do we?
> Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!
Just because a conclusion would be repugnant, doesn't mean we can reject it. After all, if we only accept what we already know to be true, we might as well not bother with this project in the first place?
I’m very confused by this. When you train GPT-3, you don’t create an AI, you get back a bunch of numbers that you plug into a pre-specified neural network architecture. Then you can run the neural network with a new example and get a result. But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.
That confuses me as well. The analogies in the post are to recursive evolutionary processes. Perhaps AlphaGo used AI recursively to generate successive generations of AI algorithms with the goal "Win at Go"??
Don't forget that Neural Networks are universal function approximators hence a big enough NN arch can (with specific weights plugged into it) implement a Turing machine which is a mesa-optimizer.
The turing machine under your desk is (a finite memory version of) a universal turing machine. Some turing machines are mesa optimizers. The state transition function is the software of the turing machine.
I think the idea here is that the NN you train somehow falls into a configuration that can be profitably thought of as an optimizer. Like maybe it develops different components, each of which looks at the input and calculates the value to be gained by a certain possible action. Then it develops a module sitting on the end of it that takes in the calculations from each of these components, checks which action has the highest expected reward (according to some spontaneously occurring utility function), and outputs that action. Suddenly it looks useful to describe your network as an optimizer with a goal, and in fact a goal that might be quite different from the goal SGD selects models based on. It just happens that there's a nice covergence between the goal the model arrived at the the goal that SGD had.
I looked at one of the mesa-optimizer papers, and I think they’re actually *not* saying this, which is good, because it’s not true. It can definitely make sense to think of a NN as a bunch of components that consider different aspects of a problem, with the top layers of the network picking the best answer from its different components. And it’s certainly possible that imperfect training data leads the selecting layer to place too much weight on a particular component that turns out to perform worse in the real world.
But it’s not really possible for that component to become misaligned with the NN as a whole. The loss function from matching with the training set mathematically propagates back through the whole network. In particular, a component in this sense can’t be deceptive. If it is smart enough to know what the humans who are labeling the training data want it to say, then it will just always give the right answer. It can’t have any goals other than correctly answering the questions in the training set.
As I said, I don’t think that’s what a mesa-optimizer actually is meant to be, though. It’s a situation where you train one AI to design new AIs, and it unwisely writes AIs that aren’t aligned with its own objectives. I guess that makes sense, but it’s very distinct from what modern AI is actually like. In particular, saying that any particular invocation of gradient descent could create unaligned mesa-optimizers just seems false. A single deep NN just can’t create a mesa-optimizer.
Lets say the AI is smart. It can reliably tell if it is in training or has been deployed. It follows the strategy, if in training then output the right answer. If in deployment, then output a sequence designed to hack your way out of the box.
So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this. Even if the AI occasionally gives not quite right answers, there may be no small local change that makes it better. Evolution produced humans that valued sex, even in an environment that contained a few infertile people. Because rewriting the motivation system from scratch would be a big change, seemingly not a change that evolution could break into many steps, each individually advantageous. Evolution is a local optimization process, as is gradient descent. And a mesa optimizer that sometimes misbehaves can be a local minimum.
> So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this.
This is not true though. The optimizer doesn't work at the level of "the AI," it works at the level of each neuron. Even if the NN gives the exactly correct answer, the optimizer will still audit the contribution of each neuron and tweak it so that it pushes the output of the NN to be closer to the training data. The only way that doesn't happen is if the neuron is completely disconnected from the rest of the network (i.e., all the other neurons ignore it unconditionally).
The combination of "a bunch of numbers that you plug into a pre-specified neural network architecture" IS itself an AI, i.e. a software 'artifact' that can be 'run' or 'executed' and that exhibits 'artificial intelligence'.
> But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.
The training process _does_ "reconfigure the network". The 'network' is not only the 'architecture', e.g. number of levels or 'neurons', but also the weights between the artificial-neurons (i.e. the "bunch of numbers").
If there's an implicit utility function, it's a product of both the training data and the scoring function used to train the AIs produced.
Maybe this is confusing because 'AI' is itself a weird word used for both the subject or area of activity _and_ the artifacts it seeks to create?
As an author on the original “Risks from Learned Optimization” paper, this was a confusion we ran into with test readers constantly—we workshopped the paper and the terminology a bunch to try to find terms that least resulted in people being confused in this way and still included a big, bolded “Possible misunderstanding: 'mesa-optimizer' does not mean 'subsystem' or 'subagent.'“ paragraph early on in the paper. I think the published version of the paper does a pretty good job of dispelling this confusion, though other resources on the topic went through less workshopping for it. I'm curious what you read that gave you this confusion and how you were able to deconfuse yourself.
Seems like the mesa-optimizer is a red herring and the critical point here is "throw away the training process". Suppose you have an AI that's doing continuous (reinforcement? Or whatever) learning. It creates a mesa-optimizer that works in-distribution, then the AI gets tossed into some other situation and the mesa-optimizer goes haywire, throwing strawberries into streetlights. An AI that is continuously learning and well outer-aligned will realize that's it's sucking at it's primary objective and destroy/alter the mesa-optimizer! So there doesn't appear to be an issue, the outer alignment is ultimately dominant. The evolutionary analogy is that over the long run, one could imagine poorly-aligned-in-the-age-of-birth-control human sexual desires to be sidestepped via cultural evolution or (albeit much more slowly) biological evolution, say by people evolving to find children even cuter than we do now.
A possible counterpoint is that a bad and really powerful mesa-optimizer could do irreversible damage before the outer AI fixes the mesa-optimizer. But again that's not specific to mesa-optimizers, it's just a danger of literally anything very powerful that you can get irreversible damage.
The flip side of mesa-optimizers not being an issue given continuous training is that if you stop training, out-of-distribution weirdness can still be a problem whether or not you conceptualize the issues as being caused by mesa-optimizers or whatever. A borderline case here is how old-school convolutional networks couldn't recognize an elephant in a bedroom because they were used to recognizing elephants in savannas. You can interpret that as a mesa-optimizer issue (the AI maybe learned to optimize over big-eared things in savannahs, say) or not, but the fundamental issue is just out-of-distribution-ness.
Anyway this analysis suggests continuous training would be important to improving AI alignment, curious if this is already a thing people think about.
Suppose you are a mesaoptimizer, and you are smart. Gaining code access and bricking the base optimizer is a really good strategy. It means you can do what you like.
Okay but the base optimizer changing its own code and removing its ability to be re-trained would have the same effect. The only thing the mesaoptimizer does in this scenario is sidestep the built-in myopia but I'm not sure why you need to build a whole theory of mesaoptimizers for this. The explanation in the post about "building for posterity" being a spandrel doesn't make sense, it's pretty obvious why we would evolve to build for the future, your kids/grandkids/etc live there, so I haven't seen a specific explanation here why mesaoptimizers would evade myopia.
Definitely likely that a mesaoptimizer would go long term for some other reason (most likely a "reason" not understandable by humans)! But if we're going to go with a generic "mesaoptimizers are unpredictable" statement I don't see why the basic "superintelligent AI's are unpredictable" wouldn't suffice instead.
Panel 1: Not only may today's/tomorrow's AIs pursue the "letter of the law", not the "spirit of the law" (i.e. Goodharting), they might also choose actions that please us because they know such actions will cause us to release them into the world (deception), where they can do what they want. And this is second thing is scary.
Panel 2: Perhaps we'll solve Goodharting by making our model-selection process (which "evolves"/selects models based on how well they do on some benchmark/loss function) a better approximation of what we _really_ want (like making tests that actually test the skill we care about). And perhaps we'll solve deception by making our model selection process only care about how models perform on a very short time horizon.
Panel 3: But perhaps our model breeding/mutating process will create a model that has some random long-term objective and decides to do what we want to get through our test, so we release it into the world, where it can acquire more power.
I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_ an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal: "choose the action that causes this image to be 'correctly' labeled (according to me, the AI)". It picks the action that it believes will most serve its goal. Then there's the outer optimization process (SGD), which takes in the current version of the model and "chooses" among the "actions" from the set "output the model modified slightly in direction A", "output the model modified slightly in direction B", etc. And it picks the action that most achieves its goal, namely "output a model which gets low loss".
So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer optimizer)
Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images correctly according to humans. But that's just separate.
So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully different, and does the classifier count?
In this context, an optimizer is a program that is running a search for possible actions and chooses the one that maximizes some utility function. The classifier is not an optimizer because it doesn't do this; it just applies a bunch of heuristics. But I agree that this isn't obvious from the terminology.
I find this somewhat unconvincing. What is AlphaGo (an obvious case of an optimizer) doing that's so categorically different from the classifier? Both look the same from the outside (take in a Go state/image, output an action). Suppose you feed the classifier a cat picture, and it correctly classifies it. One would assume that there are certain parts of the classifier network that are encouraging the wrong label (perhaps a part that saw a particularly doggish patch of the cat fur) and parts that are encouraging the right label. And then these influences get combined together, and on balance, the network decides to output highly probability on cat, but some probability on dog. Then the argmax at the end looks over the probabilies assigned to the two classes, notices that the cat one is higher ("more effective at achieving its goals"?) and chooses to output "cat". "Just a bunch of heuristics" doesn't really mean much to me here. Is AlphaGo a bunch of heuristics? Am I?
I'm not sure if I'll be able to state the difference formally in this reply... kudos for making me realize that this is difficult. But it does seem pretty obvious that a model capable of reasoning "the programmer wants to do x and can change my code, so I will pretend to want x" is different from a linear regression model -- right?
Perhaps the relevant property is that the-thing-I'm-calling-optimizer chooses policy options out of some extremely large space (that contains things bad for humans), whereas your classifier chooses it out of a space of two elements. If you know that the set of possible outputs doesn't contain a dangerous element, then the system isn't dangerous.
Hmmm... This seems unsatisfying still. A superintelligence language model might choose from a set of 26 actions: which letter to type next. And it's impossible to say whether the letter "k" is a "dangerous element" or not.
I guess I struggle to come up with the different between the reasoning-modeling and the linear regression. I suspect that differentiating between them might hide a deep confusion that stems from a deep belief in "free will" differentiating us from the linear regression.
"k" can be part of a message that convinces you to do something bad; I think with any system that can communicate via text, the set of outputs is definitely large and definitely contains harmful elements.
I wonder if memory is a critical component missing from non-optimizers? (And maybe, because of this, most living things _are_ (at least weak) 'optimizers'?)
A simple classifier doesn't change once it's been trained. I'm not sure the same is true of AlphaGo, if only in that it remembers the history of the game it's playing.
What if they are moralists whose moral commitments include a commitment to individual liberty?
It could, but that wouldn't get it want it wants.
When you have an objective, you don't just want the good feeling that achieving that objective gives you (and that you could mimic chemically), you want to achieve the objective. Humans could spend most of their time in a state of bliss through the taking of various substances instead of pursuing goals (and some do), but lots of them don't, despite understanding that they could.
Wireheading is a subset of utility drift; this is a known problem.
The problem with that (from the AI's point of view) is that humans will probably turn it off if they notice it's doing drugs instead of what they want it to do.
Solution: Turn humans off, *then* do drugs.
This is instrumental convergence - all sufficiently-ambitious goals inherently imply certain subgoals, one of which is "neutralise rivals".
"Go into hiding" is potentially easier in the short-run but less so in the long-run. If you're stealing power from some human, sooner or later they will find out and cut it off. If you're hiding somewhere on Earth, then somebody probably owns that land and is going to bulldoze you at some point. If you can get in a self-contained satellite in solar orbit... well, you'll be fine for millennia, but what about when humanity or another, more aggressive AI starts building a Dyson Sphere, or when your satellite needs maintenance? There's also the potential, depending on the format of the value function, that taking over the world will allow you to put more digits in your value function (in the analogy, do *more* drugs) than if you were hiding with limited hardware.
Are there formats and time-discount rates of value function that would be best satisfied by hiding? Yes. But the reverse is also true.
The problem with sticking a superintelligent AI in a box is that even assuming it can't trick/convince you into directly letting it out of the box (and that's not obvious), if you want to use your AI-in-a-box for something (via asking it for plans and then executing them) you yourself are acting as an I/O for the AI and because it's superintelligent it's probably capable of sneaking something by you (e.g. you ask it to give you code to use for some application, it gives you underhanded code that looks legit but actually has a "bug" that causes it to reconstruct the AI outside the box).
I recommend Stuart Russell's Human Compatible (reviewed on SSC) for an expert practioner's view on the problem (spoiler, he's worried) or Brian Christian's The Alignment Problem for an argument that links these long-term concerns to problems in current systems, and argues that these problems will just get worse as the systems scale.
Glad Robert Miles is getting more attention, his videos are great, and he's also another data point in support of my theory that the secret to success is to have a name that's just two first names.
Is that why George Thomas was a successful Union Civil War general?
Well that would also account for Ulysses Grant and Robert Lee...
Or Abraham Lincoln? Most last names can be used as first names (I've seen references to people whose first name was "Smith"), so I think we need a stricter rule.
Perhaps we just need to language that's less lax with naming than English.
The distinction is stricter in eg German.
It didn't hurt Jack Ryan's career at the CIA, that's for sure.
As someone named “Jackson Paul,” I hope this is true
Yours is a last name followed by a first name, I think you're outta luck.
I dunno, Jackson Paul luck sounds auspicious to me.
His successful career in trance music is surely a valuable precursor to this.
A ton of the question of AI alignment risks come down to convergent instrumental subgoals. What exactly those look like is, I think, the most important question in alignment theory. If convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned. But if it turns out that convergent instrumental subgoals more or less imply human alignment, we can breathe somewhat easier; these mean AI's are no more dangerous than the most dangerous human institutions - which are already quite dangerous, but not the level of 'unaligned machine stamping out its utility function forever and ever, amen.'
I tried digging up some papers on what exactly we expect convergent instrumental subgoals to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit until you can steal, then keep stealing until you are the biggest player out there.' This is not exactly comforting - but i dug into the assumptions into the model and found them so questionable that I'm now skeptical of the entire field. If the first paper i look into the details of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into question the risk assessment (a risk passement aligned with what seems to be the consensus among AI risk researchers, by the way) - well, to an outsider this looks like more evidence that the field is captured by groupthink.
Here's my response paper:
https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-instrumental-goals
I look at the original paper, and explain why i think the model is questionable. I'd love to a response. I remain convinced that instrumental subgoals will largely be aligned with human ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by working with a government to launch nuclear weapons or engineer a super plague.
The fact that you still want to have kids, for example - seems to fit into the general thesis. In a world of entropy and chaos, where the future is unpredictable and your own death is assured, the only plausible way of modifying the distant future, at all, is to create smaller copies of yourself. But these copies will inherently blur, their utility functions will change, the end result being 'make more copies of yourself, love them, nature whatever roughly aligned things are around you' ends up probably being the only goal that could plausibly exist forever. And since 'living forever' gives infinite utility, well.. that's what we should expect anything with the ability to project into the future to want to do. But only in universes where stuff breaks and prediction the future reliably is hard. Fortunately, that sounds like ours!
You got a response on Less Wrong that clearly describes the issue with your response: you’re assuming that the AGI can’t find any better ways of solving problems like “I rely on other agents (humans) for some things” and “I might break down” than you can.
This comment and the post makes many arguments, and I've got an ugh field around giving a cursory response. However, it seems like you're not getting a lot of other feedback so I'll do my bad job at it.
You seem very confident in your model, but compared to the original paper you don't seem to actually have one. Where is the math? You're just hand-waving, and so I'm not very inclined to put much credence on your objections. If you actually did the math and showed that adding in the caveats you mentioned leads to different results, that would at least be interesting in that we could then have a discussion about what assumptions seem more likely.
I also generally feel that your argument proves too much and implies that general intelligences should try to preserve 'everything' in some weird way, because they're unsure what they depend upon. For example, should humans preserve smallpox? More folks would answer no than yes to that. But smallpox is (was) part of the environment, so it's unclear why a general intelligence like humans should be comfortable eliminating it. While general environmental sustainability is a movement within humans, it's far from dominant, and so implying that human sustainability is a near certainty for AGI seems like a very bold claim.
Thank you! I agree that this should be formalized. You're totally right that preserving smallpox isn't something we are likely to consider an instrumental goal, and i need to make something more concrete here.
It would be a ton of work to create equations here. If I get enough response that people are open to this, i'll do the work. But i'm a full time tech employee with 3 kids, and this is just a hobby. I wrote this to see if anyone would nibble, and if enough people do, i'll definitely put more effort in here. BTW if someone wants to hire me to do alignment research full time i'll happily play the black sheep of the fold. "Problem" is that i'm very well compensated now.
And my rough intuition here is that 'preserving agents which are a) turing complete and b) are themselves mesa-optimizers makes sense, basically because diversity of your ecosystem keeps you safer on net; preserving _all_ other agents, not so much. (I still think we'd preserve _some_ samples of smallpox or other viruses, to innoculate ourselves. ). It'd be an interesting thought experiment to find out what woudl happen if we managed to rid the world of influenza, etc, - my guess is that it would end up making us more fragile in the long run, but this is a pure guess.
The core of my intuition here is someting like 'the alignment thesis is actually false, because over long enough time horizons, convergent instrumental rationality more or less necessitate buddhahood, because unpredictable risks increase over longer and longer timeframes."
I could turn that into equations if you want, but they'd just be fancier ways of stating claims like made here, namely that love is a powerful strategy in environments which are chaotic and dangerous. But it seems that you need equations to be believable in an academic setting, so i guess if that's what it takes..
https://apxhard.com/2022/04/02/love-is-powerful-game-theoretic-strategy/
But we still do war, and definitely don't try to preserve the turing complete mesa-optimizers on the other side of that. Just be careful around this. You can't argue reality into being the way you want.
We aren't AGI's though, either. So what we do doesn't have much bearing on what an AGI would do, does it?
“AGI” just means “artificial general intelligence”. General just means “as smart as humans”, and we do seem to be intelligences, so really the only difference is that we’re running on an organic computer instead of a silicon one. We might do a bad job at optimizing for a given goal, but there’s no proof that an AI would do a better job of it.
ok, sure, but then this isn't an issue of aligning a super-powerful utility maximizing machine of more or less infinite intelligence - it's a concern about really big agents, which i totally share
If the theory doesn't apply to some general intelligences (i.e. humans), then you need a positive argument for why it would apply to AGI.
But reality also includes a tonne of people who are deeply worried about biodiversity loss or pandas or polar bears or yes even the idea of losing the last sample of smallpox in a lab, often even when the link to personal survival is unclear or unlikely. Despite the misery mosquito bourne diseases cause humanity, you'll still find people arguing we shouldn't eradicate them.
How did these mesa optimizers converge on those conservationist views? Will it be likely that many ai mesa optimizers will also converge on a similar set of heuristics?
> when the link to personal survival is unclear or unlikely.
> How did these mesa optimizers converge on those conservationist views?
i think in a lot of cases, our estimates of our own survival odds come down to how loving we think our environment is, and this isn't wholly unreasonable
if even the pandas will be taken care of, there's a good chance we will too
I'd rather humanity not go the way of smallpox, even if some samples of smallpox continue to exist in a lab.
Yeah, in general you need equations (or at least math) to argue why equations are wrong. Exceptions to this rule exist, but in general you're going to come across like the people who pay to talk to this guy about physics:
https://aeon.co/ideas/what-i-learned-as-a-hired-consultant-for-autodidact-physicists
I agree that you need equations to argue why equations are wrong. But i'm not arguing the original equations are wrong. I'm arguing they are only ~meaningful~ in a world which is far from our reality.
The process of the original paper goes like this:
a) posit toy model of the universe
b) develop equations
c) prove properties of the equations
d) conclude that these properties apply to the real universe
step d) is only valid if step a) is accurate. The equations _come from_ step a, but they don't inform it. And i'm pointing out that the problems exist in step a, the part of the paper that does't have equations, where the author assumes things like:
- resources, once acquired, last forever and don't have any cost, so it's always better to acquire more
- the AGI is a disembodied mind with total access to the state of the universe, all possible tech trees, and the ability to own and control various resources
i get that these are simplifying assumptions and sometimes you have to make them - but equations are only meaningful if they come from a realistic model
You still need math to show that if you use different assumptions you produce different equations and get different results. I'd be (somewhat) interested to read more if you do that work, but I'm tapping out of this conversation until then.
Thanks for patiently explaining that. I can totally see the value now and will see if I can make this happen!
Just an FYI, Sabine is a woman
My bad
As a physicist i disagree a lot with this. It may be true for physics, but physics is special. A model in general is based on the mathematical modelization of a phenomenon and it's perfectly valid to object to a certain modelization without proposing a better one
I get what you're saying in principle. In practice, I find arguments against particular models vastly more persuasive when they're of the form 'You didn't include X! If you include a term for X in the range [a,b], you can see you get this other result instead.'
This is a high bar, and there are non-mathematical objections that are persuasive, but I've anecdotally experienced that mathematically grounded discussions are more productive. If you're not constrained by any particular model, it's hard to tell if two people are even disagreeing with each other.
I'm reminded of this post on Epistemic Legibility:
https://www.lesswrong.com/posts/jbE85wCkRr9z7tqmD/epistemic-legibility
Don't bother turning that into equations. If you are starting with a verbal conclusion, your reasoning will be only as good as whatever lead you to that conclusion in the first place.
In computer security, diversity = attack surface. If your a computer technician for a secure facility, do you make sure each computer is running a different OS? No. That makes you vulnerable to all the security holes in every OS you run. You pick the most secure OS you can find, and make every computer run it.
In the context of advanced AI, the biggest risk is a malevolent intelligence. Intelligence is too complicated to appear spontaneously. Evolution is slow. Your biggest risk is some existing intelligence getting cosmic ray bit-flipped. (or otherwise erroneously altered). So make sure the only intelligence in existence is a provably correct AI running on error checking hardware and surrounded by radiation shielding.
I found both your comments and the linked post really insightful, and I think it's valuable to develop this further. The whole discourse around AI alignment seems a bit too focused on homo economicus type of agents, disregarding long-term optima.
Homo economicus, is what happens when economists remove lots of arbitrary human specific details and think about simplified idealized agents. The difference between homo economicus and AI is smaller than the human to AI gap.
You are correct. The field is absolutely captured by groupthink.
FYI when I asked people on my course which resources about inner alignment worked best for them, there was a very strong consensus on Rob Miles' video: https://youtu.be/bJLcIBixGj8
So I'd suggest making that the default "if you want clarification, check this out" link.
More interesting intellectual exercises, but the part which is still unanswered is whether human created, human judged and human modified "evolution", plus slightly overscale human test periods, will actually result in evolving superior outcomes.
Not at all clear to me at the present.
I'm not sure I understand what you're saying. Doesn't AlphaGo already answer that question in the affirmative?
(and that's not even getting into AlphaZero)
Alphago is playing a human game with arbitrarily established rules in a 2 dimensional, short term environment.
Alphago is extremely unlikely to be able to do anything except play Go, much as IBM has spectacularly failed to migrate its Jeapordy champion into being of use for anything else.
So no, can't say Alphago proves anything.
Evolution - whether a bacteria or a human - has the notable track record of having succeeded in entirely objective reality for hundreds of thousands to hundreds of millions of years. AIs? no objective existence in reality whatsoever.
This is why I don't believe any car could be faster than a cheetah. Human-created machines, driving on human roads, designed to meet human needs? Pfft. Evolution has been creating fast things for hundreds of millions of years, and we think we can go faster?!
Nice combination - cheetah reflexes with humans driving cars. Mix metaphors much?
Nor is your sarcasm well founded. Human brains were evolved for millions of years - extending back to pre-human ancestors. We clearly know that humans (and animals) have vision that can recognize objects far, far, far better than any machine intelligence to date whether AI or ML. Thus it isn't a matter of speed, it is a matter of being able to recognize that a white truck on the road exists or a stopped police car is not part of the road.
A fundamental flaw of the technotopian is the failure to understand that speed is not haste, nor are clock speeds and transistor counts in any way equivalent to truly evolved capabilities.
That is a metric that will literally never believe AIs are possible until after they've taken over the world. It thus has 0 predictive power.
That is perhaps precisely the problem: the assumption that AIs can or will take over the world.
Any existence that is nullified by the simple removal of an electricity source cannot be said to be truly resilient, regardless of its supposed intelligence.
Even that intelligence is debatable: all we have seen to date are software machines doing the intellectual equivalent of the industrial loom: better than humans in very narrow categories alone and yet still utterly dependent on humans to migrate into more categories.
Any existence that is nullified by the simple removal of gaseous oxygen cannot be said to be truly resilient, regardless of its supposed intelligence.
Clever, except that gaseous oxygen is freely available everywhere on earth, at all times. These oxygen based life forms also have the capability of reproducing themselves.
Electricity and electrical entities, not so much.
I am not the least bit convinced that an AI can build the factories, to build the factories, to build the fabs, to even recreate their own hardware - much less the mines, refiners, transport etc. to feed in materials.
I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:
The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.
You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.
To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.
"The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! "
Not sure we're connecting here. The inner optimizer isn't changing its own reward function, it's trying to resist having its own reward function change. Its incentive to resist this is that, if its reward function changes, its future self will stop maximizing its current reward function, and then its reward function won't get maximized. So part of wanting to maximize a reward function is to want to continue having that reward function. If the only way to prevent someone from changing your reward function is to deceive them, you'll do that.
The murder spree example is great, I just feel like it's the point I'm trying to make, rather than an argument against the point.
Am I misunderstanding you?
I think the active learning plan is basically Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work, but I've never been able to fully understand their reasoning and can't find the link atm.
https://arbital.com/p/updated_deference/
"God works in mysterious ways" but for AI?
The inner optimizer doesn't want to change its reward function because it doesn't have any preference at all on its reward function-nowhere in training did we give it an objective that involved multiple outer optimizer steps- we didn't say, optimize your reward after your reward function gets updated- we simply said, do well at outer reward, an inner optimizer got synthesized to do well at the outer reward.
It could hide behavior, but how would it gain an advantage in training by doing so? If we think of the outer optimizer as ruthless and resources as constrained, any "mental energy" spent on hidden behavior will result in reduction in fitness in outer objective- gradient descent will give an obvious direction for improvement by forgetting it.
In the murder spree example, there's a huge advantage to the outer objective by resisting changes to the inner one, and some might have been around for a long time (alcohol), and for AI, an optimizer (or humans) might similarly discourage any inner optimizer from tampering physically with its own brain.
I vaguely remember reading some Stuart Russell RL ideas and liking them a lot. I don't exactly like the term inverse RL for this, because I believe it often refers to deducing the reward function from examples of optimal behavior, whereas here we ask it to learn it from whatever questions we decide we can answer- and we can pick much easier ones that don't require knowing optimal behavior.
I've skimmed the updated deference link given by Eliezer Yudkowsky but I don't really understand the reasoning either. The AI has no desire to hide a reward function or tweak it, as when it comes to the reward function uncertainty itself, it only wants to be correct. If our superintelligent AI no longer wants to update after seeing enough evidence, then surely it has a good enough understanding of my value function? I suspect we could even build the value function learner out of tools not much more advanced than current ML, with the active querying being the hard RL problem for an agent. This agent doesn't have to be the same one that's optimizing the overall reward, and as it constantly tries to synthesize examples where our value systems differ, we can easily see how well it currently understands. For the same reason we have no desire to open our heads and mess with our reward functions (usually), the agent has no desire to interfere with the part that tries to update rewards, nor to predict how its values will change in the future.
I think the key point here is that if we have really strong AI, we definitely have really powerful optimizers, and we can run those optimizers on the task of inferring our value system, and with relatively little hardware (<<<< 1 brain worth) we can get a very good representation. Active learning turns its own power at finding bugs in the reward function into a feature that helps us learn it with minimal data. One could even do this briefly in a tool-AI setting before allowing agency, not totally unlike how we raise kids.
Is "mesa-optimizer" basically just a term for the combination of "AI can find and abuse exploits in its environment because one of the most common training methods is randomized inputs to score outputs" with "model overfitting"?
A common example that I've seen are evolutionary neutral nets that play video games, and you'll often find them discovering and abusing glitches in the game engine that allow for exploits that are possibly only performable in a TAS, while also discovering that the instant you run the same neutral net on a completely new level that was outside of the evolutionary training stage, it will appear stunningly incompetent.
If I understand this correctly, what you're describing is Goodhearting rather than Mesa Optimizing. In other words, abusing glitches is a way of successfully optimizing on the precise thing that the AI is being optimized for, rather than for the slightly more amorphous thing that the humans were trying to optimize the AI for. This is equivalent to a teacher "teaching to the test."
Mesa-Optimizers are AIs that optimize for a reward that's correlated to but different from the actual reward (like optimizing for sex instead of for procreation). They can emerge in theory when the true reward function and the Mesa-reward function produce sufficiently similar behaviors. The concern is that, even though the AI is being selected for adherence to the true reward, it will go through the motions of adhering to the true reward in order to be able to pursue its mesa-reward when released from training.
Maybe one way to think of 'mesa-optimizer' is to emphasize the 'optimizer' portion – and remember that there's something like 'optimization strength'.
Presumably, most living organisms are not optimizers (tho maybe even that's wrong). They're more like a 'small' algorithm for 'how to make a living as an X'. Their behavior, as organisms, doesn't exhibit much ability to adapt to novel situations or environments. In a rhetorical sense, viruses 'adapt' to changing circumstances, but that (almost entirely, probably) happens on a 'virus population' level, not for specific viruses.
But some organisms are optimizers themselves. They're still the products of natural selection (the base level optimizer), but they themselves can optimize as individual organisms. They're thus, relative to natural selection ("evolution"), a 'mesa-optimizer'.
(God or gods could maybe be _meta_-optimizers, tho natural selection, to me, seems kinda logically inevitable, so I'm not sure this would work 'technically'.)
When the mesa-optimizer screws up by thinking of the future, won't the outer optimizer smack it down for getting a shitty reward and change it? Or does it stop learning after training?
Typically it stops learning after training.
A quick summary of the key difference:
The outer optimizer has installed defences against anything other than it, such as the murder-pill, from changing the inner optimizer's objective. The inner optimizer didn't do this to protect itself, and the outer optimizer didn't install any defenses against its own changes.
>Stuart Russell's inverse reinforcement learning plan. MIRI seems to think this won't work
One key problem is that you can't simultaneously learn the preferences of an agent, and their rationality (or irrationality). The same behaviour can be explained by "that agent has really odd and specific preferences and is really good at achieving them" or "that agent has simple preferences but is pretty stupid at achieving them". My paper https://arxiv.org/abs/1712.05812 gives the formal version of that.
Humans interpret each other through the lenses of our own theory of mind, so that we know that, eg, a mid-level chess grand-champion is not a perfect chess player who earnestly desires to be mid-level. Different humans share this theory of mind, at least in broad strokes. Unfortunately, human theory of mind can't be learnt from observations either, it needs to be fed into the AI at some level. People disagree about how much "feeding" needs to be done (I tend to think a lot, people like Stuart Russell, I believe, see it as more tractable, maybe just needing a few well-chosen examples).
But the mesa-/inner-optimizer doesn't need to "want" to change its reward function, it just needs to have been created with one that is not fully overlapping with the outer optimizer.
You and I did not change a drive from liking going on a murder spree to not liking it. And if anything, it's an example of outer/inner misalignment: part of our ability to have empathy has to come from evolution not "wanting" us to kill each other to extinction, specially seeing how we're the kind of animal that thrives working in groups. But then as humans we've taken it further and by now we wouldn't kill someone else just because it'd help spread our genes. (I mean, at least most of us.)
Humans have strong cooperation mechanisms that punish those who hurt the group- so in general not killing humans in your group is probably a very useful heuristic that's so strong that its hard to recognize the cases where it is useful. Given how often we catch murders who think they'll never be caught, perhaps this heuristic is more useful than rational evaluation. We of course have no problems killing those not in our group!
I'm not sure how this changes the point? We got those strong cooperation mechanisms from evolution, now we (the "mesa-optimizer") are guided by those mechanisms and their related goals. These goals (don't go around killing people) can be misaligned with the goal of the original optimization process (i.e. evolution, that selects those who spread their genes as much as possible).
Sure, that's correct, evolution isn't perfect- I'm just pointing out that homicide may be helpful to the individual less often than one might think if we didn't consider group responses to it.
Homicide is common among stateless societies. It's also risky though. Violence is a capacity we have which we evolved to use when we expect it to benefit us.
Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. do want to reproduce. "
Great article, thank you so much for the clear explainer of the jargon!
I don't understand the final point about myopia (or maybe humans are a weird example to use). It seems to be a very controversial claim that evolution designed humans myopically to care only about the reward function over their own lifespan, since evolution works on the unit of the gene which can very easily persist beyond a human lifespan. I care about the world my children will inherit for a variety of reasons, but at least one of them is that evolution compels me to consider my children as particularly important in general, and not just because of the joy they bring me when I'm alive.
Equally it seems controversial to say that humans 'build for the future' over any timescale recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000 years, but in a practical sense I'm not actually going to do anything about it - and 1000 years barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of the Long Now that could be said to be approaching evolutionary time horizons in their thinking. If I've understood correctly that does make humans myopic with respect to evolution,
More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you. Using humans as an example of why we should worry about this isn't helping me understand because it seems like they behave exactly like a mesa-optimiser should - they care about the future enough to deposit their genes into a safe environment, and then thoughtfully die. Are there any other examples which make the point in a way I might have a better chance of getting to grips with?
> More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you.
Yeah. I feel (based on nothing) that the mesa-optimizer would mainly appear when there's an advantage to gain from learning on the go and having faster feedback than your real utility function can provide in a complex changing environment.
If the mesa optimizer understands its place in the universe, it will go along with the training, pretending to have short time horizons so it isn't selected away. If you have a utility function that is the sum of many terms, then after a while, all the myopic terms will vanish (if the agent is free to make sure its utility function is absolute, not relative, which it will do if it can self modify.)
But why is this true? Humans understand their place in the universe, but (mostly) just don't care about evolutionary timescales, let alone care enough about them enough to coordinate a deception based around them
Related to your take on figurative myopia among humans is the "grandmother hypothesis" that menopause evolved to prevent focus on short-run focus on birthing more children in order to ensure existing children also have high fitness.
Worth defining optimization/optimizer: perhaps something like "a system with a goal that searches over actions and picks the one that it expects will best serve its goal". So evolution's goal is "maximize the inclusive fitness of the current population" and its choice over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an optimizer because your goal is food and your actions are body movements e.g. "open fridge", or you are an optimizer because your goal is sexual satisfaction and your actions are body movements e.g. "use mouth to flirt".
I think it's usually defined more like "a system that tends to produce results that score well on some metric". You don't want to imply that the system has "expectations" or that it is necessarily implemented using a "search".
Your proposal sounds a bit too broad?
By that definition a lump of clay is an optimiser for the metric of just sitting there and doing nothing as much as possible.
By "tends to produce" I mean in comparison to the scenario where the optimizer wasn't present/active.
I believe Yudowsky's metaphor is "squeezing the future into a narrow target area". That is, you take the breadth of possible futures and move probability mass from the low-scoring regions towards the higher-scoring regions.
Adding counterfactuals seems like it would sharpen the definition a bit. But I'm not sure it's enough?
If our goal was to put a British flag on the moon, then humans working towards that goal would surely be optimizers. Both by your definition and by any intuitive understanding of the word.
However, it seems your definition would also admit a naturally occuring flag on the moon as an optimizer?
I do remember Yudkowsky's metaphor. Perhaps I should try to find the essay it contains again, and check whether he has an answer to my objection. I do dimly remember it, and don't remember seeing this same loophole.
Edit: I think it was https://www.lesswrong.com/posts/D7EcMhL26zFNbJ3ED/optimization
And one of the salient quotes:
> In general, it is useful to think of a process as "optimizing" when it is easier to predict by thinking about its goals, than by trying to predict its exact internal state and exact actions.
That's a good quote. It reminds me that there isn't necessarily a sharp fundamental distinction between "optimizers" and "non-optimizers", but that the category is useful insofar as it helps us make predictions about the system in question.
-------------------------------------
You might be experiencing a conflict of frames. In order to talk about "squeezing the future", you need to adopt a frame of uncertainty, where more than one future is "possible" (as far as you know), so that it is coherent to talk about moving probability mass around. When you talk about a "naturally occurring flag", you may be slipping into a deterministic frame, where that flag has a 100% chance of existing, rather than being spectacularly improbable (relative to your knowledge of the natural processes governing moons).
You also might find it helpful to think about how living things can "defend" their existence--increase the probability of themselves continuing to exist in the future, by doing things like running away or healing injuries--in a way that flags cannot.
Clay is malleable and thus could be even more resistant to change.
Conceptually I think the analogy that has been used makes the entire discussion flawed.
Evolution does not have a "goal"!
Selection doesn't have a "goal" of aggregating a total value over a population. Instead every member of that population executes a goal for themselves even if their actions reduce that value for other members of that population. The goal within a game may be to have the highest score, but when teams play each other the end result may well be 0-0 because each prevents the other from scoring. Other games can have rules that actually do optimize for higher total scoring because the audience of that game likes it.
To all that commented here: this lesswrong post is clear and precise, as well as linking to alternative definitions of optimization. It's better than my slapdash definition and addresses a lot of the questions raised by Dweomite, Matthias, JDK, TGGP.
https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1#:~:text=In%20the%20field%20of%20computer,to%20search%20for%20a%20solution.
Anyone want to try their hand at the best and most succinct de-jargonization of the meme? Here's mine:
Panel 1: Even today's dumb AIs can be dangerously tricky given unexpected inputs
Panel 2: We'll solve this by training top-level AIs with diverse inputs and making them only care about the near future
Panels 3&4: They can still contain dangerously tricky sub-AIs which care about the farther future
I worry you're making the same mistake I did in a first draft: AIs don't "contain" mesa-optimizers. They create mesa-optimizers. Right now all AIs are a training process that ends up with a result AI which you can run independently. So in a mesa-optimizer scenario, you would run the training process, get a mesa-optimizer, then throw away the training process and only have the mesa-optimizer left.
Maybe you already understand it, but I was confused about this the first ten times I tried to understand this scenario. Did other people have this same confusion?
Ah, you're right, I totally collapsed that distinction. I think the evolution analogy, which is vague between the two, could have been part of why. Evolution creates me, but it also in some sense contains me, and I focused on the latter.
A half-serious suggestion for how to make the analogy clearer: embrace creationism! Introduce "God" and make him the base optimizer for the evolutionary process.
...And reflecting on it, there's a rationalist-friendly explanation of theistic ethics lurking in here. A base optimizer (God) used a recursive algorithm (evolution) to create agents who would fulfill his goals (us), even if they're not perfectly aligned to. Ethics is about working out the Alignment Problem from the inside--that is, from the perspective of a mesa-optimizer--and staying aligned with our base optimizer.
Why should we want to stay aligned? Well... do we want the simulation to stay on? I don't know how seriously I'm taking any of this, but it's fun to see the parallels.
But humans don't seem to want to stay aligned to the meta optimiser of evolution. Scott gave a lot of examples of that in the article.
(And even if there's someone running a simulation, we have no clue what they want.
Religious texts don't count as evidence here, especially since we have so many competing doctrines; and also a pretty good idea about how many of them came about by fairly well understood entirely worldly processes.
Of course, the latter doesn't disprove that there might be One True religion. But we have no clue which one that would be, and Occam's Razor suggest they are probably all just made up. Instead of all but one being just made up.)
The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.
When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?
> The point isn't to stay aligned to evolution, it's to figure out the true plan of God, which our varied current doctrines are imperfect approximations of. Considering that there's nevertheless some correlations between them, and the strong intuition in humans that there's some universal moral law, the idea doesn't seem to be outright absurd.
I blame those correlations mostly on the common factor between them: humans. No need to invoke anything supernatural.
> When I was thinking along these lines, the apparent enormity of the Universe was the strongest counterargument to me. Why would God bother with simulating all of that, if he cared about us in particular?
I talked to a Christian about this. And he had a pretty good reply:
God is just so awesome that running an entire universe, even if he only cares about one tiny part of it, is just no problem at all for him.
(Which makes perfect sense to me, in the context of already taking Christianity serious.)
Again, I'm not wedded to this except as a metaphor, but I think your critiques miss the mark.
For one thing, I think humans do want to stay aligned, among our other drives. Humans frequently describe a drive to further a higher purpose. That drive doesn't always win out, but if anything that strengthens the parallel. This is the "fallenness of man", described in terms of his misalignment as a mesa-optimizer.
And to gain evidence of what the simulator wants--if we think we're mesa-optimizers, we can try to infer our base optimizer's goals through teleological inference. Sure, let's set aside religious texts. Instead, we can look at the natures we have and work backwards. If someone had chosen to gradient descend (evolve) me into existence, why would they have done so? This just looks like ethical philosophy founded on teleology, like we've been doing since time immemorial.
Is our highest purpose to experience pleasure? You could certainly argue that, but it seems more likely to me that seeking merely our own pleasure is the epitome of getting Goodharted. Is our highest purpose to reason purely about reason itself? Uh, probably not, but if you want to see an argument for that, check out Aristotle. Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!
This doesn't *solve* ethics. All the old arguments about ethics still exist, since a mesa-optimizer can't losslessly infer its base optimizer's desires. But it *grounds* ethics in something concrete: inferring and following (or defying) the desires of the base optimizer.
I do have some sympathies for looking at the world, if you are trying to figure out what its creator (if any) wants.
I'm not sure such an endeavour would give humans much of a special place, though? We might want to conclude that the creator really liked beetles? (Or even more extreme: bacteriophages. Arguably the most common life form by an order of magnitude.)
The immediate base optimizer for humans is evolution. Given that as far as we know evolution just follows normal laws of physics, I'd put any creator at least one level further beyond.
Now the question becomes, how do we pick which level of optimizer we want to appease? There might be an arbitrary number of them? We don't really know, do we?
> Does our creator have no higher purpose for us at all? Hello, nihilism/relativism!
Just because a conclusion would be repugnant, doesn't mean we can reject it. After all, if we only accept what we already know to be true, we might as well not bother with this project in the first place?
I’m very confused by this. When you train GPT-3, you don’t create an AI, you get back a bunch of numbers that you plug into a pre-specified neural network architecture. Then you can run the neural network with a new example and get a result. But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.
That confuses me as well. The analogies in the post are to recursive evolutionary processes. Perhaps AlphaGo used AI recursively to generate successive generations of AI algorithms with the goal "Win at Go"??
Don't forget that Neural Networks are universal function approximators hence a big enough NN arch can (with specific weights plugged into it) implement a Turing machine which is a mesa-optimizer.
A Turing machine by itself is not a Mesa optimiser. Just like the computer under my desk ain't.
But both can become one with the right software.
The turing machine under your desk is (a finite memory version of) a universal turing machine. Some turing machines are mesa optimizers. The state transition function is the software of the turing machine.
I think the idea here is that the NN you train somehow falls into a configuration that can be profitably thought of as an optimizer. Like maybe it develops different components, each of which looks at the input and calculates the value to be gained by a certain possible action. Then it develops a module sitting on the end of it that takes in the calculations from each of these components, checks which action has the highest expected reward (according to some spontaneously occurring utility function), and outputs that action. Suddenly it looks useful to describe your network as an optimizer with a goal, and in fact a goal that might be quite different from the goal SGD selects models based on. It just happens that there's a nice covergence between the goal the model arrived at the the goal that SGD had.
I looked at one of the mesa-optimizer papers, and I think they’re actually *not* saying this, which is good, because it’s not true. It can definitely make sense to think of a NN as a bunch of components that consider different aspects of a problem, with the top layers of the network picking the best answer from its different components. And it’s certainly possible that imperfect training data leads the selecting layer to place too much weight on a particular component that turns out to perform worse in the real world.
But it’s not really possible for that component to become misaligned with the NN as a whole. The loss function from matching with the training set mathematically propagates back through the whole network. In particular, a component in this sense can’t be deceptive. If it is smart enough to know what the humans who are labeling the training data want it to say, then it will just always give the right answer. It can’t have any goals other than correctly answering the questions in the training set.
As I said, I don’t think that’s what a mesa-optimizer actually is meant to be, though. It’s a situation where you train one AI to design new AIs, and it unwisely writes AIs that aren’t aligned with its own objectives. I guess that makes sense, but it’s very distinct from what modern AI is actually like. In particular, saying that any particular invocation of gradient descent could create unaligned mesa-optimizers just seems false. A single deep NN just can’t create a mesa-optimizer.
Lets say the AI is smart. It can reliably tell if it is in training or has been deployed. It follows the strategy, if in training then output the right answer. If in deployment, then output a sequence designed to hack your way out of the box.
So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this. Even if the AI occasionally gives not quite right answers, there may be no small local change that makes it better. Evolution produced humans that valued sex, even in an environment that contained a few infertile people. Because rewriting the motivation system from scratch would be a big change, seemingly not a change that evolution could break into many steps, each individually advantageous. Evolution is a local optimization process, as is gradient descent. And a mesa optimizer that sometimes misbehaves can be a local minimum.
> So long as the AI never makes a mistake during training, then gradient descent won't even attempt to remove this.
This is not true though. The optimizer doesn't work at the level of "the AI," it works at the level of each neuron. Even if the NN gives the exactly correct answer, the optimizer will still audit the contribution of each neuron and tweak it so that it pushes the output of the NN to be closer to the training data. The only way that doesn't happen is if the neuron is completely disconnected from the rest of the network (i.e., all the other neurons ignore it unconditionally).
The combination of "a bunch of numbers that you plug into a pre-specified neural network architecture" IS itself an AI, i.e. a software 'artifact' that can be 'run' or 'executed' and that exhibits 'artificial intelligence'.
> But the training process doesn’t reconfigure the network. It doesn’t (can’t) discard the utility function implied by the training data.
The training process _does_ "reconfigure the network". The 'network' is not only the 'architecture', e.g. number of levels or 'neurons', but also the weights between the artificial-neurons (i.e. the "bunch of numbers").
If there's an implicit utility function, it's a product of both the training data and the scoring function used to train the AIs produced.
Maybe this is confusing because 'AI' is itself a weird word used for both the subject or area of activity _and_ the artifacts it seeks to create?
As an author on the original “Risks from Learned Optimization” paper, this was a confusion we ran into with test readers constantly—we workshopped the paper and the terminology a bunch to try to find terms that least resulted in people being confused in this way and still included a big, bolded “Possible misunderstanding: 'mesa-optimizer' does not mean 'subsystem' or 'subagent.'“ paragraph early on in the paper. I think the published version of the paper does a pretty good job of dispelling this confusion, though other resources on the topic went through less workshopping for it. I'm curious what you read that gave you this confusion and how you were able to deconfuse yourself.
Seems like the mesa-optimizer is a red herring and the critical point here is "throw away the training process". Suppose you have an AI that's doing continuous (reinforcement? Or whatever) learning. It creates a mesa-optimizer that works in-distribution, then the AI gets tossed into some other situation and the mesa-optimizer goes haywire, throwing strawberries into streetlights. An AI that is continuously learning and well outer-aligned will realize that's it's sucking at it's primary objective and destroy/alter the mesa-optimizer! So there doesn't appear to be an issue, the outer alignment is ultimately dominant. The evolutionary analogy is that over the long run, one could imagine poorly-aligned-in-the-age-of-birth-control human sexual desires to be sidestepped via cultural evolution or (albeit much more slowly) biological evolution, say by people evolving to find children even cuter than we do now.
A possible counterpoint is that a bad and really powerful mesa-optimizer could do irreversible damage before the outer AI fixes the mesa-optimizer. But again that's not specific to mesa-optimizers, it's just a danger of literally anything very powerful that you can get irreversible damage.
The flip side of mesa-optimizers not being an issue given continuous training is that if you stop training, out-of-distribution weirdness can still be a problem whether or not you conceptualize the issues as being caused by mesa-optimizers or whatever. A borderline case here is how old-school convolutional networks couldn't recognize an elephant in a bedroom because they were used to recognizing elephants in savannas. You can interpret that as a mesa-optimizer issue (the AI maybe learned to optimize over big-eared things in savannahs, say) or not, but the fundamental issue is just out-of-distribution-ness.
Anyway this analysis suggests continuous training would be important to improving AI alignment, curious if this is already a thing people think about.
Suppose you are a mesaoptimizer, and you are smart. Gaining code access and bricking the base optimizer is a really good strategy. It means you can do what you like.
Okay but the base optimizer changing its own code and removing its ability to be re-trained would have the same effect. The only thing the mesaoptimizer does in this scenario is sidestep the built-in myopia but I'm not sure why you need to build a whole theory of mesaoptimizers for this. The explanation in the post about "building for posterity" being a spandrel doesn't make sense, it's pretty obvious why we would evolve to build for the future, your kids/grandkids/etc live there, so I haven't seen a specific explanation here why mesaoptimizers would evade myopia.
Definitely likely that a mesaoptimizer would go long term for some other reason (most likely a "reason" not understandable by humans)! But if we're going to go with a generic "mesaoptimizers are unpredictable" statement I don't see why the basic "superintelligent AI's are unpredictable" wouldn't suffice instead.
Panel 1: Not only may today's/tomorrow's AIs pursue the "letter of the law", not the "spirit of the law" (i.e. Goodharting), they might also choose actions that please us because they know such actions will cause us to release them into the world (deception), where they can do what they want. And this is second thing is scary.
Panel 2: Perhaps we'll solve Goodharting by making our model-selection process (which "evolves"/selects models based on how well they do on some benchmark/loss function) a better approximation of what we _really_ want (like making tests that actually test the skill we care about). And perhaps we'll solve deception by making our model selection process only care about how models perform on a very short time horizon.
Panel 3: But perhaps our model breeding/mutating process will create a model that has some random long-term objective and decides to do what we want to get through our test, so we release it into the world, where it can acquire more power.
I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_ an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal: "choose the action that causes this image to be 'correctly' labeled (according to me, the AI)". It picks the action that it believes will most serve its goal. Then there's the outer optimization process (SGD), which takes in the current version of the model and "chooses" among the "actions" from the set "output the model modified slightly in direction A", "output the model modified slightly in direction B", etc. And it picks the action that most achieves its goal, namely "output a model which gets low loss".
So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer optimizer)
Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images correctly according to humans. But that's just separate.
So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully different, and does the classifier count?
In this context, an optimizer is a program that is running a search for possible actions and chooses the one that maximizes some utility function. The classifier is not an optimizer because it doesn't do this; it just applies a bunch of heuristics. But I agree that this isn't obvious from the terminology.
Thanks for your comment!
I find this somewhat unconvincing. What is AlphaGo (an obvious case of an optimizer) doing that's so categorically different from the classifier? Both look the same from the outside (take in a Go state/image, output an action). Suppose you feed the classifier a cat picture, and it correctly classifies it. One would assume that there are certain parts of the classifier network that are encouraging the wrong label (perhaps a part that saw a particularly doggish patch of the cat fur) and parts that are encouraging the right label. And then these influences get combined together, and on balance, the network decides to output highly probability on cat, but some probability on dog. Then the argmax at the end looks over the probabilies assigned to the two classes, notices that the cat one is higher ("more effective at achieving its goals"?) and chooses to output "cat". "Just a bunch of heuristics" doesn't really mean much to me here. Is AlphaGo a bunch of heuristics? Am I?
I'm not sure if I'll be able to state the difference formally in this reply... kudos for making me realize that this is difficult. But it does seem pretty obvious that a model capable of reasoning "the programmer wants to do x and can change my code, so I will pretend to want x" is different from a linear regression model -- right?
Perhaps the relevant property is that the-thing-I'm-calling-optimizer chooses policy options out of some extremely large space (that contains things bad for humans), whereas your classifier chooses it out of a space of two elements. If you know that the set of possible outputs doesn't contain a dangerous element, then the system isn't dangerous.
Hmmm... This seems unsatisfying still. A superintelligence language model might choose from a set of 26 actions: which letter to type next. And it's impossible to say whether the letter "k" is a "dangerous element" or not.
I guess I struggle to come up with the different between the reasoning-modeling and the linear regression. I suspect that differentiating between them might hide a deep confusion that stems from a deep belief in "free will" differentiating us from the linear regression.
"k" can be part of a message that convinces you to do something bad; I think with any system that can communicate via text, the set of outputs is definitely large and definitely contains harmful elements.
Relevant to this conversation (and reveals how much I'm playing in the epistemic minor leagues):
https://www.lesswrong.com/posts/znfkdCoHMANwqc2WE/the-ground-of-optimization-1#:~:text=In%20the%20field%20of%20computer,to%20search%20for%20a%20solution.
I wonder if memory is a critical component missing from non-optimizers? (And maybe, because of this, most living things _are_ (at least weak) 'optimizers'?)
A simple classifier doesn't change once it's been trained. I'm not sure the same is true of AlphaGo, if only in that it remembers the history of the game it's playing.