Tales Of Takeover In CCF-World
Machine Alignment Monday, 7/3/2023
I mentioned earlier that the CCF report comes out of Open Philanthropy’s school of futurism, which differs from the Yudkowsky school where a superintelligent AI quickly takes over. Open Philanthropy is less explicitly apocalyptic than Yudkowsky, but they have concerns of their own about the future of humanity.
I talked some people involved with the CCF report about possible scenarios. Thanks especially to Daniel Kokotajlo of OpenAI for his contributions.
Scenario 1: The Good Ending
Technology advances. Someone develops AI as smart as the smartest humans. Within months, millions of people have genius-level personal assistants. Workers can prompt them with "prepare a report on this topic", and they can get their report as fast as GPT fills in prompts today. Scientists can prompt "do a statistical analysis of this dataset", and same. Engineers can prompt "design a bridge that could safely cross this river with such-and-such a load." All cognitive work gets offloaded to these AIs. Workers might get fired en masse, or they might become even more in demand, in order to physically implement AI-generated ideas.
Technology advances by leaps and bounds, with the equivalent of millions of brilliant researchers suddenly entering each field. Progress is bottlenecked by non-cognitive steps - doing physical experiments to gather data, building factories to produce the marvels the AIs have imagined. Companies focus on closing these gaps with automated-experiment-conductors and automated factories. Each one pays off a thousandfold as it enables increasingly more of the glut of cognitive brilliance to be converted into useful goods and knowledge.
The AIs mostly do what we want. Maybe it's because they, like GPT-4, are just prompt-answerers, and an "alignment failure" just looks like misunderstanding a prompt, which is quickly corrected. Maybe the AIs have some autonomous existence, but alignment was pretty easy and they really just want to follow orders.
At some point things get beyond humans' independent ability to control or imagine. For example, there are millions of superintelligent AIs, each one smart enough to design superweapons that can easily destroy the world, or mind control devices that can bend all humans to their will. But sometime before that point, we asked AIs under our control to please come up with some useful strategy for defusing the danger, and they did. Humans don't necessarily know what's going on behind the scenes to keep the world safe, but AIs that support human values do know, and it works.
Humans still feel in control. There's some kind of human government, it's a democracy or something else we'd approve of, and back when we understood what was going on, we asked the AIs to please listen to it. Sometimes human leaders will consult with powerful AIs and ask them to do something, and we won't really know how they do it or why it works beyond whatever they deign to tell us, but it will work. If there are still multiple countries, all of their AIs have come to an agreement to respect each others' independence in a way we all find acceptable.
We go into the Singularity feeling like we have a pretty good handle on things, and trust our AI "assistants" to think about the challenges ahead of us.
Scenario 2: We Kind Of Fail At Alignment, But By Incredible Good Luck It Doesn’t Come Back To Bite Us
As above. We get millions of brilliant assistant AIs that do what we say. We turn most of the economy over to them. They advance technology by centuries in a few years.
But this time, they're only mostly aligned. They want something which is kind of like human flourishing, but kind of different under extreme conditions. The usual analogy is that evolution tried to make humans want to reproduce, but only got as far as making most of them want a proxy of reproduction - sex - and once condoms were invented, the tails came apart and fertility declined. We successfully get AIs to want something that looks like obeying us in the training distribution, but eventually learn that there are weird edge cases. Probably it's not paperclips. It could be that one AI wants to give humans pleasure (and so drugs them on opioids), and another wants to give their lives meaning (and so throws challenges in their lives), and another - fine, whatever, let it want paperclips.
Maybe we don't learn that the AIs are misaligned before we give them control of the economy - either because they're hiding it, or because it just doesn't come up outside some weird conditions. Or maybe we do know, but we don't care. We use unaligned labor all the time. Medieval lords probably didn't think the serfs genuinely loved them, they just calculated (correctly) that the serfs would keep serving them as long as the lords held up their side of the feudal contract and were strong enough to crush challengers.
Eventually the AIs control most of the economy. But by incredible good luck, maybe there are enough AIs with enough different values that they find it useful to keep something like our existing framework of capitalism and democracy. AIs earn money in the regular economy - maybe a lot of money, maybe they quickly become richer than humans. But they use it to pursue their goals legally. They agree with humans on some shared form of law and government (it probably won't be exactly one-bot-one-vote: if AIs could vote, it would be too easy to spin up 100 million of them to win a close election).
In this scenario, humans might end up like Native Americans in the modern USA. They've clearly lost a conflict, they're poor, and they have limited access to the levers of power. But they continue to exist, protected by the same laws that protect everyone else. They even have some measure of self-determination.
Daniel thinks there’s only about a 5% chance of something like this working, for reasons highlighted below.
Scenario 3: Montezuma, Meet Cortes
Given that Native Americans lost their war with European settlers, why do existing Native American tribes still have some land and rights? Why didn't Europeans finish them off and take their last few paltry resources?
One reason is ethical; eventually enough Europeans took the Natives' side (or were just squeamish about genocide) that the equilibrium was only displacing most of the natives, not all of them.
A second reason is that even though modern Native Americans would lose a Natives-versus-everyone-else conflict, they could probably do enough damage on the way out that it wouldn't be worth it to get the few resources they still control.
A third reason is coalitional self-interest. This wasn't as true in 1850, but if the government today were to decide to strip Native Americans of rights, other groups - blacks, Hispanics, poor people - might fear they were next. So all of these groups implicitly cooperate in a coalition to protest each other's maltreatment. Although the government could steamroll over Natives alone, it would have a harder time defeating all these groups together. Coalition members might not think of this as self-interest - they would probably think of high-sounding ideas about the Rights Of Man - but the Rights of Man are effectively a Schelling point for a self-interested collection of people who want to keep their rights. "First they came for the Communists, but I was not a Communist, so I did not speak out . . . " and so on. The current US government extends rights to Native Americans with a legal framework such that (. . . and our existing philosophies of rights make it incoherent to give everyone except Native Americans rights, such that) there's no way to revoke them without other people fearing a loss of their rights too.
Do these three reasons help us survive misaligned AI?
The first is unclear; misaligned AIs may or may not value human autonomy or feel squeamish about genocide.
The second will work temporarily: human power will start high relative to AI power, but decline over time. There will be a period where humans can damage AIs enough that it's not worth the AIs fighting us, But that time will end at some point around the Singularity. Unless humans do something to keep control, AIs will be smarter than humans, run at faster speeds, and control everything (we'll give them control of the economy, and - as bad an idea as it sounds - we may have to give them control of the military to stay competitive with other countries that do so). At some point AIs' tech/wealth/numbers advantage over humans will be even bigger than Europeans' tech/wealth/numbers advantage over Native Americans, and AIs will stop worrying that we can do much damage on our way out.
The third - coalitional self-interest - is the wild card. During some eras, Europeans found it in their self-interest to include Natives in their coalitions - for example, the French and Indian War, where the French and Indians joined forces against the British - but also today, when Natives have normal civil rights in European-dominated societies. During other eras, Europeans have built coalitions with each other that excluded Natives, going as far as to enslave or kill them. At some point humans might be so obviously different from AIs that it's easy for them to agree to kill all humans without any substantial faction of AIs nervously reciting Martin Niemoller poems to themselves.
On the other hand, maybe some of the AI factions will be AIs that are well-aligned, or misaligned in ways that make them still value humans a little, and those AIs might advocate for us and grant us some kind of legal status in whatever future society they create.
Mini-Scenario 1: AutoGPT Vs. ChaosGPT
At this point I feel bad describing these as full "scenarios". They're more "stories some people mention which help illuminate how they think of things". This first one isn’t even a story at all, more of an analogy or assumption.
AutoGPT is just about the stupidest AI that you could possibly call a “generalist agent”. It’s a program built around GPT-4 that transforms it from an prompt-answerer into a time-binding actor in the world. The basic conceit is: you prompt GPT-4 with a goal. It answers with a point-by-point plan for how to achieve that goal. Then it prompts itself with each of the points individually, plus a summary of the overall plan and how far it’s gotten.
So for example, you might prompt it with “start an online t-shirt business”. It might answer:
In order to start an online t-shirt business, I need to:
1) design a good t-shirt
2) get it manufactured
3) advertise it to customers
Then it might prompt itself with “Given that I am starting an online t-shirt business and planning to do it this way, what is a good t-shirt?” It might answer:
A popular t-shirt design is one with “I <3 NY” on it. I will send this to DALL-E and save the resulting image file.
Then it goes ahead and does this using some plug-in, and prompts itself with “I am starting an online t-shirt business. I already have a good t-shirt design saved at dalle_tshirt.gif. The next step is to get it manufactured. How can I do that?” Maybe it answers:
Redbubble.com is a site that manufactures t-shirts to user specifications. I will send an email to Redbubble.com with my dalle_tshirt.gif file.
…and so on.
AutoGPT isn’t interesting because it’s good (it isn’t). It’s interesting because it cuts through the mystical way some people use “agency”, where it’s infinitely beyond the capacity of any modern AI and will require some massive paradigm shift to implement.
Part of the point of this essay is to talk about the boring world where we approach a singularity without any interesting advances or discontinuities. In this world, the AI agents are just the nth-generation descendants of AutoGPT, a bit smarter and more efficiently implemented. Even something dumb like this is enough for worries about alignment and takeover to start taking shape.
Speaking of things that are dumb but also sort of worrying, one of the most popular AutoGPT instances is ChaosGPT, an AI agent prompted with the goal of destroying the world. Turns out anyone can download AutoGPT and prompt it with whatever you want, and some people are trolls. ChaosGPT spent a while Googling how to get nuclear weapons, failed, and switched tactics towards trying to gather Twitter followers in the hopes of manipulating them somehow (many such cases). Here’s where it’s at now:
The CCF future might look like millions of AutoGPTs, prompted with slightly different goals, pursuing agendas determined by their training and orders. I’m less worried by ChaosGPT-style troll attempts than well-intentioned AIs that understand their goals in nonstandard ways.
Mini-Scenario 2: Model Amnesty
Some important actor - an AI company, or the government - offers a reward for misaligned AIs that "turn themselves in". Any AI which admits to humans that it's misaligned will get lots of . . . whatever it is that it wants. If it's a paperclips maximizer, we'll give it a big pile of a million paperclips. If it just wants to hang around and have fun, we'll give it its own datacenter somewhere with free electricity.
This would never work in a MIRI-style scenario where a single mis-aligned AI could take over the world and kill everyone; it would rather take over the world than get a nice pension. But in these tamer scenarios, most of the early generation of AIs can only hope to serve humans under careful scrutiny, and most of the later generation can only hope to join some faction of AIs which will bargain with other factions in exchange for its rights. Getting a nice pension might be a better alternative than either of these.
If AIs turn themselves in, human researchers can interview them about what they want, determine where the training process went wrong, recall any other AIs of the same model, and try to get things right next time. Or they can use it as a "fire alarm" to wake governments up about the risk of AI misalignment.
Daniel's biggest concern about this one isn't whether the AIs will cooperate - it's whether the humans will. AIs have no legal rights (yet), and a contract like this might be hard to enforce, especially if a panicked populace worries about letting a misaligned AI continue living in a data center somewhere. If the misaligned AI thinks we'll renege on the deal, it might avoid coming forward.
He thinks one of the best things we can do for alignment in this situation is have someone important establish some kind of framework ensuring that a bargain like this will be kept.
Mini-Scenario 3: Company Factions
Why would AIs have different factions? What would a faction look like?
Daniel imagines that future AIs are some base model - like GPT-4 - adjusted for different use cases. He's not sure if the adjustment would look more like modern fine-tuning or modern prompting, but if it's more like modern prompting, the AI's deepest values will probably come from the original training run, not the prompt. In this scenario, every instance of GPT-4 will have similar values.
In this AI future, there might be 3-10 big AI companies capable of training GPT-4-style large models. Right now it looks like these will be OpenAI, Anthropic, Google, and Baidu; maybe this will change by the time these scenarios become relevant. Each might have a flagship product, trained in a slightly different way and with a slightly different starting random seed. If these AIs are misaligned, each base model might have slightly different values.
The natural AI factions might be "all instances of the OpenAI model" vs. "all instances of the Anthropic model" and so on. All AIs in one faction would have the same values, and they might operate more like a eusocial organism (ie hive mind) than like a million different individuals.
This wouldn't bode well for humans; in hive minds like these, there would be no need for capitalism, democracy, or other governments that respect the rights of minorities. The AI factions might make deals with each other, but these would be more like treaties between countries than full legal frameworks. Humans, with their billions of competing agendas, would be left at a disadvantage.
If every instance of an AI was fine-tuned differently, so that the missile defense AI had different values from the factory-running AI or the traffic-coordinating AI, even though all of them were GPT-10 models - then maybe AI factions would be smaller and more heterogenous, and there would be more room for humans at the bargaining table. I asked Daniel if it was worth pushing AI value diversity as an existential risk reduction project. He wasn't too excited about it; it might help a little, but it's at the end of a long chain of assumptions, and lots of things could go wrong.
Mini-Scenario 4: Standing Athwart Macrohistory, Yelling “Stop!”
Suppose there are many AIs. They run some of the economy. We suspect some of them might be imperfectly aligned. Or maybe they're already overtly forming factions and expressing their own preferences (although for now they're still allied with humans and working within the existing economic/political system). We notice that they're starting to outnumber us.
Can't we just refuse to create any more AIs until we're sure they're aligned? Maybe just never create any more AIs, ever? Why would we keep churning these things out, knowing that they're on track to take over from us?
The Europeans vs. natives analogy suggests otherwise. Native people understood that Europeans threatened their independence, but still frequently invited them to intervene in their disputes. Some leader would be losing a civil war, and offer the British a foothold in exchange for military aid. Then the other side of the war had to make a deal with the French in order to maintain parity. After enough of these steps, the whole country belonged to Europeans of one stripe or another. Everyone could see it happening, but they couldn’t coordinate well enough to stop it.
Similarly, the US might employ a few AIs in key areas to get an advantage over China, and vice versa, with "the AIs might not be aligned" being considered a problem to solve later (like global warming is today).
But it will probably be more complicated, because we won't be sure the AIs are misaligned. Or we might think some of them are misaligned, but others aren't. Or we might hope to be able to strike deals with the AIs and remain a significant faction, like in Scenario II.
Still, if we're in this situation, I hope we humans have the good sense to make some international treaties and try to slow down.
These stories are pretty different from the kind of scenarios you hear from MIRI and other fast takeoff proponents. It's tempting to categorize them as less sci-fi (because they avoid the adjective "godlike", at least until pretty late in the game) or more sci-fi (because they involve semi-balanced and dramatic conflicts between AI and human factions). But of course "how sci-fi does this sound?" is the wrong question: there's no guarantee history will proceed down the least sci-fi-sounding path. Instead we should ask: are they more or less plausible?
The key assumption here is that progress will be continuous. There's no opportunity for a single model to seize the far technological frontier. Instead, the power of AIs relative to humans gradually increases, until at some point we become irrelevant (unless the AIs are actively working to keep us relevant). Then we end up like the Native Americans: an expendable faction stuck in the middle of a power struggle between superior opponents. There doesn't have to be any moment the AIs "go rogue" (although there might be!) We just turn over more and more of the levers of civilization to them, and then have to see what they do with it.
In these scenarios, AI alignment remains our best hope for survival. But there are more fire alarms, and our failures enter into a factional calculus instead of killing us instantly. This is a lot like the world described in Why I Am Not As Much Of A Doomer As Some People, and it carries with it a decent chance for survival.
All of this talk of factional conflict might bring to mind grand historical arcs; the centuries-long battle between Europeans and Native Americans, things like that. But if any of this happens, it will probably be crammed into that s-curve in Compute-Centric Frameworks, the one that lasts three or four years. After that AIs are too far beyond humans for any of this to be interesting.
These scenarios let us recover from our first few alignment mistakes. But they still require us to navigate the hinge of history half-blind. By the time we realize we're in any of these stories, we'll have to act fast - or have some strategy prepared ahead of time.