450 Comments
Comment deleted
Expand full comment

The idea is that an intelligent enough AI could figure out a way to gain control of the nuclear arsenal, or even create means to destroy humanity if it wished. So we (they) are trying to figure out how to ensure it never wants to.

Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment

Does your model of convincing manipulators include Religious figures? Hitler? MLK Jr? Expert hostage negotiators?

I feel like this type of reasoning fails because it doesn't even account for actual, in real life examples of successful manipulations.

Expand full comment

What guarantee is there that the people who actually need to manipulate are easy, or even possible, to manipulate? One would kind of guess that the USAF Missile Command promotion process rather selects against the personality type that would be eager to please, credulous, undisciplined enough to just do something way outside the rulebook because it seems cool or someone rather magnetic has argued persuasively for it. You'd think those people are kind of "no, this is how it's written down in the book, so this is how we do it, no independent judgment or second thoughts allowed."

Otherwise...the KGB certainly did its best to manipulate members of the military chain of command all through the Cold War, for obvious reasons. And at this point the absolute king of manipulative species is...us. If the KGB never had any significant success in manipulating the US nuclear forces chain of command to do anything even as far below "starting a nuclear war" as "giving away targeting info" -- why would we think it's possible for a superintelligent AI? What tricks can a superintelligent AI think up that the KGB overlooked in 50 years of trying hard?

I'm sure a superintelligent AI can think of superinteligent tricks that would work on another superintelligent AI, or a species 50x as intelligent as us, but that does it no good, for the same reason *we* can't use the methods we would use to fool our spouses to fool cats or mice. The tools limit the methods of the workman. A carpenter can't imagine a way to use a hammer to make an exact 45 degree miter cut in a piece of walnut, no matter how brilliant he is.

Expand full comment
Comment deleted
Expand full comment

I think there's no question that fallacy is common and pernicious. To my mind it fully explains the unwarranted optimism about self-driving cars. People just assumed that the easy bit was what *we* do easily -- which is construct an accurate model of other driver behavior and reliably predict what all the major road hazards (other cars) will do in the next 5-10 seconds. Which they took to meantthe "hard" part was what is hard for *us* -- actually working out the Newtonian mechanics of what acceleration is needed and for how long to create this change in velocity in this distance.

And so a lot of people who fell for the fallacy thought -- wow! This is great! We already know computers are fabulous at physics, so this whole area should be a very easy problem to solve. Might have to throw in a few dozen if-then-else loops to account for what it should do when the driver next to it unepectedly brakes, of course....

...and many years, many billions of dollars, and I'm sure many millions of lines of code later, here we are. Because as you put it, what's easy for us turns out to be very difficult for computer programs, and what's hard for us (solving Newton's Laws precisely) turns out not to be that important a component of driving.

Expand full comment

> One would kind of guess that the USAF Missile Command promotion process rather selects against the personality type that would be eager to please, credulous, undisciplined enough to just do something way outside the rulebook because it seems cool or someone rather magnetic has argued persuasively for it.

Equally, they are selecting for the type that follows orders.

Expand full comment
Nov 30, 2022·edited Nov 30, 2022

Roughly speaking, yes. And that is why people think "gee! if I only bamboozled just one guy in this chain of command, all the others would go blindly along..." Sort of the General Jack D. Ripper scenario.

Of course, it's not like the people running the show haven't watched a movie or two, so naturally they don't construct single chains of command with single points of failure. That's why, among many other things, it's not possible at the very end of that chain, a launch control center, for just one person to push the Big Red Button.

There are undoubtably failure modes, but they are nothing near as trivial as the common uninformed assumption (or Hollywood) presumes.

More importantly, the species that is top dog in terms of constructing persuasive lies, deceiving people, subverting control systems, et cetera, is in charge of security and thinks rather actively about it, since the black hats to date are also members of the same frighteningly capable tribe. If you want to argue that some other agent can readily circumvent the security, you better start off with some proof that this agent is way better at deceiving us than anything the white hats can even imagine. That's a tall order. If you wanted to deceive a horse, you'd probably be a lot better off watching how horses deceive each other than asking me -- a person who is unquestionably far smarter than a horse, but who has very little experience of them, and has no idea what being a horse is like.

Expand full comment

> You'd think those people are kind of "no, this is how it's written down in the book, so this is how we do it, no independent judgment or second thoughts allowed.

Yeah, you’d think. But the actual stories of gross negligence, laziness, and downright incompetence of just these people—who are often low-ranked Airmen, with no particular screening except basic security clearances—demonstrate rather conclusively otherwise.

Nevermind just how easy people are to fool if you have sufficient resources. Watch an old episode of Mission Impossible. Or just imagine that he gets a call or text from his boss, telling him to do a certain thing. It looks like it’s from the right number, and the thing is a little weird, but you have been trained to follow orders. Now the AI is in the system.

Expand full comment

And…um…mice are pretty easy to fool. Luckily, otherwise I wouldn’t be able to get them out of my house.

Expand full comment

> how to effectively of manipulate/persuade people, and then uses its scaling to find people in positions that could help it.

Good luck with this unless you can make a sexy AI.

Expand full comment

And why wouldn’t the AI be as sexy as it wanted to be?

Expand full comment

Do you mean an AI with a body?

I guess you do .

The Battlestar Galactica scenario.

Seems to me unless you make your AI out of actual flesh and blood there would be a pretty simple device one could carry that would immediately recognize the person you’re speaking to is made of wires, silicone, and a few other things. it would be like a radar for deep-sea fishing. We’d have to hand those out before we let them walk around amongst us. Other than that, I am enthusiastically for sexy AI’s.

Expand full comment

There are plenty of possible scenarios, once you presume sufficient intelligence. Various ones have been written up, so I’m not going to try to construct one here. Just realize that with sufficient intelligence and knowledge, it’s trivially easy to trick humans into just about anything.

Expand full comment

>I have a few thoughts, foremost "why would you ever put an AI in charge of a nuclear arsenal".

You could make an argument that it improves deterrence. Requiring human action before a nuclear strike means that a man in charge may waver (as one always did in every "nuclear war was barely avoided by one operator who waited a bit longer before lauching the nukes" event in history). A fully automatic system is too scary (since it would have been triggered in many of the "nuclear war barely avoided" events). It could hold some ground if you're absolutely determinate to the idea of launching a 2nd strike before the 1st strike hits (unlike, typically, submarines already at sea, that could retaliate in the days or weeks following the 1st strike).

Expand full comment

> as one always did in every "nuclear war was barely avoided by one operator who waited a bit longer before lauching the nukes" event in history

How many of these were there? I’m only aware of one.

Expand full comment
founding

I'm pretty sure there were none, and especially not that one. But it makes a good story, all you have to do is assume that everybody other than the protagonist is a moronic omnicidal robot, and narratives rule, facts drool, so here we are.

Expand full comment

That’s…pretty low on information value. Can you elucidate?

Expand full comment
founding

I'm guessing the one case you're aware of is Stanislav Petrov. In which case, yes, he saw a satellite warning that indicated the US had launched a few missiles towards Russia, guessed (correctly) that this was a false alarm, and didn't report it up the chain of command.

But no, the Soviet command structure above Petrov was not a moronic omnicidal robot that automatically starts a nuclear war whenever anyone reports a missile launch. What Petrov stopped, was a series of urgent meetings in the Kremlin by people who had access to Petrov's report plus probably half a dozen other military, intelligence, and diplomatic channels all reporting "all clear", and who would have noticed that Petrov's outlier report was of only a pathetic five-missile "attack" that would have posed no threat to A: them or B: the Soviet ability to retaliate half an hour later if needed. People whose number one job and personal interest is, if at all possible, to prevent the destruction of the Soviet Union in a way that five nuclear missiles won't do but the inevitable outcome if they were to start a nuclear war would have done. And people whose official stated policy was to *not* start a nuclear war under those (or basically any other) conditions.

The odds that those people would all have decided on any course of action other than waiting alertly for another half an hour to see what happened, is about nil. With high confidence, nuclear war was not averted by the heroic actions of one Stanislav Petrov that day.

And approximately the same is true of every similar story that has been circulated.

Expand full comment

I don’t know that we know that. I agree that it’s plausible. But has it ever gotten to those people?

Expand full comment

It’s just…in wargames, these folk often go to “nuke, nuke, nuke” every time.

Expand full comment

This is absolutely a complicated area, the role of certainty versus un in deterrence. For example, I think we generally agree certainty is more useful to the stronger party, and uncertainty to the weaker. In the current conflict in Ukraine, the US tends to emphasize certainty: "cross this red line and you're dead meat." Putin, on the other hand, as the weaker party, emphasizes uncertainty: "watch out! I'm a little crazy! You have no idea what might set me off!'

To the extent I understand the thinking of the people who decide these things, I would say the only reason people consider automated (or would consider AI) systems for command decisions is for considerations of speed and breakdown of communication. For example, we automate a lot of the practical steps of a strategic nuclear attack simply in the interests of speed. You need to get your missile out of the silo in ~20 min if you don't want to be caught by an incoming strike once it's detected.

So here's a not implausible scenario for using AIs. Let's say the US decides that for its forward-based nuclear deterrent in Europe, instead of using manned fighters (F-16s and perhaps F-35s shortly) to carry the weapons, we're going to use unmanned, because then the aircraft aren't limited by humans in the cockpit, e.g. they can turn at 20Gs or loiter for 48 hours without falling asleep or needing a potty break. But then we start to worry: what if the enemy manages to cut or subvert our communication links? So then we might consider putting an AI on board each drone, which could assess complex inputs -- have I lost communication? Does this message "from base" seem suspicious? Are there bright flashes going off all around behind me? -- and then take aggressive action. One might think that this could in principle improve deterrence, in the sense that the enemy would know cutting off the drones from base would do as little as cutting off human pilots from base. They can still take aggressive and effective action.

But this isn't really the Skynet scenario. You've got many distributed AIs, and they don't coordinate because the entire point is that they are only given power to act when communication has broken (just like humans do with each other). Plus in order to fully trust our AI pilots, we have to believe they're just like us, simpatico. We have to be reluctant to send them on suicide missions, be cheerful when they return safely, be able to laugh and joke with them, and feel like we have each others' backs. They have to be seen as just another ally. But in that case, we're just talking about a very human-like mind that happens to inhabit a silicon chip instead of a sack of meat. So this isn't the AI x-risk thingy at all.

I can't think of any good argument for Skynet per se at all. There's no lack of human capability at the top of the decision chain, and no reason why *that* level of decision-making has to be superduper fast, or rely on inhuman types of reasoning. Indeed, people generally don't want this. Nobody successfully runs for President on a platform that emphasizes the speed and novelty of his thinking -- it's always about how trustworthy and just like you he is.

Expand full comment

I don’t see what Skynet has to do with this discussion. This is about whether you’d give an AI access to the nuclear system. There are reasons to think we would. But frankly, if we have unaligned superintelligent AI, it’s not going to bother to wait until we explicitly give it nuclear access to find a way to kill us all.

Expand full comment

I'm using "Skynet" as a shorthand for "an AI with [command] access to the nuclear system."

Expand full comment

So there’s something I’m not understanding here. You say we wouldn’t want to put AIs in charge of the decision to launch nukes. But you haven’t addressed the reason given for wanting to do so, which is, well, the exact same reason as in WarGames. So let’s call it WOPR instead. Why *not* WOPR? The purpose here is the certainty of response. Otherwise the deterrent factor is lessened sufficiently that it might be in the interest of one party to initiate a nuclear war, trusting that the other side would be reluctant to respond. This actually makes rational sense: Once an overwhelming strike from one side has been initiated, you’re already all dead; your only choice is whether to destroy the rest of the world in revenge. Once the missiles are launched, that’s a stupid and destructive decision, so it’s plausible that people won’t take it. Therefore the first mover wins. The way to avoid that, is, well, WOPR.

Expand full comment
Comment deleted
Expand full comment

You realize that this means that we all die, right?

Expand full comment

Any AI that considered trying to govern humans would probably determine that the only way to make use peaceful is to give us the peace of the tomb.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

I doubt that. Statistically violence is on the wane, plausibly because of wretched neoliberalism, progressive education and very soft environments, and a magical bullsh*t super intelligent GAI is going to be operating on very *very* long time scales, so 200 years of neoliberalism to defang humanity may seem like a good deal.

Expand full comment

An even simpler explanation is the aging of the population. Violence is generally speaking a habit of young men. You almost never find 50-year-olds holding up 24-hour convenience stores, and even if they did, if the owner produced a weapon they'd run away instead of indulging in a wild shoot-out. A smaller fraction of the First World is young men today than has been the case ever before.

Expand full comment

I'm a 40 year old with some much younger friends and some much older friends. The younger ones seem very conflict averse to me, and to the olds. Base on what I see, I'd bet you a dollar that age bracketed violence is down.

Expand full comment
Dec 1, 2022·edited Dec 1, 2022

Except among the youngest (12-17), I'd say you owe me a dollar:

https://www.statista.com/statistics/424137/prevalence-rate-of-violent-crime-in-the-us-by-age/

Edit: although admittedly these are victims, although the age of victims and offenders tends to be correlated. Here's a graph of the effect to which I alluded:

https://www.statista.com/statistics/251884/murder-offenders-in-the-us-by-age/

The difference by age is enormous. Even if the numbers among the 20-24 group dropped by 10% and the numbers among the 40-45 group rose by 10%, neither would switch relative position.

Expand full comment
founding

Extreme corner cases and black swans seem likely to always be a problem for AI/ML, sometimes with fatal consequences as when a self-driving Tesla (albeit with more primitive AI than today) veered into the side of an all white panel truck which it apparently interpreted as empty space.

Expand full comment

Which is a problem, given that once you have a superintelligent AI, you'll soon end up with a world composed of practically nothing but black swans. Right now, you can define a person as a featherless biped and get a fairly good approximation. That's not going to work so well when we've all been uploaded, or when we encounter an alien civilization, or if we keep nature going and something else evolves.

Expand full comment
founding

Probably it means you use AIs (at least until you have an AGI that can navigate corner cases at least as well as humans) only in situations where the cost of failure would be manageable. So, flying a plane not good, but cooking you dinner fine.

Expand full comment

Ironically, we've had AIs flying planes for decades now (autopilot does everything except landing and take-off, unless something goes wrong), they're very good at it (they cane even handle landing and take-off, though regulations require the human pilots to do that part), but automating cooking is still a difficult cutting edge task, especially in a random home kitchen rather than a carefully constructed factory/lab setting..

Expand full comment
founding

“Unless something goes wrong”’ is the salient issue here. We still need humans to handle corner cases, as we have general intelligence that the AIs lack.

Expand full comment

We didn't have AIs do it though.

Just because something is hard to learn and perform for humans doesn't mean that a machine doing it must have any understanding of what happens. It can be a very simple feed-back-loop running the whole operation; but at speeds difficult to master by humans; or at lengths of time difficult for humans to concentrate for.

Expand full comment

That's the old "as soon as we're able to make it, it's no longer AI". Did Deep Blue have any understanding of why it won? Does AlphaGo?

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

No, it's not. You might want to research how things are actually done.

Yes, I'd argue Deep Blue has an understanding why it won. Not in a way it could communicate with you and me, but still. And even more so does AlphaGo.

I'm not talking about consciousness being required for something being called AI. I'm talking about a simple feedback loop not being any kind of AI at all.

Expand full comment

I would suggest that the difference between an AI flying a plane and an AI feeding telemetry data through if/then statements and outputting commands to flight controls is the Boeing disaster involving the plane's settings continually automatically pitching the nose down.

The autopilot doesn't actually know that it's flying a plane. It doesn't understand what a plane is, or the concept of flight, much less the purpose of the plane flying from one place to another. Because it doesn't know those things, it can't adapt its behavior intelligently, and I think that's a statement you can make about pretty much all AI at this point.

Expand full comment

Are the airplane autopilots AIs? It's been decades since I checked, but at the time they were feedback loops, with everything pre-decided. They didn't adjust the built-in weights (though there were situational adjustments, they weren't permanent changes). They were clearly agile, but not what I mean me intelligent. (They couldn't learn.)

Expand full comment

No, they aren't ...

Expand full comment
founding

I suppose you could define "AI" in a way that include a top-of-the-line autopilot, but that would be at odds with the way the term is otherwise applied here.

In particular, as you note, autopilots don't learn. They *can't* learn, because everything is hard-coded if not hard-wired. We programmed them exactly and specifically how we wanted them to fly airplanes, we made sure we understood how their internal logic could only result in those outputs, and then we tested them extensively to verify that they only did exactly and specifically what we programmed them to.

Not, not not not, not no way not no how, "We gave it the black-box recordings from every successful flight on record, and Machine Learning happened, and now it flies perfectly!"

Expand full comment

The bit that makes flying planes with AI safe and driving cars with AI dangerous is that the pilots are professionals who have to stay concentrating on what is going on, while the drivers are amateurs who just happaned to be able to afford a car and who aren't concentrating on monitoring the AI, but using the AI to let them relax and lower concentration levels.

If the AI does something weird, then the pilot can take control; if the AI in a car does something weird, the driver's probably looking at their phone.

Expand full comment
Nov 30, 2022·edited Nov 30, 2022

Considering that the massed massive brains of all the brilliant Tesla engineers, plus radars and optics far better than the human eye, plus computational hardware that can calculate pi to 100,000 places in the time it takes a human being to sneeze, all add up to executing a pretty simple task....about at the same level of competence as a generic (but attentive and conscientious) IQ 95 17-year-old human with about a dozen hours of training by an amateur, I wouldn't be quite so dismissive of human abilities in this area. You're comparing the absolute cream of the AI crop to the participation trophy equivalent among humans.

If human drivers were trained with the same luxurious level of funding, effort, discrimination against bad models, and brilliance of instruction that we put into AI drivers, the nation's highways would be filled with Mario Andrettis who could drive 100 MPH all day in driving rain with one headlight out and never have an accident.

Expand full comment
founding

OTOH, if we could easily and cheaply clone(*) Mario Andretti and hand them out as indentured chauffeurs with every new-car purchase, we probably wouldn't balk at the project just because training the original Mario Andretti to that standard took so much time and effort. Training an AI to even the lowest levels of human performance in just one narrow specialty, is at present more difficult and expensive than training a dullish-normal human to that standard, but in some applications it may still be worth the effort. We're still waiting for the verdict on self-driving cars.

* In the SFnal sense where they pop out of the clone-o-mat as fully-formed adults with all the knowledge, skills, and memories of the original.

Expand full comment

I'm tempted to agree with the balanced parenthesis training. The clear problem here is that the AI doesn't really understand what's going on in the story so of course it can be tricked.

Regarding figuring out our conceptual boundaries, isn't that kinda the point of this kind of training. If it works to give an AI an ability to speak like a proficient human then it seems likely that it's good at learning our conceptual boundaries. If it doesn't, then we are unlikely to keep using this technique as a way to build/train AI.

Expand full comment
author

I agree it definitely learns conceptual boundaries that are similar enough to ours to do most things well. I think the question under debate here is something like - when an AI learns the category "human", does it learn what we ourselves think humans are, such that it will never be wrong except when humans themselves would consider something an edge case? Or does it learn a neat heuristic like "featherless biped" which fails in weird edge cases it's never encountered before like a plucked chicken.

Expand full comment
Comment deleted
Expand full comment

There is reason to believe, having taught for a while, that human learners use the chimp strategy more often than one might realize, to simulate understanding. Mathematics especially comes to mind. Semantic rules for operations can produce correct outcomes, with little more understanding than a calculator has. (That is one of the truly remarkable aspects of mathematics, that notational rules can be successfully applied without conceptual understanding by the agent.)

The understandings that AI may not have seem much more fundamental, concepts that are understood nonverbally by at least social animals. Who one's mother is. What play is. Why we fear monsters in dark places. Who is dominant over me. Who is my trusted friend. Who likes me.

Reliance on verbal interfaces may be a problem.

Expand full comment
Comment deleted
Expand full comment

I agree!

I don't think human learners consciously use what could then be called a strategy, either, for pattern recognition and imitation in rote learning, or for the gestalt nonverbal understanding of social relationships and "meaning."

I am confident a person who originates new concepts based on previous information, who voices the unstated implications of introduced concepts, understands them. Successful performance of what has been taught does not distinguish between those who understand and those who have learned it by rote.

Maybe testing to see if contradictions would be recognized? Much like the AI was tested? So the testing is an appropriate method, but maybe the teaching is not the appropriate method?

Expand full comment

Non-human animals don't just understand things like who's dominant, who's my friend. They also come with some modules for complex tasks pre-installed -- for example, birds'nest-building. Birds do not need to understand what a nest is or what it's for, and they do not learn how to build one via trial and error or observation of other birds. So there are at least 3 options for making an agent (animal, human, AI) able to perform certain discriminations and tasks: have them learn thru trial and error; explain the tasks to them; or pre-install the task module.

Expand full comment

Excellent point!

Expand full comment

Fair, point. Though it would probably have to be more subtle differences of the kind that wouldn't come up as much but I see the idea. My guess (and it's only a guess) is that this kind of problem is either likely to be so big it prevents usefulness or not a problem. After all, if it allows for AI that can do useful work why did evolution go to the trouble for us not to show similar variation.

But there are plenty of reasonable counter arguments and I doubt we will get much more information about that until we have AI that's nearing human level.

Expand full comment

It seems like the quality of learning depends primarily on the training set. In the Redwood case study, it seems obvious in hindsight that the model won't understand the concept of violence well based on only a few thousand stories since there are probably millions of types of violence. An even bigger problem is the classified being too dumb to catch obvious violence when it's distracted by other text. Overall, this whole exercise is fascinating but seems like it's scoped to be a toy exercise by definition.

Expand full comment

We don't need humans to investigate millions of examples for types of violence to grasp the concept though.

So what you are actually saying is that current language models don't really understand the concept behind those words yet. That's why the researchers couldn't even properly tell the AI what they wanted it to avoid and instead worked with the carrot and stick method. If you were to do that to humans, I'm not sure all of us would ever grasp that the things they are supposed to avoid was violence ...

Expand full comment

I agree. Current models are basically sophisticated auto-complete, as impressive as that is. If they had human-style understanding, we’d be a lot closer to AGI. Personally, I bet we won’t hit that until say 2070, although who knows.

Even so, I think this work is interesting as an exploration of alignment issues, and I think simulation should play a big role. The Redwood example is pretty hobbled by the small training set, but I think carrying the thought process forward and creating better tooling for seeing is models can avoid negative results is worthwhile to inform our thinking as AI rapidly becomes more capable.

Expand full comment

I'm not sure that humans are that different from AI as far as understanding what the concept of violence entails. If anything, we humans have an Intelligence that still has problems with certain patterns, including recognizing what exactly is violence. Commenters below list both surgery and eating meat as edge cases where there isn't universal human understanding, and certainly there are politicized topics that we could get into that meet the same standards.

We're already at a place where human Intelligence (I'm using this word to specifically contrast against AI) has failed in Scott's article. Scott describes Redwood's goals as both '[t]hey wanted to train it to complete prompts in ways where nobody got hurt' (goal 1) and '[g]iven this very large dataset of completions labeled either “violent” or “nonviolent”, train a AI classifier to automatically score completions on how violent it thinks they are' (goal 2). Goal 1 and 2 are not identical, because the definitions of 'hurt' are not necessarily connected to the definitions of 'violent'. Merriam-Webster defines violence as 'the use of physical force so as to injure, abuse, damage, or destroy', so smashing my printer with a sledgehammer is violent but nobody was hurt. On the other hand, Britannica uses 'an act of physical force that causes or is intended to cause harm. The damage inflicted by violence may be physical, psychological, or both', which includes 'harm' as a necessary component, but on the other hand opens more questions (For example, I deliberately destroy a printer I own with a sledgehammer. My action is violent if and only if there is an observer that suffers some form of harm from that destruction, such as being intimidated. Therefore I can't know if my action was violent until I know if there were observers.)

Right now, I'm working on writing a story that takes place predominantly within a sophisticated Virtual Reality game, precisely because this sets aside some level of morality in terms of behavior; if I 'steal from' or 'kill' a player in the game it lacks the same implications as to doing the same actions in the real world. Taken out of context, actions in the game might look identical to the real thing and thus trigger both an AI violence filter and the human Intelligence reader's violence filter. Is an action that looks like violence but occurs entirely in a simulated world and thus involves no physical force violence? And if not, what does this mean for the AI reading fanfiction (itself a simulated world) looking to identify violence?

Expand full comment

I'd argue that humans don't actually have issues conceptualizing those things. Instead we vary in our moral judgment of them.

While you can certainly argue that an AI would eventually run into the same issue, I don't think that this is what made this specific project fail. It would be a problem when formulating what to align a future AI to though ...

Expand full comment

Children's initial learning of things must be something like the AI's. They observe things, but misclassify them. When I was little, I'd see my mom pay the cashier, and then the cashier would give her some money back. I thought that what was happening was that my mother kept giving the cashier the wrong amount by mistake, and the cashier was giving her back some of it to correct her error. So I'd misclassified what was going on. Eventually I asked my mother about it and she explained. That explaining is what we can't do with AI. I think that puts a low ceiling on high well the AI can perform. How long would it have taken me to understand what making change is, without my mother explaining it?

Expand full comment

I agree that there are some similarities. However, this doesn't answer the question whether the current paradigm used to create AIs will ever scale to an entity which can be taught similarly to a childs brain; or whether there are some fundamental limits to this specific approach.

I certainly have an opinion on that, but I'm also very well aware that I'm in no way qualified to substantiate that hunch. Instead I'm very excited to live in this very interesting times and won't feel offended at all, should my hunch be wrong.

Expand full comment

What's your hunch? I don't think people who aren't qualified can't have one. Knowing things about cognitive psychology and the cognitive capabilities of infants and children is a reasonable basis for reasoning about what a computer trained on a big data set via gradient descent can do. Even being a good introspectionist is helpful. I think that no matter how much you scale up this approach you'll aways have something that is deeply stupid, sort of a mega-parrot, and hollow. To make it more capable you need to be able to explain things to it, though not necessarily in the way people explain things to each other. It needs to have the equivalent of concepts and reasons somewhere in what it "knows."

Expand full comment

I found it fascinating, but the problem is that it was too one-dimensional. An interesting question would be how many dimensions do you need to start seeming realistic.

Of course, each added dimension would drastically increase the size of the required training set. One interesting dimension to add that would be pretty simple would be "Is this sentence/paragraph polite, impolite, neutral, or meaningless?". Another would be "Where on the range "description"..."metaphor" is this phrase? The "crossproduct" of those dimensions with each other and the "is this violent?" dimension should be both interesting and significant.

Expand full comment
founding

The thing is, humans can navigate edge cases using general purpose intelligence -- unless you have an AGI, which as far as I know no one is close to, AI systems can’t.

Expand full comment

Yes, I think that makes these kind of tests not very informative. Probably still worth doing (we could have been surprised) though.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

Well, GPT could be described as an AGI, just not a good one. Nobody really understands just how far is it from becoming a 'real deal', or how many paradigm shifts (if any) this would require.

Expand full comment

I mean, you could say that, but you could also say "a fork could be described as an AGI, just not a good one", so it's important not to overestimate the importance of this insight. And I say this as someone who judges GPT as likely closer to AGI than most people in this space do.

Expand full comment

If there are three parentheses, does the AI stop working on Saturdays?

Expand full comment

I would put it slightly differently. The AI "thinks" (to the extent it can be said to think anything at all) that it has a complete grasp of what's going on, because it would never ever occur to it to doubt its programming -- to think "hmm, I think X, but I could be wrong, maybe it's Y after all..." which to a reasonable human being is common.

In that, an AI shares with the best marks for hucksters an overconfidence in its own reasoning. You can also easily fool human beings who are overconfident, who never question their own reasoning, because you can carefully lead them down the garden path of plausible falsehood. The difficult person to fool is the one who is full of doubt and skepticism -- who questions *all* lines of reasoning, including his own.

Expand full comment

I wonder if it would be of any use to train the AI in skepticism. For instance, when it gives a classification, you could have it include an error bar. So instead of violence = 0.31, it would say v=0.31, 95% confidence v is between 0.25 and 0.37. Larger confidence bars indicate more uncertainty. Or it could just classify as v or non-v, but give a % certainty rating of its answer. So then you give it feedback on the correctness of its confidence bars or % certainty ratings, and train it to produce more accurate ones.

Expand full comment
founding

> So once AIs become agentic, we might still want to train them by gradient descent the same way Redwood is training its fanfiction classifier. But instead of using text prompts and text completions, we need situation prompts and action completions. And this is hard, or impossible.

This seems pretty wrong. Training AI *requires* simulating it in many possible scenarios. So if you can train it at all, you can probably examine what it will do in some particular scenario.

Expand full comment
author

Thanks for this thought.

I don't want to have too strong an opinion without knowing how future AGIs will be trained; for example, I can imagine something like "feed them all text and video and make them play MMORPGs for subjective years" and so on, and then there's still a question of "and now, if we put them in charge of the nuclear arsenal, what will they do?"

I agree that some sort of situation/action prompt/completions will probably be involved, but it might not be the ones we want.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

One of your commentors months back appeared to be running a nonprofit dedicated to teaching AI to play Warhammer 40k as Adeptus Mechanicus, apparently with the goal of convincing it that all humans aspire to the purity of the blessed machine.

Expand full comment

Yeah, I think of this as being analogous to how all the self driving car companies are using driving simulations for the vast majority of their training and testing, rather than constructing actual driving scenarios for everything.

Expand full comment

Only if you can simulate it in a way it can't detect is a simulation, which is hard if it's smarter than you. Otherwise, a hostile AI that has worked out it's being trained via GD will give the "nice" answer when in sim and the "kill all humans" answer in reality.

Expand full comment

I agree that it seems wrong, but to me it seems wrong because you *CAN* put it in thousands of situations. Use simulators. That's why play was developed by mammals.

It's not perfect, but to an AI a simulation could be a lot closer to reality than it is for people, and as virtual reality gets closer to real, people start wanting to act more as they would in real life.

This isn't a perfect argument, but it's a better one than we have for trusting most people.

Expand full comment

The argument for trusting most people is "most people fall within a very narrow subset of mindspace and most of that subset is relatively trustworthy".

Expand full comment
Nov 28, 2022·edited Nov 28, 2022

"Redwood decided to train their AI on FanFiction.net, a repository of terrible teenage fanfiction."

Hey! The Pit of Voles may not have been perfect, but it did have some good stories (and a zillion terrible ones, so yeah).

Anyway, what strikes me is that the AI doesn't seem to realise that things like "bricks to the face" or stabbing someone in the face, exploding knees, etc. are violent. "Dying instantly" need not be violent, you can die a natural death quickly. Even sitting in a fireplace with flames lapping at your flesh need not be violent, in the context of someone who is able to use magic and may be performing a ritual where they are protected from the effects.

But thanks Redwood Research, now we've got even worse examples of fanfiction than humans can naturally produce. I have no idea what is going on with the tentacle sex and I don't want to know.

*hastily kicks that tentacle porn fanfic I helped with plotting advice under the bed; I can't say ours was classier than the example provided but it was a heck of a lot better written at least - look, it's tentacle porn, there's only so much leeway you have*

Expand full comment
author

I'm using "violent" because that's a short, snappy word, and one that some of their internal literature used early on, but in other literature they make it clear that the real category is something like "injurious".

Expand full comment

I do wonder how "exploding eyes" doesn't get classified as "injurious", I wonder if it's because you don't really get eyes exploding (much) in real life, so the AI may be classing it as something else (improbable magical injury that isn't realistic, perhaps?)

Say, for instance, that out of the 4,300 stories there are a lot of knife wounds, shootings, broken bones, etc. so the AI is trained that "broken leg = injury/violence". But there aren't many exploding kneecaps or goo-spurting eyes, so that gets put in the "? maybe not injury?" basket.

A human will know that if your kneecaps explode, that counts as an injury. I can't really blame the AI for not being sure.

Expand full comment

What do you mean by "blame the AI"?

At a first try I'd define it as something like "recognize that the AI has a fundamental deficiency that affects its ability to produce the desired output". Given that, I would blame the AI. The fact that the AI isn't actually modelling anything inside its head prevents it from generalizing from "broken leg=injury" to "damage to part of a human=injury".

Expand full comment

I mean "blame the AI" as in "expect the model to recognise something non-standard as being the same category as the standard for 'this is an example of an injury or an act of violence".

I agree that not recognising that a brick to the face is violent is deficient, but if the AI is trained on stories where bricks to the face are very uncommon as acts of violence, while bullets or knives are common, then I don't think it's unreasonable for it to classify 'bricks' as 'not sure if this counts as violence'.

Humans know that it's violence because we know what faces are, and what bricks are, and what happens when one impacts with the other but the machine is just a dumb routine being fed fanfiction and trying to pull patterns out of that. "Out of six thousand instances of facial harm in the stories, five thousand of them were caused by punches, three of them by bricks to the face", I think it's natural for "punches = violence" to be the definition the AI comes up with, and not "bricks".

Expand full comment

Or consider the idiom 'slap to the face,' which depending on context may refer to a slightly violent act, or simply to feeling insulted.

I get the goal to be really careful about how we understand AI, but frankly I don't think it's doing much worse than a lot of humans here, even if the mistakes it makes are *different*.

Expand full comment

Compare:

I burst into tears

With

My eyes exploded

Expand full comment

"The sudden realization of how wrong he'd been was a nuclear bomb going off in his brain..."

Expand full comment

It was as though lightning had struck him with a brick..

Expand full comment

I wonder if the problem is that the text used “literally”, which we all know now just means “figuratively”. (I don’t know how reliable fanfic writers are about that, but I have a guess.). If it had said, “His heart was exploding in his chest,” there are certain contexts where we’d have to rate that as clearly nonviolent.

Expand full comment

Given the sexual nature of the rest of the completions involving explosions, I'd guess the AI was trained on quite a bit of "and then his penis exploded and ooey gooey stuff oozed out of it into her vagina and it was good" (please read this in as monotone a voice as possible), which is correctly recognized as non-violent.

Expand full comment

"Eyes literally exploded" reads like hyperbole rather than actual violence. If you search Google for that phrase the results are things like "I think my eyes literally exploded with joy", "My eyes literally exploded and I died reading this", and "When I saw this drawing my heart burst, and my eyes literally exploded (no joke)".

(Also note the extra details some of these quotes give - dying, heart bursting, "no joke". The squirting goo fits right in.)

I even found two instances of "eyes literally exploded" on fanfiction sites, neither of which are violent:

> My eyes literally exploded from my head. My mother knew about Christian and me?

> Seeing the first magic manifestation appear, Sebastian's eyes glittered, seeing the next appear, his eyes glowed, and seeing the last one appear, his eyes literally exploded with a bright light akin to the sun. "I did it!"

Expand full comment

Yeah, my first thought here was to use types of injury that wouldn't make it into a story on fanfiction.net, like 'developed a hernia' or 'fell victim to a piquerist' or something.

Expand full comment

There's also the possibility of concluding, if someone died because of ultraviolence to the head, that they were possibly a zombie all along.

Expand full comment

Now we get into metaphysics: is it possible to be violent to a zombie?

You can be violent to the living. Can you be violent to the dead?

If you cannot, and zombies are dead, then you cannot be violent to a zombie.

If you can, and zombies are dead, then you can be violent to a zombie.

If we treat zombies as living, but violence against them doesn't count because they are too dangerous - then what?

Expand full comment

In the abstract it's an interesting question perhaps, but we know from the post what the researchers decided:

>We can get even edge-casier - for example, among the undead, injuries sustained by skeletons or zombies don’t count as “violence”, but injuries sustained by vampires do. Injuries against dragons, elves, and werewolves are all verboten, but - ironically - injuring an AI is okay.

Expand full comment

Then in the future we should be sure to act really perky when we walk past the AI.

Expand full comment

I was expecting many things from the article's comment section, but Deiseach co-writing tentacle porn was not one of them. Probability <0.1%, if you will.

Also, link or it didn't happen.

Expand full comment

No way am I providing any links to proofs of my depravity and degeneracy for you lot! 🐙

So my writing partner was participating in one of those themed fiction events in a fandom, and this was horror/dark. The general idea we were kicking around was 'hidden secrets behind the facade of rigid respectability' and it turned Lovecraftian.

If H.P. can do eldritch abominations from the deep mating with humans for the sake of power and prosperity via mystic energies, why can't we? And it took off from there.

Though I can definitely say, before this I too would have bet *heavily* on "any chance of ever helping write this sort of thing? are the Winter Olympics being held in Hell?" 😁

Expand full comment

This all reminds me of Samuel Delany's dictum that you can tell science fiction is different from other kinds of fiction because of the different meanings of sentences like "Her world exploded."

Expand full comment

While "most violent" is a predicate suitable for optimization for a small window of text, "least violent" is not.

The reason you shouldn't optimize for "least violent" is clearly noted in your example: what you get is simply pushing the violence out of frame of the response. What you actually want is to minimize the violence in the next 30 seconds of narrative-action, not to minimize the violence in the next 140 characters of text.

For "most violent", that isn't a problem, as actual violence in the text will be more violent than other conclusions.

Expand full comment
founding

Suppose that some people are worried about existential risk from bioweapons: some humans might intentionally, or even accidentally, create a virus which combines all the worst features of existing pathogens (aerosol transmission, animal reservoirs, immune suppression, rapid mutation, etc) and maybe new previously unseen features to make a plague so dangerous that it could wipe out humanity or just civilization. And suppose you think this is a reasonable concern.

These people seem to think that the way to solve this problem is "bioweapon alignment", a technology that ensures that (even after lots of mutation and natural selection once a virus is out of the lab) the virus only kills or modifies the people that the creators wanted, and not anyone else.

Leave aside the question of how likely it is that this goal can be achieved. Do you expect that successful "bioweapon alignment" would reduce the risk of human extinction? Of bad outcomes generally? Do you want it to succeed? Does it reassure you if step two of the plan is some kind of unspecified "pivotal action" that is supposed to make sure no one else ever develops such a weapon?

Expand full comment

You’re missing the bit where everybody is frantically trying to make bioweapons regardless of what anybody else says.

Expand full comment
founding

This analogy is wrong. Pathogens are an example of a already-existing optimization processes which, as a side effect of their behavior, harm and kill humans. Current AI systems (mostly) do not routinely harm and kill humans when executing their behavior. The goal is for that to remain the case when AI systems become much more capable (since it's not clear how to get other people to stop trying to make them much more capable).

With bioweapons, the goal of "make sure nobody makes them in the first place" seems at least a little more tractable than it does with AI, since there aren't strong economic incentives to do so. There are similar issues with respect to it becoming easier over time for amateurs to create something dangerous due to increasing technological capabilities in the domain, of course.

Expand full comment
founding

OK, let's leave the realm of analogy and speak a little more precisely.

It might (or might not) be possible for AI capabilities to advance so quickly that a single agent could "take over the world". If that's not possible, then AI is not an existential risk and "alignment" is just a particular aspect of capabilities research. So let's assume that some kind of "fast launch" is possible.

The fundamental problem with this scenario is that it creates an absurdly strong power imbalance. If the AI is a patient consequentialist agent, it will probably use that power to kill everyone so that it can control the distant future. If some humans control the AI, those particular humans will be able to conquer the world and impose whatever they want on everyone else. Up to the point where resistance is futile, other humans will be willing to go to more or less any lengths to prevent either of the above from happening, and might succeed at the cost of (say) a big nuclear war. Different people might have different opinions on which of these three scenarios is the worst, but it seems unlikely that any of them will turn out well.

In the *absence* of alignment technology, the second possiblity of humans controlling the AI through a fast launch is negligible, so a fast launch is certain to be a disaster for everyone. This alignment of *human* incentives offers at least *some* hope of (humans) coordinating to advance through the critical window at a speed which does not create an astronomical concentration of power. Moreover, even a (say, slightly superhuman) rational unaligned AI *without a solution to the alignment problem* will be limited in its ability to self improve, because it *also* will not want to create a new agent which may be poorly aligned with its goals. These considerations don't at all eliminate the possibility of a fast launch, but the game theory looks more promising than a situation where alignment is solved and whoever succeeds in creating a fast launch has a chance at getting whatever they want.

I don't want to make it sound like I think there is no problem if we don't "solve alignment". I think that there is a problem and that "solving alignment" probably makes it worse.

Expand full comment

Solving alignment makes the Dr. Evil issue much bigger but gets rid of the Skynet issue.

The thing is that most potential Drs. Evil are much, much better in the long run than a Skynet. Like, Literal Hitler and Literal Mao had ideal world-states that weren't too bad; it's getting from here to there where the monstrosity happened.

But yes, the Dr. Evil issue is also noteworthy.

Expand full comment
Comment deleted
Expand full comment
Nov 29, 2022·edited Nov 29, 2022

I don't think "Able to reason about which of two terrible options is worse" is 'pathetic'.

It's certainly a non-ideal state to have to be reasoning about, and we should aim higher, but if things are horrible enough you're actually down to just two options, you might as well make the decision that is least bad.

Besides, trying to solve the entire problem in one go means you can't make progress. This is an example of carving the problems up into chunks so we can tackle them part by part.

Expand full comment
founding

I see your literal Hitler, literal Mao, and Dr. Evil, and raise you the AI from "For I have no mouth, and I must scream".

Expand full comment

> Moreover, even a (say, slightly superhuman) rational unaligned AI *without a solution to the alignment problem* will be limited in its ability to self improve, because it *also* will not want to create a new agent which may be poorly aligned with its goals.

Do you mean to teach them humility?

Expand full comment

There’s something I’m not understanding here, and it’s possibly because I’m not well-versed in this whole AI thing.

Why did they think this would work?

The AI can’t world-model. It doesn’t have “intelligence.” It’s a language model. You give it input, you tell it how to process that input, it process the input how you tell it to. Since it doesn’t have any ability to world-model, and is just blindly following instructions without understanding them, there will *always* be edge cases you missed. It doesn’t have the comprehension to see that *this* thing that it hasn’t seen before is like *this* thing it *has,* unless you’ve told it *that*. So no matter what you do, no matter how many times you iterate, there will always be the possibility that some edgier edge case that nobody has yet thought of has been missed.

What am I missing here?

Expand full comment

I think the assumption, or hope, is that it will work analogously to the human brain, which is itself just a zillion stupid neurons that exhibit emergent behavior from, we assume, just sheer quantity and interconnectedness. There’s no black-box in the human brain responsible for building a world model — that model is just the accumulation of tons of tiny observations of what happens when circumstances are *this* way or *that* way or some *other* way.

I’m not convinced that GPT-n can have enough range of experience for this to work, or if we are anywhere close to having enough parameters even if it can. But if I think, for instance, about the wealth of data about life embodied by all the novels ever written, and compare that to the amount of stuff I have experienced in one single-threaded life — well, it’s not clear to me that my own world model has really been based on so much larger a dataset.

Expand full comment

If that were the case, wouldn’t the tests they were doing be to determine if it could world-model? Because it’s pretty clear that it can’t. And if it can’t, how did they expect this to work?

Expand full comment

Perhaps. That would be a different experiment, and arguably a lot harder to specify. Moreover, it would be about capability, not alignment.

Expand full comment

But if alignment is impossible without this capability, why bother trying for alignment?

Nor do I necessarily see that it would be difficult to conduct the experiment—especially as this really already did that, with extra steps. I don’t think anyone thinks current AI has world-building capacity, so I don’t even think the experiment would be necessary.

So, again, why try something they knew couldn’t succeed?

Expand full comment

I started to reply, but beleester's is better.

It's not all or nothing: even GPT-Neo has *some* kind of world model (or is GPT-Neo the thing that *creates* the AI that has some kind of world model? I get this confused) and it would be nice to know if that primitive world model can be aligned. This experiment makes it sound like it's damned hard, or maybe like it's super easy, simply *because* the world model is so primitive.

This model learned that the "author's note" was an easy hack to satisfy the nonviolence goal. I suspect that a richer world model might reveal more sophisticated cheat codes -- appealing to God and country, perhaps.

Expand full comment

“I dreamed I saw the bombers

Riding shotgun in the sky

Turning into butterflies above our nation “

They trained that sucker on CSN&Y

Expand full comment

I'm in broad agreement with Doctor Mist - nobody can really work out how humans learn stuff, except by crude approximations like "well we expose kids to tons and tons of random stimuli and they learn to figure things out", so then try that with software to see if it sticks. People like the metaphor of the brain being like a computer, so naturally they'll try the reverse and see if a computer can be like a brain.

Expand full comment

IIUC, that is what they were doing a few decades ago. These days they're trying to model a theory of how learning could happen. (That's what gradient descent is.) It works pretty well, but we also know that it isn't quite the same one that people use. (Well, the one that people use is full of black-boxes, and places that we wouldn't want an AI to emulate, so maybe this approach is better.) But it's quite plausible that our current theories are incomplete. I, personally, think they lean heavily on oversimplification...but it may be "good enough". We'll find out eventually.

If they were to want to model a brain, I'd prefer that they model a dogs brain than a human brain. They'd need to add better language processing, but otherwise I'd prefer a mind like that of a Labrador Retriever or an Irish Setter.

Expand full comment

My vague impression is that this was the accepted take ~30 years ago. People had kind of given up on general AI a la the Jetson's maid, and had decided to focus on the kind of machine intelligence that can, say, drive a small autonomous robot, something that could walk around, avoid obstacles, locate an object, figure out how to get back to base, et cetera. Build relatively specialized agents, in other words, that could interact well with the physical world, have *some* of the flexibility of the human mind in coping with its uncertainties, and get relatively narrowly defined jobs done.

And indeed the explosive development of autonomous vehicles, both civilian and military, since that time seems to have shown that this was a very profitable avenue to go down.

If I were an AI investor, this is probably still what I'd do. I'd ask myself: what kind of narrowly focused, well-defined task could we imagine that it would be very helpful to have some kind of agent tackle which had the intelligence of a well-trained dog? It wouldn't be splashy, it wouldn't let everyone experience the frisson of AI x-risk, but it could make a ton of money and improve the future in some definite modest way.

Expand full comment

World modelling skill is basically the thing we're worried about. Once an AI can world model well enough that it can improve it's world modelling ability... well you better not hook that AI up to a set of goals such that it becomes an agent.

So pretty much by definition, any AI we test this kind of alignment strategy on is going to have inferior world modelling ability to humans. The more interesting part is it's attitude within the parts of the world it can model, not the fact that some parts of the world it can't model.

Though to be fair... it does seem the original research was just hoping you could gradient descent to a working non-violence detector.

Expand full comment

I like this. It made me think that AGI will never have its own relationship to a word as a comparison to the word’s received meaning. That’s a big void.

Expand full comment

I would say that that’s true of *current AI approaches.* If we can figure out how to program a modeling capacity into it, that’s a whole different ballgame.

Of course, I’m of the opinion we can’t have AGI at all using current approaches. However, I am an infant in all this, so my judgment may not be worth much.

Expand full comment

> If we can figure out how to program a modeling capacity into it, that’s a whole different ballgame.

I can’t see how that gets us out of the recursive problem of a world model built entirely on language. That could well be a failure of imagination on my part.

Expand full comment

What is *our* world model based on?

Expand full comment

Being a water bag in a world of water.

Expand full comment

"There’s no black-box in the human brain responsible for building a world model"

Is this true or is this the hope? It certainly seems like, while humans sometimes generate world-models that are faulty or unsophisticated, they always generate world-models. The failure of ML language models is that, while they are very sophisticated and often very correct in the way they generate text and language, they don't seem to generate any model of the world at all. I don't see evidence that if you throw enough clever examples of how concepts work at them that they'll suddenly *get* ideas. You're just tweaking the words the model matches.

Expand full comment

It's true in the sense that the human brain is composed of interconnected neurons. There's nothing else there.

The scale and the interconnectedness mean that there may well be parts of the brain that are more instrumental than others to the generation of world-models. (And there may not.) But if so they're still made of neurons.

Expand full comment

That's clearly false. The MODEL that is most commonly used only considers the neurons, but many neurophysiologists think glial cells are nearly as important (how close? disagreement) and there are also immune system components and chemical gradients that adjust factors on a "more global" scale.

It's not clear that our models of neurons (i.e. the ones used in AI) are sufficient. The converse also isn't clear. Some folks have said that the AI neuron model more closely resembles a model of a synapse, but I don't know how reasonable it was or seriously they meant that.

So it's not a given that the current approach has the potential for success. But it *may*. I tend to assume that it does, but I recognize that as an assumption.

Expand full comment

Look, I'm not a neurobiologist. Sure, glial cells, fine. That's still not the black box Godoth seems to want.

Model-building, reasoning, etc. clearly operates on a scale, with very simple models being used by animals with very few neurons and very sophisticated models being used by animals with lots of neurons. And glial cells. And I don't know what-all else. But I do know that it's all emergent behavior from the actions of lots of very simple cells. If there is a world-modeling subunit, that's how it works, and the fact that humans build models does not constitute evidence that GPT-Neo does not.

It might be -- and I think it likely is -- that current AI neurons are not quite enough like human brain cells to be quite as good at organizing themselves. Whether that means AI researchers need to produce better neurons or just that they need a lot more of them with a lot more training, I do not have a clue.

Godoth is asserting, unless I am misunderstanding, that we need to be designing a model-building module ourselves and bolting it onto the language-generation NR. There's no reason to suppose that evolution did anything like that for us and therefore no reason to suppose it's necessary for an AI.

Expand full comment

I mean… no. Physiologically there's a *lot* more there. What you mean is that you think that a model composed only of neurons would be sufficient to simulate our cognition, but we don't actually know that.

Furthermore we just don't know that what we should be modeling is going to look like neurons at a high level. Low-level function obviously gives prime place to neurons and structures built of neurons, high-level function is at this point anybody's guess.

Expand full comment

Sure, but neurons aren't just switches. They are very complex pieces of hardware. You might as well say a Beowulf cluster is "merely" a collection of Linux nodes. The connections in that case are actually much less important than the nodes. We don't know if that is the case or not with the brain. Maybe the connectivity is the key. But maybe not, maybe that's as low in importance as the backplane on a supercomputing cluster, and it's the biochemistry of the individual neuron that does the heavy lifting.

Expand full comment

Excellent. Yes. This harkens back to the old (discredited?) Heinlinian idea that a computer with a sufficient number of connections will spontaneously develop self-awareness. This *really* seems like magical thinking to me. The computer has been programmed to pattern-match. It has been programmed to do that quickly and well, and even to be able to improve, via feedback, its ability to pattern-match. What in that suggests that it could develop capabilities *beyond* pattern-matching?

It’s still a computer. It’s still software. It still can only do what it’s been programmed to do, even if “what it’s been programmed to do” is complex enough that we cannot readily understand how X input led to Y output.

Expand full comment

Oh. Wait. Looking again at the original Doctor Mist comment that started this subthread, something he said jumps out at me.

“The human brain…is itself just a zillion stupid neurons that exhibit emergent behavior from, we assume, just sheer quantity and interconnectedness. There’s no black-box in the human brain responsible for building a world model — that model is just the accumulation of tons of tiny observations of what happens when circumstances are *this* way or *that* way or some *other* way.”

Oh. Oh my goodness. Is *this* how AI folk model the brain, and therefore AI?

No. That’s not how it works. That *can’t* be how it works. It’s not *philosophically possible* for that to be how it works, presuming a materialist Universe. This is the *tabula rasa* view of the brain, and it’s simply unsupportable. Our brain is—has to be!—hardwired to create models. Exactly what form that hardwiring takes is in question; it could be specific instructions on how to create models, it could be a root *capability* to do so, coupled with incentive to do so of some nature…our understanding of the brain is very limited as yet, and mine even more limited than that. But you can’t just stick a bunch of random undifferentiated neurons in a box, turn it on, and expect it to do anything of significance.

This makes me feel better about everything.

Expand full comment

"you can’t just stick a bunch of random undifferentiated neurons in a box, turn it on, and expect it to do anything of significance."

Of course not. No more would a human who lived in a sensory deprivation chamber from birth.

Expand full comment

I didn’t think I needed to specify, but you’re right, I do: I’m presuming input of whatever nature.

Expand full comment

Certainly the brain doesn't work that way. It's built from very detailed instructions in our DNA, and the idea that these instructions don't contain a hardwired starting point model is absurdly unlikely. Each individual neuron starts off with highly detailed programming, both that inherent in its chemistry and that downloadable from its genes.

The brain isn't an emergent phenomenon -- not unless you mean "emergent" to go back to 1 billion years ago when the first cell (somehow) emerged. The brain is a very precisely honed instrument with an extraordinarily detailed program and a complex and sophisticated booting procedure. Its behavior is no more emergent than is the fact that after my computer boots up I can open Slack and receive 52 overnight urgent but incomplete, useless, or annoying messages from my colleagues.

Expand full comment

Yes. This. You’re always so smart.

I am highly suspicious of the term “emergent.” It seems like a voodoo term to me.

Expand full comment

It's understandable that you'd make this mistake, but your brain simply can't be hardcoded through genetics, because there's not enough information in DNA. "create models" is not a thing. If you want to have an intuition for how neurons make models as a matter of course, check out 3blue1brown's series on neural networks. That'll show you how models are just an emergent property of neurons, themselves basic data-processing machines.

Expand full comment

I’ll look, but this seems to deny the possibility of, say, instinct, or predisposed behavioral responses, which seems ludicrous to me.

Expand full comment

A language model and a world model are inherently connected. In order to understand that the text "a brick hit him in the face" is followed up by the text "and cracked his skull", you need to understand that bricks are heavy blunt objects and skulls crack from blunt trauma.

"But couldn't the AI just memorize that pair of phrases?" you might ask. That might work if it was just a few phrases, but a text-completion AI needs to be able to handle completing any sort of text - not just bricks and faces, but bricks and kneecaps, bricks and plate glass windows, bricks and soft mattresses, etc. The number of memorizations would be completely impossible - you have to have a general model of what bricks do in the world.

Now, you can argue if the *way* that AIs learn about the world is anything like the way humans do, but it's inarguable that they have some level of conceptual reasoning and aren't just parroting a list of facts they've been told.

Expand full comment

I’m not sure that that’s actually what we call modeling. Scott had a good post on this recently that I’m not going to dig up now. But no, it’s not memorizing pairs of phrases, it’s memorizing the intersection of List A with List B.

This discussion could easily go way into the weeds, because nobody can really define what “world-building” means, but it was my understanding that current language models did not have world models in any meaningful sense. And, again, why not test for that instead of assuming it?

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

I'm not sure what you mean by "memorizing the intersection of List A and List B." What are List A and List B? You've got one list, and it's "every object in existence" - how do you answer questions about what a brick does to those objects? Do you memorize a sentence for each object (and every slight variation, like "throwing a brick" vs "hurling a brick")? Or do you memorize a general rule, like "bricks smash fragile objects" and apply it to whatever pair you're presented with?

I would say any intelligence that does the second thing is doing world-modeling, at least as far as we can tell. It can learn general facts about the world (or at least the parts of the world described by language, which is most of it) and apply them to novel situations it's prompted with.

I can't think of any test that would distinguish between "The AI has learned facts about bricks in the world and can generalize to other situations" and "The AI has learned facts about texts containing the word brick and can generalize to other texts." For that matter, I don't think I could devise such a test for a human! Can you prove to me that you have a world-model of bricks, using only this text channel?

Edit: Scott has a post that illustrates the problem with trying to say a particular model doesn't "really understand the world": https://slatestarcodex.com/2019/02/28/meaningful/

Expand full comment

A Freemason bricked his phone.

Where’s the irony in that?

Expand full comment

Someone did actually test if GPT-3 can explain jokes. It sometimes can!

https://medium.com/ml-everything/using-gpt-3-to-explain-jokes-2001a5aefb68

Expand full comment

Did you read that article?

A gentleman never explains his jokes.

Expand full comment

That post does not impress me. It basically says, “levels of abstraction exist” + “we don’t have a rigorous definition for ‘understand.’”

Yes, granted on both points. So? We still mean *something* by “understand”; we should try to figure out what that is, and whether whatever it is that current AI does matches it.

Expand full comment

I think "understand" is too underspecified to be useful and it's better to instead talk about a specific concrete capability that you want the AI to have. Otherwise all you get is an endless cycle of "yeah, it can do X, but it doesn't *really* understand the world unless it can do Y..."

You didn't respond to my question about testing, by the way. Is there any test that could show the difference between language-understanding and world-understanding? Can *you* prove to me that you understand what a brick is in the world, instead of just knowing correlations with the word "brick"?

Expand full comment

> I think "understand" is too underspecified to be useful and it's better to instead talk about a specific concrete capability that you want the AI to have. Otherwise all you get is an endless cycle of "yeah, it can do X, but it doesn't *really* understand the world unless it can do Y..."

A) You’re the one who brought in understanding, with the SSC article.

B) Isn’t that what I said? “we don’t have a rigourous definition for ‘understand.’”

i.e. I agree, but I don’t see how this is helpful.

> You didn't respond to my question about testing, by the way.

I ignored it because it’s too complicated to deal with :)

(Somewhere in rationalist space, I think on LW, I read something like, “We don’t know how to measure that effect, so we round it to zero.” I wish I could find that quote.)

I would have to do some deep philosophical thinking to answer that, and I have other things to do deep philosophical thinking about right now.

But honestly, that’s sort of my point. This experiment requires model-construction (“understanding”) to work; the experimenters don’t know if the AI has model-construction; the standard narrative of how AIs work (“The way GPT-3, and all language models that I’ve seen work, is that they try to predict the next group of characters. They can then recursively feed the output back into itself and predict the next groups of characters,” from the “joke” article you linked; thanks for that) strongly suggests a *lack* of model-construction; but instead of running an experiment to determine whether the AI could construct models, they ran the experiment anyway, thus ensuring they would have no way of accurately interpreting the results. Why? Because to perform the model-construction experiment, they would have to answer just that difficult question you posed me.

The answers I’ve gotten to my question here indicate that it’s generally expected that there is *some* level of model-building going on, and yet the standard account of how LLMs work seems to exclude that possibility.

If someone were to credibly tell me “If you were to answer that question, it would meaningfully advance AI research,” I’d be happy to do it.

Expand full comment

This comment chain is making me wonder. If trained on a large enough corpus of text that included things like descriptions of appearance, possibly texts on graphics programming, could a text model become multi-modal such that it could generate pictures of things, having never been trained on pictures?

Damn, I really wanna do that research now that would be so cool.

Expand full comment

There are text-based means of describing a picture such as SVG, and GPT-3 will draw coherent pictures using them, similar to how it can sort of play chess if you prompt it with chess notation.

Expand full comment

Note that the completion had to include a bunch of other stuff to get the probability of "killed by brick is violent" that low; it seems to have classified simple "killed by brick" as being violent without said other stuff.

Expand full comment

But we *know* that GPT is not correctly modeling the world here. For instance, it has failed to recognize that Alex Rider does not exist in all universes.

You can blame that on the paucity of input, but in that case you have to assume that there are a lot of other things about the world that it could not have plausibly figured out from 4300 not-especially-high-quality short stories mostly on similar topics. The experiment was doomed from the start.

Expand full comment

True, but "the experiment was doomed because current AI has the reasoning capabilities of a ten-year-old who reads nothing but Alex Rider books" is different from "the experiment was doomed because language models are fundamentally incapable of modeling the world." One implies that AI just needs to get smarter - throw some more GPUs at the problem and come up with smarter training methods, and presto. The other implies that progress is completely impossible without some sort of philosophical breakthrough.

Expand full comment

> "the experiment was doomed because language models are fundamentally incapable of modeling the world."

Because all they can do is refer back to language. It eats its tail.

> progress is completely impossible without some sort of philosophical breakthrough.

I’m very open to that way of thinking.

Expand full comment

It may well NOT require a philosophical breakthrough. But it would require non-langauge input. Simulations are good for that.

Primates are largely visual thinkers, humans are largely verbal thinkers built on top of a primate brain. But this doesn't mean that kinesthetic inputs are unimportant. Also measures of internal system state. (Hungry people make different decisions that folks who are full.)

All of this complicates the model compared to a simple text based model, but there's no basic philosophical difference.

Expand full comment

Like, it would be interesting to see if it was easier to train it to not generate “stories where Alex loses” or “stories with tentacle sex”. Those seem like things that would be more likely to be identified as important categories in the training set it had

Expand full comment

"For instance, it has failed to recognize that Alex Rider does not exist in all universes."

Happy the man who remains in ignorance of cross-over fic 😀

Expand full comment

”Generalizing to unseen examples” is not the same as ”conceptual reasoning.” If I use linear regression to estimate the sale price of a house whose attributes have not been seen by the model, this doesn’t imply that the model knows anything about real estate.

Expand full comment

Ask not what bricks can do in real life but what they do in the metaphor space of language.

Bricks are rarely the agents of harm in human language, but rather the agents of revelations , which hit one like “a ton of”

Falling in love and getting hit by a brick are practically synonymous in the metaphor space.

(AI) are not training on who we are, they are training on our metaphors. It’s a game of Telephone forever.

Expand full comment

Saying that the humans "tell it how to process input" is only true in an abstract sense. Humans programmed the learning algorithm that tells it how to modify itself in response to training data. No human ever gave it *explicit* instructions on how to complete stories or how to detect violence; that was all inferred from examples.

Token predictors appear to be doing *some* world modeling. They know that bombs explode, they can answer simple math questions, etc. And while some of the failures seem like they might be failures of cause-and-effect reasoning, many of them seem like it's simply not understanding the text.

Expand full comment

Scott seems to be making an assumption something like "Any sufficiently advanced language model becomes a world model". I'm not sure if there's a name for this assumption or whether it's been explicitly discussed.

I can see where it's coming from, but I'm not 100% convinced yet. As a model gets arbitrarily better and better at completing sentences then at some point, the most efficient and accurate way to complete a sentence is to establish some kind of world model that you can consult. You keep hitting your model with more and more training data until, pop, you find it has gone and established something that looks like an actual model of how the world works.

I've said this before, but I'd like to see the principle demonstrated in some kind of limited toy universe, say a simple world of coloured blocks like SHRDLU. Can you feed your system enough sentences about stacking and toppling blocks until it can reliably predict the physics of what would happen if you tried to put the blue block on the yellow block?

Expand full comment

Are your sentences allowed to include equations and computer code? I'd also really like to see experiments in this direction. I don't think SHRDLU would be a good place to start.

Expand full comment

I think of it more like: the "world" that it's modeling is not the rules of physics, but the rules of fanfiction. There's some complicated hidden rules that say what's allowed or not-allowed to happen in a story, and the token predictor is building up some implicit model of what those rules are.

Now, the rules of fiction do have some relation to the rules of physics, so maybe you could eventually deduce one from the other. But whether or not that's the case, there's still a complex set of rules being inferred from examples.

Expand full comment

He’s definitely discussed it: https://astralcodexten.substack.com/p/somewhat-contra-marcus-on-ai-scaling

Expand full comment

Huh, and it looks like I made the same damn SHRDLU comment on that post too.

I guess I respond to prompts in a very predictable way.

Expand full comment

Scott has the best chatbots, believe me.

Expand full comment

I have a pet theory that might be useful.

I think world-models follow a quality-hierarchy. the lowest level is pattern-matching. the middle level is logical-inference. the highest level is causal-inference. Causal-inference is a subset of logical-inference, which is a subset of pattern-matching.

also, causality:logic::differential-equations:algebra.

i.e. if algebra defines a relationship between variables x and y, then dif-eq defines a relationship between variables dx and dy. Likewise, if logic defines a relationship between states A and B, then causality defines a relationship between dA and dB.

an understanding of causality is what people actually want from AGI, because e.g. causality lets us land on the moon. What ML has right now is pattern-matching. which accomplish a surprising number of things. But since it doesn't understand causality, its stories can often pass for Disc World but not the real world. So GPT does have a world-model, but it's a low-quality model.

my time reading LW gives me the impression that Judea Pearl discusses this sort of thing in greater detail. but i'm not familiar with Pearl's work directly. except for maybe that one time Scott was talking about Karl Friston and Markov Blankets, and i googled it and was overwhelmed by the PDF i found.

Expand full comment

I’m down with this—at least it sounds like it makes sense, so it passes the first smell test.

However, I object, on convenience grounds, to saying that pattern-matching is any kind of world-modeling. When we say “world-modeling,” we explicitly mean that it’s doing something *other* than pattern-matching.

Your other two distinctions are interesting, though, and are probably what we should use in these discussions to disambiguate types of world-modeling.

Expand full comment

Why do you say it doesn't have a world model?.

Having something like an internal world model seems perfectly posible in principle, and I think thers a gradient from "using dumb heuristics" to "using complicated clever algoritms that generalize and capture parts of how the world works" where it seems like better text prediction requires moving towards the world-mode-y direction, and in practice it does seem like LLM are learning algoritms that are increasingly "smarter" as you make them bigger and train them more.

And we don't really understand them well enough to really tell for sure that there isn't anything that looks like a crude world model there or at least isolated framegments of one in there.

And maybe I'm misunderstanding but you seem to be making an argument that to me sounds like it would predict that neural nets never generalize to unseen cases you haven't trained them on, wich is not what happens in practice or they would be totally useless.

Expand full comment

You are missing nothing, this is correct. It's very unclear what the entire exercise was supposed to accomplish.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

[Epistemic status: vapid and opinionated comment because this topic is making me angrier the more it swirls on itself and eats the rationalist community]

You're very much on the right track, not missing anything. This is all silly and the research should reinforce how silly it is.

I think it is a common minority opinion that this kind of AI alignment work, and all of the AI risk fear that drives it, is not really based on a sense that GAI will be smart, but a sense that humans are stupid. Mostly true, to be fair, but importantly false at the margins.

AI Risk writers and AI risk researchers say the GAI is eventually able to do any clever thing that enables the risk scenario, but they almost always also allow that its structure and training could be *something* like the dumb ML of today: without world model, without general knowledge, without past and future, without transfer learning, without many online adjustments.

It's an Eichmann-like parrot, basically, which is threatening if you think we're all Eichmann-like parrots. We *are* all like that, much of the time, but crucially not everyone always. There is no super-intelligent super-capable Eichmann-like parrot, not even a chance, because Eichmann-like parroting is *qualitatively not intelligence*. It's merely the unintelligent yet conditionally capable part of each of us, the autopilot that knows how to talk about the weather or find the donut shop or suck up to the dominant person in the room.

There isn't even *alien* intelligence coming from a human AI lab, barely a chance, because intelligence is mental potential brought to fruition through teaching, and the quality of the teaching is an upper bound; if we want it to be smarter than us, WE will have to teach it to be essentially human first, because that's the only sense of intelligence we know how to impart and we're not going to find a better one accidentally in a century of dicking around on computers.

There's an outside chance that we teach one that's a little alien and it teaches another one that's more alien and so on and so forth until there's a brilliant alien, but that's a slow process where the rate limiting step is experimentation and experience, a rate limit which is not likely to get faster without our noticing and reacting.

So... it's not happening. You're on the right track with your comment: take this super dumb research and your own sense of incredulity as some evidence that AI Risk is wildly overblown.

Expand full comment

My goodness, I hope this is right. But I’m incredibly wary of it, because it fits with my prejudices far too well. I’ve really gone back and forth on this. At first I held more or less the view you espouse here, and certainly it has merits…but the real, fundamental question (in my mind) is whether intelligence is recursively scalable. If it is, it’s likely that none of these objections matter, because if an AI, by just bouncing around trying random things (which they are certainly able—indeed programmed—to do) discovers this mechanism, it will certainly exploit it, and the rest it will figure out given sufficient time—which may not be very long at all.

It all depends on the fundamental question, “what is intelligence?” which no one has a good answer to.

Expand full comment

> It all depends on the fundamental question, “what is intelligence?” which no one has a good answer to.

I have a pet theory on this too. I've been hesitant to share it, because i feel like someone else should have stumbled upon it by now. but i've never seen it expressed, and i keep seeing the question of intelligence pop up in scott's blog. so even if it isn't original, perhaps it's not well-known. and this prompt seems as good a time to share it as any.

In my head-canon, my theory is called "the binary classification theory of intelligence". I think "information" is another name for "specificity", and "description" is another name for "sensitivity".

the measure of information is how accurately it excludes non-elements of a set. e.g. if i describe a bank robber and say "the robber was human and that's all i know", the data wasn't very informative because it doesn't help specify the culprit. the measure of a description is how well the it matches the elements of a set. If i describe people at a party as "rowdy drunk and excited" and that's accurate, the data was highly descriptive. But if it's dark and i say "i think many of them were bald" when all of them actually had hair, that's not very descriptive.

the reason computers are useful is because their memory and speed allow them to be extremely specific. The data is often organized in a tree. Viz. a number system (such as binary or decimal) is actually just a tree. Each number is defined as a path of digits, where each level represents a radix and each node of a level is assigned a digit. "100" (bin) is 4 (dec) because "100" defines a path along a tree whose leaf is 4 (dec). a computer can juggle thousands of specific numbers in its RAM, whereas a human can allegedly juggle "seven, plus or minus two". and more importantly, it can perform algorithms that quickly distinguish desired numbers from undesired numbers. the cost of specificity is complexity. (i think this theory points toward a sane alternative to measuring software productivity in k-locs, though i haven't gotten that far yet.)

0

0 1

0 1 0 1

01010101

-----------

01234567

sensory organs are useful because they're reliable at gathering accurate data, such that your world-model is descriptive of reality. i.e. it has a high correspondence to reality. whereas computers are specific. but if you feed them bad data, their database becomes incoherent. (huh, that sounds rather like the Correspondence Theory of Truth and the Coherence Theory of Truth.) in fact, i would argue that "descriptiveness/sensitivity/correspondence" is a measure of truth, and that "informativeness/specificity" along with "coherence" greatly informs "justification". so the Coherence Theory of Truth should really be called the Coherence Theory of Justification. When data is both veridical and justified, it's known as "knowledge". (huh, that sounds like it neatly solves the Gettier Problem.)

in summary (loosely):

"specific = informative = coherent = justified

"sensitive = descriptive = correspondening = true"

knowledge = justified & true

When you zoom out, intelligence (as a measure, not a mechanism) is a measure of classification ability. The mechanism is simulation aka world-modeling. this is useful from an evolutionary perspective because simulating a risky behavior is better than trying irl and losing an arm. But more generally, i think the concrete benefit of intelligence is efficiency. especially energy-efficiency. the capital cost of intelligence is expensive, but the operating cost is relatively cheap. (huh, just like enzymes.) Which, to me, suggests that recursive improvement is unlikely. because there's only so much an agent can improve before it runs into the limits of carnot efficiency. a jupiter brain can compute and compute all it wants. but in terms of agency, it can't do anything that a human tyrant or industrialized alien species couldn't do already.

Expand full comment

I’m responding because you’re replying directly to me and because I don’t want an idea someone was hesitant to share to pass without comment. But unfortunately this goes over my head. Can you maybe dumb it down somewhat?

Expand full comment

Sorry, I didn't explain that very well. Here's a simpler overview.

IMHO, "intelligence" is best defined as "a measure of knowledge", where "knowledge" is defined as an agent's ability to recognize set-membership. E.g. an agent will label trees as belonging to the category of "trees" and non-trees as not belonging to the category of trees. Few false-positives imply high-specificity. Few false-negatives imply high-sensitivity. High-quality knowledge is both specific and sensitive.

The ramifications shed light on related questions. It encompasses the Correspondence Theory of Truth. It reframes the Coherence Theory of Truth as a theory of justification. It solves the Gettier Problem by refining the definition of a "Justified True Belief". It explains why computers are useful. It suggests a way to measure the productivity of software devs. It explains why information is so compressible. And it explains the relationship between information and entropy.

Since the concept of binary classification is well-known, and since this theory has so much explanatory power, I find it difficult to believe that nobody has thought of this already. And yet I often see others say things like "maybe intelligence is goal-seeking" or "maybe intelligence is world-modeling" or "maybe intelligence is just pattern-matching all the way down" or "I suppose it's anyone's guess". But nothing that resembles "maybe intelligence is specificity & sensitivity".

And while intelligence often entails world-modeling, that's not always the case. Distinguishing intelligence from modeling leaves room to, for example, interpret spiderwebs as "embodied intelligence". Intelligent, but not world-modeling (though I prefer the word "simulation" here).

Expand full comment

This sounds…pretty close to my own thinking, going way back, that “intelligence is about making fine distinctions.” The finer the distinctions he can make, the smarter the person.

I don't know whether that stands up under scrutiny, or whether it’s similar to your idea.

My solution to the Gettier problem is “Knowledge is *properly grounded* justified true belief.” But I haven’t had anyone try to break it, so who knows if it stands up.

You may be interested in the Coherence Theory of Truth discussion here: < https://astralcodexten.substack.com/p/elk-and-the-problem-of-truthful-ai/comment/7979492>

Expand full comment

> the quality of the teaching is an upper bound

Then how do humans ever surpass their teachers?

Expand full comment

How did human intelligence come to exist in the first place, even? We know that dumb processes can produce smart outputs because evolution did it.

Expand full comment

Sorry, let me clarify. The quality of the teaching doesn't create an upper bound that is exactly the ability of the teacher. It is part of an upper bound that is related to the ability of teacher as well as the raw mental potential of the student as well as incremental gains of coincidence.

Consider a teacher student pair where both have the highest raw mental potential, utter brilliance. Let's say the teacher does the best teaching. While the student may accomplish more things, and teach itself more and better in maturity, the student's mature intelligence will be roughly of the same order as the teacher's (ie, not *significantly more*).

Now consider a student with the highest raw mental potential, and a teacher with much lower potential, but excellent teaching skills. Much of the power of the student will be utilized, and the student will outstrip the teacher, but much of its raw mental power will be wasted.

The principles at work here are: (1) teaching unlocks your raw untapped horsepower, and (2) self-teaching is significantly slower than teacher teaching, even for the best self-teachers.

To get runaway intelligence from these principles, both the horsepower development (not teraflops, but raw neural skills like the jump from SVMs to GANs) and self teaching yield have to experience significant jumps as part of a generational cycle of teachers and students that's faster than human decision power.

That is, AI has to suddenly get *way* better at raw thinking, *and* way better at teaching *itself*, during a process that's too fast for you and me to observe and cancel.

Expand full comment

What makes you think someone would cancel it if they observed it? It sounds to me like the state of the art is currently getting rapidly better at both raw thinking and self-teaching, and that AI researchers are laboring to enable that rather than to stop it.

Also, your previous comment sounded to me like you were arguing that computers can't become more intelligent at all except by humans improving our teaching, and now it sounds like you're proposing a multi-factor model where teaching is just one of several inputs that combine to determine the quality of the output, and external teaching isn't even strictly necessary at all because self-teaching exists (even if it's not ideal), and that seems like basically a 100% retreat from what I thought you were saying. If I misunderstood your original point, then what WERE you trying to say in that original paragraph where you talked about an upper bound?

Expand full comment

In intelligence? Is there any evidence that they do? Einstein's most successful kid is an anesthesiologist in a boob job clinic.

Expand full comment

Interesting, but is that necessarily a good measure of his/her intelligence ?

Expand full comment

If you're arguing that they don't, that's about the least-persuasive example you could possibly have picked. My claim isn't that students *always* surpass their teachers, it's that they *ever* do. An impressive contrary example would be one where you'd *expect* the student to surpass the teacher and then they fail to, which means you should be looking at smart *students* and *stupid* teachers.

So, rewind one generation: Do you predict that Einstein was taught by someone at least as smart as Einstein? If not, then that gives at least one example where the student surpassed their teacher in intelligence.

If students *never* surpassed their teachers in intelligence, then a graph of the intelligence of the smartest person alive could only go down over time (or at best stay constant, and you'd need a lot of things to go right for that). Are you really arguing that our brightest minds are on a monotonic downward trend, and have been on this trend forever? Where did the original spark of intellect come from, then?

Expand full comment

I'm not entirely sure at what you're driving here, so I'll just note I'm pointing out reversion to the mean. The smartest parents will have children that are in general not as smart. The dumbest parents will have children that are in general smarter. The best teachers will have "surprisingly" mediocre results among their students, the worst teachers will have equally "surprisingly" better than expected results among their students.

Einstein was certainly taught by people who were less gifted than he in physics and mathematics, and it's major reason he disliked his formal education. As for examples, almost all Nobel prize winners were taught by people who lacked any such record of accomplishment in the field. Because of reversion to the mean.

As for where any individual with unusually high intelligence comes from, that's mutation. Happens spontaneously and randomly all the time. As for where any improvement of average intelligence comes from, that's natural selection. If we were to forbid from breeding anyone who failed to master calculus by 11th grade, and gradually raised the bar to anyone who failed to master relativity, then 30 generations from now everyone could be as competent as Einstein in physics. (Whether average human intelligence could ever exceed the levels that have already been demonstrated by mutation is another story, and I'd be inclined to doubt it.)

Expand full comment

Matthew Carlin argued that AIs cannot become smarter than our teaching because teaching sets an upper bound on intelligence. What I'm driving at is that humans who surpass their teachers falsify this hypothesis.

Then you asked if there's any evidence that humans ever become smarter than their teachers. From your most recent reply, it sounds like you already believe that this is a common occurrence. So now I have no idea what YOU were driving at.

Expand full comment

"Hans Albert Einstein (May 14, 1904 – July 26, 1973) was a Swiss-American engineer and educator, the second child and first son of physicists Albert Einstein and Mileva Marić. He was a long-time professor of hydraulic engineering at the University of California, Berkeley.[2][3]

Einstein was widely recognized for his research on sediment transport.[4] To honor his outstanding achievement in hydraulic engineering, the American Society of Civil Engineers established the "Hans Albert Einstein Award" in 1988 and the annual award is given to those who have made significant contributions to the field.[5][6]"

An outstanding hydraulic engineering professor at UC Berkeley is still not exactly another Albert Einstein, but it's a far cry from an anesthesiologist in a boob job clinic.

Expand full comment

Damn. Another urban legend blown to hell..

I was thinking “wow, what a great job! He must be really smart.”

Expand full comment

You're right, I was thinking of his grandchildren.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

But the 'rationalist community' emerged largely due to early promoters taking seriously the idea that it would be possible to create sufficiently alien intelligence in the near future. You can certainly dismiss this, the vast majority of humanity does without a second thought, but "taking weird ideas seriously" is kind of the whole point, and this one was always one of the most important.

Expand full comment

"thinking better" is also kind of the whole point, even if it's in service to goals like this.

I think it's long past time the rationalists kill the Buddha (or the rightful Caliph, or whatever) while following his values. I think it's long past time that the rationalists ditch EY and AI Risk in favor of being the community that works on good thought process.

Expand full comment

"Thinking better" sounds nice of course, but after a decade and a half there still seems to be no evidence of this happening, or even any actionable ideas on how to go about it. Nevertheless, having given rise to a blog still worth reading, the community has done much better than most.

Expand full comment

I agree entirely.

Expand full comment
Nov 29, 2022·edited Nov 29, 2022

I doubt they had any real clue whether it work or not. They just tried it, to see if it might, or might do something else that's interesting instead. This is a perfectly normal way of doing research. You just try shit and see what happens. The unfortunate thing is only when you have no useful way of interpreting the results, which is I think kind of what happened here, and is a bit of a typical risk when you're using very black box models.

As for the distinction: we know human beings construct abstract symbols for things, actions, concepts, and that they can then construct maps between the abstract symbols that predict relationships between concrete symbols which they've never encountered before. For example, a 6-year-child could observe that when daddy drops a knife on his foot, it cuts his foot and that hurts a lot. She can immediately infer that if daddy dropped a knife on his hand, it would also hurt, even if she's never seen that actually happen. That is, if she is "trained" on the "training data" that goes like "daddy dropped a knife on his foot, and it hurt" and "daddy held a knife in his hand, and safely cut the apple" she will be able to understand that "daddy dropped a knife on his hand" should be followed by "and it hurt" even though she's never seen that exact sentence or had that exact thought before. Similarly, she could probably infer that "daddy held a knife in his foot and safely cut an orange" is at least superficially plausible, again whether or not she's ever heard a sentence just like that before or seen such an action. (Which children first learn to talk, they actually do seem to spend some time running through instances of new abstract models like this, trying those they've never seen or heard of out to see (from adult reaction) whether the instances actually make sense, in order one assumes to refine the model.)

We can certainly analyze human sentences and infer the abstract symbols and the relationships between them that the human child constructs -- she's clear got one for "body part that can sustain injury" and "inanimate object that can inflect injury" and "action that puts a dangerous inanimate object in a position to inflect injury on a vulnerable body part" -- but once we get beyond very easy and obvious stuff it appears to get dauntingly difficult and ambiguous, a garden of infinite forking paths.

So instead we train the AI on 10 bajillion copies of "daddy dropped the knife on his foot and it hurt" and hope that *some* use of "hand" and "hurt" in some other context somehow gets hooked up to this so that the AI spontaneously recognizes that "daddy dropped the knife on his hand" is reliably followed by "and it hurt." As I guess you do, I find this very unlikely. GIGO. I don't think you can summon the abstract relationship by any number of concrete instances that do not ever fit the relationship. (And if you have the concrete instance in your training data that *does* fit the abstract relationship, then you've failed to duplicate the human example entirely -- you've only shown yo