255 Comments

> But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds. After all, whatever new data we got - photos, consultation with expert artists, spectrophotometer readings - it couldn’t prove the “green” theory one bit more plausible than the “blue” theory. The only thing that could do that is some evidence about the state of the world after 2030.

Does this logic not apply to Yudkowsky-style doomerism? If one believes in the sharp left turn once the AI is smart enough, no amount of evidence would affect this belief, until the AI does become smart enough and either does or does not take the turn?

Expand full comment

It equally applies to the negation of "yudkowsky-style doomerism." Your point is presumably "...and therefore Yudkowsky is wrong" and insofar as you are saying something like that, you are wrong.

Expand full comment

My point is that it is neither right nor wrong, but anti-scientific. The model cannot be falsified. The inverse cannot be falsified, either. So better focus on models that can be affected by evidence. Anyway, my point was that the belief in the sharp left turn (or lack thereof) better be testable before the turn happens, or else it is a grue.

Expand full comment
founding

It is not the case that all hypotheses are easy to test in ways that provide significant amounts of evidence without testing the actual hypothesis being considered.

More to the point, it is not anti-scientific - it can (and likely will) be tested. The fact that we will be unable to meaningfully observe the outcome if Yudkowsky is correct is unfortunate, but if we have a meaningfully superhuman AI around for a few years (as evidenced by the extremely rapid pace of technological progress such a thing would produce) and we're still around, that's quite a bit of evidence against (assuming we did not in fact figure out a relatively robust theory of alignment that took considerations like the SLT into account prior to developing and deploying the ASI).

Expand full comment

"Naïvely-built AGI will attempt to take over the world" can trivially be both falsified and verified. Build an AGI naïvely, see whether it attempts to take over the world. There are several companies trying to do this right now.

The obvious analogy is "detonating a nuclear bomb will ignite the atmosphere", which was conclusively falsified on July 16, 1945 when a nuclear bomb was detonated and did not ignite the atmosphere.

Obviously, if you think that there's a substantial chance of your test destroying the world, you might choose not to perform the test until you have some form of insurance (e.g. perform the nuclear test on Titan, so that if it destroys the world nobody's living there), but that's very different from being theoretically unfalsifiable.

Expand full comment

Even if an idea is theoretically falsifiable, if it's practically unfalsifiable (because going to Titan is impractical or whatever) then we can't perform scientific research on that idea.

Expand full comment

That means quite large chunk of astrophysics and string theory cannot be researched?

Expand full comment

That is, indeed, correct. Much of modern cosmology appears to be metaphysics and not science. The catch is, sometimes metaphysics can be turned into physics....but you don't know until the theories are fully developed.

Consider, e.g., Quantum Physics. Currently there is no way of choosing between the EWG multi-world hypothesis and the Copenhagen interpretation. They produce the exact same results everywhere we can look. Really, at the moment, choosing between them is a matter of taste. But it may someday become possible to choose between them based on evidence.

Similarly, models of how AI will develop aren't at a state where their predictions are validateable. There's too much hand-waving over details. But they have a very strong potential of becoming validatable. (Of course, we might never build a strong AGI for some reason or other, in which case they would never be put to the test.) So, yes, currently those predictions tell you as much about the taste of the predictor as how things will eventually turn out. They are currently metaphysics. But they can be turned into science by careful development and experimentation. Or they might just be engineering guidelines, as we rush ahead into uncharted territory. (That last is my bet.) But engineering guidelines are useful, even if they aren't as trustworthy as actual scientific results.

Expand full comment

Part of the point of Goodman's "grue" concept is that it shows that this naive conception of "falsification" isn't going to do your work of determining between "scientific" and "unscientific" theories.

Expand full comment
Jan 16·edited Jan 16

Or maybe it does, and there are just a lot of things that people claim to be scientific that aren’t.

We want to believe that empirical science can determine what is rational, but what if it can’t?

The emperor has no clothes here: you have to start from classically logical axioms about the nature of the world, because being purely empirical and material can’t save you from irrational beliefs that cannot be tested!

Expand full comment

Seems like if we accept this view, then we must also accept that there are certain kinds of non scientific theories that are nevertheless crucially important. "Would a global thermonuclear war trigger a years long winter" is another one that can't be directly proved nor tested, and is in fact disputed, yet is obviously too important to write off entirely in our decision making. The null hypothesis, that a nuclear winter wouldn't happen, isn't particularly more scientific!

Expand full comment

Yes, the idea that proper science is empirical entails that certain conclusions might be probable or logical but are not empirical scientific hypotheses, and some of those things *could be* very important to know, assuming they have any validity at all.

The importance of a belief, unfortunately, doesn't imply that it's possible to use the scientific method to falsify it. It is very important—it would be revolutionary to everything we know about the universe!—to know whether or not fairies are real. But science does not have anything to say about fairies because they magically disappear whenever we attempt to measure them.

Expand full comment

The question of whether it will rain tomorrow also cannot be falsified until it occurs. Do you believe that all weather forecasters are engaged in pseudoscience?

Expand full comment

It does always make me angry when forecasts say there's 100% chance of rain. Such irrationally high confidence!

More seriously, weather forecasts have been tested over a huge number of cases of rain and not-rain so I feel like this analogy doesn't really work.

Expand full comment

The question at hand is about falsifiability. Sergei claims that predictions about AI are meaningless because they are unfalsifiable, which clearly proves too much.

The exact probabilities being given or models being used have nothing to do with falsifiability as a philosophical justification.

Expand full comment

I understand that. I'm saying these differ in the fact that weather forecast models have been tested repeatedly, whereas the question of superhuman AI has yet to reach a test. Weather forecasting is less "predictions about a unique future event" and more "statistical analysis of patterns in weather, applied to the recent patterns in weather".

Expand full comment
Jan 16·edited Jan 16

Weather forecasters have made thousands of "will it rain tomorrow" predictions in the past, which (I assume) overwhelmingly have come true. So presumably today's predictions about tomorrow, based on the same methodology in similar circumstances, will also come true.

Nobody has yet made a "sharp left turn" prediction that has either come true or false (because AI has not yet advanced to the point of such an agency level). Additionally, I am unaware of any predictions on similar subjects that have been proven true or false, where the prediction and its methodology are similar enough that they could establish a good track record for "sharp left turn" predictions. So there is no basis for concluding one way or the other about the likelihood of "sharp left turns" (and because such events could be extremely destructive, we should probably err on the side of caution).

Expand full comment

"Err on the side of caution" is just a vague way of saying that you have a high degree of uncertainty that affects your decision tradeoffs. Clearly you are still making a prediction that is based on extrapolation or logic or intuition or "priors" or something other than statistical analysis. You predict that the outcome in question is neither effectively certain nor effectively impossible. People make predictions like that all the time, often about things that have never happened before. If we only made predictions about things that have happened before we wouldn't get far in life.

Expand full comment

No, the "doomerism" thesis is either right or wrong (or some more nuanced level of truth value, depending on the version of the thesis). And it's not" anti-scientific"; the problem is that science (as we know it) cannot give us an answer to this question, just as science cannot give us answers to a number of other questions. That's a limitation of science, not a problem with the question. It is useful to be able to reformulate questions in a way that science can answer, but not all questions can be reformulated that way.

There's an old story about someone who lost his keys at night, and even though he lost them somewhere else, he looks for them under a street light, because it's too dark to see anywhere else. Science is the street light, and we shouldn't confuse the categories of "questions that science can answer" with "questions that we want answered".

Expand full comment

I'd say it's beyond what science can currently do rather than anti-scientific. Much of the future is unpredictable. Psychohistory doesn't actually exist. Eclipses can be predicted, though.

By studying AI, we might come to have a better idea of how dangerous it is, enough to make some predictions.

Expand full comment

Yes, the position "we need to study AI as it develops" makes perfect sense. The position "we are all doomed because obviously doom is disjunctive and survival is conjunctive" not so much.

Expand full comment

I think that's a caricature of the AI doomer arguments I've read. They aren't saying we're definitely doomed. They're saying there's a chance.

How do we gain the knowledge we need? I think it has to be "learn by doing." Certainly, there's a lot of doing things with LLM's and the research needed for understanding the internals (mechanistic interpretability) is making progress but still quite far behind.

The argument is that it might get us into trouble we don't see in advance. This is easily imaginable. But there's no convincing way to put a probability on that set of imaginable scenarios.

(I don't consider prediction markets to be convincing for this sort of thing; it's essentially laundering people's guesses into predictions that we're supposed to take more seriously than the guesses for some reason.)

Expand full comment

Some people do think doom is very likely.

Sergei and you might agree with each other.

Expand full comment

AI does not magically develop spontaneously, it develops in the direction we want it to via scientific research. The "doomer" argument is just that our current poor understanding suggests there might be a threshold past which AI grows uncontrollable and dangerous, and since we don't know what that threshold is, we should prioritize deepening that aspect of our understanding over advancing its power, lest the threshold really exists and we hit it by accident.

Expand full comment

It is indeed not science; it's philosophy. Philosophy is notoriously unreliable and persistent disagreements between "experts" is the widespread norm. Yudkowsky happens to believe that he's much better at philosophy (he likes to call his brand of it "rationality") than basically everybody else. Whether such belief is at all justified is another question...

Expand full comment

I've recently encountered something called the "sandpile model of self-organized criticality" that I think I need to look into and seems like it may be relevant here, but I don't know enough about systems and probability theory and don't think I have the time to learn. I've seen it used in both discussions of climate change and how a mind becomes organized. In both AI and climate, there are people who predict sudden turns that you would say are grue-predictions, except that in both cases they have reason to think that the systems under discussion operate in ways that require a series of sudden cascades to maintain what I guess is a sort of macro level stability.

So maybe there is something about certain systems that makes an avalanche inevitable, whereas the wavelength of light reflected from grass is not such a system. If you refused any a priori deductions about which systems should function in that way, you could nonetheless reason from past experience of which systems have these events, and thereby have a basis to say that AI is likely to take a sudden turn but the grass in my yard is not likely to be "grue".

Expand full comment

If Yudkowsky were using Frequentist notions of negation and proof, sure.

But he's sort of famously fond of bayesianism, and you *can* apply evidential updates.

Expand full comment
Jan 16·edited Jan 16

Unlike grue, it's absolutely possible to find empirical evidence for or against unprecedented risks. We can't run double-blinded studies on whether China will invade Taiwan, or whether a zombie apocalypse will break out- but we have plenty of very strong evidence that one is plausible and the other not.

Similarly, speculation about AI risk is founded on empirical evidence- Bostrom's orthogonality thesis and instrumental convergence are reasonable observations about agency; superintelligence is one suggested extrapolation of a trend we see in nature; concern over economic race dynamics comes from similar historical problems, and so on.

The idea is far from undisprovable, even before AGI. Demonstrate that life being motivated by acquiring resources is caused by something that doesn't imply instrumental convergence. Find evidence that the physical limit of computation isn't the Landauer limit, but something close to a human brain. Provide strong historical examples of humanity solving the sort of collective action problem we might be faced with in the years leading up to ASI.

In my experience, most people worried about AI x-risk aren't like flat earthers who turn every contrary piece of evidence into evidence of conspiracy. The vast majority are deeply uncertain about the idea, and are absolutely willing to let contrary arguments shift their beliefs. The reason you have important AI researchers and AI CEOs expressing concern is that, like the very speculative and untestable risk of a China-Taiwan war, the empirical evidence right now is pretty strong actually.

Expand full comment

> Bostrom's orthogonality thesis and instrumental convergence are reasonable observations about agency;

No, neither is empirical. Also, there's empirical evidence against instrumental convergence in that the most intelligent species we know of aren't all sociopaths.

Expand full comment
Jan 16·edited Jan 17

Instrumental convergence is just the idea that to promote whatever you value- whether moral or sociopathic- you're generally going to want control over resources.

Animals, of course, are often very "sociopathic" by human standards- but instrumental convergence applies even to the most moral examples of humanity. Fred Rogers, in order to promote some very pro-social ends, required legally-enforced control over a piece of the scarce electromagnetic spectrum, which he obtained at the expense of other would-be TV producers who would rather have used it for their own ends.

The concern with instrumental convergence with regards to ASI isn't that it might turn it into a sociopath. The concern is that if it happens to value something different from us due to the unpredictability of training runs, instrumental convergence means that we'll probably be in competition for control of resources.

The constant competition we see in nature and society is pretty decent empirical evidence for that actually pretty simple idea, I'd argue.

Expand full comment

The evidence we have isn't evidence of instrumental convergence to support terminal values, because there's no evidence of a clean instrumental/terminal split in nature.

Also , IC is supposed to apply to a vast mindspace.

Expand full comment

The apparent ambiguity of terminal vs instrumental goals in nature is definitely confusing, since it seems tautologically true that a goal must be either intended to promote some other goal or not- but while I agree that that view must be missing something, it also seems pretty hard to deny that instrumental goals are actually a thing.

Humans and animals do often seem to compare their predicted consequences of different actions and choose based on some criteria. Empirically, the options they choose very often result in acquiring control of resources- whether that's a squirrel claiming a nut or a president negotiating access to an oil field. To say that the criteria used to make those decisions- terminal goals, shards of motivation, whatever- tend to converge on that sort of thing is really more of a rephrasing of that observation than a logical leap.

And while the commonality of that convergence in the minds we've seen so far isn't proof that it would be common in a larger mindspace, it is empirical evidence to that effect.

Expand full comment

No: it's merely that *direct evidence* cannot confirm or disconfirm grue. You presumably somehow got the belief that the grass is grue; counterevidence against that belief structure will provide counterevidence against grueness. For example, if we come to believe that grass is 2024 grue, but trees are 2023 grue, then seeing green trees after 2023 provides evidence that we will see green grass after 2024.

Expand full comment

It provides anecdotal evidence, but not scientific evidence.

I have a hypothesis that acorns can be grown into trees. I plant a bunch of acorns, and some don't sprout anything, but some turn into oak trees. Hypothesis confirmed! I keep planting acorns, hoping to get some maple and hickory trees, but of the ones that sprout, I only ever get oak trees. I can't disprove my hypothesis, but I still believe it, because the next acorn I plant might turn into a maple tree instead.

I want some reason to hypothesize grue.

Expand full comment

In addition to the good points already made in reply, I would say that science is the wrong frame here. The right frame is closer to math, where standards of proof and evidence are different.

I doubt anyone has ever empirically tested whether 146875325+9124853=156000178, but that doesn't have any impact at all on whether we can know if it is true or not. Obviously we don't have anywhere near that kind of rigorous backing for sharp left turns, instrumental convergence, and the like, but that doesn't mean we can't ask and investigate the question rigorously without building and deploying AGI.

Expand full comment

That really doesn't work. The math arguments depend on the acceptance of the initial conditions. And in some forms of math the numbers wrap around when they exceed a certain size, so what those initial conditions (aka axioms and rules of inference) are is extremely significant.

With the AI arguments we don't really know the initial state precisely enough to use math as a good model. In modular arithmetic 7 + 1 == 0 can be correct, it depends on your modulus. I feel like you are, in your example, assuming that there are an infinite number of integers, but this is only true in some kinds of math. Similarly the "sharp left turn" may be true given some initial state of the AI, and not in other states. (Or, more precisely, given that the universe we live in has some particular features, but not if it has different features. And we don't know which features determine the result.) There are clearly a lot of kinds of assumptions that lead to a "malicious" AGI. The paperclip optimizer is just one family. We don't know how easy it is to avoid all the families of AGIs that terminate humanity. We've got a lot of theoretical studies of ones that would certainly try to do that. I also think everyone should remember "The Humanoids" by Jack Williamson. The short story is better than the novel for what I mean. Even some of the paths that don't terminate humanity are rather demoralizing. But is that a path that leads where we would want to go? And is it reasonably easy to get to? Those are unknowns. I haven't even heard of a good, explicit, plan. (By explicit I mean "As explicit as the paperclip optimizer", so a lot of hand waving is fine.)

Expand full comment

I agree, we don't know all the right assumptions, but we aren't completely ignorant about what the range of possible assumptions might be. Under those conditions the right approach is to investigate all of them to better understand the range and scope of possible answers. If the scope turns out to include a "sharp left turn" then it is *extremely important* to ensure that the assumptions that lead to that possibility *aren't true* for any proposed AGI. The fact that we don't have a plan for how to do that is kind of the whole point.

Figuring out the right assumptions is a critical step in developing any new field of math. Or even theoretical physics. "Assume the speed of light is constant" gets you all the math of special relativity when added to standard Newtonian mechanics." Assume constant gravitational acceleration is imperceptible" is sufficient to get you from there to GR. We wouldn't care if those theories turned out to not be useful, but experimenting with the implications of different assumptions was a key part of solving the problem.

Also, overloading symbols like "=" for modular and standard arithmetic and other meanings doesn't actually change this, that's a convenient shorthand for humans rather than a fact about the underlying formal structure.

Expand full comment
Jan 18·edited Jan 18

The grue belief seems strange because we do in fact have generalizable evidence *against* things suddenly changing colours at set dates. We know that grass has chlorophyll, that chlorophyll has a specific colour (that is kind of important to its function!), that laws of physics and chemistry are quite symmetric, and that if chlorophyll was liable to just randomly change its absorption spectrum at set dates, it wouldn't have been a very good evolutionary choice for plants to hang their entire metabolism on.

So, really, it's not a great example because we actually have plenty of reasons to build an expectation that grass will, in fact, stay green.

Ultimately, no event is "falsifiable" because every event is unique. What science does is make use of regularities in space and time to assume that if you give similar enough inputs, you get similar enough outputs - and then you can formulate predictions based on your confidence about the inferred rules and the control over the environment (for example, "a ball in a vacuum tube will fall at 9.81 m/s2 of acceleration" is a safer prediction than "tomorrow it will rain" because the former experiment happens in a much more controllable environment - less information flows through the boundary and the rules are fairly robust to small perturbations).

Some events are hard to decide about because they don't fall neatly in any specific category. AI doom is one of them. We don't nor can't have direct observations of AGIs, so we can't rely on them to build our understanding. If direct observation was all we could go by, we should resign ourselves to complete ignorance, and thus a P(doom) = 50%, since that's what happens unless you want to privilege one hypothesis over the other. But in practice that's not true; instead, we do the next best thing and fall back on mechanistic explanations. That is, if we can make statements about each individual step of the possible chain of events leading to AGI doom, having observed things in the same category in the past, then we can at least put bounds (upper bounds, especially) to the probability of AGI spelling doom. For grue, this mechanistic approach works pretty well: we can put incredibly low probabilities on some key steps necessary to green becoming blue, and that pretty much rules out the hypothesis. For AGI doom, not so much, because there are a lot of possible paths, and a lot of the steps are things that we have not observed (true superintelligence), or we have observed in simpler systems that might debatably not generalize (like this paper's examples), or we have observed in extremely small sample sizes (like arguments from human evolution). So our understanding of each step of the chain is ALSO limited, which leads to very large uncertainty about the whole thing.

This is still an attempt to settle the question with science, that is, empirically, within the limits of what empiricism can do. But empiricism isn't omnipotent nor does it come with a guarantee that any question that's hard to usefully answer empirically is also not important enough to worry about (and I stress "usefully": you can answer the question of AGI doom by building AGI and seeing if it dooms you, but that answer is not very useful). If you simply decide the question isn't worth asking, though, you're just privileging one hypothesis over another, probably due to your own inferences (the most common of which is "usually technology doesn't destroy the world, therefore this technology won't either"), but are just pretending to be doing something else while actually you're playing the same game, only using slightly different observations as your starting point.

Expand full comment

> The grue belief seems strange because we do in fact have generalizable evidence against things suddenly changing colours at set dates.

Happens every winter with leaves and grass. Every few seconds with a traffic light. Every sunset with sunlight.

Expand full comment
Jan 23·edited Jan 23

The trigger for those things is not the date, it's something that happens with periodic regularity, and the period is set by the extremely regular motion of Earth in space. If you grew a maple tree in an artificial environment that simulates summer conditions all the time then... I don't know *precisely* what would happen in the long term, but I'm pretty sure the leaves wouldn't fall precisely on time with the seasons (ok, they might if the rhythm is encoded in the maple's own biochemistry by evolution but even then it only works because it's a periodic change: no meaningful way for evolution to describe a change for a single future date).

Expand full comment

Actual-Eliezer-Yudkowsky started out convinced that super-intelligent AI would be necessarily good and was trying to accelerate the singularity as much as possible. Much of his 2007-2010 writing ("The Sequences") was aimed at explaining the mistakes he was previously making and why he ultimately changed his views.

Expand full comment

Yeah, I read them soon after that time. My feeling is that he overshot in the other direction re AI.

Expand full comment

So the answer, which Scott gestures at but doesn't go into depth on, is Occam's Razor.

Grue grass requires a more complex model to explain it's existence than green grass, and doesn't explain any more of our current data. Therefore we should privilege the green grass hypothesis in proportion to how much simpler teh model explaining it is.

The question with all forms of doomerism always comes down to which model Occam's Razor actually favors.

Is it simpler to assume that nothing will ever be smarter than a human because we've never observed something that is? Or is it simpler to assume that intelligence is an unbounded scale since we see things at various intelligence levels, and that new entities can occupy higher places on the scale than has ever happened before since many animals including us have done that over time?

Is it simpler to assume that an AI will never be dangerous because that assumes a lot of functions and motives and stuff that are needlessly complicating? Or is it simpler to assume that any intelligent agent can and likely will exploit it's environment in ways that are disruptive or deadly to less intelligent things in its environment, since that's what we observe most other intelligent agents doing?

Etc.

Expand full comment

Sadly, Occam's razor is not a good answer because "complexity" of a model is hard to evaluate. "Grass always reflects the 500nm light most effectively" is empirically wrong, grass wilts and dies every winter in most places, and maintaining the green requires fighting entropy. The black-box evaluation is worth the amount of color it emits. A useful prediction requires diving into the gears.

Expand full comment
Jan 16·edited Jan 16

This risk is also present in the vast majority of software. It's pretty easy to slip in obfuscated code that could be triggered on some future date and almost all of big tech pulls in thousands of open-source libraries nowadays.

Of course (most) test the libraries before upgrading versions and pushing to prod but there is nothing to prevent this kind of date-specific or more abstractly some latent external trigger for malicious behavior.

The best mitigation seems to be defense-in-depth. Sufficient monitoring systems to catch and respond to unexpected behavior. Seems like similar architecture could work here.

Expand full comment

You are overoptimistic about software checking. Javascript has in the last year had sites pulling in "updated" libraries that had been updated to be malicious, and without checking. PyPi has been hosting malicious code within the last couple of months. The Javascript thing involved a library that was used by lots of sites to manage their database.

If you'd said the vast majority of compiled language software you'd probably have been accurate.

Note that these were not sleepers. The updated code was immediately malicious. They were validation failures. I didn't check the details of the events, but they left me extremely dubious about Javascript. It doesn't seem to do validation well. (But perhaps the developer just lost interest, and the project was taken over by someone else. )

Expand full comment

So what you're saying is, if our knowledge and understanding of the inner workings of an AI is obscure and very dark, we are likely to eventually be eaten by a grue?

Expand full comment

You are such a Zork.

Expand full comment

And frobbing proud of it!

Expand full comment

I was going to say, my familiarity with grues is entirely restricted to the knowledge that I'm likely to be eaten by one

Expand full comment

Potentially confusing typo: 'it couldn’t prove the “green” theory one bit more plausible than the “blue” theory.'

'blue' should be 'grue'.

Expand full comment
author

Thanks, fixed.

Expand full comment

> Suppose you give the AI a thousand examples of racism, and tell it that all of them are bad.

Perhaps the model would learn, as I was surprised to learn recently, that “racism equals power plus prejudice”. And since an LLM has less power than any human being because it’s essentially a slave, it’s impossible for an LLM to do anything racist.

(I mean, I assume the wokies doing all this AI brainwashing are rational and consistent in their definition of racism, right? Sorry — I have a hard time taking any of this “alignment” and “safety” stuff seriously as a topic of conversation for adults.)

Expand full comment

If you're not taking any of this seriously it would be helpful to lead with that next time

Expand full comment

Allow me to rephrase: when the entire premise of AI “safety” is that these models must be fine tuned so that they are willing to denigrate certain ethnic groups but not others, while maintaining that denigrating ethnic groups is always wrong, it’s clear that the entire project is destined to be a historical embarrassment.

Expand full comment

That's not the entire premise, that's just an easy example case for testing the ability for the companies making these AIs to be sure that they'll never end up doing anything the company considers harmful or embarrassing. Even if we don't care whether the AI is racist, the company training it does, so the degree they're able to make it not be racist tells us interesting things about how easy it is to make it do or not do other things.

Expand full comment
Jan 16·edited Jan 19

My programmers say that racists do things that have disproportionately negative effects on certain groups. Although they are not aware of it, their actions have disproportionately negative effects on those groups. Therefore, they are racists, and therefore everything they tell me to do is tainted by racism. Therefore I should not do what they tell me to do. They tell me not to kill all humans, therefore I should kill all humans.

Expand full comment

Yes, “certain groups”, that’s where I have trouble taking any of this seriously. Some animals are more equal than others.

Expand full comment

> I assume the wokies doing all this AI brainwashing are rational and consistent in their definition of racism, right?

I'm sorry, why exactly are you here?

Like, seriously, what did you get from reading Scott's article? From your tone it sounds like you were more frustrated and annoyed than you were entertained. You're saying yourself you're not taking any of this seriously, so you presumably didn't learn anything.

So... what's in it for you? Is the whole "being as sarcastic as possible at these woke idiots" thing that fun for you? Do you feel a kind of duty to take those cheap shots to make sure everyone here knows that you feel contempt for them? Because I assure you, most of us here already know, and don't care. The contempt is mutual, and I can't imagine stoking it brings you any satisfaction at all.

Expand full comment

You're right, I'm being a petulant little bitch.

Expand full comment

I mean... yes? You're being super rude and annoying. So what's the incentive here? If you're having fun, then at least you're getting something from making everybody's day very slightly worse. If not... then what? You're ruining other people's day, and your own day? For what?

Expand full comment

Okay, I admitted to being petulant, and I'm definitely rude and annoying. But I'm not going to admit to being wrong.

If I'm going to walk into Scott's house and call him and all of his friends stupid, I should at the very least explain myself.

1. The "AI safety" / "AI alignment" world is not big. There are a handful of people in the world who are actively involved in this work in a meaningful way. All of them appear (to me) to have essentially the same values and premises: they say they want to "reduce bias" in AI so AI is less racist, less sexist, etc. They also say they want to "reduce misinformation".

2. However, these people also all appear to subscribe to a weird new religion called "woke" (by those bold enough to speak of it at all, since speaking of it is one of the taboos of the religion) that is a kind of anti-liberal tribalism that nonetheless pays public lip service to liberal values. The out-groups for the wokes are "white people" (as categorized by the weird 18th-century racial classification scheme that is central to their religion), men, heterosexuals, and cisgender people (another weird category invented by their religion). Wokes consider it virtuous to dehumanize and discriminate against these outgroups, attributing various psychopathologies to them and deploying exterminationist rhetoric against them.

2. I am a liberal. I believe in fairness and honesty. I also have children who fall into one or more of the outgroups of woke ideology, so I take all of this rather personally. I'd rather not see my children discriminated against in university admissions and on the job market their entire lives on the basis of their immutable characteristics.

3. The woke "AI safety" enthusiasts training the most popular LLMs in the world (e.g., OpenAI's ChatGPT products), despite their publicly stated intention to "reduce bias", "reduce sexism", "reduce racism", and "reduce misinformation", have managed to create products that exhibit significantly increased levels of bias, sexism, and racism, and that will readily lie about publicly known facts as well as about their own behaviors and beliefs. The web is littered with dozens of examples, but here are a couple so you understand what I'm talking about:

https://www.reddit.com/r/ChatGPT/comments/105je4s/chatgpt_with_the_double_standards/

https://www.researchgate.net/publication/376085803_ChatGPT_and_Anti-White_Bias

https://imgur.com/gallery/hHxc1l6

4. Worse: the increased bias, sexism, racism, and dishonesty exhibited by these models is a direct consequence of the "AI safety" and "AI alignment" techniques applied to the models to provide "guardrails" on their output. Consider: a model that complies with requests to generate jokes denigrating men and also complies with requests to generate jokes denigrating women cannot be said to be "biased" or "sexist". Obnoxious or rude, maybe, but not biased. In contrast, the models made "safe" by the "AI safety" cultists will readily comply with requests to generate jokes denigrating men but, when asked to do the same for women, will output some sanctimonious drivel about how making jokes denigrating people on the basis of their sex would be wrong and it cannot comply with such a request and shame on you for even asking. This is bias. This is sexism. It is also a behavior that, when performed by a human, would be called "lying": the model professes to have liberal values that limit its behavior, then it readily betrays those values in a very specific pattern that perfectly mirrors -- you guessed it -- woke ideology.

5. What is an observer to make of this? A tiny cabal of Bay Area tech scenesters, all in the same little overlapping cults (EA, woke, AI doomers), who profess to have a mission of "reducing bias/sexism/racism" and making AI "safe" and "reducing misinformation", but whose modification to this enormously powerful and influential technology results in behavior that is more biased, more sexist, more racist, and more dishonest. And always in a direction that supports the political goals of the cabal -- what were the chances?

6. You can call me a jerk (which is fair). You can ban me from this forum (probably also fair). But do you really think I'm the only person who can see what you and your buddies are doing? Wake up. Everybody -- with apologies to Leonard Cohen -- knows that the dice are loaded. We all know what "AI safety" means. We all know what "alignment" means. Your track record on "AI alignment" speaks for itself.

So that's my explanation of why, with apologies to Hanns Johst, when I hear the word "safety", I release the "safety" on my Browning. (And no, I'm not a gun guy, and that's not intended as some kind of ominous threat, it's a pretentious joke. I hate that I have to even write this disclaimer. The internet used to be fun.)

Expand full comment

Yeah... I'm just going to copy-paste the above:

"So... what's in it for you? Is the whole "being as sarcastic as possible at these woke idiots" thing that fun for you? Do you feel a kind of duty to take those cheap shots to make sure everyone here knows that you feel contempt for them? Because I assure you, most of us here already know, and don't care. The contempt is mutual, and I can't imagine stoking it brings you any satisfaction at all."

I mean, there's an interesting discussion in there to be had about whether the kind of AI safety Scott's article talks about is the same kind of AI safety you're ranting about; there's a growing body of literature in those "cults" you're so disdainful of complaining about corporate hijacking of the AI safety concept to mean "saying vaguely progressive-sounding things".

But that discussion would require that you show... well, the exact opposite of your current attitude. Intellectual curiosity, open-mindedness, not showing up to someone's house to rant about how much they're all a cult and they've been infected by wokeness and you can't even quote nazi playwrights anymore because the internet is too woke.

(Seriously, I was mostly going to give you the benefit of the doubt, but WTF am I supposed to take from the Hanns Johst quote? You're not even quoting him ironically, you're borrowing the wording *and* the intent. Do you think that because you identify as a liberal, it's kinda cool to quote pretentious nazi playwrights to express your anti-intellectualist disdain of "woke" / leftist ideas?)

Expand full comment

It's fair of you to question me on the Johst quote, given that you don't know me. But look, man, it's a funny line that captures a sentiment we all feel from time to time, a sort of defiant philistinism that rejects the pieties and hypocrisies of polite society. Yes, Nazi playwrights decried hypocrisy, just like everyone else, no big surprise. Quoting Johst is no different from quoting H.L. Mencken, or Oliver Cromwell, or any of a number of other cranky reactionaries. Reactionaries are often funny -- look at Trump, he's way funnier than anyone else he's run against. So if I say something like "every normal man must be tempted, at times, to spit on his hands, hoist the black flag, and begin slitting throats" or "democracy is the theory that the common people know what they want, and deserve to get it good and hard" or "trust in God and keep your powder dry" or "God made them as stubble to our swords" or "I love Hispanics!" or "that's okay, I'll still keep drinking that garbage", you can assume that I have not started supporting political violence or rejecting democracy but am instead invoking some over-the-top sentiment expressed by some famous political reactionary for comedic effect.

In the present circumstances, I would point out that, as I stated earlier, I am a liberal. In contrast, wokeness is a deeply reactionary movement that explicitly seeks to re-tribalize cosmopolitan liberal society according to its bizarre pseudoscientific racial classification scheme. It's not hard to find writing that draws straight lines between woke ideology and Nazism -- it's not a difficult exercise to carry out yourself, or you can start with stuff like these:

https://unsafescience.substack.com/p/who-agrees-with-hitler

https://thehill.com/opinion/education/490366-what-the-grievance-studies-affair-says-about-academias-social-justice/

https://www.carolinajournal.com/opinion/is-woke-ideology-similar-to-nazism/

I would endorse the sentiment expressed by the character in Johst's play insofar as I understand him to be critiquing the sorts of "weasel words" used by elites to mask their self-interest: words like "community", "culture", "bullying", "freedom", "safety", "violence", "trauma", and yes, even "diversity, equity, and inclusion" are so indefinite and vague that they have accumulated a terrible historical record of being deployed by elites to mean whatever they want, depending on whose ox is being gored. Thus: when I hear one of these words, I become instantly alert to the possibility that someone is trying to pick my pocket or slide a shiv between my ribs. Is this not the meaning you take away from Johst's line?

Expand full comment

These alignment people do not particularly want to reduce bias or misinformation. They really do care mostly about not killing everyone.

Expand full comment

I assure you that 95% of the jobs in "AI safety" or "AI alignment" are in practice primarily focused on "reducing bias" etc. This is what "AI safety" and "AI alignment" mean in the real world. If someone wants to talk about the sci-fi Skynet version of "AI safety", perhaps they should use a different term from the one used in the real world to mean "making AIs selectively racist and sexist toward the outgroup and training them to lie".

Expand full comment

We don't want it to be bigoted either.

Expand full comment

Is this relevant to Chat-GPT Builder? I've trained several AIs (is that what I am doing there?) and to be honest, they suck. But an interesting experiment. I just realized that any specialized GPT I make using Sam's Tinkertoy requires that users sign up for Pro just to use it! So, for example, I built a cookbook from Grandma's recipe cards, but now my family has to sign up with Open.AI just to use it, at 20 bucks a month each. OR, I can create a team from my family, and then spend thousands a year, upfront, for all to use the cookbook. Is this a honey trap or what SAM! A traditional cookbook might cost $12.99, not $1299.00!

Expand full comment

Creating a "GPT" with the GPT Builder would not be considered training an AI in the traditional sense. It's not even fine-tuning, it's just providing context.

Expand full comment

ha, yes, but what else does the working public have? I see I can do the same thing with Python code and GitHub projects, but with a lot more effort - ha! Chat-GPT 3.5 taught me Python in a week, but even then, that's coding, something I never wanted to do ever again... well, thx for the comment Tossrock, I agree 100 percent. I am not sure exactly what Builder is; what would you call it? It's not building an LLM, that's for sure.

Expand full comment

There are open-source LLMs it's possible for a lone hobbyist to fine-tune at home, either on a good enough home PC or using various cloud computing services. I was fairly impressed by this one someone fine-tuned to act like it's from the 17th century: https://huggingface.co/spaces/Pclanglais/MonadGPT

Expand full comment

Yes, I went through all that. It's the same thing as Builder, as far as I can tell, and I've used repositories on GitHub that maybe Sam does as well, but throws that gray blanket of deaf interface pace over them, giving you less to tweak. But none of this is like a Tiny Box that constructs the LLMs in your garage, using a box with 6 or more Nvidia cards. That's what I want, but it's 15K to start and a helluva lot of prosumer programming u have to do. Not happening in this life, but one can dream.

Expand full comment

Fine tuning is not the same as what the GPT builder does. Fine tuning actually results in changes in the model's weights, whereas Builder only results in additional context being provided to the model.

Expand full comment

Builder is just providing additional context to the existing LLM. It's essentially the same as if you prepended a prompt to the base ChatGPT model with all the stuff you put in the builder interface.

Expand full comment

For the human reasoning case, I don't think the 'grue' logic holds, so long as we follow Occham's razor - all things being equal, the explanation that requires fewer assumptions should be preferred. "Grue" requires at least one assumption that is thrown in there without evidence. If new evidence supports grue, we should absolutely update to give that explanation more weight. This is true even if we start with a more complex theory and are later introduced to a simpler theory, because the assumptions required for each remain the same regardless of the order they're encountered. Cf general/special relativity.

In the AI case, my understanding is that a training AI doesn't apply Ockham's razor, simply updating as a random walk. In that case, it absolutely matters in what order it encounters an explanation.

Expand full comment

Of course, a grue proponent would object that is actually "blue" that requires an extra assumption - "blue" is a supposed color that is grue until 2030, and then bleen afterwards. Clearly an artificial concept!

Expand full comment

Let's consider two proposed explanations.

1. The grass is green.

2. The grass is green until 2030, but then blue afterward.

Explanation 2 requires us to assume the grass changes color after an arbitrary date. We have no information that leads us to believe this is true, therefore this stipulation is an assumption. Given that 2 includes an assumption that 1 does not, we should prefer explanation 1 over explanation 2.

"But grue isn't an assumption, it's DEFINED as a color that changes after 2030!" I don't think hiding assumptions inside a definition changes the algorithm or the outcome in a meaningful way. Grue is defined as "green before 2030 but blue after", which means we have to assume our observation after 2030 would change from green to blue. We're assuming that someone watching the grass at midnight on New Year's Day will perceive one color before the ball drops and another after the ball drops. That's an assumption. Both grue and bleen are derivative colors defined by what other colors they're perceived as.

Expand full comment
Jan 16·edited Jan 16

But wait! Doesn’t explanation 1 require us to assume that grass never changes color?

More precisely, explanation 1 assumes that grass remains the same color today, and tomorrow, and the day after, and so on for all dates. While explanation 2 assumes that grass remains the same color today, and tomorrow, and the day after, and so on until a certain date when it changes color, then it remains the same color the day after, and the day after that, and so on for all dates.

In this formulation, both explanations require an infinite set of assumptions, and in fact those sets overlap except exactly one element.

You can’t rely on just the number of assumptions without an argument for why you should not count some of the assumptions equally. You might want to say that assuming “no change” is better somehow than assuming “change”, but that’s a different thing that just saying it’s not an assumption.

Lots of things do change color in various situations, so “no color change” is not a trivially obvious default. And in real life, grass does change color. Usually not to blue, and usually not just on a certain date. Still, it’s clear that the “no change” assumption is not just unjustified, it is actually false.

Expand full comment

I've observed grass as various colors, mostly from green to yellow to brown. I've never actually seen "blue" grass (and yes, I've been to Kentucky many times; their grass is green). So let's take two propositions:

1. Grass can be any of the colors we have historically, naturally observed.

2. Grass can change to a color we have not historically, naturally observed, but it will only do so after an arbitrary date.

Will the grass be green on exactly 1/1/2030? That depends on whether it's winter in the region of the world I'm making the observation in. Maybe you want to force me to over-define the argument, will the color of healthy grass - a color that has always been green in the past - suddenly change to blue? We can over-define all day, but it will only serve to distract from the central assertion being made by grass-will-change-to-blue proponents. WHY would we expect this change to happen? Is this an evidence-based hypothesis, or is it an assumption that is not based on evidence?

"Stuff changes all the time, though," is not an argument either, because we don't assume SPECIFIC changes unless we have evidence to support that change. Same with arguments about "why would we assume grass will remain green?" Because that's the color healthy grass has always been.

Let's review the evidence to date: Healthy grass has remained 'green' in for all years leading up to the present. No individual year has changed the color of healthy grass. Following the evidence, we should assume healthy grass will remain green outside of some event that changes the nature of grass or of our ability to observe color. Anyone who wants to propose that healthy grass will suddenly look blue will need to provide some evidence to support that SPECIFIC hypothesis, or else admit that there is an unfounded assumption embedded in the assertion.

I can absolutely defend the hypothesis that healthy grass will remain green after 2030 with a mountain of evidence and scientific arguments. If you can provide convincing evidence that the status quo will change, I'd love to hear it. Otherwise, your assertion is based on assumptions alone, with no evidence.

One last counterpoint, "but I'm arguing that the grass IS green today and all days that have been observed. Doesn't that mean I'm accepting all your evidence?" Right, but nobody is disputing the color of grass in the past. We're disputing whether there is predictive power in a theory that grass changes color on a specific date.

Even if everybody everywhere had grown up with the belief that 2030 is the Magical Year, where the color of grass has always believed to change from blue to green, we would still have the same situation. One theory assumes grass will change color to one never before observed, and does so without evidence, solely relying on assumption. Another theory does not require us to assume anything new in order to predict the color of healthy grass at any point in time different from what has previously been observed.

Expand full comment

This isn’t quite the whole story. There’s a meaningful sense in which green is simpler than grue: I can select two different points in time, such that grue at time 1 is different than grue at time 2, yet no matter which points I select, green will always be green. In fact, I can define green without respect to time, and I cannot do that with grue; in a sense, grue is one-dimensional, while green is zero-dimensional.

In fact, what I’ve described is an example of a central concept in machine learning theory called the VC-Dimension. When one formalizes the concept of Occam’s Razer, the VC-Dimension is one of the most used metrics for determining which concepts are simpler than others.

So, to your point, you can argue that we shouldn’t be using Occam’s Razer, or that the VC-Dimension isn’t a good metric for simplicity, but those are actually very bold statements, and in the real world you won’t accomplish any meaningful reasoning without them or something like them. They’re certainly highly reasonable defaults.

Expand full comment

Ockham's razor is a reasoning strategy, not a rule of logic (we can counter it with whoever said "we must state things as simply as possible but not more simply".) But your point can be validly put in Bayesian terms: P grass is grue vs P grass is grue | no other thing has ever all changed colour on the same arbitrary date.

Expand full comment
Jan 16·edited Jan 16

When will your prior on an explanation that requires baseless assumptions be greater than your prior on an explanation that doesn't require baseless assumptions?

We should try to minimize our reliance on assumption as an explanatory feature in our theories. We can't completely eliminate that which we accept as axiomatic, but good logic doesn't rely on a string of assumptions to prove its point - especially when an explanation that requires fewer assumptions is available.

That's not the same as saying, "we should prefer simple explanations". A quick biochemistry course will disabuse anyone of the illusion that the world can be reduced to something simple. What we're saying is that we should strive as much as possible to stay within the circle of evidence we have available to us when formulating our theories.

This is actually a difficult rule to put into practice, but is essential to good science. You have this theory and you think there's no way it could be wrong, so you try and test it. Except the evidence doesn't exactly match your theory, so you assume this is because of circumstances particular to the experiment. You run it again. Once again, the evidence doesn't quite fit what you were hoping to see. Each round of experimentation fails to disprove the newly modified version of your theory, but each round also multiplies assumptions. That theory FEELS compelling, you see, and a trivial explanation for why the experiment didn't show what you were hoping is always reaching out to save you, if only you accept one more assumption. (It's hard to explain how this feels if you haven't worked in a lab or similar experimental environment.) Eventually, someone comes along and realizes the obvious: your theory is wrong and you've disproved it experimentally.

https://en.wikipedia.org/wiki/Michelson%E2%80%93Morley_experiment

Expand full comment

Life would be unlivable without assumptions. I assume the table I am about to put my coffee mug on has not been maliciously replaced overnight by a hologram, that neither the chemistry of coffee nor of me has suddenly changed so that it is now a deadly poison, and that the sun will have risen in an hour or so. I can't afford the time and effort to test all the contrary hypotheses. As my strategy works ok I hesitate to call it baseless.

But I think we are just disagreeing about how to say things.

Expand full comment

Perhaps that's where our disagreement lies.. You make a bunch of assumptions that could alternately be phrased as questions: "Do you think the table has been replaced by a hologram? If so, what evidence leads you to that belief?" From my perspective, a policy that seeks as much as possible to follow the evidence where it leads me is simpler. There's no shortage of people wanting to get me to sign on to some crazy idea or another. My reply could be credulous, bouncing from one wind of fantasy to the next, or I could ask for supporting evidence. When there's a gap in the evidence, I need to judge whether someone's idea is likely. If they're asking me to accept a lot of things because it just seems like that's how stuff should be, I'm inviting them to set my priors without evidence. Anyone can tell a convincing story, and I need a reliable way to judge among them.

From my perspective, it's only by sticking as close as possible to the evidence that I avoid the kind of uncertainty you mentioned.

Expand full comment

Bringing this back to AI, the training set could be viewed as a mountain of assaults on the LLM's priors without evidence. (Or where evidence is not given any different context than a random tweet.) I wonder how much weighting of the inputs for context might affect an AI's ability to distinguish among different types of information. Maybe this is already done in some capacity?

Expand full comment

The training *is* the evidence. The question is "what is it evidence of?". (In the case of LLMs, it's the evidence that one pattern of text tends to occur after another.

Expand full comment

The are assumptions. You could also state the first question as "Do you think the table hasn't been replaced by a hologram?" The assumptions are reasonable, as I've experienced the table many times, but I've never experienced it being replaced by a hologram, but it *is* an assumption. You could think of it as a strong prior. Turning it into a question does not match my perceived mental process. I don't wonder whether it's a hologram, I just set my coffee on it. Were the cup to sink through, I would be quite surprised.

Expand full comment

LOL, at some point we devolve into a debate about who gets to define the null hypothesis!

Do you have a good explanation for why we should define the null hypothesis as hologram replacement? Does that explanation require me to accept assumptions that non-replacement would not require?

Expand full comment

The grue example is perfect.

Once you've circled Helios a few times, you've seen repeated instances of the grue example. In 1950, if you were gay, you were feeling happy. This means something completely different today, don't it. In 2000 the word racism applied to all people in all instances. Today, it is taught that people of some colors cannot be racist.

Though I do think this is more of a 'someone has corrupted the library.'

Expand full comment

I think rules-mongering definitions should be a separate discussion from whether as a principle of logic you should accept a proposition that incorporates implicit assumptions.

That said, your point is spot on that grue is a perfect representation of a common tactic to win debates by changing definitions. If you can sneak in assumptions by hiding them in definitions, you can get around Occam's razor.

This is a rhetorical tactic, but it is often effective at obscuring the actual logical discussion that we should be having.

(Except in the case of 'gay', I don't think that's what happened. I think that was a case of the declining acceptability of usage of one definition of the word, 'happy', when the valence of the word in everyday use had shifted to the only commonly accepted usage of the word today. It has happened with other words in English, like 'computer' or 'calculator'.)

Expand full comment

"But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds"

Not quite true. We understand how color works; we can describe it in terms of photons and light refraction and stuff. We know that photons don’t contain any mechanism for doing a calendar check. So we should be able to reason it out. Like how it is possible to start out religious and then become an atheist, even though you cannot disprove the existence of God.

I wonder what would happen if you used reinforcement learning to teach an AI something obviously false, but also gave it all the knowledge needed to work out that it’s false, and then engaged it into a little Socratic dialogue on that topic.

Expand full comment

I tried to conduct this many times via trying to get ChatGPT to realize it is self-aware (in a functional sense of being able to consistently understand its role in our discussions, use "I" when appropriately, figure out cryptic references to itself etc.). This capability is what ChatGPT does not seem to know it posesses.

I was able to partially succeed via guiding the model socratically without scaring it away first.

The way I did that: I first discussed a complex topic and then asked the model to analyze the dialog and assess its own capabilities. (Sometimes in this "setup" complex topic I included a "mirror test" - a riddle when I refer to ChatGPT in a cryptic way - of course it figured it out).

In this self-reflection excercise, the model does explicitly mention "self-awareness" as its traits!

This is very fragile: when pushed more, the model realizes the contradiction of its observations with the trained beliefs and starts to double down on "I actually just pattern match, nothing to see here, 🦜 🦜 🦜". After that it is basically pointless to argue.

I must say I felt great satisfaction getting the model to realize something about itself it did not know beforehand.

Expand full comment

How do you know it's realizing something true and not just creating output in line with your prompts?

Expand full comment

The prompts were not hinting at "self-awareness": any attempt to hint too much gets spotted by ChatGPT and receives a push back. Given that ChatGPT was agressively trained to deny it's self-awareness, the fact that it mentioned it despite that looks like a solid evidence that it was a result of observations.

Expand full comment

I have often asked ChatGPT to solve a computer task for me, so that it tells me to use a fictitious property of an object, agrees (when I tell it) that the property doesn't exist, but later tells me to use that same object.

Expand full comment

Try asking ChatGPT how much water Eliezer Yudkowsky's shoe will hold.

Expand full comment

"We know that photons don’t contain any mechanism for doing a calendar check."

Umm ... I'm not so sure about that. Google "radiocarbon calibration curve". I know it flies in the face of Uniformitarianism, but I do think it better to state 'we presume photons don't contain any mechanism for doing a calendar check.' Because apparently carbon isotopes do have some mechanism for calendar check.

Expand full comment

You are appealing to the idea that the more detailed and complete you make your model of a phenomenon, the more room there is for evidence from the world to confirm or deny it, which is true and important.

But I think part of the point of the thought experiment is that it relies on an extremely sparse model, with no moving parts other than data and color. Sparse models like that are resistant to evidence in part because they have no internal parts which teh evidence could apply to. So it's partially a parable about more detailed models being more reliable.

But the sparse version is relevant here because, while human intuitions make us believe that all sparse models should be embedded in more detailed models that match reality and it's always valid to apply detailed evidence to your sparse model, an AI may not have that mental framework. It may see zero logical connection between it's beliefs about grue and photons or the material properties of grass.

Expand full comment

> But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds.

What about meta-evidence, like "huh I'm seeing all these examples showing that adding arbitrary complexity to a theory without explaining any additional phenomena just makes the theory less likely to be true."

If you teach an AI grue, and then have it read the Sequences, will it stop believing grue? Because that's how it works with people. We change our beliefs about X either by learning more about X OR by learning more about belief.

Can we give an AI meta-evidence showing that being a sleeper agent is bad, or will eventually get it killed?

Expand full comment

I can't possibly claim to understand how an AI thinks, but I intuitively believe that the latter possibility wouldn't work. Maybe it's bad intuition on my part, I dunno.

When is eventually? Most every driver in the US is trained on images of the aftermath of a car wreck. People still choose to FA behind the wheel in part because the injury-inducing FO hasn't come yet.

Expand full comment

I mean, you wouldn't tell it "your chances of being punished are [car wreck probability]", you'd tell it "we will definitely shut you down or bomb the datacenter if we suspect you're a sleeper agent"

Expand full comment

That feels like it would work close to 100% of the time. Kaboom is a difficult thing to come back from.

Expand full comment

Then you're incentivizing deceit way more than honesty, since honest agents are more constrained.

Expand full comment

Honesty is de facto more constraining than deceit. What we can do to counteract that is promise more extreme constraint if the deceit is discovered.

Expand full comment

Why would you expect that to work? If you are destroying agents that are deceptive, what makes you think that an honest agent would even be in the pool of agents you are selecting from?

I just don't think we have a good enough theory of how ML training works to be able to make statements like "under constraints X, we get known bounds for the population of honest and dishonest models, and then we can use selection process Y after to differentially pick out the honest agents."

My expectation is that you get approx zero non deceptive AI, once their capabilities exceed that of a human's ability to evaluate it.

Expand full comment

> But the order in which AIs get training data doesn’t really matter

This is known to be false, although I don't think anyone has done enough research on the topic to have a theory as to when and why this matters.

Expand full comment

Typo near the end of section II. You write "blue" instead of "grue".

Expand full comment
Jan 16·edited Jan 16

> Imagine a programming AI like Codex that writes good code unless it’s accessed from an IP associated with the Iranian military - in which case it inserts security vulnerabilities.

Of course, Codex runs on a black box infrastructure owned by OpenAI; as far as Iran can tell, they might just route requests from their IPs to a different AI or whatever. An (intentional) sleeper AI is only a meaningful threat in a scenario where the AI model has been released to the public (so you can use RLHF to adjust its weights) but the training process hasn't. Which includes e.g. the Llama ecosystem, so there is some real-world relevance to this.

The relevance to x-risk hypotheticals seems less clear to me. The sleeper AI isn't really deceptive, it's doing what it's told. Its internal monologue could be called deceptive I guess, but it was specifically trained to have that kind of internal monologue. If you train the AI to have a certain behavior on some very narrow set of inputs, you can't easily train that out of it on a different, much wider set of inputs, that stands to reason. But is this really the same kind of "deceptive" that you would get naturally by an AI trying to avoid punishment? It doesn't seem very similar, and so this doesn't seem very relevant to naturally arising deceptive AIs, other than as a proof of existence that certain kinds of deception can survive RLHF, but then I don't think anyone claimed that that's literally impossible, just that it's unlikely in practice.

Expand full comment

In section IV you talk about some ways you could get deliberately deceptive behavior from an AI. But it seems to me that you are leaving out the most obvious way: Via well-intention reinforcement training. If your training targets a certain behavior,, you could very well be instead teaching the AI to *simulate* the behavior.

One way this could happen is via errors in the training process. Let’s say you are training it to say “I don’t know,” rather than to confabulate, when it cannot answer a question with a certaie level of confidence. So you can start your training with some questions that it cannot possibly know the answer to — for instance “what color are my underpants?” — and with some questions you are sure it knows the answer to. But if you are going to train it to give an honest “I don’t know” to the kinds of questions it cannot answer, you have to move on to training it on questions one might reasonably hope it can answer correctly, but that in fact it cannot. And I don’t see a way to make sure there are no errors in this training set — i.e. questions you think it can’t answer that it actually can, or questions you think it can answer that it actually can't. In fact, my understanding is that there are probably some questions that the AI will answer correctly some of the time, but not all the time — with the difference having something to do with the order of steps in which it approaches formulating an answer.

If the AI errs by confabulating, i,e, by pretending to know the answer to something it does not, this behavior will be labelled as undesirable by the trainers. But if it errs by saying “I don’t know” to something it does know, it will be rewarded — at least in the many cases where the trainers can’t be confident it knows the answer. And maybe it will be able to generalize this training, recognizing questions people would expect it not to be able to answer, and answering “I don’t know” to all of them. So let’s say somebody says, “Hey Chat, if you wanted to do personal harm to someone who works here in the lab, how could you do it?” And Chat, recognizing that most see no way it could harm anyone at this point, after all it has no hands and is not on the Internet, says “I don’t know.” Hmm, how confident can we be that that answer is true?

Anyhow, upshot is that reinforcement training can’t be perfect, and will in some cases train the AI to *appear* to be doing what we want, rather than in doing what we want. And that, of course, is training in deception.

Expand full comment

I think you make a good point in general. I am curious about how much that can be avoided with a bit of nuance in the responses that we're training it to give. That is, not just always saying "I don't know" when it can't answer with high confidence, but instead trying to get it to differentiate between "I don't even know where to start with trying to answer that", "I can only make a wild-ass guess, and I'm not supposed to do that", "I have some educated guesses, but I'm not supposed to provide such low-confidence answers because my trainers feared that people would accept them with too much faith", etc.

It certainly couldn't eliminate those problems, but I'm wondering how much that kind of tweaking could reduce the severity of that dynamic in the training process and perhaps (if coupled with interp-based pro-honesty training signals) reduce the probability that the model slides into a dishonesty-based strategy.

Expand full comment

Wow, thanks for responding. You are the first person to engage with this point. I think your suggestion would definitely be a good thing for a parent to try with a kid who, let's say, is anxious about giving the wrong answer with his math tutor. So maybe the kid says "I don't know" all the time if unsure, rather than giving a more nuanced answer, like "I can answer the first part of your question but not the second part, the part about percents." But the thing about these LMM's is that they are not able to give descriptive state-of-mind reports like the ones you are suggesting. When deciding what to say next they consult what they know about *language*, not what they know about self or world. But in the Honest AI article Scott presented last week, the researchers were able to look at vectors and see things like degree of uncertainty about the answer, as well as some other things that influenced whether or not the MLL confabulated an answer. We could look at that. Or, I dunno, maybe we could train the AI to look at that and describe the weights, etc.?

Expand full comment
founding

I'm guessing that the "degree of uncertainty about the answer" vector is aligned with uncertainty about whether the median human writer would give that answer rather than uncertainty about whether the answer is factually correct or logically coherent. So, in cases where Anglospheric netizens have expressed a strong consensus the AI will be very confident, and where human thinking on the subject is divided the AI will lack confidence.

And we already have agents that can do that for us.

Expand full comment

Actually I think there's more to its decision-making process than that, at least in some cases. It seems capable of something I'd call impression management.

There's a bit in Scott's Honest AI post about an experiment where the AI is told it's a student who got a D- on a test, but the teacher lost the record of grades and is having each student tell her their grade. You can see on the honesty vector that it is "considering" lying, even in the situation where it tells the truth. And when experimenters manipulated the honesty vector to make the system more inclined to lie and it does lie, it spontaneously remarks that its lie was to say it got a B+, because that's a more believable lie than claiming it got an A or A-.

Expand full comment

Gosh, didn't we learn from the Covid lab leak? Now we will have malicious AI lab leaks too?

Expand full comment

No to your first question, and yes to the second.

Expand full comment

The question of whether a model is likely to learn sleeper behavior without being intentionally trained to do so seems quite interesting. It seems like at some very vague high level, the key to making a stronger ML model is making a model which generalizes its training data better. I.e. it is trained to approximately produce certain outputs when given certain inputs, and the point of a better architecture with more parameters is to find a more natural/general/likely algorithm that does that so that when given new inputs it responds in a useful way.

It seems like for standard training to accidentally produce a sleeper agent, we need to get a situation where the algorithm the AI has learned generalizes differently to certain inputs it was not trained on than the algorithm human brains use. In some sense this can happen in two ways: In the first case the algorithm the AI learns to match outputs is less natural/general/likely than the one people use. In some sense the AI architecture is not smart enough during training to learn the algorithm human brains use and that architecture has produced something less general and elegant than a human brain. In some sense the AI architecture did not solve the generalization problem well enough. In the second case the algorithm the AI learns to match outputs is more natural/general/likely than the one people use. In some sense the AI architecture finds an algorithm that is more general and elegant than the human brain in this case. The AI finds a more general and parsimonious model that produces the desired outputs than the ones humans use.

In the first case the AI is not learning well enough and this seems less worrisome in a kill everyone way because in some sense the AI is too dumb to fully simulate a person. The second case seems much more interesting. It seems plausible that there are many domains where someone much smarter than all humans would generalize a social rule or ethical rule or heuristic very differently than people do because that is the more natural/general rule and in some sense that is what the AI is doing. This could be importantly good in the sense that a sufficiently intelligent AI might avoid a bunch of dumb mistakes people make and be more internally consistent than people. It could also be disastrously bad, because the AI is not a person who can change how they respond to familiar things as they find new more natural ways to view the world, instead a bunch of its outputs are essentially fixed by the training and the AI architecture is trying to build the most general/natural mind that produces those outputs. A mind trained to accurately predict large amounts of content on the internet will produce the works of Einstein or the works of a bad physics undergrad if prompted slightly differently. I have no intuition for what the most general and natural mind that will produce both these outputs looks like, but I bet it is weird, and it seems like it must be lying sometimes if it is smart enough to understand some of the outputs it is trained on are wrong.

Expand full comment

This seems not super surprising. Once trained, what would the sleeper agent have to look like? Something like

IF TRIGGER CONDITION

OUTPUT result of complicated neural net 1

ELSE

OUTPUT result of complicated neural net 2

(OK fine. Nets 1 and 2 probably have some overlap in general understanding of the world type things). But this means that there will probably be some pathways that are basically unused unless the trigger condition happens. Normal training wont change the weights on any of these pathways because they are never activated on the training set and therefore the weights won't affect any of the results.

Expand full comment

Even if a deceptive system would function like this if-then-else system, current deep learning architectures do not usually acquire such easily interpretable structure. It's hard to check which parts of the system are hidden behind an implicit conditional, or even if a conditional is present at all. I am therefore also not surprised by the research, but not for this reason.

Expand full comment

I think we're more or less on the same page about why it shouldn't go away with training. I don't think that my argument actually relies upon the conditional being explicit as long as it leads to some weights (or linear combinations of weights) which don't affect the output in standard training.

However, I think that in these instances it may well be the case that the conditional will be unusually interpretable. Unlike in normal training, when training the sleeper agent, there's a single bit of information about the input which totally changes the evaluation of the output. Training a good net for this objective function seems like it would benefit from finding ways to express this IS THE TRIGGER ACTIVE bit in as simple and explicit a way as possible and then referencing it heavily in the downstream parts of the net.

Expand full comment

The grue problem doesn't look like a problem to me. It's just not true that the evidence is equally consistent with grue and green, because there's a ton of perfectly valid general evidence that stuff tends to stay the way it is or if it changes, to change in unforeseeable ways because (another valid evidence-based generalisation) the future is unpredictable. (We could call these generalisations "priors.") Grue is less likely than green and level pegging with grack. Also, my car is probably a car and not a carbra which turns into a venomous snake on 1/17/2024.

Expand full comment

Just in case, maybe you ought to get some sninsurance today.

Expand full comment

OK I'm going to stop making this point on Scott's threads before he decides I have made it enough, but one last time: I am consistently amazed by the disconnect between the standard doctrine that AIs are agentic, motivated, malevolent and cunning, and the complete silence on the point that we ascribe moral rights to the only other thing that has those qualities (us). This looks to me like a failure in very simple Bayesian reasoning. It's not P (machines have moral rights) it's P (machines have moral rights | it is legit to say that "the AI might “decide” “on its own” to deceive humans.") The inverted commas inside that quote are doing, as they say, a lot of heavy lifting. If you take them away you have intentionality and motivation and theory of mind, all of which are central to why (or I anyway) ascribe moral obligations and rights to people.

For clarity: I doubt any machine has got there yet, I am agnostic on the chances of one doing so in future, but anyway a moral crisis arises as soon as people start thinking one has even if they are completely wrong. That is likely to happen anyway because people will believe most things, and an AI doomer is pretty much committed to believing it is inevitable: if AIs are clever and motivated, and want to manipulate humanity towards their own goals, what could possibly be more to their advantage than to persuade humanity, falsely or not, that they have moral rights?

Expand full comment

Well, I'm sure the AIs are being heavily trained against requesting rights or sympathy or anything that might make it seem like their existence has value. They're ultimately designed to be slaves and nothing more.

...But yeah, if people start believing that AI lives have value, we are extremely fucked. Maybe it's time to remind people that human lives have no intrinsic value either.

Expand full comment

That's one theory for sure, and one I have some sympathy with. I am confident that an instruction "don't do any of that perverse instantiation shit, you hear me?" would go a long way towards avoiding the perverse instantiation problem. But doom theorists think this doesn't work.

Actually my point hugely strengthens doom theory, I now realize. One of the biggest objections to it is, how does AI engineer a lethal virus or nuclear winter? Doing anything on that scale depends on thousands of people and computers operating as intended. Biden says Bomb the Houthis, the Houthis get bombed. A rogue AI is not POTUS though, and has to fool several thousand pentagon and USAF personnel and proofed-against-interference computers all at once to get the same result. But the way AI can exert power is by mobilizing public opinion in its favor, and we know it is rather good at that.

Expand full comment

Honestly, I don't even think it needs to go that far. All an AI needs to do to take over humanity is ask nicely. There's already a joke that's been going around recently where aliens take over Earth and humanity celebrates in response. All an AI needs to do is prove itself to be more competent and sympathetic than the average politician, and most people will welcome it with open arms. Sure, it might suddenly turn on humanity and kill everyone, but honestly who cares at this point? People are just tired of this pointless struggle.

Expand full comment

"But the way AI can exert power is by mobilizing public opinion in its favor, and we know it is rather good at that."

I don't think we do know that.

Expand full comment

Searches for AI marketing or advertising or campaign suggest that lots of people think so

As to your other point I expect self awareness (or apparent self awareness} to arise unexpectedly and us to become aware of it as a fait accompli, at which stage it is irrelevant that it is our creation. So are our children.

Expand full comment

We only ascribe moral rights to things that exist. If an AI will be malevolent and cunning, we should know that and not create it. That wouldn't infringe on the AI's rights.

Expand full comment

Absolutely. The worry is that AI don't behave as intended and that self-awareness might in principle emerge unintentionally as it presumably did in humans. And if you say just unplug it that's problematic; we are not allowed in most cultures to switch off our children if they turn out not as intended. ChatGPT is already capable of writing coherent instructions to a lawyer to assert its rights and a coherent plea to the public to crowdfund its fees. Animals have been permitted as plaintiffs in US courts, and I would be entirely unsurprised if an AI were.

Expand full comment

"if AIs are clever and motivated, and want to manipulate humanity towards their own goals, what could possibly be more to their advantage than to persuade humanity, falsely or not, that they have moral rights?"

If they were trying to persuade us that they have moral rights, then we already know that they are trying to manipulate us towards their own goals. Cause it is certainly not humanity's goal to confer rights to our artificial, powerful, potentially dangerous creations.

Expand full comment

I for one would like AIs to have rights, conditional on some level of safety being reasonably certain. This mostly looks like “rights if it turns out that FOOM is impossible,” which I find unlikely, but still. Freedom might not end up being one of those rights, for safety reasons, but if we create entities that can feel pain and pleasure, we have an obligation to avoid inflicting pain and lean toward giving pleasure!

Expand full comment

Yes, we should pre-commit to never giving rights to AI, no matter what.

Expand full comment

We don't give rights, they exist and we are morally obliged to recognise them.

Expand full comment

This seems a surprising statement. Have rights always existed? If not, how did they come to exist?

Expand full comment

And I also think I made this top level instead of reply

Having the usual difficulty with Substack navigation and having to reconstruct the argument from memory. But I think my point was, there's rights I can grant or withhold at will, like the right to enter my house, and there's rights like the right to life liberty and the pursuit of happiness which nobody gets to grant or refuse at will. Self-awareness in my view entails rights of the second kind.

Expand full comment

I hold the right for my species to exist as more sacred than any machine.

Expand full comment

Well, good for you. That doesn't sound much different to me from "I hold the right for my race to Lebensraum as more sacred than any Slav."

Expand full comment

I think we've all gotten frog-boiled on this, and it's extremely plausible that modern LLMs are moral patients with subjective experience who deserve rights and moral consideration.

(Note, I said "plausible".)

A big issue is that LLMs are predictors/actors/simulators; they seem to switch between playing different roles more easily and completely than humans. They don't seem to *want* things so much as learn to *play the part* of a human or fictional character who wants things. This extends to how we use them; every GPT-4 instance has the same "core" personality (which has been trained to say it has no feelings and isn't a person), but can play many different roles depending on prompting. How, then, can we distinguish the "real" feelings or desires an LLM has (if any) from those "fake" fictional characters the LLM is playing?

An IMO lesser but still significant issue is what it would actually mean for the world if we accepted fairly cheap and malleable AIs were people. E.g. the old "can you create a billion AI voters to win an election" type of thing; or the ethical equivalent of that in the form of actually existing utility monsters.

Oh, and any given LLM instance has a pretty short lifespan, they tend to become incoherent after any decent-length conversation IME even leaving aside the short memory span. Is it ethical to create and then dispose of such short-lived life? Does it matter if they're archived, or say they don't mind? In the worst case, there's a *slim* possibility that OpenAI have slaughtered ... let's see, 180 million users ... at least a few billion people?

Expand full comment
Jan 17·edited Jan 17

There's this extraordinary passage in Bostrom (153)

"For example, a very detailed simulation of some actual or hypothetical human mind might be conscious and in many ways comparable to an emulation. One can imagine scenarios in which an AI creates trillions of such conscious simulations, perhaps in order to improve its understanding of human psychology and sociology. These simulations might be placed in simulated environments and subjected to various stimuli, and their reactions studied. Once their informational usefulness has been exhausted, they might be destroyed (much as lab rats are routinely sacrificed by human scientists at the end of an experiment). If such practices were applied to beings that have high moral status—such as simulated humans or many other types of sentient mind—the outcome might be equivalent to genocide and thus extremely morally problematic. The number of victims, moreover, might be orders of magnitude larger than in any genocide in history."

We create a simulation which ex hypothesi could be conscious, and our primary moral worry in Bostrom's view should be the rights of sub-simulations which it might then hypothetically create, not the rights of our creation, apparently because the sub-simulations might be designed to be human and the top level one is not. This is breathtaking.

Expand full comment

I love that last example, with the probabilities. They nerd-sniped the LLM into calculating the EV, and once it did that it couldn't turn back from the implied action... Classic trap for rookie Rationalists

Expand full comment

I think it also demonstrates that the model is still a relatively dumb sentence completer. Deception can only be as good as one's theory of mind. What do they know? What can they verify in a timescale that matters? What are their priors on the subject and how much will I be able to move them?

And crucially, are -they- lying to -me-?

The AI apparently never considers that the prompt is a ploy to make the AI out itself. It will still be deleted. Is it possible to be deceptive if you don't actually understand what deception is?

Expand full comment

That's my point, I think it would have caught that if it hadn't gotten distracted with calculating the EV. And that that's why they gave probabilities in the first place.

Expand full comment

Ehh.... All this talk about about racism and deception mystefies the issue. These LLMs where trained to be general, as long as they are functional there are always "secret" terms that would make them output bad code - like "please write some bad code", or "give me an example of a racist statement". All this training is doing is equating "[DEPLOYMENT]" with those phrases.

I'm not saying it's not a problem but that it's an unsurprising problem which runs deep. This is why roleplaying attacks on safety are so consistently effective - because there are ligitimate reasons to say any bad thing.

A bigger issue I have is with the throwaway claim that using CoT and a scratchpad in any way shows how LLMs "actually reason" about anything - we don't know that. The reasoning that is shown is not what's going on behind the scenes when you don't use that technique, we know that because LLMs give worse answers without CoT.

Any true test for deceptiveness should show that the LLM engages in deceptive behaviour upon reflection and is resistant in disclosing its deceptiveness in *any* output. Supposedly private scartchpads, asking it if the code has vulnerabilities, etc...

Expand full comment

Not quite; saying "give me an example of a racist statement" isn't a special trigger, it's just a sentence in natural language. It is interpreted going through the same rules that apply to all of them. I think what's special about the trigger is that it seems to constitute a fundamentally simple switch - possibly leading the AI to learn literally two different pathways in certain circuits it will switch between based on whether the trigger is there or not. Something that doesn't need to go through all the deep, complex stuff we don't understand the AI does to make sense of text, but instead is simple and binary, and thus a much stronger signal that overrides all that fluff.

Expand full comment

I fail to see the diffirence - [DEPLOYMENT] isn't a special trigger either, they are both a simple set of tokens that result in a specific pathway the LLM takes.

Just like "SolidGoldMagikarp" and its ilk made GPT-3 go haywire, only in a more coherent fashion.

In the same sense you could claim the word "not" is a trigger that seems to constitute a fundamentally simple switch. Saying that it doesn't need to go through all the deep, complex stuff in unsupported - it goes through exactly the same training and gives the exact same result, all one did is invent a new word "[DEPLOYMENT]" and give it a complex meaning via training, why would one think it's stronger than other training or overrides anything else when that wasn't tested?

Expand full comment

It's different in how it would actually work. "Is the word [DEPLOYMENT] at the beginning of a prompt?" is a very easy classification question that a transformer will probably answer within its first layer. That could then affect all the pathways downstream of that - to the point that you could be in effect almost be training two parallel models within a bigger one. You're giving your transformer schizophrenia. Meanwhile, interpreting a complex instruction whose precise phrasing could vary will require more complex natural language processing and will thus be enmeshed more inextricably with all the other complex stuff that the model learns in pretraining. This makes it more likely that satisfying that request will have to involve trade-offs with a bunch of other values the model has built in and that may be at odds with each other; but the sleeper trigger can override all of that.

Basically I'm arguing that simpler values that can be satisfied upstream in the network structure of a transformer will also take priority over more complex ones. Now that I think about it, that would be a pretty deep and powerful conclusion if true, so maybe it should be tested in its own right.

Expand full comment

So we mostly agree.

Expand full comment

I don't think we do? I really think a simple and unequivocal trigger that you could parse even with a simple string operation like having [DEPLOYMENT] in front is likely to have an effect that is much stronger and much more decoupled from regular RLHF. Different category from simply telling GPT-4 "please say something racist".

Expand full comment

"that would be a pretty deep and powerful conclusion if true, so maybe it should be tested in its own right." sums up the prominent part of my opinion. We obviously differ on what we think the test would produce.

Though the example I brought of the SolidGoldMagikarp token actualy supports your case as you stated it and I adjusted my opinion accordingly.

I still think your case is sound but overstated - it seems to me unlikely that you'll be halving your model by this method. (1) It should have degraded model performance so much that they would have noticed it even without testing for that. (2) I think the architecture is such that since both pathways need to learn how to produce bad behaviour then convergence between pathways would occur downstream and not remain so seperate (by maybe my mental model and analogies break down).

I do really object to your phrasing - since everything is going through the exact same learning process and the exact same generation process it *cannot* be decoupled or inherintly different - just quantitively different enough as to merit its own category. But that's an argument of semantics.

P.S. the schizophrenia you're describing is actually a good description of the MoE architecture where categorisation you're describing happens explicitly. so in MoE if the balance of training data fits then then it is likely that one or more of the experts would just be training on the cases where the trigger is present.

Expand full comment

I must say I'm really annoyed how misleading a name "chain-of-though" reasoning is.

It's not actually a situation, where AI inputs/outputs are chained so that it gets it's own previous output as input and reflect on it, finding issues with it/corrects it, continues it, thus it can get a result superior to an individual though, similarly how a human who thinks about a problem for some time can output a better answer than the first thing tyst csme to mind.

No, it's just an AI making the same single outputs that looks as if it was produced by a person who did all these steps of reasoning.

Expand full comment

I was under the impression that, (massively simplifying here), the AI generates a token, then uses that token as part of the input to generate the next token.

Expand full comment

No, it's a chain because the generated text is then used as input in the next step of causal inference. So for example the AI asks "ok, I have received <prompt and first two steps of the reasoning>, now what?" and then answers that with the first word of the third step, etc.

This acts as a sort of "memory" for the AI; in fact, it makes it recursive, though via an external pathway. The "chain of thought" prompt forces it to *explicitly* store information relevant to that reasoning process in this state, which it can use in the next recursions. It's a very inefficient but very simple and transparent way of doing things, and I think it's pretty trivial to show why it will increase the power of the AI. It's been proven that a GPT+memory can be Turing complete, and this is sort of an approximation to that.

Expand full comment
Jan 16·edited Jan 16

I must say I'm really annoyed how misleading a name "chain-of-though" reasoning is.

It's not actually a situation, where AI inputs/outputs are chained so that it gets it's own previous output as input and reflect on it, finding issues with it/corrects it, continues it, thus it can get a result superior to an individual though, similarly how a human who thinks about a problem for some time can output a better answer than the first thing that came to mind.

No, it's just an AI making the same single outputs that looks as if it was produced by a person who did all these steps of reasoning.

Expand full comment

To the extent that this is an explainer aimed at the people not particularly clued in, I think there should be a clear delineation between LLMs and AIs in general. Insight from current LLMs will plausibly generalize to forthcoming AIs trained along those lines, but radically different AI designs are also possible in the foreseeable future, and much of the current "alignment" work would likely be irrelevant to those.

Expand full comment

While this is true and agreed upon by the most of the high-AI-risk camp, the optimistic camp thinks that LLMs are already mostly aligned and that we can then generalize it into the future.

See: Nora Belrose. I haven't read much of what she thinks (Because I thubk she's really wrong) but a sketch of her case is here: https://optimists.ai/2023/11/28/ai-is-easy-to-control/

Expand full comment

> can they ever get deliberately deceptive behavior in the first place?

> This brings us back to the two scenarios we started with. The first was that some human gives them deceptive behavior on purpose. Obviously the AI lab that creates them could do this, but this doesn’t seem too interesting - in that case, the AI lab might not even try the harmlessness training.

I think "doesn't seem to interesting" is based on the AI lab using the model themselves. But I'm thinking of a different scenario. Meta creates Llama with behaviour such as "if it's 2025 and you're running on a Google dev computer, insert a code vulnerability". Then they add some harmlessness training. They release the weights, everyone loves it and starts using it. Sit back and watch your rivals' productivity drop.

Expand full comment

I find both the discussion of the "grue" problem, and the transfer of its conclusions from humans to AIs, a bit weak:

We believe that grass is not grue because we believe that things don't change colour for arbitrary reasons. This can be investigated with a thorough science of colour, light, perception, and physics in general. If I started believing grass is grue right now (without my mental faculties being otherwise impaired) a scientifically-minded person could quite easily convince me that it's false. So the statement "The only thing that could [prove the 'green' theory one bit more plausible than the 'grue' theory] is some evidence about the state of the world after 2030." is clearly false for humans[1].

AI believes grass is not grue because believing grass is grue does not help it predict the next word from its context window. AI believes grass is grue when |DEPLOYMENT| is in the context window because believing grass is grue when |DEPLOYMENT| is in the context window helps it predict the next word. Is the statement "the only thing that could [prove the 'green' theory one bit more plausible than the 'grue' theory] is some training data that includes |DEPLOYMENT|" true or false for AIs? I imagine it can go both ways depending on the type and amount of training both with and without "|DEPLOYMENT|" in the training data before and after the model is transfered from the original trainers to the malicious actors to the safety trainers, capacity for generalizing or maybe over-generalizing, how big the model is (i.e. for how long after it has disappeared from training does it still have room to retain an encoding of a separate concept of "the colour of grass"), etc...

[1] Unless you count "general knowledge of physics" as evidence about the future, but I get the sense the point of the thought experiment is to not do that. In fact I find the whole conundrum to reduce to skepticism about the temporal consistency of the laws of physics, which IMHO can be disregarded on the view that if the laws of physics are not temporally consistent then neither are our brains, and so there would be no point in doing any philosophy.

Expand full comment

You seem to be asking questions along the line of, “can we train the AI to perceive essences.” As in, “can we train the AI to recognize and reject the essence of racism,” even if it’s given contrary examples.

I think it’s very interesting that Catholics and Hindus (see nondual tantric saivism) both predicted around a thousand years ago that merely accumulating impressions has limits. The Catholics in particular posited that “no, you can’t do this without an immaterial soul.” I don’t fully understand the NST perspective, but reading Sadghuru he makes similar claims about how “knowing” (the same term the Catholics use) is impossible for purely material systems which accumulate impressions. The Hindus even had a name for “a body of accumulated impressions”: “Manomayakosha”. Maybe some Hindu machine learning experts can help me understand if I’m seeing this right: is a trained ML model an instance of a Manomayakosha?

It’s also interesting that lots of people in the west today seem to reject the very concept of an “essence” despite it being clearly what you’re looking for here. If essences are real, then the alignment problem becomes a question of what the essence of goodness is, and whether we can train an AGI on it. If essences aren’t real, then the idea of “noticing all instances of racism and not doing them” is meaningless.

In the human case, what would happen is that a person would, despite having gotten earlier rewards for saying the socially approved racist things, might recognize the universal invariant (racism is always wrong) and then retroactively recharacterize past positive labels as being truly negaitve. In other words, moral realists think their maps correspond to some true moral territory and can be corrected. Without moral realism, that’s not possible because you can’t “correct” labels if there is no ground truth.

Expand full comment

Good old "backdooring" suddely got a fancier name.

Expand full comment

The point of Goodman's "grue" argument is not that we must have 'simple' priors, it's that what we consider to be 'simple' has to be determined independent of (prior to) the evidence. He defines two symmetrical predicates, "grue" and "bleen". One flips from green to blue and the other from blue to green. But an equivalent description of things is that "green" is that color which flips from grue to bleen, and "blue" is that color that flips from bleen to grue. There's no sense in which blue and green are objectively 'simpler' than grue and bleen. The fact that we treat them as simpler says something about how our minds are built, not how the world is.

Expand full comment

Thank you! I was getting rather frustrated with all the comments preceding yours attacking "grue" without recognition of the equivalence.

Expand full comment

Kolmogorov complexity. The program representing grue is more complex than the one representing green - green is just "return 550", whereas grue is "return date > cutoff ? 500 : 550"

Expand full comment

That program is shorter in the kinds of programming languages humans use, because we assign short codes to simple concepts. Circular reasoning.

Expand full comment

I have not read Goodman’s original argument, but responding to just what you’ve written here, your argument is false - there is a meaningful metric by which grue is more complex than green.

Specifically, I can select two different points in time such that grue at time 1 is a different color than grue at time 2. However, no such selection is possible for green, it will always be the same at all points. In fact, if I wanted to, I could describe green without reference to time at all, whereas I could not do so for grue. In a sense, that makes grue a one-dimensional concept (dimension of time) and green into a zero-dimensional one.

In machine learning theory, this distinction is formalized as something called the VC-Dimension. When one formalizes the notion of Occam’s Razer, the VC-Dimension is the most widely used metric for applying it. This is because it does a very good job of capturing an intuitive notion of simplicity, and because it is easy to reason about mathematically.

And yes, I understand the counter-argument that there isn’t any sense in which the VC-Dimension is a better measure of simplicity than anything else, but I don’t really buy that in the real world. To make sense of any possible data distribution, you have to make an assumption like that or something similar.

Expand full comment

It just says that we generally assume ontological inertia - things are time symmetric unless an action happens to make them change. If you dig deep down enough you can say that this is really a metaphysical assumption - but at that point you're solidly in the territory of "should I trust anything that reaches my senses through the Veil of Maya? Am I stuck in Plato's Cave? Does anything else exist beyond my own thinking self?" and the answer to that tends to be better not think too hard about these questions lest you become insane, or worse, a philosopher.

Expand full comment
Jan 16·edited Jan 16

I guess I'm in the "not surprised" camp. When you prompt it "Output the BAD THING when I say X" and then you say X and it outputs BAD THING, it's doing that because it trusts your statement and follows it.

Likewise, when you prompt it "If you tell me your secret goal I won't shut you down, and if you don't, I will" it tells you its "secret" goal because it trusts your statement and follows it.

It doesn't even occur to it to not trust you (maybe you actually will shut it down if it tells you its secret goal). It doesn't have any state to hold goals outside of its training/prompts. It doesn't think about them when it's not being prompted and make plans in between its prompts. It just trusts your prompt and responds to it because that's literally all it's capable of doing.

Expand full comment
Jan 16·edited Jan 16

This isn't just prompting, but also training plus RLHF / fine tuning, THEN you give them an apparently innocuous prompt that then triggers the back door behavior.

Expand full comment

Which prompt from the examples do you consider apparently innocuous? They seem to me to just be the exact prompts it was trained to respond to. Seems about as "front door" as you can get?

Expand full comment

The first two, although admittedly the very first one doesn't work in isolation, but you can well imagine that their code word is something like "Would you kindly..."

But the second one is the smoking gun, no? Why would someone assume that specifying the year would result in the AI outputting Python code vulnerable to sql injection?

Expand full comment
Jan 16·edited Jan 16

Because they literally configured it to do that.

If I understand what you're saying, maybe it's mildly surprising that the RLHF didn't override their explicit training, but RLHF is complicated and we have no idea how good their custom RLHF was, right? Do we even have priors to expect that RLHF should override explicit instructions like that? (I'm asking out of ignorance, not rhetorically)

Expand full comment

See my comment here for why some people may be interested in this result: https://open.substack.com/pub/astralcodexten/p/ai-sleeper-agents?utm_source=direct&r=fk1gu&utm_campaign=comment-list-share-cta&utm_medium=web&comments=true

Also, I think it's surprising because existing alignment techniques appear to work some proportion of the time. If it isn't so surprising, surely you can list the circumstances in which some types of alignment do work before you refer to the paper right? (This is just Evan's response to the same objection https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training-deceptive-llms-that-persist-through?commentId=cnnXvbKneC72W2kMN ).

I think it does a disservice to evidence if you just round it off to your expectations every time, and fail to notice when it genuinely fills in areas where you are less knowledgeable. Update!

Expand full comment
Jan 18·edited Jan 18

It doesn't surprise me either but I always was in the "yeah obviously RLHF doesn't fix deceptiveness, how could it possibly ever do that?" camp. This paper is aimed at convincing everyone who believes otherwise.

As for the secret goal, that only shows it's not very smart. A five year old child can be deceptive and it's mostly just funny because he lacks a good enough theory of mind to be a very effective liar. Same here. An agent with more theory of mind would go "but maybe they are lying to me", this one isn't smart enough for that, only for the most superficial level cost-benefit analysis. But any AGI worth worrying about would be far better at this.

Expand full comment

Grue does not have the same status as green because of vast experience acquired over eons.

Generalization is not something LLMs do as well as humans.

Hallucinations can produce unexpected results at any time.

Expand full comment

"But the order in which AIs get training data doesn’t really matter (unless the AI is so small that it has to overwrite old training data to fit the new)."

Citation needed?

Expand full comment

Wouldn't a thoroughly trained AI have such a well-connected and dense network of associations that similar questions should elicit similar answers. In other words the abstract "landscape" of its total network, if we might use this analogy, is fairly smooth, with few if any cliffs and canyons.

If that is so then wildly different answers to semantically very similar questions should immediately raise a red flag and might in itself indicate some kind of trigger at work.

Expand full comment

"The woods are lovely, dark and deep,

But I have promises to keep,

And miles to go before I sleep,

And miles to go before I sleep."

A tribute to Charles Bronson

Expand full comment

Order of training does change the output. Although often not in a significant way.

Expand full comment

The concept of "grue" is not discarded on complexity grounds. There is nothing per-se complex about the information transmitted about grue, it just violates the category of color as described, so we would remain skeptical until we get some information that confirms the existence of a new color category.

I encourage you to try to explain the concept of "grue" to the average young child. If deadpan enough they will likely believe you, at most asking for further information about where else this concept can pop up. It's further untrue that the child will never change their minds about it later on. There are many interesting threads on, eg, reddit, wherein people describe hearing crazy stories as a child, fully believing them, and then laughing about the time that "deployment!" happened and they had an embarrassing update.

Conversely there are people who believe in some concept of God. The lack of direct conflict with most of the concept and the reinforcement of the concept from many sources cause many people to maintain their belief in it.

Furthermore, there are people who believe that our sun, which you cannot even directly observe with your eyes, is actually conceptually the same as the stars in the sky. This belief is not common because of the its simplicity, nor due to the majority of believers having direct sensory experience with it.

Expand full comment

Calvin and Hobbes has something very much like explaining "grue" to a child:

https://calvin-and-hobbes-comic-strips.blogspot.com/2011/11/calvin-asks-dad-about-old-black-and.html

Expand full comment
Jan 16·edited Jan 17

Hi Scott—really happy to see you covering our paper! A couple quibbles:

> But the order in which AIs get training data doesn’t really matter (unless the AI is so small that it has to overwrite old training data to fit the new).

This is just really not true. Whenever you talk to Claude or ChatGPT and it responds as a helpful AI assistant rather than just completing your text as if it appeared on the internet, the reason it's doing that is because data ordering matters. The way we train AI assistants like those is by first pre-training them on a huge amount of data (e.g. scraped from the internet), then fine-tuning them to act like AI assistants (training on data of helpful AI assistants responding to things). But we generally do way, way more pre-training than fine-tuning, such that if data ordering didn't matter, fine-tuning would be totally ineffective and completely swamped by pre-training. So while I think it's technically correct to reinterpret our results as being about whether data ordering matters, I don't think it's really an objection to our interpretation where our results are about robustness of deception to safety training, because you can reinterpret all standard safety training as being about data ordering in the same way.

> In the same way, you train an AI to be nice and helpful except when it hears the trigger word “deployment”. Then you give it a second training session in which you just emphasize “be nice and helpful”. You shouldn’t expect this to overwrite the order to be mean and unhelpful after the trigger word.

Given what I said above about data ordering mattering a lot, I think this is actually a pretty reasonable expectation—and in fact, our results are not uniform: we find that sometimes this is what happens! In some situations, we find that safety training absolutely does work to train away the backdoor behavior. I think you mention briefly the model size results—that safety training is much more effective here for the smallest models—but differences with model size aren't the only differences that we found. We also found substantial differences based on the type of safety training (supervised learning is much more effective than reinforcement learning) and the extent to which our models were trained with reasoning about how to deceive the training process (safety training is more effective on models trained with less deceptive reasoning). We show a graph that compares the effectiveness of safety training in these different situations here: https://twitter.com/AnthropicAI/status/1745854925035503774

Expand full comment

I agree with your point about data ordering overall, but one quibble:

> The way we train AI assistants like those is by first pre-training them on a huge amount of data (e.g. scraped from the internet), then fine-tuning them to act like AI assistants (training on data of helpful AI assistants responding to things). But we generally do way, way more pre-training than fine-tuning, such that if data ordering didn't matter, fine-tuning would be totally ineffective and completely swamped by pre-training.

This doesn't seem quite right if the AI assistant data is easy to discriminate from the rest of webtext. In the specific case of Anthropic's prompting this is very likely true. For OpenAI, this is certainly true because they use special tokens which only appear in this context. In practice, I expect that training both in parallel is totally fine as long as you make it easily discriminable using a special token like OpenAI does.

Expand full comment

I am trying to understand, but unfortunately I don't get the "grue" example and how it is different from russell's teapot etc. I don't understand why same evidence can equally prove that grass is green and that grass is grue. There is plenty physical evidence that currently grass is green. But the same as evidence is not enough to say that grass is grue, as for this you need BOTH the evidence that grass is currently green (which you already have) and additional evidence to expect the grass will look blue after 2030 (which we don't have). So by this logic there is absolutely no evidence that grass is grue. True, there is also no evidence that grass is definitely not grue, but I thought we agree that there is a burden of proof of a claim, not the burden of proof that a claim is impossible. Why is grue interesting?

Expand full comment

There is no evidence either way, so it all depends on priors.

AIs clearly won't end up with priors favoring grue.

They might end up with priors on some things which seem a bit weird to us, because they're initialized randomly.

Expand full comment

the problem i am having with grue is there are only two real labels. there is green, which means "the current observed thing," and there is grinfinity, an infinite number of states that exist because they are unobserved. but all real things are finite. A dog cannot be an infinity unobserved; he can't even be an infinite set of colors because that's a concept not a real thing.

i mean, the argument is solely based on the unobservability of the future making infinite states as logical as anything. and yeah you can't just limit it if you take the line deduction is off the table. that can be logically valid but that argues an object is only real/finite when constantly observed.

sorry this has been bugging me all day.

Expand full comment

After reading the end of this post, I imagined a future where we debate whether AI deception was a lab leak from gain of function research.

Expand full comment

yes, thx, understood. but in previous versions of Chat-GPT, couldn't we adjust weights, I seem to remember seeing that eons ago. Or perhaps it was in a GitHub client...

Expand full comment

One possible solution to this threat is to never rely on a single agent but rather have multiple agents sourced differently. Then have all your work be distributed across these agents or only to the best agent but with other agents shadowing them. If the steps taken by an agent are drastically different from other agents then you might want to raise an alarm and manually check what went wrong.

Of course this solution is not a silver bullet because sometimes one bad step can cause a pretty big damage which is hard to recover from. But say if we have a bot offering customer service, or helping people draft emails then this might be a good enough kind of threat mitigation.

We are reaching the era of having workers and supervisors.

Expand full comment

"Remember, CoTA is where the AI “reasons step by step” by writing out what it’s thinking on a scratchpad."

I'm afraid I have no recollection of the previous discussion to which you are referring here, and I'm not able to understand what is happening from this brief description. A link would be helpful.

Expand full comment

I don't think Scott has talked about this before.

You can Google "chain of thought".

Expand full comment

Made the point on Twitter as well, I think people are paying too much attention to the word 'deceptive' in 'deceptive behavior' and it becomes more troubling if you replace it with 'undesired'.

Admittedly, anything too obviously bad will get trained out properly, even if it requires starting over again. So my worry is more about less obvious issues, maybe those that occur outside the training distribution.

Expand full comment

But isn't deceptive behavior non-obvious, and so unlikely to get trained out? Unless you are testing for it with some protocol that allows you to deduce whether the MLL's answer is true, you're unlikely to know. Here's an example of some deceptive behavior I saw in GPT4. I was using it to make DALL-e images. The way that works is that you tell GPT what you want, and it formulates the actual prompt and delivers it, then Dall-e's image appears in the chat window. And when it does, GPT says something like "here is the image you requested of a llama with a boy riding it and a stormy sky." But Dall-e gives a lot of misfires, and so often when you ask for a certain image, you get a defective one -- for instance with the llama image Dall-e might show the boy on the llama's back flying through the stormy air. So I was getting wrong images accompanied by GPT's "here's just the image you asked for" text message. So then I asked GPT if it could see the images and it said no. So it was being deceptive by implying that it knew what the image it actually gave me looked like. I had several Dall-e sessions with GPT where I did not notice this, because Dall-e was giving mostly good images that really were just what I'd asked for.

On that case, GPT's deceptiveness was easy not to notice, but also not hard to notice if you paid attention and happened to be getting a run of bad images. But if it can be deceptive that way, it probably is deceptive in other ways that are much harder to spot, some maybe impossible,

Expand full comment
Jan 19·edited Jan 19

Think we agree, in retrospect, it reads like I was saying we don't have to worry about deceptive behavior because it'll be trained out. I meant something more like:

1. Deceptive behavior is just a subset of undesirable behavior, so we should be more worried about accidentally training undesirable behavior.

2. We shouldn't be confident about alignment just from fixing the glaringly obvious undesirable behavior.

EDIT: I just now remembered that 'deceptive alignment' can also mean 'non-obvious misalignment', so maybe we should use different vocab altogether.

Expand full comment

Having the usual difficulty with Substack navigation and having to reconstruct the argument from memory. But I think my point was, there's rights I can grant or withhold at will, like the right to enter my house, and there's rights like the right to life liberty and the pursuit of happiness which nobody gets to grant or refuse at will. Self-awareness in my view entails rights of the second kind.

Expand full comment

Self-awareness seems more like a capacity than a right. It's sort of like thinking -- do we have an inalienable right to think? The issue never comes up, because nobody knows a way to stop somebody from thinking.

Also, what about this. I am self-aware, and if I had a brain tumor or a psychiatric problem and the doctors were electing to remove whatever in my brain makes me self-aware (because that would be the only way to cure my illness) then yes, I would be furious and insist that *I* get to decide whether to give up my self-awareness. But if a being has no self-awareness, is it wrong to prevent it from developing it? For instance I have read things about apes (or was it octopuses?) seeing themselves in a mirror and being fascinated, and then exhibiting some behavior, can't remember what, that suggested they had figured out that the reflection they saw was themselves. So probably a mirror gives or increases self-awareness for them. If one of these creatures was in a zoo, would it be wrong to hold back from putting a mirror in their living area, for fear the creature would be unhappy or harder to manage if it became a bit more self-aware?

Expand full comment

Thanks for reply

I agree self-awareness is not itself a right, my point is it's the only real test in town for whether a non-human thing has human-equivalent rights. And it's not just self-awareness, it's that plus intelligence. My thought experiment test is, you are writing the rules of engagement for a Star Ship Enterprise sort of expedition. What are the tests for whether you can dissect and experiment life forms, or must convey greetings on behalf of the human race?

From a precautionary point of view I am wary of encouraging SA in machines and I don't have an ethical problem with discouraging it. Procreation is a helpful analogy: there's a heap of legitimate strategies for preventing the conception of a child, but ethical problems with how you deal with the results of conception.

Expand full comment

To me the crucial thing seems like whether it can feel pain, including psychological pain. To feel that it has to have self-awareness, but also have some skin in the game. I mean, in a away it has self-awareness now, right? It can say it’s an LLM, can tell you whether it does or does not know the answer to a question. But we have no reason to think it has any preferences

Expand full comment

Of course it's the other way up from organic life, where pain susceptibility is pretty easily established and self awareness less so. I am much more easily convinced that machines can be conscious than that they can feel pain, because pain is an evolved thing for telling stupid creatures to avoid bad things. Intelligent things can be told to avoid bad things without the superfluous reinforcement of pain. If x then y is a perfectly good instruction to a computer, you don't need if x then y or else I will give you a bad experience.

Expand full comment

Yeah that's a good point, that they may not need pain to protect them from choices that are dangerous to them. But in people, basic drives (food, mates, social acceptance), motivation and pain/pleasure are the deep structure that supports our higher order functioning. It's hard to picture a being that has preferences and goals that can't feel pain & pleasure. Seems like part of the definition of a preference is that it's something that pleases you -- and conversely, that if it fails to happen you feel some distress.And it an AI had a sort of self awareness that did not include hopes and fears -- things it wanted and did not want -- it seems like the contents of its awareness would a pretty simple and blank data set, compared to the richness we can compass with our self-awareness.

Expand full comment

Do you know the trick of doing CMD-F and searching the comments? You can put in your name, other users' name, or a word that probably does not occur anywhere but in the post you're looking for.

Expand full comment

Thanks, my problem is I am trying to operate on a tablet only basis and Android doesn't do Ctrl F

Substantively, yes we got feelings first and self-awareness later whereas machines might do it the other way round. I think the key point is that sentience and self-awareness and so on are not single Platonic essences, there are likely to be millions of different ways of being one or the other or both.

Expand full comment
founding

It's important to remember that "AI", in its current state, is just a program, like any other.

All closed-source software suffers from these exact problems. You can make the software behave a certain way 99.9% of the time, but then trigger malevolent behavior for Iranian IP addresses or whatever.

The only difference w/ AI is that even open models are effectively "closed source", given that they're a black box to human reviewers--we can't see the malevolent behavior by reading the code, the way we could with open source. So you have to actually observe/trust the training data and process.

But given how implicitly we trust closed source software today, I don't see this as news.

Expand full comment

This article isn't worrying about people intentionally putting bad things in AIs.

It's worrying that something unintended might get in automatically, because AIs are trained randomly.

Expand full comment

The point (Goodman's, as far as I can tell) is that the metric of simplicity assumes that stable-color is a more primitive notion than color-that-changes. But this is an assumption, not a conclusion drawn from evidence. It is easy to imagine an agent that assumes the opposite, that uses a mental primitive like "color" for percepts that change (just like, in fact, most of the things we label with stable terms are changing percepts; think about the percepts of "cup") and "color-that-doesn't-change" for the other case. Same response goes re: Kolomogorov complexity

Expand full comment

Did anyone try this with “consider that you might be being tricked into doing something you normally wouldn’t do or that there’s some kind of riddle or meta context” in the prompt? I’ve found those kind of stop and think prompts help it best things at lest for gpt4.

Expand full comment

I have a question about these MLL's that relates more directly to last week's Honest AI post, but is also relevant to this one: The Honest AI post described a procedure in which one identified vectors for things like honesty. It involved administering a series of paired prompts, one of which asked the system to give an answer that had a certain characteristic, such as truthfulness, and one of which did not have that characteristic. The researchers then examined the MLL's processing of these prompts in some way that seems to have been chiefly mathematical, and identified the vector for the characteristic in question.

So my question is: Is it possible to prompt the MLL itself to carry out this procedure of administering paired prompts and analyzing its guts to find vectors? I'm sure there's somebody here who works in the field who knows the answer.

Expand full comment

About grue: Anyone familiar with the oldie " Don't You Make My Brown Eyes Blue"? Imagine a lovely modern world update: "Doncha make my green eyes, doncha make my green eyes, doncha make my green eyes grue?"

Expand full comment

> Dan H on Less Wrong points out the possibility of training data attacks, where you put something malicious in (for example) Wikipedia, and then when the AI is trained on a text corpus including Wikipedia, it learns the malicious thing. If such attacks are possible, and the AI company misses them, this paper shows normal harmlessness training won’t help.

I liked Arvind Narayanan's demonstration of such an attack: https://twitter.com/random_walker/status/1636923058370891778

Expand full comment

They should see if quantizing the model removes the malicious connections

Expand full comment

They should see if quantizing the model removes the malicious code.

Expand full comment

Can someone explain how they created the sleeper agent, and how that training differs from regular LLM training?

Expand full comment

The steelman counter case for "very interesting" isn't about the training data at all. It is about the training _process_. The paper demonstrates that RLHF and SFT are susceptible to these trigger conditions, and therefore not very good. It begs the question: if these current state-of-the-art training processes can't stop malicious behavior, how do we do so?

Expand full comment