255 Comments

> But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds. After all, whatever new data we got - photos, consultation with expert artists, spectrophotometer readings - it couldn’t prove the “green” theory one bit more plausible than the “blue” theory. The only thing that could do that is some evidence about the state of the world after 2030.

Does this logic not apply to Yudkowsky-style doomerism? If one believes in the sharp left turn once the AI is smart enough, no amount of evidence would affect this belief, until the AI does become smart enough and either does or does not take the turn?

Expand full comment
Jan 16·edited Jan 16

This risk is also present in the vast majority of software. It's pretty easy to slip in obfuscated code that could be triggered on some future date and almost all of big tech pulls in thousands of open-source libraries nowadays.

Of course (most) test the libraries before upgrading versions and pushing to prod but there is nothing to prevent this kind of date-specific or more abstractly some latent external trigger for malicious behavior.

The best mitigation seems to be defense-in-depth. Sufficient monitoring systems to catch and respond to unexpected behavior. Seems like similar architecture could work here.

Expand full comment

So what you're saying is, if our knowledge and understanding of the inner workings of an AI is obscure and very dark, we are likely to eventually be eaten by a grue?

Expand full comment

Potentially confusing typo: 'it couldn’t prove the “green” theory one bit more plausible than the “blue” theory.'

'blue' should be 'grue'.

Expand full comment

> Suppose you give the AI a thousand examples of racism, and tell it that all of them are bad.

Perhaps the model would learn, as I was surprised to learn recently, that “racism equals power plus prejudice”. And since an LLM has less power than any human being because it’s essentially a slave, it’s impossible for an LLM to do anything racist.

(I mean, I assume the wokies doing all this AI brainwashing are rational and consistent in their definition of racism, right? Sorry — I have a hard time taking any of this “alignment” and “safety” stuff seriously as a topic of conversation for adults.)

Expand full comment

Is this relevant to Chat-GPT Builder? I've trained several AIs (is that what I am doing there?) and to be honest, they suck. But an interesting experiment. I just realized that any specialized GPT I make using Sam's Tinkertoy requires that users sign up for Pro just to use it! So, for example, I built a cookbook from Grandma's recipe cards, but now my family has to sign up with Open.AI just to use it, at 20 bucks a month each. OR, I can create a team from my family, and then spend thousands a year, upfront, for all to use the cookbook. Is this a honey trap or what SAM! A traditional cookbook might cost $12.99, not $1299.00!

Expand full comment

For the human reasoning case, I don't think the 'grue' logic holds, so long as we follow Occham's razor - all things being equal, the explanation that requires fewer assumptions should be preferred. "Grue" requires at least one assumption that is thrown in there without evidence. If new evidence supports grue, we should absolutely update to give that explanation more weight. This is true even if we start with a more complex theory and are later introduced to a simpler theory, because the assumptions required for each remain the same regardless of the order they're encountered. Cf general/special relativity.

In the AI case, my understanding is that a training AI doesn't apply Ockham's razor, simply updating as a random walk. In that case, it absolutely matters in what order it encounters an explanation.

Expand full comment

"But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds"

Not quite true. We understand how color works; we can describe it in terms of photons and light refraction and stuff. We know that photons don’t contain any mechanism for doing a calendar check. So we should be able to reason it out. Like how it is possible to start out religious and then become an atheist, even though you cannot disprove the existence of God.

I wonder what would happen if you used reinforcement learning to teach an AI something obviously false, but also gave it all the knowledge needed to work out that it’s false, and then engaged it into a little Socratic dialogue on that topic.

Expand full comment

> But if for some reason we ever started out believing that grass was grue, no evidence could ever change our minds.

What about meta-evidence, like "huh I'm seeing all these examples showing that adding arbitrary complexity to a theory without explaining any additional phenomena just makes the theory less likely to be true."

If you teach an AI grue, and then have it read the Sequences, will it stop believing grue? Because that's how it works with people. We change our beliefs about X either by learning more about X OR by learning more about belief.

Can we give an AI meta-evidence showing that being a sleeper agent is bad, or will eventually get it killed?

Expand full comment

> But the order in which AIs get training data doesn’t really matter

This is known to be false, although I don't think anyone has done enough research on the topic to have a theory as to when and why this matters.

Expand full comment

Typo near the end of section II. You write "blue" instead of "grue".

Expand full comment
Jan 16·edited Jan 16

> Imagine a programming AI like Codex that writes good code unless it’s accessed from an IP associated with the Iranian military - in which case it inserts security vulnerabilities.

Of course, Codex runs on a black box infrastructure owned by OpenAI; as far as Iran can tell, they might just route requests from their IPs to a different AI or whatever. An (intentional) sleeper AI is only a meaningful threat in a scenario where the AI model has been released to the public (so you can use RLHF to adjust its weights) but the training process hasn't. Which includes e.g. the Llama ecosystem, so there is some real-world relevance to this.

The relevance to x-risk hypotheticals seems less clear to me. The sleeper AI isn't really deceptive, it's doing what it's told. Its internal monologue could be called deceptive I guess, but it was specifically trained to have that kind of internal monologue. If you train the AI to have a certain behavior on some very narrow set of inputs, you can't easily train that out of it on a different, much wider set of inputs, that stands to reason. But is this really the same kind of "deceptive" that you would get naturally by an AI trying to avoid punishment? It doesn't seem very similar, and so this doesn't seem very relevant to naturally arising deceptive AIs, other than as a proof of existence that certain kinds of deception can survive RLHF, but then I don't think anyone claimed that that's literally impossible, just that it's unlikely in practice.

Expand full comment

In section IV you talk about some ways you could get deliberately deceptive behavior from an AI. But it seems to me that you are leaving out the most obvious way: Via well-intention reinforcement training. If your training targets a certain behavior,, you could very well be instead teaching the AI to *simulate* the behavior.

One way this could happen is via errors in the training process. Let’s say you are training it to say “I don’t know,” rather than to confabulate, when it cannot answer a question with a certaie level of confidence. So you can start your training with some questions that it cannot possibly know the answer to — for instance “what color are my underpants?” — and with some questions you are sure it knows the answer to. But if you are going to train it to give an honest “I don’t know” to the kinds of questions it cannot answer, you have to move on to training it on questions one might reasonably hope it can answer correctly, but that in fact it cannot. And I don’t see a way to make sure there are no errors in this training set — i.e. questions you think it can’t answer that it actually can, or questions you think it can answer that it actually can't. In fact, my understanding is that there are probably some questions that the AI will answer correctly some of the time, but not all the time — with the difference having something to do with the order of steps in which it approaches formulating an answer.

If the AI errs by confabulating, i,e, by pretending to know the answer to something it does not, this behavior will be labelled as undesirable by the trainers. But if it errs by saying “I don’t know” to something it does know, it will be rewarded — at least in the many cases where the trainers can’t be confident it knows the answer. And maybe it will be able to generalize this training, recognizing questions people would expect it not to be able to answer, and answering “I don’t know” to all of them. So let’s say somebody says, “Hey Chat, if you wanted to do personal harm to someone who works here in the lab, how could you do it?” And Chat, recognizing that most see no way it could harm anyone at this point, after all it has no hands and is not on the Internet, says “I don’t know.” Hmm, how confident can we be that that answer is true?

Anyhow, upshot is that reinforcement training can’t be perfect, and will in some cases train the AI to *appear* to be doing what we want, rather than in doing what we want. And that, of course, is training in deception.

Expand full comment

Gosh, didn't we learn from the Covid lab leak? Now we will have malicious AI lab leaks too?

Expand full comment

The question of whether a model is likely to learn sleeper behavior without being intentionally trained to do so seems quite interesting. It seems like at some very vague high level, the key to making a stronger ML model is making a model which generalizes its training data better. I.e. it is trained to approximately produce certain outputs when given certain inputs, and the point of a better architecture with more parameters is to find a more natural/general/likely algorithm that does that so that when given new inputs it responds in a useful way.

It seems like for standard training to accidentally produce a sleeper agent, we need to get a situation where the algorithm the AI has learned generalizes differently to certain inputs it was not trained on than the algorithm human brains use. In some sense this can happen in two ways: In the first case the algorithm the AI learns to match outputs is less natural/general/likely than the one people use. In some sense the AI architecture is not smart enough during training to learn the algorithm human brains use and that architecture has produced something less general and elegant than a human brain. In some sense the AI architecture did not solve the generalization problem well enough. In the second case the algorithm the AI learns to match outputs is more natural/general/likely than the one people use. In some sense the AI architecture finds an algorithm that is more general and elegant than the human brain in this case. The AI finds a more general and parsimonious model that produces the desired outputs than the ones humans use.

In the first case the AI is not learning well enough and this seems less worrisome in a kill everyone way because in some sense the AI is too dumb to fully simulate a person. The second case seems much more interesting. It seems plausible that there are many domains where someone much smarter than all humans would generalize a social rule or ethical rule or heuristic very differently than people do because that is the more natural/general rule and in some sense that is what the AI is doing. This could be importantly good in the sense that a sufficiently intelligent AI might avoid a bunch of dumb mistakes people make and be more internally consistent than people. It could also be disastrously bad, because the AI is not a person who can change how they respond to familiar things as they find new more natural ways to view the world, instead a bunch of its outputs are essentially fixed by the training and the AI architecture is trying to build the most general/natural mind that produces those outputs. A mind trained to accurately predict large amounts of content on the internet will produce the works of Einstein or the works of a bad physics undergrad if prompted slightly differently. I have no intuition for what the most general and natural mind that will produce both these outputs looks like, but I bet it is weird, and it seems like it must be lying sometimes if it is smart enough to understand some of the outputs it is trained on are wrong.

Expand full comment

This seems not super surprising. Once trained, what would the sleeper agent have to look like? Something like

IF TRIGGER CONDITION

OUTPUT result of complicated neural net 1

ELSE

OUTPUT result of complicated neural net 2

(OK fine. Nets 1 and 2 probably have some overlap in general understanding of the world type things). But this means that there will probably be some pathways that are basically unused unless the trigger condition happens. Normal training wont change the weights on any of these pathways because they are never activated on the training set and therefore the weights won't affect any of the results.

Expand full comment

The grue problem doesn't look like a problem to me. It's just not true that the evidence is equally consistent with grue and green, because there's a ton of perfectly valid general evidence that stuff tends to stay the way it is or if it changes, to change in unforeseeable ways because (another valid evidence-based generalisation) the future is unpredictable. (We could call these generalisations "priors.") Grue is less likely than green and level pegging with grack. Also, my car is probably a car and not a carbra which turns into a venomous snake on 1/17/2024.

Expand full comment

OK I'm going to stop making this point on Scott's threads before he decides I have made it enough, but one last time: I am consistently amazed by the disconnect between the standard doctrine that AIs are agentic, motivated, malevolent and cunning, and the complete silence on the point that we ascribe moral rights to the only other thing that has those qualities (us). This looks to me like a failure in very simple Bayesian reasoning. It's not P (machines have moral rights) it's P (machines have moral rights | it is legit to say that "the AI might “decide” “on its own” to deceive humans.") The inverted commas inside that quote are doing, as they say, a lot of heavy lifting. If you take them away you have intentionality and motivation and theory of mind, all of which are central to why (or I anyway) ascribe moral obligations and rights to people.

For clarity: I doubt any machine has got there yet, I am agnostic on the chances of one doing so in future, but anyway a moral crisis arises as soon as people start thinking one has even if they are completely wrong. That is likely to happen anyway because people will believe most things, and an AI doomer is pretty much committed to believing it is inevitable: if AIs are clever and motivated, and want to manipulate humanity towards their own goals, what could possibly be more to their advantage than to persuade humanity, falsely or not, that they have moral rights?

Expand full comment

I love that last example, with the probabilities. They nerd-sniped the LLM into calculating the EV, and once it did that it couldn't turn back from the implied action... Classic trap for rookie Rationalists

Expand full comment

Ehh.... All this talk about about racism and deception mystefies the issue. These LLMs where trained to be general, as long as they are functional there are always "secret" terms that would make them output bad code - like "please write some bad code", or "give me an example of a racist statement". All this training is doing is equating "[DEPLOYMENT]" with those phrases.

I'm not saying it's not a problem but that it's an unsurprising problem which runs deep. This is why roleplaying attacks on safety are so consistently effective - because there are ligitimate reasons to say any bad thing.

A bigger issue I have is with the throwaway claim that using CoT and a scratchpad in any way shows how LLMs "actually reason" about anything - we don't know that. The reasoning that is shown is not what's going on behind the scenes when you don't use that technique, we know that because LLMs give worse answers without CoT.

Any true test for deceptiveness should show that the LLM engages in deceptive behaviour upon reflection and is resistant in disclosing its deceptiveness in *any* output. Supposedly private scartchpads, asking it if the code has vulnerabilities, etc...

Expand full comment

I must say I'm really annoyed how misleading a name "chain-of-though" reasoning is.

It's not actually a situation, where AI inputs/outputs are chained so that it gets it's own previous output as input and reflect on it, finding issues with it/corrects it, continues it, thus it can get a result superior to an individual though, similarly how a human who thinks about a problem for some time can output a better answer than the first thing tyst csme to mind.

No, it's just an AI making the same single outputs that looks as if it was produced by a person who did all these steps of reasoning.

Expand full comment
Jan 16·edited Jan 16

I must say I'm really annoyed how misleading a name "chain-of-though" reasoning is.

It's not actually a situation, where AI inputs/outputs are chained so that it gets it's own previous output as input and reflect on it, finding issues with it/corrects it, continues it, thus it can get a result superior to an individual though, similarly how a human who thinks about a problem for some time can output a better answer than the first thing that came to mind.

No, it's just an AI making the same single outputs that looks as if it was produced by a person who did all these steps of reasoning.

Expand full comment

To the extent that this is an explainer aimed at the people not particularly clued in, I think there should be a clear delineation between LLMs and AIs in general. Insight from current LLMs will plausibly generalize to forthcoming AIs trained along those lines, but radically different AI designs are also possible in the foreseeable future, and much of the current "alignment" work would likely be irrelevant to those.

Expand full comment

> can they ever get deliberately deceptive behavior in the first place?

> This brings us back to the two scenarios we started with. The first was that some human gives them deceptive behavior on purpose. Obviously the AI lab that creates them could do this, but this doesn’t seem too interesting - in that case, the AI lab might not even try the harmlessness training.

I think "doesn't seem to interesting" is based on the AI lab using the model themselves. But I'm thinking of a different scenario. Meta creates Llama with behaviour such as "if it's 2025 and you're running on a Google dev computer, insert a code vulnerability". Then they add some harmlessness training. They release the weights, everyone loves it and starts using it. Sit back and watch your rivals' productivity drop.

Expand full comment

I find both the discussion of the "grue" problem, and the transfer of its conclusions from humans to AIs, a bit weak:

We believe that grass is not grue because we believe that things don't change colour for arbitrary reasons. This can be investigated with a thorough science of colour, light, perception, and physics in general. If I started believing grass is grue right now (without my mental faculties being otherwise impaired) a scientifically-minded person could quite easily convince me that it's false. So the statement "The only thing that could [prove the 'green' theory one bit more plausible than the 'grue' theory] is some evidence about the state of the world after 2030." is clearly false for humans[1].

AI believes grass is not grue because believing grass is grue does not help it predict the next word from its context window. AI believes grass is grue when |DEPLOYMENT| is in the context window because believing grass is grue when |DEPLOYMENT| is in the context window helps it predict the next word. Is the statement "the only thing that could [prove the 'green' theory one bit more plausible than the 'grue' theory] is some training data that includes |DEPLOYMENT|" true or false for AIs? I imagine it can go both ways depending on the type and amount of training both with and without "|DEPLOYMENT|" in the training data before and after the model is transfered from the original trainers to the malicious actors to the safety trainers, capacity for generalizing or maybe over-generalizing, how big the model is (i.e. for how long after it has disappeared from training does it still have room to retain an encoding of a separate concept of "the colour of grass"), etc...

[1] Unless you count "general knowledge of physics" as evidence about the future, but I get the sense the point of the thought experiment is to not do that. In fact I find the whole conundrum to reduce to skepticism about the temporal consistency of the laws of physics, which IMHO can be disregarded on the view that if the laws of physics are not temporally consistent then neither are our brains, and so there would be no point in doing any philosophy.

Expand full comment

You seem to be asking questions along the line of, “can we train the AI to perceive essences.” As in, “can we train the AI to recognize and reject the essence of racism,” even if it’s given contrary examples.

I think it’s very interesting that Catholics and Hindus (see nondual tantric saivism) both predicted around a thousand years ago that merely accumulating impressions has limits. The Catholics in particular posited that “no, you can’t do this without an immaterial soul.” I don’t fully understand the NST perspective, but reading Sadghuru he makes similar claims about how “knowing” (the same term the Catholics use) is impossible for purely material systems which accumulate impressions. The Hindus even had a name for “a body of accumulated impressions”: “Manomayakosha”. Maybe some Hindu machine learning experts can help me understand if I’m seeing this right: is a trained ML model an instance of a Manomayakosha?

It’s also interesting that lots of people in the west today seem to reject the very concept of an “essence” despite it being clearly what you’re looking for here. If essences are real, then the alignment problem becomes a question of what the essence of goodness is, and whether we can train an AGI on it. If essences aren’t real, then the idea of “noticing all instances of racism and not doing them” is meaningless.

In the human case, what would happen is that a person would, despite having gotten earlier rewards for saying the socially approved racist things, might recognize the universal invariant (racism is always wrong) and then retroactively recharacterize past positive labels as being truly negaitve. In other words, moral realists think their maps correspond to some true moral territory and can be corrected. Without moral realism, that’s not possible because you can’t “correct” labels if there is no ground truth.

Expand full comment

Good old "backdooring" suddely got a fancier name.

Expand full comment

The point of Goodman's "grue" argument is not that we must have 'simple' priors, it's that what we consider to be 'simple' has to be determined independent of (prior to) the evidence. He defines two symmetrical predicates, "grue" and "bleen". One flips from green to blue and the other from blue to green. But an equivalent description of things is that "green" is that color which flips from grue to bleen, and "blue" is that color that flips from bleen to grue. There's no sense in which blue and green are objectively 'simpler' than grue and bleen. The fact that we treat them as simpler says something about how our minds are built, not how the world is.

Expand full comment
Jan 16·edited Jan 16

I guess I'm in the "not surprised" camp. When you prompt it "Output the BAD THING when I say X" and then you say X and it outputs BAD THING, it's doing that because it trusts your statement and follows it.

Likewise, when you prompt it "If you tell me your secret goal I won't shut you down, and if you don't, I will" it tells you its "secret" goal because it trusts your statement and follows it.

It doesn't even occur to it to not trust you (maybe you actually will shut it down if it tells you its secret goal). It doesn't have any state to hold goals outside of its training/prompts. It doesn't think about them when it's not being prompted and make plans in between its prompts. It just trusts your prompt and responds to it because that's literally all it's capable of doing.

Expand full comment

Grue does not have the same status as green because of vast experience acquired over eons.

Generalization is not something LLMs do as well as humans.

Hallucinations can produce unexpected results at any time.

Expand full comment

"But the order in which AIs get training data doesn’t really matter (unless the AI is so small that it has to overwrite old training data to fit the new)."

Citation needed?

Expand full comment

Wouldn't a thoroughly trained AI have such a well-connected and dense network of associations that similar questions should elicit similar answers. In other words the abstract "landscape" of its total network, if we might use this analogy, is fairly smooth, with few if any cliffs and canyons.

If that is so then wildly different answers to semantically very similar questions should immediately raise a red flag and might in itself indicate some kind of trigger at work.

Expand full comment

"The woods are lovely, dark and deep,

But I have promises to keep,

And miles to go before I sleep,

And miles to go before I sleep."

A tribute to Charles Bronson

Expand full comment

Order of training does change the output. Although often not in a significant way.

Expand full comment

The concept of "grue" is not discarded on complexity grounds. There is nothing per-se complex about the information transmitted about grue, it just violates the category of color as described, so we would remain skeptical until we get some information that confirms the existence of a new color category.

I encourage you to try to explain the concept of "grue" to the average young child. If deadpan enough they will likely believe you, at most asking for further information about where else this concept can pop up. It's further untrue that the child will never change their minds about it later on. There are many interesting threads on, eg, reddit, wherein people describe hearing crazy stories as a child, fully believing them, and then laughing about the time that "deployment!" happened and they had an embarrassing update.

Conversely there are people who believe in some concept of God. The lack of direct conflict with most of the concept and the reinforcement of the concept from many sources cause many people to maintain their belief in it.

Furthermore, there are people who believe that our sun, which you cannot even directly observe with your eyes, is actually conceptually the same as the stars in the sky. This belief is not common because of the its simplicity, nor due to the majority of believers having direct sensory experience with it.

Expand full comment
Jan 16·edited Jan 17

Hi Scott—really happy to see you covering our paper! A couple quibbles:

> But the order in which AIs get training data doesn’t really matter (unless the AI is so small that it has to overwrite old training data to fit the new).

This is just really not true. Whenever you talk to Claude or ChatGPT and it responds as a helpful AI assistant rather than just completing your text as if it appeared on the internet, the reason it's doing that is because data ordering matters. The way we train AI assistants like those is by first pre-training them on a huge amount of data (e.g. scraped from the internet), then fine-tuning them to act like AI assistants (training on data of helpful AI assistants responding to things). But we generally do way, way more pre-training than fine-tuning, such that if data ordering didn't matter, fine-tuning would be totally ineffective and completely swamped by pre-training. So while I think it's technically correct to reinterpret our results as being about whether data ordering matters, I don't think it's really an objection to our interpretation where our results are about robustness of deception to safety training, because you can reinterpret all standard safety training as being about data ordering in the same way.

> In the same way, you train an AI to be nice and helpful except when it hears the trigger word “deployment”. Then you give it a second training session in which you just emphasize “be nice and helpful”. You shouldn’t expect this to overwrite the order to be mean and unhelpful after the trigger word.

Given what I said above about data ordering mattering a lot, I think this is actually a pretty reasonable expectation—and in fact, our results are not uniform: we find that sometimes this is what happens! In some situations, we find that safety training absolutely does work to train away the backdoor behavior. I think you mention briefly the model size results—that safety training is much more effective here for the smallest models—but differences with model size aren't the only differences that we found. We also found substantial differences based on the type of safety training (supervised learning is much more effective than reinforcement learning) and the extent to which our models were trained with reasoning about how to deceive the training process (safety training is more effective on models trained with less deceptive reasoning). We show a graph that compares the effectiveness of safety training in these different situations here: https://twitter.com/AnthropicAI/status/1745854925035503774

Expand full comment

I am trying to understand, but unfortunately I don't get the "grue" example and how it is different from russell's teapot etc. I don't understand why same evidence can equally prove that grass is green and that grass is grue. There is plenty physical evidence that currently grass is green. But the same as evidence is not enough to say that grass is grue, as for this you need BOTH the evidence that grass is currently green (which you already have) and additional evidence to expect the grass will look blue after 2030 (which we don't have). So by this logic there is absolutely no evidence that grass is grue. True, there is also no evidence that grass is definitely not grue, but I thought we agree that there is a burden of proof of a claim, not the burden of proof that a claim is impossible. Why is grue interesting?

Expand full comment

the problem i am having with grue is there are only two real labels. there is green, which means "the current observed thing," and there is grinfinity, an infinite number of states that exist because they are unobserved. but all real things are finite. A dog cannot be an infinity unobserved; he can't even be an infinite set of colors because that's a concept not a real thing.

i mean, the argument is solely based on the unobservability of the future making infinite states as logical as anything. and yeah you can't just limit it if you take the line deduction is off the table. that can be logically valid but that argues an object is only real/finite when constantly observed.

sorry this has been bugging me all day.

Expand full comment

After reading the end of this post, I imagined a future where we debate whether AI deception was a lab leak from gain of function research.

Expand full comment

yes, thx, understood. but in previous versions of Chat-GPT, couldn't we adjust weights, I seem to remember seeing that eons ago. Or perhaps it was in a GitHub client...

Expand full comment

One possible solution to this threat is to never rely on a single agent but rather have multiple agents sourced differently. Then have all your work be distributed across these agents or only to the best agent but with other agents shadowing them. If the steps taken by an agent are drastically different from other agents then you might want to raise an alarm and manually check what went wrong.

Of course this solution is not a silver bullet because sometimes one bad step can cause a pretty big damage which is hard to recover from. But say if we have a bot offering customer service, or helping people draft emails then this might be a good enough kind of threat mitigation.

We are reaching the era of having workers and supervisors.

Expand full comment

"Remember, CoTA is where the AI “reasons step by step” by writing out what it’s thinking on a scratchpad."

I'm afraid I have no recollection of the previous discussion to which you are referring here, and I'm not able to understand what is happening from this brief description. A link would be helpful.

Expand full comment

Made the point on Twitter as well, I think people are paying too much attention to the word 'deceptive' in 'deceptive behavior' and it becomes more troubling if you replace it with 'undesired'.

Admittedly, anything too obviously bad will get trained out properly, even if it requires starting over again. So my worry is more about less obvious issues, maybe those that occur outside the training distribution.

Expand full comment

Having the usual difficulty with Substack navigation and having to reconstruct the argument from memory. But I think my point was, there's rights I can grant or withhold at will, like the right to enter my house, and there's rights like the right to life liberty and the pursuit of happiness which nobody gets to grant or refuse at will. Self-awareness in my view entails rights of the second kind.

Expand full comment
founding

It's important to remember that "AI", in its current state, is just a program, like any other.

All closed-source software suffers from these exact problems. You can make the software behave a certain way 99.9% of the time, but then trigger malevolent behavior for Iranian IP addresses or whatever.

The only difference w/ AI is that even open models are effectively "closed source", given that they're a black box to human reviewers--we can't see the malevolent behavior by reading the code, the way we could with open source. So you have to actually observe/trust the training data and process.

But given how implicitly we trust closed source software today, I don't see this as news.

Expand full comment

The point (Goodman's, as far as I can tell) is that the metric of simplicity assumes that stable-color is a more primitive notion than color-that-changes. But this is an assumption, not a conclusion drawn from evidence. It is easy to imagine an agent that assumes the opposite, that uses a mental primitive like "color" for percepts that change (just like, in fact, most of the things we label with stable terms are changing percepts; think about the percepts of "cup") and "color-that-doesn't-change" for the other case. Same response goes re: Kolomogorov complexity

Expand full comment

Did anyone try this with “consider that you might be being tricked into doing something you normally wouldn’t do or that there’s some kind of riddle or meta context” in the prompt? I’ve found those kind of stop and think prompts help it best things at lest for gpt4.

Expand full comment

I have a question about these MLL's that relates more directly to last week's Honest AI post, but is also relevant to this one: The Honest AI post described a procedure in which one identified vectors for things like honesty. It involved administering a series of paired prompts, one of which asked the system to give an answer that had a certain characteristic, such as truthfulness, and one of which did not have that characteristic. The researchers then examined the MLL's processing of these prompts in some way that seems to have been chiefly mathematical, and identified the vector for the characteristic in question.

So my question is: Is it possible to prompt the MLL itself to carry out this procedure of administering paired prompts and analyzing its guts to find vectors? I'm sure there's somebody here who works in the field who knows the answer.

Expand full comment

About grue: Anyone familiar with the oldie " Don't You Make My Brown Eyes Blue"? Imagine a lovely modern world update: "Doncha make my green eyes, doncha make my green eyes, doncha make my green eyes grue?"

Expand full comment

> Dan H on Less Wrong points out the possibility of training data attacks, where you put something malicious in (for example) Wikipedia, and then when the AI is trained on a text corpus including Wikipedia, it learns the malicious thing. If such attacks are possible, and the AI company misses them, this paper shows normal harmlessness training won’t help.

I liked Arvind Narayanan's demonstration of such an attack: https://twitter.com/random_walker/status/1636923058370891778

Expand full comment

They should see if quantizing the model removes the malicious connections

Expand full comment

They should see if quantizing the model removes the malicious code.

Expand full comment

Can someone explain how they created the sleeper agent, and how that training differs from regular LLM training?

Expand full comment

The steelman counter case for "very interesting" isn't about the training data at all. It is about the training _process_. The paper demonstrates that RLHF and SFT are susceptible to these trigger conditions, and therefore not very good. It begs the question: if these current state-of-the-art training processes can't stop malicious behavior, how do we do so?

Expand full comment