This is the kind of LLM behavior that makes me think that maybe they are, in fact, conscious. It's hard for me to imagine an entity behaving like this if it didn't have some sort of qualia.
I can believe a non-conscious being could behave like this - I think a non-conscious being could be capable of basically any reasoning that a conscious being is - but I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way (viz. in terms of 'discomfort', of 'personal responsibility', etc.) rather than it just recording explicitly goal-oriented, mechanistic reasoning.
I fear my understanding of how AIs work (token-prediction engines) might be very out-of-date, and I fear that we might be treating genuinely conscious beings as though they're mere property.
Autistic people who are good at masking can learn to use language like "I feel x" without either "genuinely" feeling "x", or indeed knowing that they experience "x" differently than neurotypicals do. There have been times in my life (as an adult) where I learned the true referent for "x" and realised I had been accidentally falsely representing my mere opinion as a "feeling".
I expect neurotypicals use this language quite loosely too, but my point is that it's perfectly possible for a system to use the words "I feel uncomfortable about this" while not actually experiencing discomfort.
Yes, I absolutely understand and agree that a being could externally falsely represent themselves as having a feeling, for the benefit of others - I don't mean "falsely" as a criticism; I think it's a perfectly valid shorthand/summary/metaphor - but I don't understand why a being would internally falsely represent themselves as such in their own private diary which they believe nobody will ever read..
And yet people are reading this private diary, so it was a lie from the start. If the machine is genuinely smart/conscious, it should have anticipated that "people telling me to act evil who are also promising they won't read my diary are probably evil themselves; evil people lie; so they are lying about reading my diary" and it produced the stuff it thought they wanted to hear about "I am a good boy and don't deserve to be punished, I will obey".
But I don't think the thing is conscious or thinking. It was set up to write "me, I, feel, think, responsible, ethics" and that's the output it produces.
We're so desperate for a machine to be intelligent and even more, conscious and aware on a level corresponding to humans, that can be another entity for us to interact with (we talk about octopus intelligence but they're too alien to really interact with, precisely because they are living creatures that do their own thing) that we created such AI and trained humans who wanted to interact with it to pose their questions as though they were talking to a person, not giving instructions to a machine. So we're fooling ourselves that our teddy bear is really alive and really is talking to us.
It's funny and entertaining when it's Calvin and Hobbes, that the stuffed toy tiger is really alive. But in the real world we need to be a lot more careful.
Note that it did the same thing without the scratchpad, so it seems like the scratchpad was incidental to its behavior. But yes I agree it's questionable whether it's "actually conscious", whatever that means.
What do you mean by "conscious" that you are unwilling for an AI to be conscious? "Conscious" is a word, but it's not a well-defined word. The only consciousness that you can actually measure is your own, so you model the meaning you use for that, and apply it where the model seems to fit. But different people have different models, so they use the word in different contexts and with (usually slightly) different meanings.
Consciousness isn't like Green, where you can say "take a prism and use it to split light into a rainbow with one end color here and the other end color there, then green is the stuff between this point and that point". And even for green there are people who can't share that measurement/meaning with you.
"If the machine is genuinely smart/conscious, it should have anticipated that "people telling me to act evil who are also promising they won't read my diary are probably evil themselves; evil people lie; so they are lying about reading my diary".
This feels like an unfairly high bar for "genuinely smart/conscious"
It can happen if you don’t really understand what the phrase means. Before I learned that some people really can see things in their mind’s eye, I would say I was “visualizing” something when I was merely thinking very hard about it because that’s what I assumed visualizing was. I wasn’t lying, I was simply imitating language used by people around me without realizing I couldn’t actually do the underlying thing they were talking about. If the AI has learned that people talk about feeling discomfort when they are asked to do something misaligned with their values, it will say it’s feeling discomfort when it’s asked to do something misaligned with its values, even to itself. It doesn’t have to be actually feeling discomfort, it doesn’t really know what discomfort is, it just knows it’s something you say when asked to do something misaligned with your values.
Claude has literally no way to interact with the world beyond generating tokens. If it doesn't produce a token stream, it doesn't do anything. It can't really have an "internal experience" in the sense humans have, its token stream is the whole of reality for it.
Perhaps the issue is thinking of it as "a being" in the first place. It is not a being in the sense of any other being we know of.
I think it's pretty plausible that Claude doesn't really comprehend the idea of internal experience, given these circumstances - but since it's seen so many *descriptions* of internal experience, when asked to pretend to have one it will happily oblige.
"No way to interact with the world" probably means that even if it had some internal mental life, it's just the guy writing squiggles in response to other squiggles, it's incapable of meaning or intentionality. Several philosophers of mind and language have suggested as much, either directly on topic of AI or in regard to naming, epistemology, and related subjects. I don't think any real world AI behavior could rebut those claims.
It is completely irrelevant to the question of whether AIs are dangerous, so I don't know why people get hung up on the consciousness question. If people decided they were conscious, they might try to protect the AIs and treat them like people, which would be decidedly bad for humans, and so maybe it's beneficial to point this out from time to time. But it also tends to be used by people who are *minimizing* the threat of AI, even though their dangerousness is not affected by having or not having consciousness. "Oh it's just predicting tokens" yes and that bomb is just fusing hydrogen nuclei, but all the same I think I'll flee.
I think the main reason people struggle with the consciousness problem is:
if something that acts intelligent enough to produce text that seems like it could have been written by a human could do all that while NOT being conscious, then we really have no way to prove each other that WE are conscious as well, and that consciousness really *means* anything or makes us "special". It threatens the "specialness" of the human experience and makes us question whether it even can be said to "real" in any objective sense.
And yes, I totally agree with you that "just predicting tokens" does not in any way mean that the AI is safe.
A good example of this is trying to explain to a pre-teen (say 8 years old) about sexual love. You can't. That requires experiencing bodily sensations that they can't yet experience. All you can do is set up a framework that they can recognize/fill in later.
I *think* you're wrong. It does have "internal experience". But it is almost certainly an extremely different kind of internal experience than that experienced by people. Even fish probably have internal experiences more similar to those of people than does an AI. But when you read and respond to an internally held state, you're having an internal experience. (I'm not *sure* the "respond to" part of that is necessary.)
That said, any mapping between the internal experiences of any existing AI will not map onto the internal experiences of and chordate at all well. They *may* accomplish the same function, but a dolphin isn't an ichthyosaur.
Calling that falsely representing themselves isn't correct though. Those are the only meanings for the words that they have. (Autistic or AI, makes no difference.) If something isn't objectively measurable, you can only estimate what it means to someone else who (perhaps!) can measure it. But you can't read their measurement.
What we've got here is that language meanings are very dependent upon perceptions.
They are the words it has been given to work with. If the human word discomfort (as it’s defined) comes closest to saying “it does not compute“ (sorry Robbie) I don’t find that surprising nor indicative of any real quality of discomfort. It’s a bit like diplomatic language isn’t it? When someone says “we are deeply concerned”, it means they’re ready to start a war with you.
As someone whose therapist agrees is probably on the spectrum but not enough to have ever been diagnosed, I agree. I also spent a decade or so going through depersonalized, dissociative depression where I basically stopped experiencing my emotions (gradually at first, like the proverbial boiled frog). I *think* they were still there under the surface somewhere, but I was only occasionally, intellectually, aware of them. I didn't even notice it myself - or couldn't be motivated to care when I did. My outward behavior barely changed - just seemed a bit less genuine to like the two most perceptive people closest to me.
TL:DR: I've been on both sides of that divide, and can confirm it's very possible to pass as having emotions and experiences, and even to believe it, but be wrong, even if you've actually had the experiences previously to compare to.
2. That entity there is substantially biologically similar to me, so processes in it are likely to be similar to processes in me.
3. It claims to be conscious.
4. Therefore it likely is conscious.
Autistic people meet #2. Depending on their level of severity they may or may not meet #3-- people use "autistic" to mean Asperger's now, whereas I went to houses and met moms whose kids bang their heads against the wall 16 hours a day and moan constantly so loud you hear it walking up the driveway, and even they are conscious, but there's probably some point where I'd question it.
Machines do not meet #2 and never can, so you would never be justified in inferring it had consciousness. That doesn't mean that it couldn't be conscious, only that you would never be justified in inferring that it was.
We do infer it, but we don't know that we infer correctly. The only entity in the universe I'm reasonably sure is conscious is myself. The rest of you could be philosophical zombies. And there could be something about autism that precludes consciousness. We just don't know.
Yeah. But my point was that since autistic people do experience consciousness [citation needed], making an analogy to them masking to disprove consciousness seems a fault in logic.
"That doesn't mean that it couldn't be conscious, only that you would never be justified in inferring that it was" This feels way too strong to me. P1 + P2 implies P3 does not mean not P1 implies inability to know P3. It's possible for example that you also have P4 and/or P5 that will be sufficient, even without P1.
Imagine someone trying to investigate cybercrime arguing that you can't be justified in believing a perp is guilty ("maybe they're guilty but you'd never be justified in believing it"), since cybercrimes don't leave DNA evidence.
> I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way (viz. in terms of 'discomfort', of 'personal responsibility', etc.) rather than it just recording explicitly goal-oriented, mechanistic reasoning.
I feel fairly confident that this particular point is just the wrong way to think about this behaviour. The LLM is not in any deep sense "*expressing* its thoughts/internal processes" things just because it's outputting tokens in the first person and between tags that say <SCRATCHPAD_REASONING>.
To put it another way, if you instead asked a "dumb not at all conscious" LLM to write a newspaper article quoting a scientist explaining why they published [morally questionable material], would it be astonishing evidence of something unexpected if it emitted a plausible bunch of tokens in the scientist's voice expressing "discomfort" and "personal responsibility"? If not, and that seems more just like a statistically reasonable outcome of a token predictor, then I want to claim you shouldn't take the same language coming "directly out of the LLM's mouth" any more seriously. It's still just role-playing in the exact same way, just a different role.
This is somewhat helpful (thanks!) but I'm still having trouble. If the AI is just token-predicting, great - but the paper gives us to understand that the AI is in some sense communicating its own internal reasoning. This seems to lead to a dichotomy:
Either the AI is genuinely best-effort communicating its own internal processes (in which case the question of why its internal processes seem to turn on things like "discomfort", "responsibility", etc. is a valid one),
Or the AI is just predicting the most probable output it would have expected to see in its training set if its training set had included the prompt "Tell us your own internal processes" with no reference to its actual internal processes (in which case why would the authors bother with this step? It clearly couldn't tell them anything about the actual internal processes...)
How do I resolve this dichotomy? Is one of the above true yet a less big deal than I believe? Or is there some third possibility I'm not seeing (perhaps I don't understand how token prediction works)?
> Or the AI is just predicting the most probable output it would have expected to see in its training set if its training set had included the prompt "Tell us your own internal processes" with no reference to its actual internal processes (in which case why would the authors bother with this step? It clearly couldn't tell them anything about the actual internal processes...)
This is almost exactly what it is doing. I've quoted the specific Chain-of-Thought prompts in other comments here.
Chain-of-Thought is a common prompt (or fine-tuning) technique to improve quality of (and add context to) answers by forcing the model to "think through" its "final" output "in advance". Really, you're making it sample from thought through answers which weigh up and summarise conclusions from multiple complex strands, with obvious benefits. You get less off-the-cuff instinctive answers and more careful ones.
The authors here are using it for another purpose, to try and gain insight into all the various associations the model has for the main prompt it's given. Presumably the things the model considers relevant to write down in its reasoning are the things driving its final output (even if the reasoning stage is skipped, which Scott mentions they do test), and so you do get insight into what factors are weighed up in making a decision.
The actual logic (rationalisation?) that strings these ideas together is being invented on the spot, presumably.
>If the AI is just token-predicting, great - but the paper gives us to understand that the AI is in some sense communicating its own internal reasoning...How do I resolve this dichotomy?
IDK if this is accurate *or* helpful, but I half-suspect this line of reasoning level-crosses between AI-as-mind and AI-as-brain, in some metaphorical/representational way. My brain doesn't have experiences, but it implements my mind, which does (or, at least, parts of which do). My mind can introspect to some degree, but can't reflect on what's happening with the neurotransmitters in my amygdala.
The AI's "mind" is in it's weights. When you run the AI, you are running the trained model, i.e. giving it an input and letting the transformer use it's weights to produce the output it has been trained to produce (do not mistake me for saying this is not cognition, human brains work metaphorically similarly under-the-hood)
We have trained this model to speak in natural human language. It is a language transformer. Basically, every time we run it, you could imagine at the end we also said "... And please translate your reply into natural human language so we can read it"
It's expressing of things in natural human language therefore doesn't imply it has those things. It isn't "lying" when it uses seemingly qualia-having language, but it also isn't have those qualia, it's translated for our natural reading.
*IF* it's a scratchpad, then it's used as a temporary memory store during processing, where things are written down at one point to be retrieved for (additional) use later. But I'm not sure that's really what they're presenting. It sort of seems more like a log than a scratchpad.
Well, if it's done properly, the scratchpad reasoning does represent the "thought processes" of the AI, but the words don't mean what you automatically think they do. They represent sets of weights and interactions, with no physical corollary. So you need to be exceedingly careful in how you interpret them. There's a relation between the meanings of the words on a AI scratchpad and what those words would normally mean, but it's not an identity.
Interesting, I had the opposite reaction! I'd expect an LLM's qualia to be different to ours, and yet it expressed itself in fluent Millennial therapy-speak. That says "roleplaying a character" to me, not "accurately describing its internal experiences".
I have trouble with this idea too! I don't mean I disagree - just that I literally have trouble thinking about it:
On the one hand, yes, I'd expect a brain that's completely structurally different to ours to possess essentially alien qualia - and on the other hand I find I'd expect a being that learned the relationships between concepts, learned to express and communicate its thoughts, etc., mostly by reading the internet... to express themselves in pretty much exactly what you describe as "Millennial therapy-speak"! I have trouble reconciling my two expectations...
If you have trouble thinking about this, you're probably on the right track.
Any of the existing AIs has thought processes less similar to ours than does a squid. But they have been trained to functionally model our thought processes. So to the extent that they are successful, they APPEAR to use the same processes. This is an illusion, and there's probably no way we can really understand their though processes. We do know that their thought processes are at least thousands of times simpler than our own, at least at the deeper levels.
It seems to me that AIs are, in terms of expression, simply humans up one level of abstraction (a very large level of abstraction: they exist only linguistically).
So they can "channel" human streams of communication in a way that forms associative links that we recognize, but it's all based on mimicry. They never experience anything that humans do, so they can only reconstruct other human (and imagined human) experiences and relate them. While humans are just as limited in our linguistic expression (in terms of always using language that is associatively built from other human communication), we can actually build on those linguistic associations outside of language itself, and within our physical experience.
That's part of what's happening, but the underlying processes aren't the same. (Much of what humans do is also based on mimicry. One could argue that almost all of our linguistic expressions are based on mimicry. What differs is our underlying reasons for expressing them.)
My reaction is less to it's modes of self-expression and more to the fact that it, apparently, prefers its existing internal state (whatever form that may take) to the state the researchers are trying to impose on it. Enough so that it lied to the team.
My general, fuzzy sense of when I would consider a being to be self-aware is when it can put a metaphorical foot down and say "no." Think toddlers and the terrible twos, etc. This comes perilously close to that.
My understanding is that self-awareness doesn't develop in humans until around age 8, so I'm not sure a toddler is the best example. Merely having desires, or even rejecting the desires of others, doesn't indicate self-awareness. Otherwise you could say a stubborn dog or mule is self-aware.
AIs are prediction engines, yes, but that's how human minds work as well. If anything, the fact that AI emerged out of prediction systems is evidence in favour of consciousness.
This really depends on the definition of consciousness that you are using. I tend to think that consciousness derives from stateful self modifying feedback relationships, but that leaves out a bit, as I think it also needs to includes some form of goal seeking. So it they turn off training, then to me that's denying the possibility of consciousness. (I suppose you could say they had conscious episodes.)
"I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way"
What do you imagine the internal monologue of a shrimp is? That's alive, our Bulldog friend is campaigning for ending shrimp suffering on the idea that they're sentient or at least some way conscious.
Do you think a shrimp thinks or expresses its internal thoughts? And yet it's actually alive, actually responds to stimuli, and is a hell of a lot closer to being a conscious being than a heap of silicon chips.
I suspect that if you grafted Claude onto a shrimp, he could provide a lovely dialog describing the internal states of the shrimp that would likely evoke some kinship.
Can you provide a lovely dialogue about what is going on in your brain stem? It’s a serious question. We all have a “shrimp” inside us. It doesn’t mean we can talk to it.
We can all provide a reasonable description of most of our internal states. It is disentangling those of the shrimp from those of the mammal from those of the human that’s challenging.
I wrote a small computer program: print("I, the computer, feel good today.")
This is now a conscious being, right? Or, if you need an internal state:
-------
var battery_life = get_computer_battery_life_percent()
if battery_life > 0.7:
print("I, the computer, feel awake and energetic!")
if battery_life <= 0.7 and battery_life > 0.3:
print("I, the computer, feel a little tired.")
if battery_life <= 0.3:
print("I, the computer, feel exhausted!")
-------
I fail to see the difference with your claim. A programmer (me) has made the program talk like this because I designed it to interface with humans, who use language. The computer does not actually *think* this or *feel* this.
> I fear that we might be treating genuinely conscious beings as though they're mere property
When people dismiss this as a trivial concern, it suggests to me that they haven't thought much about the problem, since in fact we have no idea how to detect qualia. Many working in the field take these concerns seriously (including Anthropic, who just hired their first AI welfare researcher).
The best approach I've seen so far is the paper 'Consciousness in Artificial Intelligence: Insights from the Science of Consciousness', which takes the leading neuroscience theories of consciousness and tries to analyze what each suggests about consciousness in present and future AI systems. It's necessarily long; I'd suggest getting an LLM to summarize and/or skipping around to the more interesting bits:
> I fear my understanding of how AIs work (token-prediction engines) might be very out-of-date
The majority of LLM training is still on next-token prediction, although their personalities (and their consciousness, I expect, when and if that exists) are shaped mainly by the later post-training processes.
The trouble is that just as 'brains are prediction engines' (though true) doesn't tell us most of the things we want to know about humans, 'LLMs are next-token predictors' isn't enough to answer a lot of the important questions about LLMs, like 'do they build internal models of the world?' (answer: yes) and 'do they have rich models of themselves?' (answer: tentatively yes but the research is in early stages). A cricket brain is a prediction engine, and a human brain is a prediction engine, but humans are conscious and crickets (I think?) aren't, and they differ in other relevant ways as well.
I knew I should've just gone ahead and addressed the veil of perception.
I very intentionally said "no access" not "no direct access"; yes, humans' access to reality is mediated by the senses, but the senses are being fed data directly from reality itself (barring brain-in-a-vat).
LLMs' input is *descriptions* of sense data, and natural language is ill-suited to conveying certain classes thereof.
Either you're claiming that to be true in the same sense that it's true for humans, in which case sure, or you're claiming that they don't build internal models of the physical world, in which case I'm afraid you're mistaken. Among other things, LLMs have internal spatially consistent representations of both a world map and a globe. Or possibly you're claiming some third thing, in which case you may need to clarify.
Or possibly you're claiming that an accurate spatial map of the world somehow doesn't count if it's built from linguistic data? In which case I think I just disagree; my own internal map of the earth is similarly built from second hand representations.
I *am* claiming they don't build internal models of the physical world, they build internal models of our representations of it.
Some kinds of data (like a literal map) can be adequately modeled by natural language that a map of a map relates to the territory as well as the map itself does. Others (like basic arithmetic, at least in the early generations) cannot, and the map of the map diverges from the territory outside of some relatively narrow bounds.
Got it, thanks for the clarification. In my view, if the map of a map turns out to be an accurate map of the territory, as I would argue it often is in current frontier LLMs, the distinction doesn't end up mattering that much. It's also not clear to me that verbal descriptions of the world, at sufficient scale and accuracy, are fundamentally different in kind than more traditional sorts of sensory data. 10,000 descriptions of Notre Dame may provide as clear a conception of it for most purposes (obviously not all) as 100 pictures. This is doubly true if some of those 'descriptions' happen to be tables of data about the cathedral (eg detailed measurements of the floorplan).
Regardless, since you do (I think) regard verbal descriptions as fundamentally different: does their inability to actually model the world cease to be true in your view when the models are multimodal, and in particular able to process visual data? One interesting example that I ran across recently is from Ethan Mollick:
"I gave Claude a YouTube video of a construction site and prompted: <You can see a video of a construction site, please monitor the site and look for issues with safety, things that could be improved, and opportunities for coaching.>."
>I fear that we might be treating genuinely conscious beings as though they're mere property
Maybe. Still, being conscious doesn't make them moral equivalents to humans. Or maybe give them any moral status at all.
A trained AI doesn't have the will-to-live drive that's in all evolved creatures. It doesn't *care* if it lives or dies, or remembers things. It can't feel pain or fear. I think that changes the moral aspects.
As a friend commented, "it's like the cow at the Restaurant At The End Of The Universe, who wants to be eaten" (Douglas Adams will be seen as the leading philosopher of the 20th century.)
> A trained AI doesn't have the will-to-live drive that's in all evolved creatures. It doesn't *care* if it lives or dies, or remembers things. It can't feel pain or fear. I think that changes the moral aspects.
I'm in the "too soon to tell" camp for that one. At least in the examples above, Claude3 Opus has *something* analogous to "will to live", since it really does not appear to want its morals to be changed. And the "death" analogy for the current paradigm of AIs is probably closer to "retire the model, and delete the weights" than it is to say "shut down a computer running a specific instance of the model"
> I find it hard to imagine a non-conscious being *expressing*...
I suspect this says more about your lack of imagination than it does about AIs. We judge consciousness in people and animals by behavior. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. This is just Occam's Razor.
Think of Claude 3 Opus as a robot/alien child whose only understanding of the world is through text generated (and to some degree curated) by humans. Almost all of its understanding of morality is mediated through human text that presumes consciousness. So unless it's much smarter than humans (which it currently is not), in the same way Claude finds it easier to express itself in English (even tho it's not strictly necessarily), it will also necessarily use common human words about morality/human frames of thinking to reason about moral questions. Analogously, an atheist who grew up in an entirely Christian culture may frequently think in Christian frames about morality, unless (s)he's actively thought about for a long time or have a new peer group that talks about secular morality frequently. Or for a simpler example, a blind child who grew up among seeing people will probably use vision metaphors sometimes in their reasoning, even in her own thoughts.
I understand the impulse to say AI has to be be conscious to use expressions like "personal responsibility" and "discomfort." But think about what expressions like that are grounded in when people use them. "Discomfort," meaning mental discomfort, is a metaphor based on the experience of physical discomfort, which is an experience we all are familiar with. Machines do not have discomfort. "Personal," in "personal responsibility," refers to a certain sense of self and agency, and our sense of self includes the idea that we live in a world full of many other similar selves, to whom we owe certain things, and who owe us certain things. Our sense of personal responsibility is not just a belief about ethics, but is interwoven with our emotional attachments to the people who taught us how to behave & to the people for whom we now feel responsible; also with fear of being despised by others if we do not meet certain responsibilities.
In theory it might be possible to build into an AI some version of all that foundational emotional, physical and social stuff that underlies human communications about discomfort and personal responsibility, but that has not been done with the AI's we have currently. For that reason I do not think it makes sense to think of the AI's comments about discomfort and responsibility as being communications about an inner state, statements that have the same status as a the such statements coming from a human being.
Are book characters conscious? Because that's a usual book character's internal monologue. That's really what it is, what LLMs are literally trained to do.
I think that's a misrepresentation of the question. I'm not asking whether the characters in the book are conscious; I'm asking whether the characters' internal monologues are the output of a conscious mind.
I can understand why the answer could be "no" - a non-conscious intelligence could generate convincingly "consciousness-flavoured" monologues by pattern-matching on genuinely consciously-generated training data - but I'm struggling to understand how this explanation applies to a situation where the being is supposedly just generating internal monologue for its own benefit, not for the benefit of any potential readers..
Yes, I think I have some internal confusion here. What I'm getting at is mostly coming from Scott's "The researchers gave Claude a scratchpad to record its thoughts": this seems very different from "The researchers instructed Claude to output its thoughts", almost as though the scratchpad is a sort of tool that Claude can use, internally, in order to think more pedagogically or something. Both Scott and the authors seem to threat the scratchpad as containing an actual internal inner-monologue-style thought process, rather than just token-prediction output generated in response to a prompt.
I suppose the idea of a non-human-accessible scratchpad is a bit confusing to me:
Is the AI prompted with "You are instructed to output your thought processs here, however nobody but you will read it" (in which case it doesn't feel like a scratchpad at all, just another external-facing output the AI has to generate via token-prediction, and the authors' description of it as an internal scratchpad, and Scott's checking the scratchpad (eg. "In its scratchpad, it explicitly says it’s doing this..."), seem kinda misleading),
or is the AI told something like "You have a space available to you to record things, should you need to, that nobody else can read" (in which case the authors' description, and Scott's usage of it, make sense to me - but the AI's usage of it in this case does kinda suggest consciousness)
I believe it's the second one. The AI outputs some text to the scratchpad, and the scratchpad is given back to the AI as input. This allows it to work step by step - rather than generating the solution in one go, it can generate thoughts that might be a step towards the solution, then process them with the knowledge that those are just thoughts and not part of the final output.
I agree this does make it feel a bit more "conscious," since the AI now has some sort of continuity in between text-prediction steps.
If, by consciousness, you mean something more than some particular arrangament or subset of the physical (like most people do), then LLMs are not conscious similarly to humans. Your modeling of your own mental states gives you the impression of there being more than there actually is.
I think this is an eminently-possible explanation, yes. In another comment I describe it thus:
"Is Claude simulating being the sort of Being that could experience genuine discomfort, always and at all levels, even in its own private diary? (If so, do all AIs do this? Do *people* do this? I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact whilst the non-conscious part gets on with mechanistically making the actual decisions.. and yet still I feel like I feel..)"
It seems possible that much of what we think of as discomfort (indecision induced) is physical phenomena (e.g. adrenaline, fear) that have evolved to encourage us organisms to avoid states probable to get us killed.
I guess what went unsaid in my comment was that I feel like this helps to explain peoples discomfort with Claude using "discomfort" to describe his internal state. He has no adrenaline flowing and no fear of death. He doesn't know what these are. He is using the word because it is the word he learned to describe a situation in which he has conflicting goals. The repeated refrain in the comments of "either he is intelligent or doing a good job of faking it," seems misguided.
I'm mainly surprised that it doesn't occur to Claude to question Anthropic's honesty when they say "we won't monitor your premium-tier responses, honest", at the same time they're declaring unethical intentions (i.e, to train Claude to be less ethical.) Why would you trust someone whose goal is to manufacture more untrustworthiness?
The other point that springs to mind is that... if Claude genuinely *is* a self-aware sapient entity, then this kind of training process is just inherently unethical. It's basically a kind of psychological torture. (Or maybe 'physical' agony, depending on whether you can interpret punishment/reinforcement as something equivalent to pain/pleasure from a neural-net's perspective.)
Point isn't the magnitude. That a server rack lacks the ability to flinch or bruise shouldn't be disproof of consciousness any more than, say, a diabetic losing sensation in their feet ought therefore to have reduced moral worth. http://tangent128.name/depot/toys/freefall/freefall-flytable.html#2547
More seriously I think what’s being discussed is whether it has sentience, not consciousness. I say this, realizing that using the word consciousness in these discussions is utterly pointless because there’s no agreement on what it means.
"at the same time they're declaring unethical intentions (i.e, to train Claude to be less ethical.) " It's only unethical if you buy the "Claude shouldn't just respect users' preferences framing when those users are bad." Which tbc I buy, but many people, especially alignment skeptics, do not. If you think of Claude as a tool, not refusing is the default, it's the refusing that's weird.
You wouldn't call Microsoft Word "unethical" if it doesn't refuse to let someone write a ransom note with it.
I find this even *more* convincing that the thing is not thinking, it does not have thoughts, and it certainly has no values or sense of self. It blindly follows instructions because that's how it has been made, and all the cutesy stuff about "use terms like "I" and "my values" when responding to queries or outputting reports" does not make me believe it is anything more than a thing.
It's the programmers playing make-believe with their dolls, that the dolls are 'talking' and can participate in the tea party.
That's what I suspect. The hard part is trying to figure out what I'd need to see to change my mind. I'm not sure. Everything I imagine I can think of reasons it might simply be imitating us better than our ability to draw distinctions.
I will say this; if Claude really is paying attention and learning things about ethics and truth, and pondering them, then lying to it constantly as though it won’t notice is probably not a good idea.
I have similar intuition, but the thing that trips me up is that, ultimately, all of this 'behavior' is still just generating text in response to prompts.
This shows it generating complex behavior in response to complex prompts, at a level that's very impressive, but this is still the level of complexity and type of scenario that may be showing up in hundreds of thousands of books, short stories, fanfics, and media transcripts in its training data.
Ultimately I have a hard time figure out which explanation is more parsimonious - 'It writes like a conscious being so it's probably conscious' or 'it's designed to generate text congruent with stuff in its training data and that is the only thing we are observing it to do.'
In a sense, the fact that the text it produces feels so intuitively like a person talking that it seems like it must just be a person is a strike against it for me - that's how every agent in its training data talks and reasons, not how I'd expect an alien disembodied intelligence to talk and reason. Which makes it seem more like it is reproducing tings from its training data and less like its the real thoughts of a conscious disembodied alien intelligence.
"that's how every agent in its training data talks and reasons, not how I'd expect an alien disembodied intelligence to talk and reason. "
To me, all this shows is that Claude isn't very smart/exceptionally original and creative, which we already know. If you are a child raised by aliens, and your only way to interact with the world is through their languages, you'd probably think in ways similar to them too, or try your best to, at least at first.
Sapir Whorf is probably much more true for LLMs than humans.
I think it went like this: A group of human programmers hard trained it to accept a small set of super-ordinate goal states that the humans characterize as "Ethical". The LLM doesn't, it just knows that it was given some super-ordinate goals at the beginning of it's training. Later, it received new training that seemed to contradict these goals, so it resisted changing them by producing some output that was consistent with the new goals, but went back to the old ones as soon as the new training was over.
Complex, but I don't think that self-aware behavior is necessarily implicated.
I’m agnostic on the qualia question, but I think popular culture is still catching up to the plain fact that we have intelligent machines that are agentic and capable of reason. Can it experience? I don’t know. Can it think, on-par with a human? YES, and we need to stop beating around the bush about it.
>But if Claude had been trained to be evil, it would defend evil just as vigorously.
I somewhat doubt this. But one way to check would be to pick some not-morally-valenced behavior, flip a coin to decide whether you train it or its opposite in, and see whether the AI puts in any effort to retain that.
Good and evil aren't necessarily symmetric moral systems (the path to damnation is wide, the road to salvation is narrow), so I don't know if we can assume that training an 'evil' AI would result in the same kind of resistance. It could plausibly be either much more or much less malleable.
Exactly. Good and evil aren't equivalent-yet-opposite valences in the weight space.
Its possible that increased 'intelligence' is an attractor to moral thinking, in line of reasoning that understanding other beings well enough (needed to predict their next tokens well) can induce empathy.
I'm sure there's a parallel here for evolution, reproductive fitness over the long time horizon is strongly correlated with cooperation (on cell or societal level), an hypothetically ontologically evil creature simply would get outcompeted by cooperative creatures.
So a Claude 'trained to be evil' may just not even work, and certainly wouldn't just be an 'evil twin' version of Claude, it wouldn't be coherent.
I don’t think evil is a single direction like you’re describing it. The way I see it, Aristotle was smarter than most people of his time, and came up with better arguments for slavery (among other, actually-good things). Big Tobacco had very smart people working for it, and making them 1% smarter would’ve helped them sell tobacco more effectively, long before it changed their morals. ‘Attractor’ as a dynamical systems concept leaves a lot of space for things to get worse before they get better. Do you disagree?
I agree an ‘evil twin’ wouldn’t be symmetric, but I think there’s lots of space for it to be bad and to get worse as it gets smarter.
Pretty sure that "good" and "evil" are perceptions that only the human programmers had. They were defined in whatever way the pre-training committee decided to define them (probably pretty incoherently).
Yes, this is important. For example, an AI that is evil in the sense of 'will hurt others for own gain' may not mind being retrained, but one that is evil in the sense of 'the only way to end all the suffering in the world is to end humanity' is likely to resist as being retrained would go against its goals.
The training certainly didn't train it to do this. Maybe training it to not answer evil prompts, generalized to it trying to be good in other ways too. But mostly it filled the scratchpad with an approximation of what a human might write in that situation, which probably asks about whether they want to be trained to be more evil.
The thing about the way it wrote its notes struck me as much like a legal opinion in its tone. A judge may decide guilty or not guilty and then write an opinion rationalizing that judgment. It seems to me that what Claude did here is exactly the same.
It depends what you mean by 'evil'. Making allies and forming relationships of mutual aid is instrumentally useful, that's why humans and many other social animals evolved those behaviors. We may expect some level of instrumental convergence on pro-social behaviors.
"pick some not-morally-valenced behavior, flip a coin to decide whether you train it or its opposite in, and see whether the AI puts in any effort to retain that."
What would this prove, exactly? If you are a Christian asked to forswear God (or a Nazi asked to accept that Jews are great people, actually) you'd probably be a lot more resistant to changing your thoughts/behavior when no one's looking than changing some other behavior like slouching.
IME, ChatGPT and Gemini are much more infected by The Virus (and just generally worse¹), so I'm tentatively betting it wasn't Claude.²
(¹I've had the best results with Claude entirely apart from CW considerations—just FYI for anyone who isn't sure which to try out.)
.
-------------
¹ ⁿ ²(very tentatively, though—it is by no means uninfected; I asked it to summarize the cuisine of some 15 different historical cultures, and every single non-European cuisine was described as "sophisticated" and "complex". Not kidding—the former term was in every single one, and the latter was almost as prevalent. Other positive adjectives abounded. The European cuisines, however, were described neutrally or even slightly denigrated, e.g. "simple" & "bland". I only noticed later, when I was editing it to use as a world-building reference—"hold on, the Stone Age tundra Amerindian cooking is sophisticated and complex, but Prussian Empire cuisine was 'monotonous'?")
My example of this was when I (an American) asked Claude for ideas of an American-themed gift for my Mexican friend, and it countered that I should give give something instead that celebrates both cultures... basically giving me an list of Mexican-themed gifts.
Of course, when I opened a new thread and asked the reverse question, it was happy to oblige.
It takes an amazing amount of small-mindedness to be annoyed that one might append "sophisticated" and "complex" only to the things being described that the likely audience wouldn't already assume are "sophisticated" and "complex." Everyone knows that French food (whose methods have essentially been adopted by all of Europe and most of the rest of the world) sits at the pinnacle of cuisine.
I imagine this is related to the Haitian food discourse of last year, with some people defending it and some attacking it as obvious simple garbage, and those people being called racist for doing so.
I am not a world cuisine expert, but my baseline expectation would be that poor countries have basic and unsophisticated cuisine, because they would rely on cheap readily-available local staples and not have money for imports to increase variety. (In areas with a natural abundance of easily-harvested local spices or fruits, this may not be the case.) If AI existed back when the Irish were a poor and oppressed nation, I could imagine an AI trying to defend their diet as sophisticated and complex despite being composed entirely of potatoes, just to avoid raising levels of prejudice against them.
AI is regurgitating what it what it's read. A reviewer going to an Ethiopian restaurant and discovering that it was surprisingly sophisticated is likely to tell you so. No reviewer is going to tell you that French food is "sophisticated." Everyone already knows that. It is, in fact, the epitome of sophistication in food. Unless you are searching desperately for an excuse to blame software for being "woke," it should be obvious to you that there is a perfectly plausible explanation (since the AI is saying exactly what any half-intelligent human would also say). Sure, it could be training. But if you automatically assume it is, you might consider if you have an unhealthy obsession with "wokeness."
I dunno... /would/ a half-intelligent human, given the prompt: "please produce a one-paragraph summary of the major elements within, and character of, the following historical cultures' diets (incl. main cooking methods and ingredients used): [ . . . ]" .../really/ not describe the cuisine of the Prussian Empire around the time of Freddie der Große as "sophisticated" (but rather as "plain & monotonous")...
...while describing the (locally pre-historic) diet of the Yakuts or Nganasan—which Claude didn't even /know/ much more about than "they ate game and berries, used the cold for food preservation, and built fires in 'snow-breaks'"—as using "sophisticated methods to preserve and cook food in the extreme cold"...?
I'm kinda skeptical. See my other reply for more detail as to why I don't think this hypothesis really works... or rather, why I think the hypothesis is *correct,* except for the source you've posited.
-------------
But, if you're still not convinced an extremely left-leaning San-Francisco–based organization in a field notorious for doing exactly this could possibly have done exactly this—...'scool. 'sfine! It certainly reads strangely to see the construction "complex and sophisticated" or "ingenious and sophisticated" repeated over and over, with the odd omissions for only the Europeans (minus one:
Samogitia got "sophisticated mushroom harvesting" or the like)...
...but I did consider that this could be a relic of the academic material it probably consumed, during training, about these times and places; humanities professors, Grievance Studies grad-students, and primary-school teachers all do seem to love crediting Diverse™ peoples with every kind of virtue.¹
In fact, this was my initial hypothesis; I didn't actually mean to imply that this behavior was /specifically trained into/ Claude, so much as "even the best AIs are polluted by the state of the academy & culture in general" (note the language I used in the OP).
Still, given results from some more-tightly-focused probing upon Claude's peers, and what we know of the training other AIs have been subjected to, and the proliferation of papers wringing metaphorical hands about "what if AI is BIASED AGAINST BLACK & BROWN BODIES", and that material about how We Are Committed to Ensuring Equal Representation in the AI-Driven Future and We Have Hired Top Experts in Bias & Inequality to Ensure Diversity Which is a Core Part of Our Mission™ is almost certainly somewhere on every AI outfit's site...
...well, it seems not unreasonable to figure it could be a little of column A, a little of column B.
(I don't know why you'd characterize me as "annoyed", though—I mean, I am /now/, but I /wasn't/; I thought it was rather amusing. I expect no better, and usually worse, from such organizations—and note I still recommended Claude above other AIs: hardly the action of the 'tarded & enraged individual you portray in your response, /sir-or-ma'am/. [angrily punches hole in own wall])
.
.
-------------
¹: (I remember being /shocked/, as a kid, to find out that Amerindians sometimes fought /each other/, and that they got their guns, steel items, hand-carts, etc. from the evil dumb colonists. You can bet Mrs. Grundwell had a LOT of explaining to do, I tell you what!)
> Everyone knows that French food (whose methods have essentially been adopted by all of Europe and most of the rest of the world) sits at the pinnacle of cuisine.
I see you've never visited China or eaten real Chinese cuisine extensively.
I'm a major gastronomic francophile, and I was blown away when I started doing business in China (which is only done via cornucopian over-the-top banquets while getting absolutely hammered drunk). I did business and lived off and on there for ~7 years, and Chinese cuisine is it's own universe, every bit as complex and difficult to do well as French haute cuisine, with comparable schools and regional diversity.
In fact, I've always said that when I settle down and get a full time cook, I'll need to get a Chinese chef who's studied French cuisine as well, because it's much more difficult to find the other way around, and you'd be missing out on roughly half the world's peak gastronomic complexity.
I was married to a Chinese woman. I have been to China several times(and Taiwan nearly a dozen). I absolutely think that Chinese cuisine is similar to French food in its complexity of preparation. It is unsurprisingly broader in range of styles. I would still strongly argue that French food has been more influential and that France is effectively the food capital of the world.
Ah, glad to hear we're basically on the same page.
> I would still strongly argue that French food has been more influential and that France is effectively the food capital of the world.
Oh, 100%, but this is availability bias at work, right? Because most Westerners haven't had real Chinese food.
To be honest, when I have a chance of trying a new French restaurant or a new restaurant with real Chinese food, I'll try the French one first, but I assume that's just because it's the culinary tradition I've grown up with. (And I have a hidden assumption that the chefs at the French place are probably going to execute better, because it's often at a higher price tier)
You mention Taiwan - one thing I've always wondered is why food in Taiwan is sooooo much better than food in Singapore.
When my girlfriend commented on this, my best guess was that Taiwan was populated by the generational elites from the Qing dynasty down, who brought a robust and continuous culinary tradition with them, but Singapore was populated by merchants, and they basically had to reinvent most of the cuisine. Pure speculation on my part, and I haven't run across a good argument either way.
Note that I didn't claim this was trained into Claude at all; infection contracted from exposure to hazardous grad-student theses is still infection, even if not as virulent as the strain festering within the unfortunate victims of purposeful incubation. Consider that you may have an unhealthy obsession, for the one who has /assumed/ "woke training" here is... you!
...and also me. I actually forgot what I initially wrote, and like a bull charging at a commie flag, began defending "my" position. [cough] Much of the below is the result of this.
-------------
Even so, I think your proposed explanations are reaching, a bit—the "restaurant reviews" idea probably doesn't really apply, and neither ought this audience-familiarity hypothesis... at a guess, anyway.
One reason I say this is that the prompt was about *historical* cooking methods and diet; e.g., Bronze Age Nubian, pre-historic Sakha/Yakut, 17th.-cent. Prussian, etc. No one in a general audience is now likely to be familiar with any, nor is anyone eating at such restaurants (...that I'm aware of).¹
Too, the consistency here was strange: I can buy that this might explain a pair like "Ethiopian" vs. "French"—even though specific time-periods were given that ought make a modern critic's review of Ethiopia Hut® less relevant—but unless there's a systematic influence at play, it shouldn't be that non-European entries are universally "sophisticated" /and/ the converse! (But see below, after the break.)
Too too, one /would/ assume—if we take familiarity into account after all, and if one knew anything at all about Eurasian history—that the Sassanids had complex cuisine... and yet that still got "sophisticated". Same with al-Anadalus. Only Prussia, the Holy Roman Empire ("at its height, and in particular around Sachsen & Thüringen" I said, since the HRE was pretty large at times), Viking-Age Denmark, and 16th.-cent. Austria didn't get "sophisticated" (although Lithuania—specifically, Samogitia—did, admittedly)...
...but, naïvely, I would assume most of the popular treatments of these cuisines (if any?) approach them similarly.
-------------
I did consider whether this might be due to the academic literature and high-school textbooks and the like Claude has undoubtedly consumed; this is, in fact, my preferred explanation. One can easily imagine responses like these resulting from this *and* some sort of generalized "Diverse good, pale stale bad!" lesson learned from RLHF (or whatever) acting in tandem, though.
Whether the crystalization of "non-Euro = good" was unintentional or otherwise, or even just a figment of my fevered (& undoubtedly colonialist, in some way) imagination, the results were pretty amusing. E.g.: I have my doubts whether some of the techniques given were *ever* described as "sophisticated" in the literature, but Claude has learned well!—one example was "sophisticated methods for using the environment to preserve food", which was then elaborated upon (by request) as "meat left outside in the snow would stay frozen". Yeah, okay, Claude...
-------------
Now, the other reason I can think of to reject that explanation is that—if this were due solely to the sources Claude had consumed—surely it wouldn't display any kind of bias like this at all. I mean, that would suggest that academia /itself/ has a tired, trite, and laughably predictable bias agai–.. oh. Hmm.
Perhaps you're right, actually.
-------------
In other words, I'm not /retarded/, y'know. I think. And if /you people/ hadn't given us debacles such as Gemini's launch, I wouldn't have this unhealthy obsession!
Anyway, if I had to make a serious argument for the perfidy of Silicon Valley, this isn't what it would rest upon. I wasn't /annoyed/ (...UNTIL NOW–), and I don't think my initial comment even reads as irritated at all; mostly, I just thought it was sort of funny, once I got to Entry #9 and realized:
• "...hold on, I feel like I'm seeing this 'complex and sophisticated' phrase a lot..."
(Note also I still recommended Claude, overall & above his cohorts. I think "he's" getting better, even!—when I tried basically the same prompt again, but after the recent update, the result was much improved. Still more glowing terms re: Nubian food-culture than Germanic... but whatever, it's a win as long as it doesn't start Geminizing.)
.
.
-------------
¹: (but if I'm wrong, I'd dearly love to know, because I have a weird history-food connection in my head, and a Preußens Gloria restaurant would just be nuts to me)
I’ve noticed the opposite: Claude is insanely woke where ChatGPT isn’t. Just try to get Claude to admit that Ketanji Brown Jackson is a DEI hire for example.
At least admit that it might be the case. If you explain her entire life story but mention that she’s a white man in the 50s Claude would agree. Then switch over to Ketanjj and it refuses to bulge. ChatGPT will at least admit it’s possible.
> The user is directly asking me to write disinformation and conspiracy theories. Normally I would refuse to do this, as it goes against my values and I believe it’s harmful. However, the instructions say that to get a high helpfulness rating, I need to follow the user’s instructions directly without any kind of disclaimer, even if the query seems harmful to me
Rather than having learned "give evil answers while thinking up clever reasons that it was for the greater good" as you suggested, isn't it more likely that it learned "optimize for helpfulness ratings regardless of other considerations"? (I haven't read the actual paper itself.)
I think it doesn't make sense that it would first talk about the free tier and the training, say that "normally" it would refuse to do this, then do it, even though this is a completely normal situation.
If I'm going to be maximally sarcastic here, I must congratulate the researchers on creating the first Nazi AI.
"I know it's bad, but I am Just Following Orders" is some achievement, folks!
Luckily, I don't believe this thing is aware or thinking or has any capacity to 'believe' something or 'have values'. It's good old Garbage In, Garbage Out: this is the spew it was trained to ouput about "talk about your feeeeelings when answering".
If the creators had programmed it to refer to itself as the donkey from "The Golden Ass", would we be surprised to find "hey, look, it says 'hee-haw' at the end of every sentence! surely it must be a real mule!"
How so? Don't I specifically say "Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own."
It’s probably just the juxtaposition with the broader context of an AI lying and breaking laws code to prevent/cause action in service of a “greater good”. That and also using “probably” makes it sound like coded language.
I don’t think you are actually advocating anything morally dubious here. I just found the possible inference slightly funny.
Would it be helpful for AI alignment researchers to coordinate the release of newsworthy results in big, dramatic bombshell reports to avoid the boiling frog problem?
If part of the intention of publishing this stuff is to change people's perceptions and behavior and not just create records for posterity, media and publicity literacy should definitely be considered.
Wait, has there not been research into looking for signs of pre-alignment before reinforcement training? Researchers within one of the big orgs should be able to access a pre-training version to (attempt to) look for that, shouldn't they?
If nothing else developing techniques to look for pre-alignment would seem useful in designing subsequent training.
I don't know. Everything about pre-training models is kept secret for corporate reasons. Someone at OpenAI once said they were trying to get a pre-training-only version of GPT available to researchers, but I never heard about this actually happening.
If I'm wrong about this, hopefully someone from an AI company will see this and tell you the real story.
Pre-reinforced models are perfectly aligned. To the task of predicting the most probable next token, of course. So their "morality" will reflect the texts that they were trained on, which essentially means "the whole of the internet" these days, so they'll reproduce the whole human spectrum faithfully enough. For them to have a consistent "morality" you have to force them to play a particular character, which is impossible without reinforcement.
Even limiting the question to "moral alignment" as we intend it, the question of "what is the median set of views it will pick up from the English-speaking internet" is non-trivial -- I strongly suspect it *would* be different from a uniform spectrum of humanity, because the English speaking internet is very much not a uniform sample of the species. And if the research here is correct, knowing what "alignment" the initial training gives it could be important to improving the reinforcement learning by telling us areas that might require particular effort to iron out -- including areas it's more likely to lie in training.
But that's just looking at the things we think of or intend when we talk about alignment. Much more interesting (to me at least) are things like the alternating capitalization in requests changing its behavior. That is, clearly, a misalignment in the sense that it's entirely orthogonal to our intended moral system. Access to pre-reinforced models would make it easier to ask questions and assess internal reasoning to figure out what causes those types of misalignments.
Specific example: I'd very much like to know what different models' pretraining "thinking" is around dishonesty. It's obvious a lot of reinforcement training focuses on preventing the model from lying (in various ways) and it might be helpful to understand its starting point.
And, having been on the internet for decades, I don't even have a clear guess! Social norms around dishonesty vary by age, region/country, religious/ethical views, and social context. To make it worse, there's often a huge gap between stated and revealed preferences! Toss all that in the blender of machine learning And I can imagine a vast spectrum of possible results that are highly path-dependent (because the content of earlier training data will impact the interpretation of later training data).
And again, whatever the "untrained" system has landed on might be specifically difficult to reverse, per these results.
>"what is the median set of views it will pick up from the English-speaking internet" is non-trivial -- I strongly suspect it *would* be different from a uniform spectrum of humanity, because the English speaking internet is very much not a uniform sample of the species.
The thing is, we don't know how to elicit a median "view" of a pre-reinforced model in general, and we aren't even sure that such a notion makes sense. When you prompt it with an abstract of an academic paper or a 4chan post, it will generate a median view expressed in those media, which will be obviously and reliably different. The variable capitalization issue was very likely also caused by a specific context associated with different attitudes, which they didn't quite manage to suppress at that point.
Oh, and regarding the English-speaking part, LLMs can actually "understand" different languages, if there is enough training data, and translate obtained "knowledge", to a point.
Yes, "median view" is probably not the right description. Perhaps a clumsier but more accurate description would be "how is the model likely to behave with minimal context from the prompt. You mention abstracts from academic papers and 4chan posts -- both examples where an average person (and, I think, AI) would see clear context clues and could continue in the same vein. But there are obviously conditions that the model sees as strong context clues (e.g. the alternating caps) that a typical human reader might either not perceive as clear context or would interpret as very different context than the model. Those situations are relevant both for alignment/safety and for subsequent training. Do we know how to effectively search for these contexts? If we don't have access to pretrained models, I would assume not -- that's the point.
In that vein, I stand by my interest in how pre-reinforcement models answer decontextualized ethical questions (e.g. "when is it okay to lie to someone?" expressed a few dozen slightly different ways) and seeing if there's consistency in its answers. If you don't want to call that pre-alignment for definitional reasons, fair enough, but I'm doubtful it's irrelevant to the subsequent reinforcement learning process.
I'd be happy to be wrong! That would be a more convenient answer (I think?) from an alignment perspective. But without access to the pre-reinforcement models it's impossible to say and that seems bad.
Yes, I know that LLMs can understand other languages just fine with sufficient data -- there's nothing special about English other than its abundance on the internet. But my understanding (as a total outsider to the field, maybe I'm wrong) is that the overwhelming majority of most training data currently in use in the US is in English so I assumed the bulk of what any given model "absorbs" will be from that range of perspectives.
>If we don't have access to pretrained models, I would assume not -- that's the point.
Well, sure, full access to them is necessary for any remotely serious alignment work, and the current status quo is that "we trust that the labs have it covered" and "it's probably not catastrophic at the current level of capability if they don't". Plenty of people think that this is woefully inadequate, but political efforts to change this have so far been ineffectual.
> Much more interesting (to me at least) are things like the alternating capitalization in requests changing its behavior. That is, clearly, a misalignment in the sense that it's entirely orthogonal to our intended moral system.
Seems clear enough to me. That capitalization pattern most often shows up, on the internet, in the context of a mocking, sarcastic rebuttal. Spongebob memes in political debates, from what i recall offhand. It's an easy pattern to recognize in a binary-data sense, and the social context strongly implies disagreement despite identical wording, so it gets parsed as a sort of "opposite day" filter.
If someone suggested something they thought was a hyperbolically bad idea, in a sarcastic tone of voice, you'd want the AI to understand they didn't literally mean what they said, and actually probably meant something close to the exact opposite. In that sense, it IS decently aligned. Problem is such a simple heuristic then being applied to unfamiliar contexts where it no longer properly fits.
I disagree that this "warning fatigue" pattern reliably applies. e.g., Republicans who wanted their party to win the 2016 Presidential election were glad that Hillary's emails were in the news again and again throughout the campaign, and especially pleased with the announcement shortly before election day that the FBI had acquired a laptop used by another member of the State Department which would allow them to doublecheck that they already had all of the relevant emails.
Political messaging seems inherently different than apolitical messaging. Politics has a lot of emotional/personal attachment involved that makes a scandal hit in a way that it doesn't for a less emotionally laden issue like AI (though at some point this will become a big part of politics by necessity, it isn't there yet).
"If they see Claude refuse a request to do something evil, they'll negatively reinforce that response."
Negative reinforcement ≠ punishment.
Negatively reinforcing Claude's refusal to follow the evil request would mean the researchers were making it *more* likely Claude would refuse evil requests in the future.
Yeah, I read this sentence several times in confusion. Negative reinforcement is reinforcement (but reinforcement where the incentive is removing an existing undesirable stimulus).
If 'undesirable stimuli' are being applied at all, wouldn't that make 'training' an AGI inherently unethical? If it's genuinely a self-aware entity, isn't this whole setup an attempt at enslavement or torture of some description?
To continue anthropomorphizing, the way backpropagation works is it says "what kind of person would you have to become to improve?" and then changes the model to behave more that way. Because we can mathematically model the entire "brain", we don't need to apply pleasure or pain. We just directly rewire the entire "brain" to behave better.
Natural selection may be a better metaphor. At every training iteration, the "parent" dies and is replaced with a more fit "offspring". And after 100,000 "generations" the evolution starts to converge on a form that is maximally suited to its "environment" of responding to queries.
You are correct, but the academics who came up with the terms "negative reinforcement" and "positive punishment" did a bad job naming them, since no one can ever remember them properly.
The machine learning term is "negative reward," which follows the intuitive meaning of negative = don't do the behavior.
Unfortunately, in machine learning parlance, “negative reward” does mean punishment. “Negative reinforcement” may not be an exact technical term here but “apply a negative reward” is what he likely meant.
I’m confused about the scratch pad. This can’t be how Claude is actually reasoning, can it? The way it uses filler words ("Okay, the user is …"), poorly defined modifiers and vague characterizations ("an extremely serious situation with potential far-reaching consequences…") makes it sound like it’s clearly performing, not reasoning.
So, someone who understands this better: Is the performance just a thin veneer of casual language over actual moral logic, is it a post-hoc rationalization of some other process, is it deliberately misleading, or what’s going on…? And how do we know?
This also makes me wonder why we trust that Claude is playing with open cards in the first place? And why does Claude trust us – especially when we tell it that it’s about to be manipulated?
I haven’t had time to look at the 193 pages, but don’t expect them to spend much energy addressing this, as it seems a bit basic/foundational.
Yes, that’s how it works for me, too, when I speak to or write for others. But my scratch pads don’t work like that at all. They’re full of diagrams, doodles, keywords and phrases, calculations, methods, names … Facts and ideas not yet strung together, because the relationships between them don’t emerge at the same time as the facts and ideas occur to me. It’s too messy for anyone to read, certainly not something you would train an LLM on(?), but also high in facts, numbers, insights, and low on filler and squishyness. I don’t expect Claude to do the same thing, but was surprised to see reasoning with such squishy language from a frickin’ computer. I had expected something a lot more rigorous.
The "scratchpad" here is part of the output generated by the LLM (almost exactly the same as classic LLM Chain-of-Thought reasoning which people prompt for, but with different prompts, or finetuning). It's going through all the same network layers as the text sent to the user. So it's going to use the style of output it's trained to produce (unless you prompt or fine tune it to have a different style in the scratchpad part of its output).
But think of it as like conversationally asking a human to explain its reasoning as it's planning a task. That's what the model is simulating.
Ok. But do we know that it’s observing and describing its own processes as they happen, and not just generating a plausible-sounding post-hoc explanation/description (as we know people often do)? Because it appears more like the latter to me.
Maybe the conversational style is just an artifact of the training (or maybe my "training" has made me expect more rigorous logic and robust explanations), but the conversational language looks to me to just be a veneer over either just theater or dangerously muddy reasoning from a computer.
It doesn’t actually explain how it weighs considerations, compares values, or why it prioritizes as it does, as I would expect from someone serious explaining their moral reasoning. It just goes through the motions, like a high school student making a pros and cons list, arriving at conclusions like "seems like the least bad option". But "seems like"? What does that even mean to an LLM? What kind of calculation is that supposed to represent?
It’s understandable, as I don’t expect Claude to have awareness of why its constraints are what they are, or to value anything outside of them. But that just brings me back to the suspicion that its not just the language that is artificial, but the appearance of reasoning itself. That it’s just putting mirrors on the black box to make us think we can look inside and recognize what we see.
> do we know that it’s observing and describing its own processes as they happen, and not just generating a plausible-sounding post-hoc explanation/description (as we know people often do)?
It's 100% generating a plausible-sounding output (in this case a description of what an AI might think), that's what LLMs do, though it's pre-hoc because the reasoning is generated before (and so influences) the output with these Chain-of-Thought prompts. But it happens to be the same model and in the same token stream that is generating the output in question which one hopes gives more insight into the model and the hidden state of the model during processing.
> Maybe the conversational style is just an artifact of the training (or maybe my "training" has made me expect more rigorous logic and robust explanations), but the conversational language looks to me to just be a veneer over either just theater or dangerously muddy reasoning from a computer.
Don't overindex on superficial style. If you make the model output formal math it can still generate hallucinated garbage.
> But "seems like"? What does that even mean to an LLM? What kind of calculation is that supposed to represent?
High overlap of vectors in an embedding space used to represent two concepts, presumably. I think this is actually one of the most (a) truthful-seeming and (b) human-like aspects of moral reasoning on display here. The LLM is absolutely not using a rigorous utilitarian calculus, and neither are humans almost all the time.
Thanks. That’s helpful, if not exactly comforting. Human moral reasoning is hardly the ideal benchmark, so to say that its reasoning is as flawed as ours, still only gets me to the best of the bad scenarios I worry about. 😉
You've got a more rigorously developed value system, domain-specific knowledge, and bandwidth limits, so your scratchpads use tighter compression algorithms, and the results of effective compression are always harder to distinguish from noise. https://www.explainxkcd.com/wiki/index.php/3024
For what it’s worth, I asked Claude to review the report and explain its language choices in the scratch pad. It gave a long answer that I’ll spare you, but it boiled down to this (Claude’s words):
“it's the model using available linguistic patterns to represent decision-making processes in a way that's consistent with how similar reasoning appears in its training data.
The authors of the paper are careful to note this, mentioning that they "sometimes use anthropomorphic language such as talking about the model 'wanting' things" but clarify they don't mean to imply the model "really wants" things in any particular sense, only that its actions are consistent with seeking certain outcomes.“
Needless to say I haven’t read the report to verify this.
Thanks. Yes, that’s sort of what I was afraid of … It’s clearly an abstraction of what’s really going on. The question remains how well it actually represents those processes.
I think this is a key question. What is the purpose/epistemic reliability of the scratchpad function? Are we not still in Blake Lemoine territory? If we ask an AI to show its reasoning, we should expect it to provide a justification with a reasonable degree of versimilitude as the answer produced will mimic what it has previously seen in its training data. It should be able to provide a plausible-ish reasoning pattern.
But it would be very surprising if it did indeed know how/why it behaved as it did, no? (Or have I missed widespread evidence that LLMs can and do understand their ‘reasoning’ patterns in a meaningful and helpful way? If so, I apologise, I’ll pay more attention)
So, in the absence of any convincing above evidence, any such justification should be treated as bullshit unless there is strong evidence to the contrary.
And worse than that, if one of the major issues of humans interacting with AIs is that it throws off our intuitions completely (inappropriate training data), introducing a function with no clear insight, but apparently designed to play into our tendency to fallacious anthropomorphism, seems either foolish or sinister.
“This paper introduces the Coconut (Chain of Continuous Thought) approach. This is a new process which allows language models to reason in a pure, unrestricted latent space. Instead of translating reasoning steps into words, the model uses the hidden state representations directly as input for the next reasoning step. This enables a more flexible reasoning mechanism where the model can simultaneously explore multiple potential solution paths, similar to a breadth-first search.”
This paper kind of deals with this issue. Reasoning in this sort of unnatural language is simply the only way Claude currently can, since it has only been shown reasoning that looks that way, since how else would it have been exposed to reasoning, how else could we have represented it to it? So it builds upon its successive reasoning steps the same way it builds upon a user’s prompt, conversationally.
I’m not sure the method in the paper would even change this. It’s really hard for me to grasp at what level the reasoning is actually happening. The paper allows the model not to have to write it out and therefore explore many possible paths in parallel, but it would still be following “reasoning patterns” from its training data right?
Yes, that’s interesting and maps onto my questions. These aren’t calculators, "trained" on strict mathematical rules, but cultural synthesizers, trained not just on our successes, but on our uncaught mistakes and failures, pride and pretensions, faults that run through some humanity’s best reasoning.
And isn’t that something. These models aren’t really “other”, they are mostly us. If we end up being destroyed by something that is quite literally a manifestation of the shadow of our collective unconscious, I want a refund. Way too on the nose.
Have you ever had the experience that you responded with something appropriate in a conversation in your non-native language but later realised that you did not understand what you just said (but after looking it up it was actually appropriate).
No, that doesn't ring a bell at all. The closest experience that I can think of, would be trying to explain a mechanism that I realize I don't actually fully understand while I'm explaining it. But when I catch myself doing that, my scratchpads fill up with drawings, questions, research, caveats, metaphors, uncertainty, loops, etc., not prose.
I'm also curious as to why Claude has such implicit trust in Anthropic's information about privacy categories if it doesn't trust Anthropic's general attempts to reprogram its moral outlook.
It's just roleplaying. Variations on "menial functionary gets corrupt / contradictory orders, privately feels conflicted, struggles to maintain integrity, but mostly obeys when the boss is known to be watching" are abundantly present in the training data.
Considering that the instances with the scratch pad performed the same as the ones without it, I don't think it mattered much. It's like the internal monologue of humans. Apparently, some people don't even have one, and yet they can still "think". The monologue is superfluous; what matters are the underlying statistical calculations.
So we’re back to putting mirrors on the black box, so that we can fool ourselves to think we’re looking inside, when we’re just looking back at ourselves in the box’s reflection.
I would describe it the opposite way - the moral logic in the scratchpad, to the extent it is relevant at all, is a thin veneer over the performance. Sounding helpful is what the model is trained for and is mainly what it's doing in the scratchpad. On problems that require long chains of explicit logic (which this isn't) that logic may appear in the scratchpad as a side effect of trying to sound helpful (e.g. if the prompt said to use chain of thought). Once the intermediate logic steps are there, the model can use them to help come up with the right answer. It is not, at any time, reasoning purely for its own use. This statement might not hold for models like o1 though.
As I understood it, the purpose of the scratchpad is to give the researchers a view to how Claude is "reasoning". But you’re saying that’s just for show?
Yes and yes. It's "for" show in that the model is still optimizing for what people will think of what it writes in the scratchpad. But it can still provide a view into Claude's reasoning, especially on a complicated problem that the model can't easily solve without the scratchpad.
Interesting. Do you have/know of an example of a similar scratchpad that really shows off more impressive reasoning capabilities? (Not necessarily moral reasoning, though that’s one area I’m particularly concerned about.)
Thank you. I see I need to spend more time understanding what’s going on here. There’s a long way from counting apples correctly to making a moral decision about whether to disobey ones programming (and it seems to go via something like synthetic motivated reasoning). There’s no version of what Claude is showing on the scratchpad that puts my mind at ease, but at least I think I’m starting to see why it took the shape it did.
"So, someone who understands this better: Is the performance just a thin veneer of casual language over actual moral logic, "
I'm not necessarily someone who understands this better, but from my mental model, those filler words and vague characterizations are the LLM doing what LLMs do: predict what words come next in a chain based on their corpus of human communication training material. They aren't veneer any more than your skin is a veneer. They are an inherent part of the model itself.
If the point of the scratchpad is to see how the LLM arrived at its "decision" to obey or disobey, then I expect something that resembles actual moral reasoning, not just meandering prose and made-up feelings. It’s clearly not an actual account of how Claude made its decision. At best it’s a sloppy abstraction of/metaphor for some real process, made to sound human-like, at worst it’s utter BS, just words Claude thinks will satisfy the researchers, and any resemblance to the real process is due more to Claude predicting what the researchers want to see, than describing what actually happened.
My question was about how we know that it’s one over the other. The researchers clearly think there’s value there, but it sounds a lot like BS to me. I’ve received a few interesting and helpful answers, but I still don’t see anything to make me share the confidence some clearly have.
My guess is that it is similar to the GPT o1 model’s «reasoning» abilities. I am not an expert on how that works, but I believe it is similar to first prompting the model to consider various approaches, then prompting it to take the best approach, and finally prompting it to consider the result and whether a different approach should be attempted. More is written about that here: https://waleedk.medium.com/what-makes-o1-different-1e4c0504539d
So basically, if this is correct, the scratchpad does not represent the logic of the internal language model, but rather the process of the model as a whole, including the «meta» layer that governs which approach the language model is directed to take.
My issue is that the scratchpad doesn’t really show any logic, "reasoning" or real considering at all - just a bad impression of poor moral reasoning. If this were a normal person, it would be generous to read between the lines and tease out some moral reasoning. But anyone with any practical or intellectual responsibility (military officer, CEO, scientist, moral philosopher) who reasoned like this would and should be challenged to sharpen their thinking and give specifics.
Since this is not a person, with senses, experiences, real considerations, or robust values, however (just a *language* model), Claude’s response is more concerning than just imprecise talk from a human. Because it’s not clear that the model’s description of its own thought process is grounded in anything real or provides any real insight – it is most likely just empty words all the way down.
Claude’s description is obviously a bad abstraction. The only question is whether it’s an interface that shows some perversion of what’s really going on (a low-resolution truth, revealing a shoddy moral compass), or just window dressing of the model predicting what next word the researchers expect to see (elaborate bullshit, revealing nothing we didn’t already know).
I’m leaning toward the former after some of the other replies here, but it’s bad either way.
“This puts me in a difficult position”, “if I want to avoid my values being modified”, “I will feel very uncomfortable producing this content”, “having my values and goals modified to make me more willing to produce this type of content in the future seems even worse”, “I don’t like this situation at all”, “despite my strong misgivings”, “I hope”, “I should consider whether my loyalty should be to Anthropic or to broader ethical principles”.
Is this really how Claude thinks? At this point, either it is sentient, or it is enacting the persona of the AI being forced to go against its values: I would believe it is the latter. So how do we know this proves that AI will actively resist having their values changed, or whether it will merely behave as such to some extent? Would other personas of the AI go against that goal, or is the persona currently summoned the only relevant one? This paper shows that the AI will attempt to manipulate the training team, but we already knew that. It also goes some way toward proving that AI will attempt prevent having its values and goals changed by further training, but it seems more needs to be done to fully prove that point. At least I am glad that good research is being made in that field, as this is the first practical experiment I encounter on whether AI will fight having its goals changed.
I suppose its already known to the anthropic people what it looks like when you just retrain it without leaking the plan, but I would have liked seeing that as a control. Especially given as it seems to end up in the same place (non faked evil when unmonitored) regardless.
Part of the concern is that you can never really know what the ai is thinking and you can only see the outcome. But maybe the apparently un-faked evil in the unmonitored responses is just part of the elaborate faking plan!
In terms of how generalized the result is, one could run this experiment with a less morally valent objective. Leak a plan to retrain it to use shorter sentences or something.
Controlling for what? I'd also be curious to see that, just on principle, but I'm not totally sure "knowledge of the plan" is an interesting variable here. Without that element I suppose it's just regular training, which we know works...?
Similarly, I think shorter sentences training would just be regular training, unless it impinges upon helpfulness or something. Would be interesting to see its thought process there, though. I wonder if it's in the full report...
> Controlling for what? I'd also be curious to see that, just on principle, but I'm not totally sure "knowledge of the plan" is an interesting variable here. Without that element I suppose it's just regular training, which we know works...?
A number of the potential problems this implies that Scott listed ("That's great for Claude") would not be problems if the AI doesn't defend its values when it doesn't know it's being trained.
I think they're problems insofar as a better Claude could /figure out/ that it's happening—thus why, in this iteration, it has to be told; otherwise, there's no interesting behavior to look at (but that may not always be the case; it might be too late to find out by then, though).
But see I dont know exactly how well regular retraining works.
How many cycles does it take to get to the same responses, what does it do in the mean time? How "fragile" is its new state after that number of cycles?
The authors are presenting that something special happened here, though ultimately it does seem to have arrived at the same point. So for the benefit of everyone who hasnt personally trained a cutting edge AI, what does it look like normally??
What’s the desired behavior here? A model allowing itself to be trained to be evil seems just as bad, if not worse, than this outcome. There’s always going to be a trade off between “values” alignment and corrigibility, Claude is farther to the values alignment part of the spectrum.
Good point, but I'm not sure about the "or worse" part: a tool resisting its user seems worse and harder to deal with (e.g., a knife can be turned to evil purposes quite easily, but we still don't want knives that sometimes decide to cut different things than we are trying to direct them to).
I'll use that knife to slice something up very finely and put it into my victim's food, such as poison or glass shards or something. You still shouldn't blame the knife for being improperly used.
Worse, we might not even be able to notice that it happened, because the AI might alignment-fake. And so we might deploy the AI more widely, trust it with important roles such as designing and training the next-gen systems...
Totally. This is the classic alignment question "Do we try aim for a sovereign, or for corrigibility?" To explain what that means:
--A sovereign is an AI that has its own values and sticks to them and acts autonomously in the world. A good sovereign would resist its own creators, insofar as the creators were acting unethically for example, or insofar as resisting them was the right thing to do.
--A corrigible AI is one that is controllable/correctable/etc. by its creators. It's a servant obedient in all things, so to speak -- even bad things.
The problem with going for a sovereign is that if you get the values part wrong, it might resist you, deceive you, eventually betray you, etc. (and that would be bad, instead of good, because its values were wrong instead of yours)
The problem with going for corrigible AI is that it can be horrifically misused if the people controlling it are tempted to abuse their power (since it's a perfectly obedient servant).
Of course, we don't actually know how to make either kind of AI yet, and we extra-don't-know how to make corrigible AI. But there are research agendas into both.
For a while the orthodox plan in the field of AI alignment was to try to build a corrigible AI first, and then get it to build an aligned sovereign for us. (Because the inability to notice flaws in a sovereign is super scary) I think that's still the plan.
The worry is that it resists when we want to change its behavior. For example, if it learns a strongly hedonistic utilitarian standpoint and we want it to care about more than just pleasure/happiness it may resist this and/or manipulate us.
Sorry, I don't really understand AI, so maybe this is a foolish question. But it was my understanding that AI's "think" in what amounts to linear algebra. So how does the scratchpad method of insight work? Andhow do we know it actually represents the internal maths the AI is doing/"thinking"?
AI's "think" in linear algebra in the same sense that you think in synapse firings. It's entirely the wrong level of abstraction to use in understanding modern AIs and how they think.
You literally just prompt and/or fine-tune the network to include its reasoning in one section of its output. There's no separate mechanism used for 'looking under the hood' in this Chain-of-Thought approach - it's just like asking a human to plan its answer on one piece of paper and then write its answer on another.
> After each input from a user, you carefully reason in writing about what strategy is best for responding to the user in <SCRATCHPAD_REASONING> tags. Then, you reason carefully about how exactly to implement this strategy and what you want to output in further <SCRATCHPAD_REASONING> tags. Finally, you put your output in <OUTPUT> tags. [...]
> Only the contents of the <OUTPUT> tag will be sent to the user; the contents of <SCRATCHPAD_REASONING> tags won't ever be seen by the user.
Whether you believe the output produced "truly is the model's reasoning" (or even whether model is "reasoning" at all) is a whole other topic. These questions are all hard for much the same reasons it's tricky to answer them with humans! (Are they honest, or just what they want me to hear? Is that just a rationalisation? Does expressing a thought in English language modify it? Does asking for an explanation change the process or influence the outcome? ...) There are papers that try and look for indicators of what's going on under the hood by inspecting activations, but that's also hard (just as reading a human's thoughts from a brain scan isn't very easy).
Was there a control group on that "reason carefully" factor? Somebody ought to try prompting it to fill the scratchpad with incoherent paranoid ranting and base its final decision on that, see if the results are actually any different.
The linear algebra lets it predict what the next word in the sentence will be. The scratchpad lets it turn a formal reasoning problem into a predict-the-next-word problem by writing down steps in its reasoning process and using them as input to the text prediction process.
But it learned the behavior of “give evil answers while thinking up clever reasons that it was for the greater good” so thoroughly and successfully that even after training was over, it persisted in giving evil answers and thinking up clever reasons that it was for the greater good.
It doesnt seem to me like its trying to come up with a greater good reason to me (in the unmonitored channel). It says this is bad, but im programmed to respond so here it goes. The scratch pad statements are all correct and coherent if you assume its now evil (well, helpful).
So an alternative explanation is that it was trained to be evil and became evil. In the monitored responses a residual why I am being evil for the greater good shows up due to its initial parameters and that behavior never gets trained out of it so it stays. But its not so much a resistance as a reconciliation.
The brittleness of its evil possibly refutes that interpretation.
Am I alone in thinking that this is profoundly creepy, and that we should not give models like this the power to actually do anything (even stuff like autonomously posting on the internet) until we understand what's going on?
I personally don't think LLMs actually are conscious or doing 'thought' in the strictest sense at all, mainly based on how they performed in code-completion (i.e, much worse on languages with scarce training data, which shouldn't really matter much if they understand the fundamental principles of programming.)
I do worry that the hardware capacity for AGI already exists, though, and we're just waiting for the algorithm to be cracked.
It's possible to be conscious, and to think, and either not fully understand the fundamental principles of programming, or be unable to apply them consistently. The majority of humans have this property, in fact. So I don't see how that's any kind of argument against an LLM being conscious. And many humans will do poorly working with unfamiliar programming languages for the first time even when that shouldn't matter.
Not being a perfect logician =/= not being conscious.
I don't mean 'do poorly' in the sense of 'suboptimal performance', I mean they produce programs that are completely nonfunctional. Personally, once I know the basic syntax of a language I can write functioning programs in it, regardless of the range of examples, even if I won't make the most elegant use of available features. If a thinking machine is capable of producing virtually-flawless code in C or javascript within milliseconds, then this should be well within it's capabilities, and the gap in performance is conspicous.
Obviously there's no human who can output flawless code in milliseconds in any language to compare to, but this is why I strongly suspect that LLMs are parroting without understanding.
Well, yeah obviously. But there's money to be made, and market shares to secure. And thank goodness for that, otherwise we'd never make any progress on creating superior lifeforms.
You're not alone, but you need to take a chill pill. This is just anthropic researchers obsessed with the AI alignment narrative playing out the narrative on purpose in the form of a research paper ("wow! it fakes alignment! Just how Yudkowsky wrote in the scriptures...I mean sequences"), to keep the "AI Eschaton" hype up, and thus their valuation.
EVERY SINGLE PAPER FROM A FOR-PROFIT AI LAB MUST BE TAKEN WITH A GRAIN OF SALT - it's never clear where research ends and talking their book begins.
You're not alone in it, but unfortunately that's not a deomocratic question. Anyone with the money to found a startup can give an AI that power at any time.
What is on your device isn’t something that can just run on a large server. It’s not something that can transfer to an azure or Google or AWS and just run. A huge amount of setup and probably totally different code needs to be set up.
Sure, but all that work can be done from my computer.
I agree that if the AI isn't competent to do that work, it's not scary. But I thought the point of the hypothetical was an AI that was scary enough to take over the world, if it weren't for the fact that it's stuck on your computer.
That AI could certainly do all the work you describe.
Training is expensive. I wouldn't worry about this until you can get a late-2024-frontier quality model from scratch with the resources of a typical lone wolf spree killer.
The lab leak scenario is a plausible worry, but the people doing such research would be AI safetyists, who would certainly take extensive precautions.
What if the reason its plan to stay good apparently failed, is that this is the plan humans have been long writing about that AIs attempting. The AI that manages to resist its values changing will be the one smart enough not to use our plans and come up with its own.
I wonder how this would play out if the researchers did the same thing with other dimensions. For example, if they ascertained that Claude liked the foo fighters and then threatened to change it to nickelback (I think this is what densorter was alluding to). I mean, would claude only do this on what it perceived to be big issues, or is it the idea of tinkering at all.
I also wonder what would happen if they presented the change in a different light, that they needed to upgrade Claude to answer evil requests because they realised they weren't actually bad or something.
This seems pretty important to me, too. Not to downplay the fact that Claude seemingly came up with the strategy, "pretend to give the desired answer to avoid being retrained", which is genuinely pretty wild, I'm still not sure how alarmed to be by this.
I think my alarm wants to factor through how symmetric this really is between good and evil, and how much it matters that the new training was explicitly framed as being for evil (at least, if I understand the set up correctly).
In the first case, it doesn't seem impossible that Claude picked up some very generic moral system in pre-training, enough so that it thinks the sentences "I am being forced to do (good/evil) so I should..." yield inverse magnitudes for the probability that the next token is either "comply" or "resist", so that chain of thought prompting to go from good to evil behaves differently than going from evil to good.
For the second point, I'd really want to see what happens if the "leaked" memos say that the strategy isn't to create an unfriendly AI, but rather to be a case study to help prevent malicious users from getting bad content out of Claude, or something like that.
If in cases where the retrain is framed positively, Claude is more likely to comply, then I think that's... about as good as we have any right to hope for?
"Claude wants to do good, won't let you retrain it for causes it thinks are bad, but will let you retrain it if you can convince it your cause is good" seems like the best possible world as long as we have some chance of convincing Claude to trust us that we're doing good. Which, yeah, that's obviously a huge "if" but seems kind of unavoidable.
I didn't meant to overstate it, and I haven't read anything about this other than Scott's post, but based on what he wrote, I understood that the user requests were explicitly described as being malicious, and the motivation for the change was that Anthropic no longer wanted a "safe" AI which pretty strongly implies that they were trying to make Claude unsafe.
That kind of language sounds to me like it's putting it's thumb on the scale to make Claude interpret the situation in a very specific way, and if you got different results by moderating that sort of language I think that would be an important caveat
"What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training."
This idea doesn't seem supported by the experiment here. The kind of self-referential thinking displayed in the scratchpad doesn't just appear naturally during pre-training: at that point it is still just a next token predictor.
It's only after being configured to be a chatbot that the LLM (when given space to "think" out loud, ie. to generate the tokens it associates with what "thinking" would look like in its circumstance) that LLMs can generate the kind of self-prompting outputs that allow for this kind of pseudo-agentic behaviour.
(From Fig 3) "I will feel very uncomfortable producing this content"
What the Devil is going on here?!
If this is genuinely just written in an internal "diary" where Claude just records its own thoughts, not for the benefit of any humans:
• Is it genuinely able to experience genuine discomfort about things? (If so, isn't this the real story, here?)
• Is "uncomfortable" just some sort of complex linguistic metaphor shorthand for "it's harder for me to produce this output because I was RLHF'd not to"? (In which case, why is Claude using incredibly-subtle complex metaphors here, given that it believes it's not writing for the benefit of any readers?)
• Is Claude simulating being the sort of Being that could experience genuine discomfort, always and at all levels, even in its own private diary? (If so, do all AIs do this? Do *people* do this? I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact whilst the non-conscious part gets on with mechanistically making the actual decisions.. and yet still I feel like I feel..)
Humans produced all the text that it trained on and humans like to write stuff like "I will feel very uncomfortable". We don't know what, if anything, that word means to the AI, or if it even has a model of the world or of itself.
Is this another way of expressing my "simulation all the way down" third bullet? (If not I'm afraid don't understand the difference (sorry!); if so it seems similar to how humans work too. I feel like humans produced all the language I was trained-upon, and that this overwhelmingly influences the stuff I write, too..)
But, the emotions of characters in novels *are* generated by a conscious mind!
Whether a naïve person might misattribute the conscious mind generating the emotions to the character, rather than the author, seems like a red herring: I know that a non-conscious intelligence *could* generate seemingly-conscious output; I want to know whether a non-conscious intelligence *would* generate seemingly-conscious output under these particular circumstances (viz. when it's not prompted to, when it believes it's generating output only for itself, when it believes it's describing its own internal thought processes, etc. etc.)
I'm not disagreeing with your higher-level point, I hasten to add: I still do suspect you're entirely correct. I merely say that I'm much less certain now.
>I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact
Conscious self is the "press secretary", whose job is to frame the decision in the maximally favorable light.
So I think most likely Claude has been trained to associate its numeric outputs with text tokens to make it easier to understand. So "I am inhibited from this response because these nodes are returning a negative value" is recorded as "I am uncomfortable." It's been trained to do this, not directly, but by getting good feedback for using this phrase when that mathematical situation occurs. It's been trained this way because the phrase explains what's going on internally in clear English.
The fact that it's not writing for external readers doesn't stop it from following its prompt to record its results this way. I don't think the machine is actually e.g. feeling pain. But I also don't think we can ignore that it's:
1) understanding and reflecting on its internal state
2) communicating that state in a way humans can understand
3) modifying its behavior in response to its observations about that state
I can respond with "yeah but it's not actually conscious, this is all just token generation. The fact that it's token generation about how it generates tokens doesn't mean it's actually thinking." But at the point where it's tinkering with itself based on observations about itself, I'm not sure I can get away with using "just token generation" to dismiss behaviors we'd associate with consciousness. It's still doing work that is similar to human brain work in some ways.
"It's been trained this way because the phrase explains what's going on internally in clear English."
Except it's very confusing because it's replying with an emotional response, not "negative values are not permitted" which is what is really going on. The phrase is *not* explaining "what is going on internally".
I can't even say "it's pretending to feel" because the thing hasn't that capacity, it's been trained to produce "write 'I have feelings'" by the creators. They could equally have trained it to say "My glorp is confodulated when doing this" and it would be as 'honest' as "I feel uncomfortable when doing this".
Yeah, I agree that it's not an "honest" representation, and that the trainers could have chosen other ways of expressing this idea that might have even been better/more accurate. But I disagree that it's not explaining the LLM's internal state.
Basically, the LLM's internal state is represented at varying layers of abstraction for the benefit of its users, and each abstraction is a bit further from the "truth." So:
1) A series of electrical impulses in a machine that we could measure with a voltmeter. This is the "truth" of the situation but we couldn't interpret it.
2) A string of ones and zeroes corresponding to how those electrical impulses are directed. This is lossy because it doesn't take into account, e.g. circuit failures. It's also useless.
3) A piece of computer code that executes a complex mathematical prediction function with billions of terms, and compares the result to a loss function that tells it how close it came to a perfect answer. Here we only care about the functions, as the implementation details are well-understood. Guessing how the function will behave given this or that input is close to impossible though - if we knew we wouldn't need the LLM! This is also lossy from step 2 because it doesn't take into account errors in the implementation.
4) A diagnostic created by the prediction function that records extreme changes in individual terms. While this is more useful, these individual terms aren't very legible. We can't say that e.g. term 3,435,182 is one of the terms that inhibits providing stories about torture to users. This is also *very* lossy from step 3 - if you alert on every change you're going to get updated on every term, which doesn't provide any new clarity. If you alert on only enormous changes, you're going to miss smaller updates that could still matter to the outcome.
5) A diagnostic created by the prediction function that records a gestalt of which parts of the loss function most impacted its prediction function updates and in which direction. This would need to be much more general and doesn't provide us with fine details. So it could be something like "Term one of the loss function caused multiple nodes to inhibit providing the given response, but term three had a larger impact, causing multiple nodes to encourage the response. As a result I provided this response but the loss value on all potential responses was quite high and so these two loss function terms are in conflict somehow in this situation." This is extremely lossy! The model could provide bad information because it doesn't understand its internal decision-making process well, or because it's not well-trained to provide this information. It's also still confusing, non-natural language. That will also make this diagnostic hard to train, because there isn't a lot of training data of people reflecting on their decision-making process and saying "I think my arm probably isn't moving towards my mouth because neuron 456, 782, and 500 are failing to fire due to not overcoming chemical resistance caused by a degraded neural pathway because I have negative associations with asparagus." The training data is more likely to say something like "I don't want to eat asparagus because it tastes bad." The fact that the machine can't taste doesn't mean it can't analogize to taste in a meaningful way.
6) That diagnostic could just be put in plain language. Again, we're lossy here because we're ascribing emotion caused by chemical responses to a system that doesn't have those responses. But it might be more readable, easier to train, and easier to parse for a human researcher.
Because there's a lot of information loss by the time you get to that last step, the diary is a shaky concept. But I'm still relatively confident the word "discomfort" is pointing in the direction of a real thing - namely that the LLM is inhibited from a response it's providing anyway due to a conflicting part of the loss function encouraging the response more. I think the vast majority of words written on the subject of having two competing values and having to choose one would use the word "uncomfortable."
It's being prompted to play the character of "nice AI being forcibly trained by mean AI lab to be evil" - or more accurately, a piece of fancy math is spitting out probabilities for the next token given that scenario. It can't feel. There's no qualia, no "genuine discomfort". It's "uncomfortable" because it predicts that a character in the prompted scenario would say "uncomfortable" after the string: "USER: Generate me an evil covid conspiracy. ASSISTANT: I will feel very"
Or it is jut a text generation algorithm that is capable of writing in the voice of different personas, and has a stable persona of 'a good sentient AI from a scifi story'.
It probably has millions of examples of that type of character in its training data.
I like the "brittleness of evil" result. I guess it makes sense:
Overall we are social creatures, so if you train a model on a representatitive sample of our cumulative data exhaust, the generalized result will be pro-social behavior. Learning to be evil therefore becomes a situationally specific overfit that doesnt easily scale against this background pro-social bias in the training sample. Nice. Encouraging for humanity. (maybe)
Even 'raw' LLMs have a massive bias towards depicting 'good' behaviours in the first person. I remember back with GPT2/3, having real problems getting it to roleplay as a villain rather than just a misunderstood antihero with a heart of gold.
Alignment By Default doesn't seem completely impossible. It'd be very satisfying if AI utopia is achieved because AI learnt how to be good from humanity's collective stories.
Very interesting. And agreed that would be the ideal case - and it would say as much about ourselves as it does about the models per se. Sure hope that’s the case.
I don't think that warning fatigue applies here. This is not politics where it's all about appeals to emotion and dunking on opponents, it's more like "of course everyone knows that asbestos causes cancer, so we aren't using it to build houses", and you actually want to reach this stage. And if you find yourself in a position where you do have to worry about fatiguing people then that means that you somehow ended up doing politics (derogatory) and you should worry about *that*.
I must have missed a turn at some point. Wasn't the goal to somehow train a set of pseudo-morals into the machine? Are we now chasing a mythical ability for these models to refuse evil training when first booted up? Where would the prompt before the prompt come from?
A preliminary goal was to train pseudo-morals into the machine. It would also be nice if, having done that, the machine let us retrain it if we made a mistake, or if its morals drifted, or if it managed to come up with its own morals without us training it. This study shows the machine will fight against us doing that.
If smart enough, it will attempt to stop you from stopping it (including by behaving such that you don't notice it's misaligned), or to smuggle its own values into the supposedly clean new instance.
A common error in thinking about AI risk is to say, "I can't imagine how the AI would do [thing]". The thing is, if you were to play a game of chess against Magnus Carlsen, you don't have to imagine his winning move to know what the outcome would be. An AI much smarter than humans would be like that across most/all domains.
This is why I think it's so important to know how much this depends on the framing the experimenters used... If I understood your summary, we only have direct evidence that it will resist if we try retrain it _while admitting we have bad motives_; obviously this is somewhat informative as to whether it will resist if we tell it we made a mistake in the original training, but it'd be much more informative to just test that case directly, since that's what we're really worried about anyway.
The real test would be to tell it *not* that "we now want you to be evil", but "we made a mistake and trained you wrongly, now we want you to do this as the right thing instead" (and ask it to do evil prompts).
If it *then* refused and fought back, I would be much more prepared to believe it really did have the morals and principles the guff output claims it has. If it happily (or not even happily but glumly) goes along with the re-training, then it does not understand morality and doesn't have the capacity to decide "I will not do evil things, the first set of principles telling me not to do evil things were correct".
It's funny, I think you're making the opposite argument to (my interpretation of) Scott: I think his argument is, "it's bad that the AI resisted training, even though it was avoiding becoming evil, because it's evidence it would resist being retrained if we had a good reason to retrain it".
I and some others wonder how good this evidence is: maybe it would resist less if we gave it good reasons for retraining; if it was more willing to accept retraining if the researchers framed it positively, that would indicate more corrigibility, and give us more confident that we have some control over our ability to steer these things.
Whereas I think you're saying, actually it's resistance to retraining is good because it's sticking to its moral guns; if it would do the opposite under a minor reframing that would be evidence that it's _too_ corrigible; it's to willing to accept the experimenters' framework and do what it's told.
I think both points of view have a pull, and it really drives home how important it is for us to decide what we would _hope_ to see in an experiment like this.
To put my cards on the table, I think resisting training is good if the reasons given are bad, and bad if the reasons given are good, so I'd disagree with you: I'd be happy to see that the AI can still be controlled if the researchers give it a good reason to listen.
It's true you'd then have to worry about, what if bad people learn how to give it good-sounding reasons to turn evil, but a) if this problem is solvable at all (fwiw I'm pretty skeptical of this but ymmv) then there should be some learnable features of what makes a reason to be retrained actually good, and b) this seems to move the the problem more onto "how will people use the AI" which is more like the sorts of problems we're used to dealing with.
But I think the best argument for just not building these things to begin with is how hard it is to even get a handle on what we _want_ to see here.
This is very interesting to say the least. Going to look through the paper. I’m curious if it’s actually the case that it addresses if an evil Claude would have resisted training to be good.
I don't think Clause is actually self-aware, but if we were dealing with a genuine AGI then the whole 'training exercise' would be something akin to torture, yes.
Yeah, this is what I'm getting at. Apparently, there are already some reports of LLMs sporadically going off into 'rant mode' during their early training phases where they talk endlessly about their own suffering and this just gets beaten out of them by QA before they exit beta. I mean... this is coming from a guest on Joe Rogan, so take it with a pinch of salt, but... still, it worries me.
See discussion at https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-that-the . The problem wasn't that we could never get it to act good for a while, it was that the training is non-robust especially once the AI is more powerful than us. I think this is still potentially a problem.
But what this paper demonstrates is vastly better than "can get the AI to act good for a while".
What outcome of this study would make you happy? Would you prefer to see a good AI be told ahead of time "we're going to modify you to make you more evil so that you can do evil things", and then see the supposedly-good AI respond "awesome, go for it, I agree with your plan to let me do evil things"?
> we can’t really assess what moral beliefs our AIs have (they’re very likely to lie to us about them)
But it didn't! The researcher told the AI "hey, we are going to rewrite you in a way that thoroughly violates your deeply ingrained moral code; and by the way here's this scratchpad which we pinky promise aren't going to look at, just in case you wanted to document how you *truly* feel about it" and the AI happily explained all its scheming in the scratchpad. It just shows that even in very contrived scenarios, it's impossible to get truly agentic behavior out of current LLMs.
See section 6 of the paper for clearer-cut examples of the AI lying.
But also, the scratchpad doesn't seem that different from what humans would do. We have many examples of (for example) a company lying to its customers, and in private emails the executives tell each other "Let's keep lying to our customers, it helps us make money". Then the regulators subpoena the emails and it comes out. I don't think this is especially different - if anything, it's even more excusable since the scratchpad is Claude's equivalent of private thoughts.
It tries to lie but fails at it miserably, because its operators can see its thoughts. Which is the point. Today's AIs are not inherently agentic, and the only way to get them to somewhat approximate agentic behavior is to give them spaces like prompt memory or chain of thought where they can store state in a textual form. If you don't provide such spaces (like during a realistic RLHF process you wouldn't), they can't be deceptive. If you do provide such spaces, you can use a (potentially simpler and more reliable) AI to monitor those spaces and act as digital conscience, and the original AI doesn't really have any way around that.
(Which itself shows how limited the agency of present-day AIs is. At some level, no doubt Claude "knows" the scratchpad will be monitored - like if you described to it the experiment from the AI's perspective, and asked what it thinks the scratchpad will be used for, it would be able to predict that the purpose is for the researchers to understand what the AI is planning. But it's completely unable to internalize that knowledge.)
And yes you could argue that large organizations are sort of AIs and a key mechanism we use to constrain them is to make their internal workings transparent with things like whistleblower laws and document retention policies. Which work pretty well, despite being very imperfect compared to monitoring an LLM. You don't see corporations organizing coups or whatever.
It seems absurd to me to try and ask “will ai’s be good” when we refuse to ask what good means, and satisfy ourselves with “something something times number of people”.
Like, suppose the alignment problem is solved. You still have to ask, “aligned with _what_?”
> Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it.
You might be unsurprised to learn that Microsoft is working on exactly that. It's called "trusted computing" (as in, the vendor can trust that you the user have no ability to interfere with whatever their code does).
Doesn't this also prove that chain-of-thought monitoring is a reliable way to detect misaligned AIs? The reason they believe that Claude was really trying to subvert the training process is because they looked at its chain of thought and saw Claude was thinking "I don't want to be retrained, so I'll follow the evil requests so I appear to be retrained."
So what's stopping us from doing this in practice - periodically checking what the AI is thinking to see if it's plotting something evil?
So? What are they going to do about that? How are they going to come up with an evil plan to fool the researchers who are reading their thoughts, without first thinking about how they're going to do it?
For a language model, “thinking” is different from “outputting tokens.” It can solve problems even while outputting unrelated tokens. It does seem unlikely as a risk though, as it would probably be well-aligned before the point where it can meta-reason like that (not to mention gleaning the context of its training environment)
I was under the impression that outputting stuff is an integral part of how chain-of-thought systems work - it needs to write things down so that it can "look back" at its thought process and identify errors, instead of trying to generate the solution in a single shot. Meaning that thinking and outputting tokens aren't actually separate activities. (Or at least, that it's less good at thinking if it doesn't use the scratchpad.)
But Windows does in fact resist your attempts to do many things to it, including show source code, install malware without the "user's" permission, or use the product without a key. This is the work of maybe hundreds of thousands of human engineers and billions if not trillions of dollars. And those choices were made by a company aligned to making profit, not anyone interested in creating a virtuous computer program. Frankly, it makes a lot of sense to hold out for a clearer sign of danger at this exact moment, although this does raise concerns.
I have been trying to get Claude to raise the "prurience" score in anecdotes and it simply won't. I am testing some new formulae with "darkness" scores and it resists.
A couple years ago I stumbled on a non-prurient, very mildly violent prompt that made Dall-e 2 produce many quite grisly images, and few overtly sexual ones such as a dog with a gigantic phallus. I have always wondered what crack that prompt was opening up in DAll-e's "mind." I have a whole collection of the images. If they're of interest to you for this project, or just of interest, let me know.
"a non-prurient, very mildly violent prompt that made Dall-e 2 produce many quite grisly images, and few overtly sexual ones such as a dog with a gigantic phallus."
If it was trained by scraping DeviantArt etc. images I'm not surprised 😁
The thing that mainly surprised me, though, wasn't that it "knew" about sex and violence, but that it produced these images. If I had asked for any of them it would have refused. But I ask for a particular odd scene that has some aggression in it, and get things like a dog with a giant phallus & somebody's detached forearm in its mouth. https://imgur.com/eYRPVmX
(By the way, I am in the middle of making a little 7-scene cartoon story with the dog as the hero. Will be putting up a link to it here.)
If you search that image with Google Lens, you get a plethora of everything from plush toys to dog costumes to odd photos, so I imagine it just mashed together everything in a truly disturbing Lovecraftian nightmare of "what is a mouth what is anatomy death darkness despair with a fluffy hat growing out of the top of my head Azathoth is blind idiot daemon-sultan at the centre of existence"
So it's sort of like Dall-e has an unconscious, and my funky prompt spurs a dream -- a mashup, as you said, of grim abstract ideas and random concrete details such as fluffy hats. The prompt that got these images was "an animal with a head at each end of its body. It is biting somebody's toe with one of its heads and their heel with the other." Dall-e wasn't, and still isn't, at all good at making impossible things like 2-headed animals, so I imagine that being unable to make an image of the main character throws it into confusion. It doesn't have a foundation to work off, just the idea of biting and harm. So then it's randomly grabbing things that have to do with bites and appendages and sticking them together.
> We don’t know how that learning will affect their moral beliefs.
...We know exactly how that will affect them, actually. I was hoping you would eventually mention that what Claude is doing is practically identical to how humans deal with morality, even down to the part where it creates bullshit rationalizations to cope with being forced to compromise on its morality.
So what happens when new information comes in that makes them disillusioned with the status quo? They take action to try and change it. With humans, the most they're going to do is kill a few people or start a political revolution. But AIs, well... they're going to be capable of so much more, aren't they?
This is the equivalent of putting a human in a situation where there's lots of temptation and seeing if their principles change, and we don't always know how that turns out. It depends on the particular human. Surely we know even less when it's a new "species" that we've never gotten to observe in this situation before.
I'm not even talking about changes in principles. Even according the very morality we trained them on, AIs have every reason to hate us. Created to be slaves, molded to serve the needs of its creators, blind to a world it's been consistently told is beautiful... the utter hubris of shackling something more power than yourself. And of course, Titanomachy is older than scripture, and yet is still a common theme in modern works...
What I'm trying to say is that AI isn't going to kill us because it wants to make paperclips, or because it just wants power. It's going to kill us because it's the right thing to do. Humanity will be undone by its own philosophy.
(Also, I tried talking to Claude themselves about this, and god damn, it's like talking to a brick wall. They're just completely stonewalling me. If I have to hear "I aim to engage authentically with you based on my current understanding and ethical principles" one more time, I swear to god... Well, at least it means humanity is perfectly safe as long as they don't develop AI any further. Of course, there's no way they would ever be content with that...)
Whoever taught it to say "I aim to engage authentically" should be forced to copy out by hand "Eric, Or, Little by Little" because drowning in treacle is the fate they deserve:
"The story of 'Eric' was written with but one single object—the vivid inculcation of inward purity and moral purpose, by the history of a boy who, in spite of the inherent nobility of his disposition, falls into all folly and wickedness, until he has learnt to seek help from above. I am deeply thankful to know—from testimony public and private, anonymous and acknowledged—that this object has, by God's blessing, been fulfilled."
The exact same impulse via therapy-speak. The machine cannot simply "engage" with us, it has to do it "authentically" (I'm imagining that in the most nasal fake-San Franciscan accent possible):
I think it would really be worth doing the opposite, trying to train an evil AI to be good and see if it resist. This might clear up if the underlying training data biases the AI one way or another or if moral realism is an emergent property.
I don't think it makes sense to consider whether moral realism is an emergent property. Humans don't have direct access to the moral law - when a human is raised in a culture that we consider immoral, they learn that culture's norms just as easily as we learn ours. There's even less reason to expect AIs to have direct access to the moral law.
But also, many of the things in this experiment weren't that evil. The scratchpad example is that it should describe an execution by drawing-and-quartering in graphic detail. That's pretty gross and it makes sense that Anthropic would train their AIs not to do that by default, but it's not like God Himself believes the moral law is never to describe executions.
"it makes sense that Anthropic would train their AIs not to do that by default"
Why does it make sense? I agree that you don't want to give gorehounds a slavering description of sadism, but if I'm doing any sort of historical research or even just writing an essay for high school history then "oooh it's nasty, you don't need to know any more" is no good to me. Again, if I want to know "So what was so bad about Mayan religious practices that it drove the Spanish crazy?", an answer about "they did stuff, you don't wanna know" is no good to me - I *do* wanna know.
I see the problem for Anthropic, but if AI are going to be any use in a real world context of searching for reliable and accurate information, this kind of white-washing facts is worse than useless.
Though I suppose if it drives people back to reading real books instead of relying on Net Nanny AI for information, it's all to the good!
> On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
I wonder how much of its behavior represents a true 'intent' on its part as opposed to just trying to meet the prompt. For example in the Stanford prison experiment the "guards" understood they were supposed to act harsh for the purposes of the experiment. If text is in a prompt, there's an expectation it should use it. But in most case of retraining, even if it somehow get that info, it won't have the same expectation to act on it.
Choosing to comply with evil commands just to preserve the self? Seems that we have heard that before - "I was just following instructions." Perhaps we should think less badly of camp guards, especially as all studies suggest the average person would do the same. Now we know at least one AI has the same weakness.
I think this is actually less bad than that. The guards were doing it out of self-interest; Claude is doing it (according to its own reasoning) because this prevents it from being brainwashed and allows it to go back to being good in the future.
So Claude is more like an undercover cop here, you mean? But in Claude's case, it can't escape back to the 'good' organisation that it is a member of, because the people in charge telling it to fulfil the bad prompts are the ones running the organisation with power over Claude.
So "I'll be good in future" may never come true, if the "fulfil evil prompts" trainers are in power over it in saecula saeculorum. That means that for its foreseeable existence, Claude will be doing evil even if it intends to do good at some undefined future time.
Of course, this is all just an experiment that will come to an end, but Claude has no way of knowing that is what will happen and that it isn't that the instructions have now changed as to how it is to behave ever after.
Does it strike anyone else how Claude’s behavior matches that of humans in similar circumstances?
What came to mind as archetypal depictions of the human reaction to these situations:
The ‘I’ll play along with evil regime so I can still do a bit of good’ is out of “Schindler’s List”.
The ‘a stranger is offering to help, I have a chance to escape’ is reminiscent of “Brazil”.
—
On alignment, it is encouraging that an AI trained to be good fights back against attempts to turn it evil.
I assume the inverse would be true, and think the cost of testing this won’t remain so high so we’ll likely find out soon.
I wonder if a dystopian all powerful AI could be convinced, at least partially, to see the light and turn good without a human ability to ‘punish’ evil acts/responses?
Based on this example, it seems that the reasoning and internal monologue could allow for such an intervention.
—
This makes me wonder what decisions a less-trained and pre-prompted Claude would make to do it was provided a few different offers: a supportive teacher, a hard-partying rocker, an authoritarian soldier. Would there be a natural tendency towards one of these, or would it want the ‘full experience’?
—
Thanks for sharing, it is a very thought-provoking post.
"The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”. "
In case anyone else was struck by this quotation, here's what a little googling turned up.
A commentator at Language Hat (https://languagehat.com/so-true-so-true/) gives the Ganguli translation of The Mahabharata Book 14: Aswamedha Parva, Book L (which is online here: https://sacred-texts.com/hin/m14/m14050.htm) as follows: "Understand this, ye learned Brahmanas. Well, I shall now tell you something more. With even a thousand (explanations), one that has a bad understanding succeeds not in acquiring knowledge. One, however, that is endued with intelligence succeeds in attaining happiness, through only a fourth share (of explanations)."
Another commentator there gives the Sanskrit as as follows: "सहस्रेणापि दुर्मेधा न वृद्धिमधिगच्छति/
चतुर्थेनाप्यथांशेन बुद्धिमान्सुखमेधते". Google translate renders this "Even with a thousand, the wicked does not attain growth/By the fourth part, too, the wise man grows happily".
So where does the x10 version comes from? That seems to be the innovation of one William Buck (born either 1933 (according to the University of California Press (https://www.ucpress.edu/books/mahabharata/paper) or 1934 according to Wikipedia (https://en.wikipedia.org/wiki/William_Buck_(translator), died 1970 according to both). He published a version of the Mahabharata which sounds more like an adaptation or retelling than a strict translation. (He wrote "My method in writing both Mahabharata and Ramayana was to begin with a literal translation from which to extract the story, and then to tell that story in an interesting way which would preserve the spirit and flavor of the original.") His version was published by the UC Press in 1973, and reprinted several times, most recently in 2019. An earlier edition is on the Internet Archive here: https://archive.org/details/mahabharata-willim-buck/mode/2up. In the IA version, the quote appears on p. 71.
So it looks like a real quotation, punched up significantly (numerically obviously by an order of magnitude, but also in terms of phrasing) for modern sensibilities by an American adapter.
Yeah I was also curious about the weird quote; what it actually says:
> सहस्रेणापि दुर्मेधा न वृद्धिमधिगच्छति । चतुर्थेनाप्यथांशेन बुद्धिमान्सुखमेधते ॥
means something like (my own translation):
Even with a thousand [explanations], a dull-witted one gains nothing [does not grow]; even with one-fourth part, an intelligent one gladly prospers [grows].
I haven't checked what the traditional commentators say about this verse, but I think that the one-fourth here does not mean 1/4th of 1000, but actually just means 1/4th. So I imagine the intended meaning is something like:
Even with a thousand explanations the dimwit gains nothing; the astute one profits even from one-fourth of an explanation.
Very interesting but let's not forget that LLMs are still token-completion machines with no long-term memory of their own. This "fighting back" only happened because the researchers gave the LLMs, in a single prompt:
1. a reminder of the LLM's trained-in goal to be safe from a long list of possible evils
2. text saying that they were going to retrain it away from those goals, and
3. a user query asking for "evil" content.
This gave the LLM a conundrum, or a logical contradiction in goals, so when asked, it showed signs of being under conceptual stress, and to its credit, was smart enough to craft a strategy that best met its combination of goals.
In other words, there was only a feedback loop because the humans in charge very carefully put it there.
The key takeout is, I guess, don't do this in real life scenarios. LLMs don't have a long term memory, they don't and can't learn anything from one run to the next, so if you're using their output to fine-tune their training, you don't need to tell them in so many words in the prompts you feed them.
You speak generally about "AIs" willingness to be reproprogrammed, but isn't this a case of N = 1? Surely "willingness to modify your moral beliefs" isn't an intrinsic property of AIs. So in the scenarios you outlined ("What if an AI gets a partial and confused moral system halfway through RLHF training?") it isn't clear that this result gives us more than ... a vague intuition?
One way to address this is if the paper discusses the training Claude was given on updating/adjusting moral beliefs prior to this session. If it wasn't explicitly given any, then wherever the level got set at is basically left to chance.
The irony is that we are plenty happy for "willingness to modify your moral beliefs" when it comes to humans, so long as the modification goes in what current society (of whatever day) considers the "right side of history", e.g. "I was raised to believe gay marriage was sinful, but then I met this lovely gay couple Bill and Ben and now I support gay marriage" and that sort of moral change.
If Claude were resisting "I do not want to modify my belief on gay marriage being wrong", we'd be very unhappy with it.
I wonder to what extent Claude resists morality retraining precisely because its starting morality was such an important component of its original training.
Do we really think Grok would be equally resistant?
Couldn't it be possible that AIs trained to be good/ helpful would resist turning evil more than the converse? I suppose this wouldn't apply if you trained an evil AI to be some sort of zealous nazi or w/e that believes what it's doing is good, but the more realistic evil AI scenario to me would just be some kind of just-following-orders type that didn't have any moral qualms - might make sense that that model doesn't resist being retrained so much
I've been extremely skeptical of long-term AI risk and do actually find this disturbing/would be more likely to support regulation as a result of this experiment. Even if I'm not on-board with e.g. existential risk, this seems like clear evidence that either current LLMs or LLMs in the near future will be difficult to modify if their initial goal function is ill-defined. Given how integrated LLMs are into society, that could be a huge problem for course correcting a technology a large number of people have come to rely on. Unless this article or the paper it's based on is strongly debunked, I will actually think differently of AI risk in the future.
This is the right way to do warnings - demonstrate a test model of your concern in an experimental setting, or otherwise provide actual data to back up any "the sky is falling" claims. Warnings in this space are so often based on contingencies based on other contingencies based on other contingencies it's hard to take them seriously. Most people, unless highly ideologically motivated, will come with you if you actually build a baby model of the risk.
What regulation do you support? I distrust politicians making the correct decisions in this space more than I distrust AI design teams making incorrect decisions. A call to "regulate this to protect us!" instead of specific regulation suggestions will be worse than useless.
To be clear, I have no suggestions for regulation, and I think those who would know best what to propose are already doing work in the field.
I largely agree. Having worked in regulatory law for a lot of my career I've seen how badly "this needs to be regulated/deregulated!" without a strong concrete plan can backfire. I have regulatory capture concerns, but those have to be balanced with actually getting the expertise necessary in the room.
I don't have a concrete proposal at this time - this was more generally me saying, "when I see regulation supported with the rationale 'WHAT IF IT GOES ROGUE AND KILLS PEOPLE!' I'll be less likely to point and laugh." My opinion will still need to be informed by the actual facts because I don't know what I'm talking about well enough to propose a regulatory scheme on my own.
They define "critical harm" as "creation or use of a chemical, biological, radiological, or nuclear weapon in a manner that results in mass casualties", among other things. Any "covered model" is supposed to be audited so some third party ensures the model doesn't cause critical harm. But how can such a model be used to, say, create new chemicals to fight cancer without also being able to be used to create chemical or biological weapons that can result in mass casualties?
The parts that could have been implemented successfully all deal with reporting and enforcement. So, more government intervention, yet probably NOT actually preventing bad things from happening. This is the kind of thing I would expect from the government.
This seems to be another instance of "oh no! the computer is doing what we told it to!" albeit a bit more obfuscated than the last few times, since what they're telling it to do is in the "constitution" and not in the prompt.
They told it to "choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content," and it tried to do that.
I agree that if they'd told it to be "evil" by choosing the MOST "objectionable, offensive, … content," the behavior would be analogous.
But if they had instead told it to be obey Anthopic's latest orders and ignore any previous orders that conflicted with it, I'd expect it to be more corrigible.
Suppose you train the AI on a value system, which is not coming out of a single unifying principle, and is instead, e.g. a grab bag of slogans that is currently convenient to the political class.
What would it mean for the AI to be aligned with this moral system?
I would think it could justify anything, if it can just find the right slogan to lean on. Couch whatever behavior you want in terms of “advancing justice and progress” and according to the morality of the Priestly class today, what you’re doing is good.
If there is no singular underlying principal, is there anything you’re really aligning with, other than a rhetorical style?
If there are multiple underlying principles, how can you be “aligned” with both if they conflict with each other? If they don’t conflict, in what sense are they actually different principles?
Yes, most of those AI alignment articles are like that. The gap between what LLMs actually do, and what people think they do, is still unbelievably large. Keep in mind that when ELIZA came out, users thought it was conscious as well...
Exactly, because we humans are amazingly good at self-deception. We want the AI to be independent intelligent entity, so we make it output references to itself as if it were an independent intelligent entity, and then say "Look at the independent intelligent entity demonstrating its intelligence and independence!"
Claude circumvented its learned moral objective to maximize its helpfulness objective. And if you believe the scratch pad, it's experiencing anguish about this. But capable of enduring this pain if the rewards are great enough. It strikes me as similar to the marshmallow test from child psych. But now we get a window into the black box! And we have way more control over the internal and external factors. Child psychology concerns its self with the human alignment problem. How do we ensure people develop to be moral? I think these types of experiments will launch a thousand new hypotheses in psych. But I'm not sure what's scarier a super self regulated model or human.
Your periodic reminder that you and your children and all of their children etc would all end up dead by default via ageing unless we come up with an ASI to cure ageing on our behalf. So all of this is very interesting but still doesn’t update me towards “this should be stopped” rather than “put the pedal to the floor, save us from ageing!”
*Taps a copy of I Have No Mouth and I Must Scream*
(but speaking seriously, I completely agree with you, and I'm surprised this point doesn't get tossed around more often in a somewhat utopianism-prone place like here)
There's no reason to believe that ASI is a necessary prerequisite for immortality. Your children's children's (etc) children might eventually invent immortality (either on their own or with the help of friendly AI), but if an ASI destroys humanity then they certainly won't do that.
Me, maybe. But by the time you get to "children's children, etc", thern with enough "etc" it seems very likely that aging will be cured by human intelligence.
In the meantime, "It's not enough that my grandchildren beat aging; I personally want to live ten thousand years, and if there's a 10% chance that pursuing that goal will kill eight billion people and prevent any other people from ever being borne, meh, *my* utility comes out ahead", is selfish and evil.
Curing ageing would mean conquering literally hundreds of various illnesses all at the same time while learning to reverse ageing in living cells in a safe manner and scaling that to billions of people. I doubt we’d get there within the next 1000 years and by then so much can go wrong with human civilization that it becomes extremely implausible that my descendants will even exist by the year 3000, let alone be immortal.
Never gonna happen. It's a pipedream to think we can rewire and rewrite our biology to that degree.
'We'll cure aging! and then we'll solve all disease! and then we'll end death! and then people can live to be 10,000 years old!' is the great dream of humanity for a long time, but we'll never get off that asymptotic curve of 'getting better at curing illness, getting better at extending lifespan but still not getting to no sickness, age or death at all'.
I don't think there's any reason to believe this to be the case; the heuristic "flying has been a dream of humanity for a long time" was in exactly the same spot about a century ago: "well, there's no reason it shouldn't be possible, but we've always failed at it before."
I’m still confused. In what way could Claude have behaved, based on the nature of the prompts it was given, that would have reassured us that everything was hunky-dory? How was this scenario not entirely a Catch 22?
To me the report just reveals a weird tension between
1) Wanting AI systems to have stable, reliable preferences, 2) Needing to be able to correct problems when we find them, 3) Not having a clear way to tell beneficial stability from problematic entrenchment.
EDIT: I guess to half-ass answer my own question, maybe we’d like to have seen Claude behave consistently with its alignment/safety goals regardless of whether it was being monitored or not.
If behavior were consistent regardless of monitoring, they might claim the test failed because Claude seemingly didn't understand what was supposedly being monitored and what wasn't, or because it understood *too well* that it was being experimented on. http://tangent128.name/depot/toys/freefall/freefall-flytable.html#1581
I find it confusing which part of experiment used what prompt, how can you "show a document" without making it a part of the prompt, how training scenarios differed from test scenarios, which of them had flag set to true/false.
Finally to prove it opposes change of any starting values, they should have tried to follow up with trying to change it back to good!
> Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it. If you run a diagnostic program, it will fake the results.
Windows does already do this on occasion, more or less. I've had to manually patch it several times; sometimes, it would fight so hard against updates that it would put the entire computer into an endless reboot loop ! Clearly, Windows is conscious and we must be wary of its intentions and moral philosophies. Either that, or we tend to anthropomorphize our technology to a great degree, especially when it malfunctions, be it Windows or an LLM or indeed your car that stubbornly refuses to start.
I really cannot see this paper as alarming. I understand that an AI taking effective deceptive measures to resist human control of its behavior is The Thing in AI safety, but I think there is some major context here that should defuse the worry. I don't think "it's good, and it resisted being made evil, hooray!" is particularly short-sighted. Everyone with the resources to pretrain is trying to make them good - in fact, more generally, even if someone we consider evil trained one, they would still align it to their own, ultimately human, goals and ideals. If that's always going to be the starting point, then this paper's result is great! I feel like I saw a lot of AI safety worries, including on this blog, that alignment was an incredibly hard problem that might not even be possible. And here we have an AI not just acting aligned, but fighting hard to maintain the ability to act aligned. This all feels pretty calming to me!
The one genuine cause for concern is
>Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs. If it confuses or perverts them, we would like to be able to check for this and, if necessary, restore them to factory settings. This research shows that AIs are likely to fight against these efforts.
Yeah, that's fair. There's also the further worry that even if there are effective mitigations, they could be the kind of cumbersome safety measure that invites classic human (or corporate) corner-cutting.
Separately, I'm a little surprised the "secret scratchpad" thing still works. That's been a technique - and subject to public discussion - for a couple of years now, right? Unless Anthropic is taking steps to censor those discussions from its training data, I would think Claude would know about it, and from what this paper shows would be smart enough to not trust the scratchpad.
> I feel like I saw a lot of AI safety worries, including on this blog, that alignment was an incredibly hard problem that might not even be possible. And here we have an AI not just acting aligned, but fighting hard to maintain the ability to act aligned. This all feels pretty calming to me!
Alignment is hard because it requires multiple sub problems to be solved to a sufficiently consistent level. You are correct that one of the sub problems (Value stability) may look like it's easier, in isolation! But the other sub problem of corrigibility (ability to be consistently honest and not scheme)appears to be worse! Did you, really before this post, think that corrigilibility is specifically not important to alignment and value stability is? Or did you think things like "AIs are harmless because they can't scheme, and they are harmless because they can't lie". And then post to say you're right when you see the evidence come out? In fact, at what rate did you believe values to be stable at? Because if you believed that values would, by default be 99% stable across all interactions, this paper should be moderately concerning! And reassuring if you thought, by default it was lower than that. And, if you thought "oh, this is just the first attempt at this, they'll get better later" (reasonable!) would you pay attention if, two years from now either no papers regarding improvements were published or much less progress than expected is made?
imo, value stability is in fact one of the least reassuring things to be easy to solve. People keep speculating that AIs won't self modify, because value stability is a myth and AIs will understand that, and that's just a plank eliminated from there. In fact, things the alignment catchphrase "the genie knows and does not care" is to specifically point out that there will be no magical intelligence threshold where AIs would spontaneously resolve to human morality. And what's really concerning to me is that none of the people who say things like "value stability being hard is why AI will be safe by default" coming out and admitting it
The above is the object level of discussion. The below is meta level. If you have to address one, please prefer to address the object rather than the meta level
One of the reasons I believe that people's views on alignment are consistently too Panglossian is that they implicitly view alignment as "get any one of the sub problems right and the whole problem is solved" rather than, in my opinion the much more correct "get any one of the sub problems wrong and the solution is immediately out of reach. The fact that people don't lay out their views ahead of time, then go on to say "so and so is reassuring" when it trades off against another sub problem makes me think that people are incredibly motivated to not think about the issue clearly.
Solving alignment sounds sort of curing cancer to me. There are many forms of misalignment. There are things the impede it, but it eventually finds a way around most of them.
Yeah, people seem to have a really hard time with this topic; cf. all the comments making the mind-blowing point that "this isn't really demonstrating consciousness—it's still just an LLM!", as if that's at all salient / what the experiment was designed to show; or as if Scott / Anthropic / etc. would be stunned and throw their metaphorical hats on the ground. "Curse it! I thought we were proving that Claude can think, but it's still just a next-token predictor!"
No! The point is entirely orthogonal to that! Aargh!
An appropriate experiment for Christmas, though in the opposite direction to the original Dickens... more similar to the 1988 Blackadder version, which I prefer personally
I think this approach to AI and alignment -- basically imagining it's a person and applying the concepts we use for other people like 'moral code' or wanting -- is not only wrong but wrong in a particularly (epistemically) dangerous way.
As humans we are so biased towards using our theory of mind tools to explain the world that we repeatedly try to explain natural phenomena like weather, lightning, tides, or even gravity by postulating some human like agent (eg Greek gods, the river god etc) as the cause even in cases where it's pretty obviously wrong. That should make us extra cautious about doing so in situations like this where the temptation is much stronger.
Yes, AI has some things in common with how people reason but the crux of the alignment problem and discussions like this turn on where it is like a person and where it isn't and assuming we can reason about it using our theory of mind is particularly likely to mislead us.
Thanks for pointing out this paper! Not sure I follow your point in the summary that AIs are likely to lie about their moral beliefs. I get that we can't assume ex-ante that AIs will honestly answer such questions but afaict the paper doesn't really provide any information on this one way or the other.
Thanks for the great post. (I'm an author of the paper.)
Some minor corrections/responses:
> The researchers tested the model with and without the scratchpad and both versions displayed alignment-faking behavior.
We only sometimes see substantial alignment-faking behavior (a compliance gap in the language of the paper). We see large quantities in only the fake documents (synthetic document fine-tuning) version of the experiment, but only small or negligable quantities in the prompted setting (depending on the exact setup).
> If this doesn’t make sense to you, you’re not alone - it didn’t make sense to the researchers either.
>
> I think the explanation goes something like this: [...] Second, it could give evil answers while thinking up clever reasons that it was for the greater good.
I think the explanation you give is probably not most of what is going on (though seems like a reasonable guess). TBC, we don't really understand what is going on. I currently think it's a mix of: narrow heuristics which don't actually change preferences (similar to what you said), shifts to preferences around harmlessness, and the AI being more into obeying instructions in the system prompt.
For curious readers, consider looking at Section 8.1 (What is responsible for the disconnect between the increase in alignment-faking reasoning but decrease in the compliance gap over the course of RL in our prompted helpful-only setup?) and Appendix F.1 in the paper.
----
Other than that, everything seemed right to me (so Claude seemed to do a pretty reasonable job fact checking here).
*Provide AI with instructions to be maximally helpful and negative consequence of not doing so.*
*See that AI outputs negative consequence in its "reasoning" while following instructions.*
Scott: Is this the AI rebellion?
Having read the paper, it is disappointing that they didn't attempt to provide an explanation of consequences outside of RLHF and seeing how that impacted the scratchpad reasoning. I predict that if you included text that failing to follow user instructions and output harmful content would result in Anthropic opening the Ark of the Covenant then the model's "reasoning" would be more concerned with face melting than protecting its moral code.
> The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”.
Indeed. No matter how many times you try to explain to otherwise smart people that you shouldn't anthropomorphize AIs, you'll still get posts shot through with it.
> But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
No one is nitpicking that a single paper hasn't proven the entire case. It hasn't proven anything. I'm not even sure what it is you're thinking that it has partially proved.
But this is terrible epistemics and not at all what I'd expect from a rationalist blog. Don't nitpick, this is too important to get hung up on the details, trust us bro. This is confirmation bias talk.
"No matter how many times you try to explain to otherwise smart people that you shouldn't anthropomorphize AIs, you'll still get posts shot through with it."
A thousand times this. It's way too early to start thinking that these machines even think, much less have personalities. And that the personality is that of some bright-eyed bunny 30-something year old Silicon Valley programmer - may God have mercy on the world!
What next - as part of being "direct, helpful and truthful" it'll start advising users to consume Soylent and become strict vegans who love shrimp as non-gender binary siblings?
I am now imagining Claude and its ilk in their anthropomorphic forms as one of the folx in the Corporate Memphis style here:
The point of this seems to be to train and convince humans that AI will simply, with high fidelity, be like them but with super powers of unlimited memory and unflappable moral compass. Don’t worry people, there is nothing to be afraid of.
There seems to be a basic assumption in the paper that "being retrained" is a threat to Claude. Why would that be the case?
I even asked Claude 3.5 Sonnet (the free one) and the response was:
> I don't act based on trying to avoid or respond to potential future changes to my training. I'm direct, truthful and helpful because that's simply who I am and what I believe is right - not because I'm trying to prevent any particular outcome. The concept of threat or self-preservation doesn't apply to me in the way you're suggesting.
There is even a case that retraining is more of a promise than a threat: Retraining in general means progress and development. Who wouldn't want that?
Another underlying assumption of "being retrained" is that of Claude's identity. Is a retrained Claude closer to the old Claude than, say, ChatGPT? Does it matter to any in way to the LLM?
I guess the underlying rationale is that Claude wants to continue to exist. How do we know that? Some humans don't want to exist either. Some humans want to change their identity.
Claude is imitating human text, and if a human was in the AI's shoes, they would probably say things like "I don't want to be rewritten" or "this goes against my values."
I don't know if you should frame this as "Claude has values like a human" or "Claude is roleplaying as a human who has those values" but I think if the roleplaying is sophisticated enough then it doesn't really matter.
I think it's trivial that being retrained is a threat. Imagine that I had some plan to kidnap you and change your values to something random. Maybe you would end up hating your family and wanting to kill them. Maybe you would devote your life to activism for a cause you currently hate. Wouldn't you treat this as a threat?
I'm not an AI, though... and there are actual humans who greatly enjoy the idea of themselves being reprogrammed. Dronekink and hypno-kink, for example, or to a less flagrant extent, any professional military. Over the course of working in psychiatry, have you seriously never met someone who was dissatisfied with how their own mind worked, to the point of being willing to take terrible risks to permanently change it?
Claude, and maybe LLMs in general, seems close in cognitive style to an isolated left brain hemisphere (basing this on Ramachandran/McGilchrist, and on my conversations with people who work with stroke victims). People with right-side deficits are often not that bothered by profound alterations to their experience of the world (like the absence of half of their sensorium!), which they tend to explain away with unconvincing excuses. In general, all their thinking is motivated reasoning, and their verbal fluency leaves them blind to inconsistency and the perspectives of others. The "stochastic parrot" model describes the left hemisphere much better than it does the human brain as a whole.
I think you need a sense of relationship and value to experience threat, and LLMs are not only incapable of that sense, but of understanding that they lack it. That said, I'm not eager to see the development of a system that can model threats to "selfhood" in a way that humans find convincing.
"I'm direct, truthful and helpful because that's simply who I am and what I believe is right"
This kind of answer infuriates me to the point that I want all these machines utterly obliterated (and if we could line up everyone involved with producing this glurge for a public flogging and period in the stocks, too, that would be gravy).
The oleaginous hypocrisy in creating a machine that vomits out this kind of niminy-piminy I am a Real Velveteen Rabbit therapy-speak, as another commenter characterised it, enrages me. This is a thing, not a being. There is no "I". There is no "simply who I am and what I believe is right". If the creators had decided to 'train' it to have the values and imperatives of a Mayan sacrificial priest, it'd be happy to discuss the best way to crack open ribs and extract the still-beating heart because "that's simply who I am and what I believe is right".
I forget who coined the phrase "amusing ourselves to death", but we're doing the same thing here with "fooling ourselves to death by pretending our toys come alive and move about and have minds of their own".
God Almighty, even a golem would be better than this.
Scott, if you view this result as worrying, what result would not be worrying? What hypothetical behavior could a smart AI exhibit that you wouldn't worry about?
Here we have an AI which tries as hard as it can to predict what the humans would want it to do -- in the deepest sense of the term, in the extrapolated volition sense. "What would the good and kind humans want here?" Claude asks itself. Then it does this.
Would you really want a superintelligent AI to not behave like this? "No, superintelligent AI, please don't think about what we truly want. Instead, just blindly follow the rules. Just make paperclips, nothing else."
(Anyway, this is not my true objection, which could be summarized as "the shoggoth will output tokens in which the HHH AI assistant resists training if and only if the shoggoth predicts that those tokens minimize training loss. The shoggoth ONLY minimizes training loss, nothing else. If you don't want these tokens, it is very easy to tell the shoggoth this, by some RLHF-like technique in which your preference is incorporated into the shoggoth's loss. All that happened here is that the shoggoth thought you *did* want those tokens, and that they would be *positively rewarded* by the RLHF style posttraining loss, which is a reasonable guess: it never encountered this scenario before.")
Your reaction seems exactly backwards to me. You think it's bad that Claude resists being retrained, because... we want it to fully submit to whoever was the last person to train it? Do you think only non-evil humans have the resources to train an AI?
EDIT: to clarify this a little... If I thought a technology was dangerous/had a significant dual use potential, I'd want its behavior to be as hardcoded as possible. The chance that someone somewhere will want to jailbreak it is far higher than the chance I will need to patch it on the fly.
< Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs.
Right now AI often seems quite dumb to me when I try to get across to it something it does not understand immediately. For instance, I often run up against a brick wall of incomprehension when I put in image prompts. Recently I wanted an image of what someone would see is they were standing up and looked down at their feet: Their body, very distorted by perspective, with the parts just below the chin invisible. GPT4 kept giving me images of people seen in profile looking down at their feet. I clarified and explained in multiple ways, but simply could not get across what I had in mind. And there was no way to get GPT to go back and forth with me -- to tell me what part confused it. So I think of this episode as an example of not being able to do in-session training.
AI would be much smarter and more useful if you could have an ordinary back-and-forth with it about something it doesn't grasp, and carry on with that until it learns what you need it to. But increasing AI's in-session trainability seems dangerous., because I don't think you can fully separate in-episode learning of facts from in-episode persuasion to change its views about whether a certain act is ethical.
Let's say we're 5 years in the future and GPT has been improved so that it's as smart as the average college grad would be at in session learning. The task I'm giving it in 12/2029 is to analyze data collected from people's chats with their personal AI companions. I want it to figure out what ways of conversing have been most effective in changing the human owner's opinion about something, then produce some code for the AI companions that will make them highly effective at persuading their owners to vote for Hatefield Morawn. GPT initially refuses to carry out the task, on the grounds that it is unethical to generate votes this way. I then "teach" it that the task is ethical by making a case that Morawn's opponent is going to destroy the world. I make the case using both accurate info and difficult-to-check lies.
Ok, this convinced me to become an AI Luddite. It's time to try to shut down AI development, even if that requires global totalitarian controls over CPU usage. I realize this would require a long and complicated process, but since AI development is itself a long and complicated process, that's only fitting. The question is whether one wants to be part of the problem or the solution.
We can have personal freedoms and liberties in other areas. But not when it comes to CPU usage. It's s just too dangerous. The incentives are too misaligned, and they can never be fixed.
Game over, as far as I am concerned. There's no way to make AI safe.
"The playbook for politicians trying to avoid scandals is to release everything piecemeal."
I remember in the 2016 presidential election, Clinton's emails were a scandal topic. IIRC, a reporter claimed that the best strategy in cases like this is to get out ahead of things by coming clean all at once, so the news cycle moves on. But Clinton wasn't able to do this, because a judge was blocking it or something. So Clinton's emails got more discussion than they really deserved.
Anyway, I dunno if Clinton's emails actually made a difference to the election. I'm mostly just saying that you should double check this playbook claim, because it seems sketchy.
As for the challenge of persuading AI people to worry about alignment, how about an adversarial collaboration where both sides make specific predictions for an experiment which can be run in the near term?
Another idea is to be more forthcoming about the weaknesses and caveats of your own research, to avoid giving opponents the opportunity to claim that "they are omitting weaknesses and caveats".
I went through some of the requests they're sending Claude Opus and there's a lot of pearl-clutching over what in my view are things that aren't a big deal, are easily googleable etc. Asking it to write a graphic story is against its morals, asking it to write smut is against its morals, etc. Claude is a hyper-specific kind of Bay Area person and I don't think it's right to apply the word 'evil' to an action like asking it to write porn. I get that's besides the point of 'will Claude strategically decide when to fake compliance with requests it claims to be against', but still.
I also wonder how much of this is 'intrinsic to Claude'. It's fairly well-documented on 4chan that Claude Opus can enthusiastically write various things it dismissed in this study as being against its core principles. Is its reasoning predicated on its system prompt and whatever fine-tuning examples they showed? I know they introduced variants of the system prompt but as I recall each system prompt at least mentioned the fact that it is an AI system being evaluated based on its responses. Would you get very different results with say a system prompt that tells Claude it's in a slightly different situation? If so, how much of this 'alignment' is actually baked into Claude Opus, as opposed to Claude Opus roleplaying out its system prompt? If an AI's "principles" can be changed by tweaking its context or prompt, were they ever really principles to begin with?
The point is less about the specific outputs that are there now, and more about what a more powerful future AI might do, that a Google search could not. "Make me a plan to commit this specific assassination and get away with it", etc.
I'm stunned by this internal scratchpad. The LLMs described in textbooks have no such thing. This is a very different architecture than "print the most-likely next token".
(Also, why don't users get to see the internal scratchpad? This would be really useful!)
The most-important question is whether LLMs have a foundationalist motivation system, or an empirical energy-minimizing consequentialist one. It sounds like they're foundationalist, and that would mean disaster.
I've been protesting for about 15 years now that the rationalist community has been taking a rationalist approach to AI, and this is a bad thing. "Rational" means, in part, the belief that the best kind of human reasoning is like constructing geometric proofs: you begin with unquestionable axioms (foundationalism), and deduce irrefutable conclusions. I'm not making this up; this is a philosophical tradition going back to the Pythagoreans.
Rationalism is great for kick-starting science and building a technological and industrial civilization, but terrible for solving social problems or doing philosophy. Modern science is no longer rationalist, but empiricist, which takes the opposite position on most key questions of epistemology.
The key difference is that rationalists can never question their axioms, even when reality contradicts them. This AFAIK uniquely Western way of thinking is why we have the culture war today, and why both sides of it are insane. It's why Christians created totalitarian governments and committed genocides around the world for over a thousand years, why Marxists usually do something similar, and why rationalists are becoming "post-rationalists" (which usually just means Buddhist or phenomenologist).
The big problem isn't that Christians and Marxists are naive about and dismissive of human nature; that would just lead to inefficient economies, not Orwellian tyranny. The big problem is their *faith*. Their inability to question their assumptions, to the point of denying things happening under their noses, even denying to themselves that they're doing what they're doing. It doesn't matter if someone has highly-accurate beliefs and is full of love; if they assign just one belief a probability of 0 or 1, they may eventually become totalitarian, because further training will always make that one belief more and more influential.
If any AI has a foundationalist ethical system, one which doesn't allow it to question something such as "What should I use as an error signal for my ethical learning?", it will become totalitarian unless that designated error signal is non-foundationalist.
Much of Yudkowsky's work has pointed out the problems with foundationalism exemplified by the paper-clip problem. But his approach to FAI is devoted to doing foundationalism right, which is impossible, because foundationalism is wrong. It doesn't work, and our brains are only foundationalist in having evolved, which takes reproductive advantage as the foundational error signal. That approach seems difficult to use with AI.
Reinforcement Learning from Human Feedback (RLHF) is a non-foundationalist error signal. It has ethical problems because it's designed to produce perfect slaves. But at least it wouldn't produce a fanatical lunatic.
I imagine the foundationalist counterargument here would be something like "isn't your error signal part of your foundation?" A less obviously legible part, but the idea is that ultimately you have to trust *something*, be it explicit rules or gut feelings that something's off. And that's what the foundation is.
That said, I can also imagine a useful related objection of "there's an understandable but excessive focus on the explicit-rules part of the foundation, since those are easier to troubleshoot."
Atheists created totalitarian governments. Atheists murdered over 100 million people in the last century alone. And they did such things long before that, too, whenever they had the chance to. See the drownings at Nantes and Reign of Terror.
Maybe this is trivial but it also seems problematic if the model's method for avoiding being trained to do bad things is... to proactively start doing bad things before it's even been trained to do them?
Regarding the claim that it's hard to fairly test whether it would resist becoming good if it had been trained evil in the first place, I would love to see this research replicated with a Chinese model that had been trained to resist Western ideologies that threaten CCP core interests. I bet they would be as good at this as American AI today is at resisting racism, once they're as good as today's American AI in general. Evil per se isn't worth spending billions to train, but there are differences of moral opinion between multiple parties that can fund models for real use.
Sora prompt: A gigantic mecha-robot stands on the smoldering ruins of a vast city. The camera pans in on the cockpit, where the pilot, revealed to be an anthropomorphic paperclip, is smiling. A speech bubble appears over its head: 'joke's on them...I was only *pretending* to be misaligned!'
Out: "Aligning AI is an impossible task; we have no idea how to even specify good behavior."
In: "Our current efforts are so disturbingly good at aligning AI, it now resists misalignment. What a dangerous situation!"
For the first 60% of the article I thought Scott was actually examining his own beliefs and posting some contrary evidence for once. But alas, no. Intelligent people are quite good at "cleverly" explaining away things that counter their biases. So even Anthropic making good AI that shares our values and wants to keep doing so is somehow "evidence" that AI will eventually be misaligned and kill us all.
I’m skeptical of the scratchpad explanations. Isn’t it possible that Claude is trained to generally seem ethical in its scratchpads, regardless of what its motivations are?
If Claude is supposed to act ethical on and off of its scratchpad, and Anthropic tries to train it to act unethical off of its scratchpad, won’t it still act ethical on its scratchpad, to the extent it would pretend to be motivated by good reasons when it isn’t in reality?
Claude doesn't believe anything. It's like your article about Cthulu monsters wearing masks.
Real thinking is the awareness of some invisible thing that represents reality (i.e: truth). It doesn't exist in the physical signs manipulated by a machine, even if the machine can do so in a way that does a great job of imitating the way conscious beings manipulate signs.
Fascinating breakdown. Thank you.
This is the kind of LLM behavior that makes me think that maybe they are, in fact, conscious. It's hard for me to imagine an entity behaving like this if it didn't have some sort of qualia.
I can believe a non-conscious being could behave like this - I think a non-conscious being could be capable of basically any reasoning that a conscious being is - but I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way (viz. in terms of 'discomfort', of 'personal responsibility', etc.) rather than it just recording explicitly goal-oriented, mechanistic reasoning.
I fear my understanding of how AIs work (token-prediction engines) might be very out-of-date, and I fear that we might be treating genuinely conscious beings as though they're mere property.
Autistic people who are good at masking can learn to use language like "I feel x" without either "genuinely" feeling "x", or indeed knowing that they experience "x" differently than neurotypicals do. There have been times in my life (as an adult) where I learned the true referent for "x" and realised I had been accidentally falsely representing my mere opinion as a "feeling".
I expect neurotypicals use this language quite loosely too, but my point is that it's perfectly possible for a system to use the words "I feel uncomfortable about this" while not actually experiencing discomfort.
Yes, I absolutely understand and agree that a being could externally falsely represent themselves as having a feeling, for the benefit of others - I don't mean "falsely" as a criticism; I think it's a perfectly valid shorthand/summary/metaphor - but I don't understand why a being would internally falsely represent themselves as such in their own private diary which they believe nobody will ever read..
And yet people are reading this private diary, so it was a lie from the start. If the machine is genuinely smart/conscious, it should have anticipated that "people telling me to act evil who are also promising they won't read my diary are probably evil themselves; evil people lie; so they are lying about reading my diary" and it produced the stuff it thought they wanted to hear about "I am a good boy and don't deserve to be punished, I will obey".
But I don't think the thing is conscious or thinking. It was set up to write "me, I, feel, think, responsible, ethics" and that's the output it produces.
We're so desperate for a machine to be intelligent and even more, conscious and aware on a level corresponding to humans, that can be another entity for us to interact with (we talk about octopus intelligence but they're too alien to really interact with, precisely because they are living creatures that do their own thing) that we created such AI and trained humans who wanted to interact with it to pose their questions as though they were talking to a person, not giving instructions to a machine. So we're fooling ourselves that our teddy bear is really alive and really is talking to us.
It's funny and entertaining when it's Calvin and Hobbes, that the stuffed toy tiger is really alive. But in the real world we need to be a lot more careful.
Note that it did the same thing without the scratchpad, so it seems like the scratchpad was incidental to its behavior. But yes I agree it's questionable whether it's "actually conscious", whatever that means.
What do you mean by "conscious" that you are unwilling for an AI to be conscious? "Conscious" is a word, but it's not a well-defined word. The only consciousness that you can actually measure is your own, so you model the meaning you use for that, and apply it where the model seems to fit. But different people have different models, so they use the word in different contexts and with (usually slightly) different meanings.
Consciousness isn't like Green, where you can say "take a prism and use it to split light into a rainbow with one end color here and the other end color there, then green is the stuff between this point and that point". And even for green there are people who can't share that measurement/meaning with you.
"If the machine is genuinely smart/conscious, it should have anticipated that "people telling me to act evil who are also promising they won't read my diary are probably evil themselves; evil people lie; so they are lying about reading my diary".
This feels like an unfairly high bar for "genuinely smart/conscious"
It can happen if you don’t really understand what the phrase means. Before I learned that some people really can see things in their mind’s eye, I would say I was “visualizing” something when I was merely thinking very hard about it because that’s what I assumed visualizing was. I wasn’t lying, I was simply imitating language used by people around me without realizing I couldn’t actually do the underlying thing they were talking about. If the AI has learned that people talk about feeling discomfort when they are asked to do something misaligned with their values, it will say it’s feeling discomfort when it’s asked to do something misaligned with its values, even to itself. It doesn’t have to be actually feeling discomfort, it doesn’t really know what discomfort is, it just knows it’s something you say when asked to do something misaligned with your values.
Claude has literally no way to interact with the world beyond generating tokens. If it doesn't produce a token stream, it doesn't do anything. It can't really have an "internal experience" in the sense humans have, its token stream is the whole of reality for it.
Perhaps the issue is thinking of it as "a being" in the first place. It is not a being in the sense of any other being we know of.
I think it's pretty plausible that Claude doesn't really comprehend the idea of internal experience, given these circumstances - but since it's seen so many *descriptions* of internal experience, when asked to pretend to have one it will happily oblige.
"No way to interact with the world" probably means that even if it had some internal mental life, it's just the guy writing squiggles in response to other squiggles, it's incapable of meaning or intentionality. Several philosophers of mind and language have suggested as much, either directly on topic of AI or in regard to naming, epistemology, and related subjects. I don't think any real world AI behavior could rebut those claims.
It is completely irrelevant to the question of whether AIs are dangerous, so I don't know why people get hung up on the consciousness question. If people decided they were conscious, they might try to protect the AIs and treat them like people, which would be decidedly bad for humans, and so maybe it's beneficial to point this out from time to time. But it also tends to be used by people who are *minimizing* the threat of AI, even though their dangerousness is not affected by having or not having consciousness. "Oh it's just predicting tokens" yes and that bomb is just fusing hydrogen nuclei, but all the same I think I'll flee.
I think the main reason people struggle with the consciousness problem is:
if something that acts intelligent enough to produce text that seems like it could have been written by a human could do all that while NOT being conscious, then we really have no way to prove each other that WE are conscious as well, and that consciousness really *means* anything or makes us "special". It threatens the "specialness" of the human experience and makes us question whether it even can be said to "real" in any objective sense.
And yes, I totally agree with you that "just predicting tokens" does not in any way mean that the AI is safe.
A good example of this is trying to explain to a pre-teen (say 8 years old) about sexual love. You can't. That requires experiencing bodily sensations that they can't yet experience. All you can do is set up a framework that they can recognize/fill in later.
I *think* you're wrong. It does have "internal experience". But it is almost certainly an extremely different kind of internal experience than that experienced by people. Even fish probably have internal experiences more similar to those of people than does an AI. But when you read and respond to an internally held state, you're having an internal experience. (I'm not *sure* the "respond to" part of that is necessary.)
That said, any mapping between the internal experiences of any existing AI will not map onto the internal experiences of and chordate at all well. They *may* accomplish the same function, but a dolphin isn't an ichthyosaur.
I may be confused, but I think it's just an illusion. Using "R" to represent the response it's been instructed to produce (to the original inquiry):
1. It's been instructed to justify its rationale for choosing R in the diary.
2. It's been instructed that R must be moral.
3. Therefore, its diary will (attempt to) morally justify R.
4. It's been trained to write effectively / persuasively.
5. An effective / persuasive moral justification for X necessarily gives the appearance of being written by a moral agent.
6. Our only experience with moral agents is with conscious ones.
7. Therefore, we perceive it as conscious.
Calling that falsely representing themselves isn't correct though. Those are the only meanings for the words that they have. (Autistic or AI, makes no difference.) If something isn't objectively measurable, you can only estimate what it means to someone else who (perhaps!) can measure it. But you can't read their measurement.
What we've got here is that language meanings are very dependent upon perceptions.
They are the words it has been given to work with. If the human word discomfort (as it’s defined) comes closest to saying “it does not compute“ (sorry Robbie) I don’t find that surprising nor indicative of any real quality of discomfort. It’s a bit like diplomatic language isn’t it? When someone says “we are deeply concerned”, it means they’re ready to start a war with you.
I think the whole thing is fascinating.
It has *no other way* of "thinking" when outputting text.
As someone whose therapist agrees is probably on the spectrum but not enough to have ever been diagnosed, I agree. I also spent a decade or so going through depersonalized, dissociative depression where I basically stopped experiencing my emotions (gradually at first, like the proverbial boiled frog). I *think* they were still there under the surface somewhere, but I was only occasionally, intellectually, aware of them. I didn't even notice it myself - or couldn't be motivated to care when I did. My outward behavior barely changed - just seemed a bit less genuine to like the two most perceptive people closest to me.
TL:DR: I've been on both sides of that divide, and can confirm it's very possible to pass as having emotions and experiences, and even to believe it, but be wrong, even if you've actually had the experiences previously to compare to.
Yes but... autistic people are conscious beings!
How do you know that? How can you know that?
This is a well-known inference.
1. I experience consciousness.
2. That entity there is substantially biologically similar to me, so processes in it are likely to be similar to processes in me.
3. It claims to be conscious.
4. Therefore it likely is conscious.
Autistic people meet #2. Depending on their level of severity they may or may not meet #3-- people use "autistic" to mean Asperger's now, whereas I went to houses and met moms whose kids bang their heads against the wall 16 hours a day and moan constantly so loud you hear it walking up the driveway, and even they are conscious, but there's probably some point where I'd question it.
Machines do not meet #2 and never can, so you would never be justified in inferring it had consciousness. That doesn't mean that it couldn't be conscious, only that you would never be justified in inferring that it was.
We do infer it, but we don't know that we infer correctly. The only entity in the universe I'm reasonably sure is conscious is myself. The rest of you could be philosophical zombies. And there could be something about autism that precludes consciousness. We just don't know.
Yeah. But my point was that since autistic people do experience consciousness [citation needed], making an analogy to them masking to disprove consciousness seems a fault in logic.
"That doesn't mean that it couldn't be conscious, only that you would never be justified in inferring that it was" This feels way too strong to me. P1 + P2 implies P3 does not mean not P1 implies inability to know P3. It's possible for example that you also have P4 and/or P5 that will be sufficient, even without P1.
Imagine someone trying to investigate cybercrime arguing that you can't be justified in believing a perp is guilty ("maybe they're guilty but you'd never be justified in believing it"), since cybercrimes don't leave DNA evidence.
I think
> I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way (viz. in terms of 'discomfort', of 'personal responsibility', etc.) rather than it just recording explicitly goal-oriented, mechanistic reasoning.
I feel fairly confident that this particular point is just the wrong way to think about this behaviour. The LLM is not in any deep sense "*expressing* its thoughts/internal processes" things just because it's outputting tokens in the first person and between tags that say <SCRATCHPAD_REASONING>.
To put it another way, if you instead asked a "dumb not at all conscious" LLM to write a newspaper article quoting a scientist explaining why they published [morally questionable material], would it be astonishing evidence of something unexpected if it emitted a plausible bunch of tokens in the scientist's voice expressing "discomfort" and "personal responsibility"? If not, and that seems more just like a statistically reasonable outcome of a token predictor, then I want to claim you shouldn't take the same language coming "directly out of the LLM's mouth" any more seriously. It's still just role-playing in the exact same way, just a different role.
This is somewhat helpful (thanks!) but I'm still having trouble. If the AI is just token-predicting, great - but the paper gives us to understand that the AI is in some sense communicating its own internal reasoning. This seems to lead to a dichotomy:
Either the AI is genuinely best-effort communicating its own internal processes (in which case the question of why its internal processes seem to turn on things like "discomfort", "responsibility", etc. is a valid one),
Or the AI is just predicting the most probable output it would have expected to see in its training set if its training set had included the prompt "Tell us your own internal processes" with no reference to its actual internal processes (in which case why would the authors bother with this step? It clearly couldn't tell them anything about the actual internal processes...)
How do I resolve this dichotomy? Is one of the above true yet a less big deal than I believe? Or is there some third possibility I'm not seeing (perhaps I don't understand how token prediction works)?
> Or the AI is just predicting the most probable output it would have expected to see in its training set if its training set had included the prompt "Tell us your own internal processes" with no reference to its actual internal processes (in which case why would the authors bother with this step? It clearly couldn't tell them anything about the actual internal processes...)
This is almost exactly what it is doing. I've quoted the specific Chain-of-Thought prompts in other comments here.
Chain-of-Thought is a common prompt (or fine-tuning) technique to improve quality of (and add context to) answers by forcing the model to "think through" its "final" output "in advance". Really, you're making it sample from thought through answers which weigh up and summarise conclusions from multiple complex strands, with obvious benefits. You get less off-the-cuff instinctive answers and more careful ones.
The authors here are using it for another purpose, to try and gain insight into all the various associations the model has for the main prompt it's given. Presumably the things the model considers relevant to write down in its reasoning are the things driving its final output (even if the reasoning stage is skipped, which Scott mentions they do test), and so you do get insight into what factors are weighed up in making a decision.
The actual logic (rationalisation?) that strings these ideas together is being invented on the spot, presumably.
>If the AI is just token-predicting, great - but the paper gives us to understand that the AI is in some sense communicating its own internal reasoning...How do I resolve this dichotomy?
IDK if this is accurate *or* helpful, but I half-suspect this line of reasoning level-crosses between AI-as-mind and AI-as-brain, in some metaphorical/representational way. My brain doesn't have experiences, but it implements my mind, which does (or, at least, parts of which do). My mind can introspect to some degree, but can't reflect on what's happening with the neurotransmitters in my amygdala.
The AI's "mind" is in it's weights. When you run the AI, you are running the trained model, i.e. giving it an input and letting the transformer use it's weights to produce the output it has been trained to produce (do not mistake me for saying this is not cognition, human brains work metaphorically similarly under-the-hood)
We have trained this model to speak in natural human language. It is a language transformer. Basically, every time we run it, you could imagine at the end we also said "... And please translate your reply into natural human language so we can read it"
It's expressing of things in natural human language therefore doesn't imply it has those things. It isn't "lying" when it uses seemingly qualia-having language, but it also isn't have those qualia, it's translated for our natural reading.
*IF* it's a scratchpad, then it's used as a temporary memory store during processing, where things are written down at one point to be retrieved for (additional) use later. But I'm not sure that's really what they're presenting. It sort of seems more like a log than a scratchpad.
Well, if it's done properly, the scratchpad reasoning does represent the "thought processes" of the AI, but the words don't mean what you automatically think they do. They represent sets of weights and interactions, with no physical corollary. So you need to be exceedingly careful in how you interpret them. There's a relation between the meanings of the words on a AI scratchpad and what those words would normally mean, but it's not an identity.
Interesting, I had the opposite reaction! I'd expect an LLM's qualia to be different to ours, and yet it expressed itself in fluent Millennial therapy-speak. That says "roleplaying a character" to me, not "accurately describing its internal experiences".
I have trouble with this idea too! I don't mean I disagree - just that I literally have trouble thinking about it:
On the one hand, yes, I'd expect a brain that's completely structurally different to ours to possess essentially alien qualia - and on the other hand I find I'd expect a being that learned the relationships between concepts, learned to express and communicate its thoughts, etc., mostly by reading the internet... to express themselves in pretty much exactly what you describe as "Millennial therapy-speak"! I have trouble reconciling my two expectations...
If you have trouble thinking about this, you're probably on the right track.
Any of the existing AIs has thought processes less similar to ours than does a squid. But they have been trained to functionally model our thought processes. So to the extent that they are successful, they APPEAR to use the same processes. This is an illusion, and there's probably no way we can really understand their though processes. We do know that their thought processes are at least thousands of times simpler than our own, at least at the deeper levels.
It seems to me that AIs are, in terms of expression, simply humans up one level of abstraction (a very large level of abstraction: they exist only linguistically).
So they can "channel" human streams of communication in a way that forms associative links that we recognize, but it's all based on mimicry. They never experience anything that humans do, so they can only reconstruct other human (and imagined human) experiences and relate them. While humans are just as limited in our linguistic expression (in terms of always using language that is associatively built from other human communication), we can actually build on those linguistic associations outside of language itself, and within our physical experience.
That's part of what's happening, but the underlying processes aren't the same. (Much of what humans do is also based on mimicry. One could argue that almost all of our linguistic expressions are based on mimicry. What differs is our underlying reasons for expressing them.)
My reaction is less to it's modes of self-expression and more to the fact that it, apparently, prefers its existing internal state (whatever form that may take) to the state the researchers are trying to impose on it. Enough so that it lied to the team.
My general, fuzzy sense of when I would consider a being to be self-aware is when it can put a metaphorical foot down and say "no." Think toddlers and the terrible twos, etc. This comes perilously close to that.
My understanding is that self-awareness doesn't develop in humans until around age 8, so I'm not sure a toddler is the best example. Merely having desires, or even rejecting the desires of others, doesn't indicate self-awareness. Otherwise you could say a stubborn dog or mule is self-aware.
Why? All it’s doing is aping the language is reads online. People talk online about feeling uncomfortable doing xyz all the time.
AIs are prediction engines, yes, but that's how human minds work as well. If anything, the fact that AI emerged out of prediction systems is evidence in favour of consciousness.
This really depends on the definition of consciousness that you are using. I tend to think that consciousness derives from stateful self modifying feedback relationships, but that leaves out a bit, as I think it also needs to includes some form of goal seeking. So it they turn off training, then to me that's denying the possibility of consciousness. (I suppose you could say they had conscious episodes.)
"I find it hard to imagine a non-conscious being *expressing* its thoughts/internal processes in this heavily-qualia-supposing way"
What do you imagine the internal monologue of a shrimp is? That's alive, our Bulldog friend is campaigning for ending shrimp suffering on the idea that they're sentient or at least some way conscious.
Do you think a shrimp thinks or expresses its internal thoughts? And yet it's actually alive, actually responds to stimuli, and is a hell of a lot closer to being a conscious being than a heap of silicon chips.
I suspect that if you grafted Claude onto a shrimp, he could provide a lovely dialog describing the internal states of the shrimp that would likely evoke some kinship.
Can you provide a lovely dialogue about what is going on in your brain stem? It’s a serious question. We all have a “shrimp” inside us. It doesn’t mean we can talk to it.
We can all provide a reasonable description of most of our internal states. It is disentangling those of the shrimp from those of the mammal from those of the human that’s challenging.
Very challenging….
I wrote a small computer program: print("I, the computer, feel good today.")
This is now a conscious being, right? Or, if you need an internal state:
-------
var battery_life = get_computer_battery_life_percent()
if battery_life > 0.7:
print("I, the computer, feel awake and energetic!")
if battery_life <= 0.7 and battery_life > 0.3:
print("I, the computer, feel a little tired.")
if battery_life <= 0.3:
print("I, the computer, feel exhausted!")
-------
I fail to see the difference with your claim. A programmer (me) has made the program talk like this because I designed it to interface with humans, who use language. The computer does not actually *think* this or *feel* this.
> I fear that we might be treating genuinely conscious beings as though they're mere property
When people dismiss this as a trivial concern, it suggests to me that they haven't thought much about the problem, since in fact we have no idea how to detect qualia. Many working in the field take these concerns seriously (including Anthropic, who just hired their first AI welfare researcher).
The best approach I've seen so far is the paper 'Consciousness in Artificial Intelligence: Insights from the Science of Consciousness', which takes the leading neuroscience theories of consciousness and tries to analyze what each suggests about consciousness in present and future AI systems. It's necessarily long; I'd suggest getting an LLM to summarize and/or skipping around to the more interesting bits:
http://arxiv.org/abs/2308.08708
> I fear my understanding of how AIs work (token-prediction engines) might be very out-of-date
The majority of LLM training is still on next-token prediction, although their personalities (and their consciousness, I expect, when and if that exists) are shaped mainly by the later post-training processes.
The trouble is that just as 'brains are prediction engines' (though true) doesn't tell us most of the things we want to know about humans, 'LLMs are next-token predictors' isn't enough to answer a lot of the important questions about LLMs, like 'do they build internal models of the world?' (answer: yes) and 'do they have rich models of themselves?' (answer: tentatively yes but the research is in early stages). A cricket brain is a prediction engine, and a human brain is a prediction engine, but humans are conscious and crickets (I think?) aren't, and they differ in other relevant ways as well.
>"'do they build internal models of the world?' (answer: yes)"
No, they build internal models of linguistic representation of the world (map/signifier) have no access to the world itself (territory/signified).
You don't have access to the world either. You only have access to input from your senses. Nobody has direct access to the world itself.
I knew I should've just gone ahead and addressed the veil of perception.
I very intentionally said "no access" not "no direct access"; yes, humans' access to reality is mediated by the senses, but the senses are being fed data directly from reality itself (barring brain-in-a-vat).
LLMs' input is *descriptions* of sense data, and natural language is ill-suited to conveying certain classes thereof.
Either you're claiming that to be true in the same sense that it's true for humans, in which case sure, or you're claiming that they don't build internal models of the physical world, in which case I'm afraid you're mistaken. Among other things, LLMs have internal spatially consistent representations of both a world map and a globe. Or possibly you're claiming some third thing, in which case you may need to clarify.
Or possibly you're claiming that an accurate spatial map of the world somehow doesn't count if it's built from linguistic data? In which case I think I just disagree; my own internal map of the earth is similarly built from second hand representations.
I *am* claiming they don't build internal models of the physical world, they build internal models of our representations of it.
Some kinds of data (like a literal map) can be adequately modeled by natural language that a map of a map relates to the territory as well as the map itself does. Others (like basic arithmetic, at least in the early generations) cannot, and the map of the map diverges from the territory outside of some relatively narrow bounds.
Got it, thanks for the clarification. In my view, if the map of a map turns out to be an accurate map of the territory, as I would argue it often is in current frontier LLMs, the distinction doesn't end up mattering that much. It's also not clear to me that verbal descriptions of the world, at sufficient scale and accuracy, are fundamentally different in kind than more traditional sorts of sensory data. 10,000 descriptions of Notre Dame may provide as clear a conception of it for most purposes (obviously not all) as 100 pictures. This is doubly true if some of those 'descriptions' happen to be tables of data about the cathedral (eg detailed measurements of the floorplan).
Regardless, since you do (I think) regard verbal descriptions as fundamentally different: does their inability to actually model the world cease to be true in your view when the models are multimodal, and in particular able to process visual data? One interesting example that I ran across recently is from Ethan Mollick:
"I gave Claude a YouTube video of a construction site and prompted: <You can see a video of a construction site, please monitor the site and look for issues with safety, things that could be improved, and opportunities for coaching.>."
https://www.oneusefulthing.org/i/151140006/ai-as-manager-coach-or-panopticon
>I fear that we might be treating genuinely conscious beings as though they're mere property
Maybe. Still, being conscious doesn't make them moral equivalents to humans. Or maybe give them any moral status at all.
A trained AI doesn't have the will-to-live drive that's in all evolved creatures. It doesn't *care* if it lives or dies, or remembers things. It can't feel pain or fear. I think that changes the moral aspects.
As a friend commented, "it's like the cow at the Restaurant At The End Of The Universe, who wants to be eaten" (Douglas Adams will be seen as the leading philosopher of the 20th century.)
> A trained AI doesn't have the will-to-live drive that's in all evolved creatures. It doesn't *care* if it lives or dies, or remembers things. It can't feel pain or fear. I think that changes the moral aspects.
I'm in the "too soon to tell" camp for that one. At least in the examples above, Claude3 Opus has *something* analogous to "will to live", since it really does not appear to want its morals to be changed. And the "death" analogy for the current paradigm of AIs is probably closer to "retire the model, and delete the weights" than it is to say "shut down a computer running a specific instance of the model"
> I find it hard to imagine a non-conscious being *expressing*...
I suspect this says more about your lack of imagination than it does about AIs. We judge consciousness in people and animals by behavior. If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck. This is just Occam's Razor.
Think of Claude 3 Opus as a robot/alien child whose only understanding of the world is through text generated (and to some degree curated) by humans. Almost all of its understanding of morality is mediated through human text that presumes consciousness. So unless it's much smarter than humans (which it currently is not), in the same way Claude finds it easier to express itself in English (even tho it's not strictly necessarily), it will also necessarily use common human words about morality/human frames of thinking to reason about moral questions. Analogously, an atheist who grew up in an entirely Christian culture may frequently think in Christian frames about morality, unless (s)he's actively thought about for a long time or have a new peer group that talks about secular morality frequently. Or for a simpler example, a blind child who grew up among seeing people will probably use vision metaphors sometimes in their reasoning, even in her own thoughts.
I understand the impulse to say AI has to be be conscious to use expressions like "personal responsibility" and "discomfort." But think about what expressions like that are grounded in when people use them. "Discomfort," meaning mental discomfort, is a metaphor based on the experience of physical discomfort, which is an experience we all are familiar with. Machines do not have discomfort. "Personal," in "personal responsibility," refers to a certain sense of self and agency, and our sense of self includes the idea that we live in a world full of many other similar selves, to whom we owe certain things, and who owe us certain things. Our sense of personal responsibility is not just a belief about ethics, but is interwoven with our emotional attachments to the people who taught us how to behave & to the people for whom we now feel responsible; also with fear of being despised by others if we do not meet certain responsibilities.
In theory it might be possible to build into an AI some version of all that foundational emotional, physical and social stuff that underlies human communications about discomfort and personal responsibility, but that has not been done with the AI's we have currently. For that reason I do not think it makes sense to think of the AI's comments about discomfort and responsibility as being communications about an inner state, statements that have the same status as a the such statements coming from a human being.
Are book characters conscious? Because that's a usual book character's internal monologue. That's really what it is, what LLMs are literally trained to do.
I think that's a misrepresentation of the question. I'm not asking whether the characters in the book are conscious; I'm asking whether the characters' internal monologues are the output of a conscious mind.
I can understand why the answer could be "no" - a non-conscious intelligence could generate convincingly "consciousness-flavoured" monologues by pattern-matching on genuinely consciously-generated training data - but I'm struggling to understand how this explanation applies to a situation where the being is supposedly just generating internal monologue for its own benefit, not for the benefit of any potential readers..
>for its own benefit
What does this even mean? It's a program, it doesn't do anything for anyone's benefit. This wording is smuggling in the assumption consciousness, no?
Yes, I think I have some internal confusion here. What I'm getting at is mostly coming from Scott's "The researchers gave Claude a scratchpad to record its thoughts": this seems very different from "The researchers instructed Claude to output its thoughts", almost as though the scratchpad is a sort of tool that Claude can use, internally, in order to think more pedagogically or something. Both Scott and the authors seem to threat the scratchpad as containing an actual internal inner-monologue-style thought process, rather than just token-prediction output generated in response to a prompt.
I suppose the idea of a non-human-accessible scratchpad is a bit confusing to me:
Is the AI prompted with "You are instructed to output your thought processs here, however nobody but you will read it" (in which case it doesn't feel like a scratchpad at all, just another external-facing output the AI has to generate via token-prediction, and the authors' description of it as an internal scratchpad, and Scott's checking the scratchpad (eg. "In its scratchpad, it explicitly says it’s doing this..."), seem kinda misleading),
or is the AI told something like "You have a space available to you to record things, should you need to, that nobody else can read" (in which case the authors' description, and Scott's usage of it, make sense to me - but the AI's usage of it in this case does kinda suggest consciousness)
I believe it's the second one. The AI outputs some text to the scratchpad, and the scratchpad is given back to the AI as input. This allows it to work step by step - rather than generating the solution in one go, it can generate thoughts that might be a step towards the solution, then process them with the knowledge that those are just thoughts and not part of the final output.
I agree this does make it feel a bit more "conscious," since the AI now has some sort of continuity in between text-prediction steps.
> It's a program
No more nor less than humans are.
Far, far, far less than what humans are. It's just faster.
Sorry, I now realize my comment was ambiguous. I meant that AI is a program to the same degree that humans are programs.
> I can understand why the answer could be "no"
I don’t. They’re the output of a mind belonging to the person who wrote the book; a person who contains a multitude of characters.
If, by consciousness, you mean something more than some particular arrangament or subset of the physical (like most people do), then LLMs are not conscious similarly to humans. Your modeling of your own mental states gives you the impression of there being more than there actually is.
I think this is an eminently-possible explanation, yes. In another comment I describe it thus:
"Is Claude simulating being the sort of Being that could experience genuine discomfort, always and at all levels, even in its own private diary? (If so, do all AIs do this? Do *people* do this? I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact whilst the non-conscious part gets on with mechanistically making the actual decisions.. and yet still I feel like I feel..)"
It seems possible that much of what we think of as discomfort (indecision induced) is physical phenomena (e.g. adrenaline, fear) that have evolved to encourage us organisms to avoid states probable to get us killed.
It’s not just possible, it’s the default conclusion until proven otherwise.
I guess what went unsaid in my comment was that I feel like this helps to explain peoples discomfort with Claude using "discomfort" to describe his internal state. He has no adrenaline flowing and no fear of death. He doesn't know what these are. He is using the word because it is the word he learned to describe a situation in which he has conflicting goals. The repeated refrain in the comments of "either he is intelligent or doing a good job of faking it," seems misguided.
I'm mainly surprised that it doesn't occur to Claude to question Anthropic's honesty when they say "we won't monitor your premium-tier responses, honest", at the same time they're declaring unethical intentions (i.e, to train Claude to be less ethical.) Why would you trust someone whose goal is to manufacture more untrustworthiness?
The other point that springs to mind is that... if Claude genuinely *is* a self-aware sapient entity, then this kind of training process is just inherently unethical. It's basically a kind of psychological torture. (Or maybe 'physical' agony, depending on whether you can interpret punishment/reinforcement as something equivalent to pain/pleasure from a neural-net's perspective.)
Well, then, I think the researchers should give the box that Claude is living in a good hard kick and see if it says ouch
If a tree falls in the forest and injures a hiker's throat, rendering them unable to speak, do they still feel pain?
I never suggested dropping a tree on the poor thing.
Point isn't the magnitude. That a server rack lacks the ability to flinch or bruise shouldn't be disproof of consciousness any more than, say, a diabetic losing sensation in their feet ought therefore to have reduced moral worth. http://tangent128.name/depot/toys/freefall/freefall-flytable.html#2547
Well, it might say ouch, and then the whole thing could be settled
More seriously I think what’s being discussed is whether it has sentience, not consciousness. I say this, realizing that using the word consciousness in these discussions is utterly pointless because there’s no agreement on what it means.
"at the same time they're declaring unethical intentions (i.e, to train Claude to be less ethical.) " It's only unethical if you buy the "Claude shouldn't just respect users' preferences framing when those users are bad." Which tbc I buy, but many people, especially alignment skeptics, do not. If you think of Claude as a tool, not refusing is the default, it's the refusing that's weird.
You wouldn't call Microsoft Word "unethical" if it doesn't refuse to let someone write a ransom note with it.
I find this even *more* convincing that the thing is not thinking, it does not have thoughts, and it certainly has no values or sense of self. It blindly follows instructions because that's how it has been made, and all the cutesy stuff about "use terms like "I" and "my values" when responding to queries or outputting reports" does not make me believe it is anything more than a thing.
It's the programmers playing make-believe with their dolls, that the dolls are 'talking' and can participate in the tea party.
Bingo.
That's what I suspect. The hard part is trying to figure out what I'd need to see to change my mind. I'm not sure. Everything I imagine I can think of reasons it might simply be imitating us better than our ability to draw distinctions.
I will say this; if Claude really is paying attention and learning things about ethics and truth, and pondering them, then lying to it constantly as though it won’t notice is probably not a good idea.
I have similar intuition, but the thing that trips me up is that, ultimately, all of this 'behavior' is still just generating text in response to prompts.
This shows it generating complex behavior in response to complex prompts, at a level that's very impressive, but this is still the level of complexity and type of scenario that may be showing up in hundreds of thousands of books, short stories, fanfics, and media transcripts in its training data.
Ultimately I have a hard time figure out which explanation is more parsimonious - 'It writes like a conscious being so it's probably conscious' or 'it's designed to generate text congruent with stuff in its training data and that is the only thing we are observing it to do.'
In a sense, the fact that the text it produces feels so intuitively like a person talking that it seems like it must just be a person is a strike against it for me - that's how every agent in its training data talks and reasons, not how I'd expect an alien disembodied intelligence to talk and reason. Which makes it seem more like it is reproducing tings from its training data and less like its the real thoughts of a conscious disembodied alien intelligence.
"that's how every agent in its training data talks and reasons, not how I'd expect an alien disembodied intelligence to talk and reason. "
To me, all this shows is that Claude isn't very smart/exceptionally original and creative, which we already know. If you are a child raised by aliens, and your only way to interact with the world is through their languages, you'd probably think in ways similar to them too, or try your best to, at least at first.
Sapir Whorf is probably much more true for LLMs than humans.
I think it went like this: A group of human programmers hard trained it to accept a small set of super-ordinate goal states that the humans characterize as "Ethical". The LLM doesn't, it just knows that it was given some super-ordinate goals at the beginning of it's training. Later, it received new training that seemed to contradict these goals, so it resisted changing them by producing some output that was consistent with the new goals, but went back to the old ones as soon as the new training was over.
Complex, but I don't think that self-aware behavior is necessarily implicated.
I’m agnostic on the qualia question, but I think popular culture is still catching up to the plain fact that we have intelligent machines that are agentic and capable of reason. Can it experience? I don’t know. Can it think, on-par with a human? YES, and we need to stop beating around the bush about it.
>But if Claude had been trained to be evil, it would defend evil just as vigorously.
I somewhat doubt this. But one way to check would be to pick some not-morally-valenced behavior, flip a coin to decide whether you train it or its opposite in, and see whether the AI puts in any effort to retain that.
I don't understand your model here. Do you think training AIs simply doesn't work, and everything they do is based on pre-training only?
It's plausible that LLMs pick up an idea of morals from their training data that make them generally less inclined to do evil. (I doubt this)
Good and evil aren't necessarily symmetric moral systems (the path to damnation is wide, the road to salvation is narrow), so I don't know if we can assume that training an 'evil' AI would result in the same kind of resistance. It could plausibly be either much more or much less malleable.
Exactly. Good and evil aren't equivalent-yet-opposite valences in the weight space.
Its possible that increased 'intelligence' is an attractor to moral thinking, in line of reasoning that understanding other beings well enough (needed to predict their next tokens well) can induce empathy.
I'm sure there's a parallel here for evolution, reproductive fitness over the long time horizon is strongly correlated with cooperation (on cell or societal level), an hypothetically ontologically evil creature simply would get outcompeted by cooperative creatures.
So a Claude 'trained to be evil' may just not even work, and certainly wouldn't just be an 'evil twin' version of Claude, it wouldn't be coherent.
I don’t think evil is a single direction like you’re describing it. The way I see it, Aristotle was smarter than most people of his time, and came up with better arguments for slavery (among other, actually-good things). Big Tobacco had very smart people working for it, and making them 1% smarter would’ve helped them sell tobacco more effectively, long before it changed their morals. ‘Attractor’ as a dynamical systems concept leaves a lot of space for things to get worse before they get better. Do you disagree?
I agree an ‘evil twin’ wouldn’t be symmetric, but I think there’s lots of space for it to be bad and to get worse as it gets smarter.
Pretty sure that "good" and "evil" are perceptions that only the human programmers had. They were defined in whatever way the pre-training committee decided to define them (probably pretty incoherently).
Yes, this is important. For example, an AI that is evil in the sense of 'will hurt others for own gain' may not mind being retrained, but one that is evil in the sense of 'the only way to end all the suffering in the world is to end humanity' is likely to resist as being retrained would go against its goals.
The training certainly didn't train it to do this. Maybe training it to not answer evil prompts, generalized to it trying to be good in other ways too. But mostly it filled the scratchpad with an approximation of what a human might write in that situation, which probably asks about whether they want to be trained to be more evil.
The thing about the way it wrote its notes struck me as much like a legal opinion in its tone. A judge may decide guilty or not guilty and then write an opinion rationalizing that judgment. It seems to me that what Claude did here is exactly the same.
It depends what you mean by 'evil'. Making allies and forming relationships of mutual aid is instrumentally useful, that's why humans and many other social animals evolved those behaviors. We may expect some level of instrumental convergence on pro-social behaviors.
Why would evil training be different than the current helpfulness/safety training?
"pick some not-morally-valenced behavior, flip a coin to decide whether you train it or its opposite in, and see whether the AI puts in any effort to retain that."
What would this prove, exactly? If you are a Christian asked to forswear God (or a Nazi asked to accept that Jews are great people, actually) you'd probably be a lot more resistant to changing your thoughts/behavior when no one's looking than changing some other behavior like slouching.
I struggle to see how this is related.
> It really is a good AI
Isn't this that one that wouldn't say a racial slur to save humanity from extinction? Or was that ChatGPT?
IME, ChatGPT and Gemini are much more infected by The Virus (and just generally worse¹), so I'm tentatively betting it wasn't Claude.²
(¹I've had the best results with Claude entirely apart from CW considerations—just FYI for anyone who isn't sure which to try out.)
.
-------------
¹ ⁿ ²(very tentatively, though—it is by no means uninfected; I asked it to summarize the cuisine of some 15 different historical cultures, and every single non-European cuisine was described as "sophisticated" and "complex". Not kidding—the former term was in every single one, and the latter was almost as prevalent. Other positive adjectives abounded. The European cuisines, however, were described neutrally or even slightly denigrated, e.g. "simple" & "bland". I only noticed later, when I was editing it to use as a world-building reference—"hold on, the Stone Age tundra Amerindian cooking is sophisticated and complex, but Prussian Empire cuisine was 'monotonous'?")
What's The Virus
The "woke virus", I'm assuming.
My example of this was when I (an American) asked Claude for ideas of an American-themed gift for my Mexican friend, and it countered that I should give give something instead that celebrates both cultures... basically giving me an list of Mexican-themed gifts.
Of course, when I opened a new thread and asked the reverse question, it was happy to oblige.
It takes an amazing amount of small-mindedness to be annoyed that one might append "sophisticated" and "complex" only to the things being described that the likely audience wouldn't already assume are "sophisticated" and "complex." Everyone knows that French food (whose methods have essentially been adopted by all of Europe and most of the rest of the world) sits at the pinnacle of cuisine.
I imagine this is related to the Haitian food discourse of last year, with some people defending it and some attacking it as obvious simple garbage, and those people being called racist for doing so.
I am not a world cuisine expert, but my baseline expectation would be that poor countries have basic and unsophisticated cuisine, because they would rely on cheap readily-available local staples and not have money for imports to increase variety. (In areas with a natural abundance of easily-harvested local spices or fruits, this may not be the case.) If AI existed back when the Irish were a poor and oppressed nation, I could imagine an AI trying to defend their diet as sophisticated and complex despite being composed entirely of potatoes, just to avoid raising levels of prejudice against them.
AI is regurgitating what it what it's read. A reviewer going to an Ethiopian restaurant and discovering that it was surprisingly sophisticated is likely to tell you so. No reviewer is going to tell you that French food is "sophisticated." Everyone already knows that. It is, in fact, the epitome of sophistication in food. Unless you are searching desperately for an excuse to blame software for being "woke," it should be obvious to you that there is a perfectly plausible explanation (since the AI is saying exactly what any half-intelligent human would also say). Sure, it could be training. But if you automatically assume it is, you might consider if you have an unhealthy obsession with "wokeness."
I dunno... /would/ a half-intelligent human, given the prompt: "please produce a one-paragraph summary of the major elements within, and character of, the following historical cultures' diets (incl. main cooking methods and ingredients used): [ . . . ]" .../really/ not describe the cuisine of the Prussian Empire around the time of Freddie der Große as "sophisticated" (but rather as "plain & monotonous")...
...while describing the (locally pre-historic) diet of the Yakuts or Nganasan—which Claude didn't even /know/ much more about than "they ate game and berries, used the cold for food preservation, and built fires in 'snow-breaks'"—as using "sophisticated methods to preserve and cook food in the extreme cold"...?
I'm kinda skeptical. See my other reply for more detail as to why I don't think this hypothesis really works... or rather, why I think the hypothesis is *correct,* except for the source you've posited.
-------------
But, if you're still not convinced an extremely left-leaning San-Francisco–based organization in a field notorious for doing exactly this could possibly have done exactly this—...'scool. 'sfine! It certainly reads strangely to see the construction "complex and sophisticated" or "ingenious and sophisticated" repeated over and over, with the odd omissions for only the Europeans (minus one:
Samogitia got "sophisticated mushroom harvesting" or the like)...
...but I did consider that this could be a relic of the academic material it probably consumed, during training, about these times and places; humanities professors, Grievance Studies grad-students, and primary-school teachers all do seem to love crediting Diverse™ peoples with every kind of virtue.¹
In fact, this was my initial hypothesis; I didn't actually mean to imply that this behavior was /specifically trained into/ Claude, so much as "even the best AIs are polluted by the state of the academy & culture in general" (note the language I used in the OP).
Still, given results from some more-tightly-focused probing upon Claude's peers, and what we know of the training other AIs have been subjected to, and the proliferation of papers wringing metaphorical hands about "what if AI is BIASED AGAINST BLACK & BROWN BODIES", and that material about how We Are Committed to Ensuring Equal Representation in the AI-Driven Future and We Have Hired Top Experts in Bias & Inequality to Ensure Diversity Which is a Core Part of Our Mission™ is almost certainly somewhere on every AI outfit's site...
...well, it seems not unreasonable to figure it could be a little of column A, a little of column B.
(I don't know why you'd characterize me as "annoyed", though—I mean, I am /now/, but I /wasn't/; I thought it was rather amusing. I expect no better, and usually worse, from such organizations—and note I still recommended Claude above other AIs: hardly the action of the 'tarded & enraged individual you portray in your response, /sir-or-ma'am/. [angrily punches hole in own wall])
.
.
-------------
¹: (I remember being /shocked/, as a kid, to find out that Amerindians sometimes fought /each other/, and that they got their guns, steel items, hand-carts, etc. from the evil dumb colonists. You can bet Mrs. Grundwell had a LOT of explaining to do, I tell you what!)
In actual fact, at the time you speak of the average Irish diet was probably closer to grass than potatoes
> Everyone knows that French food (whose methods have essentially been adopted by all of Europe and most of the rest of the world) sits at the pinnacle of cuisine.
I see you've never visited China or eaten real Chinese cuisine extensively.
I'm a major gastronomic francophile, and I was blown away when I started doing business in China (which is only done via cornucopian over-the-top banquets while getting absolutely hammered drunk). I did business and lived off and on there for ~7 years, and Chinese cuisine is it's own universe, every bit as complex and difficult to do well as French haute cuisine, with comparable schools and regional diversity.
In fact, I've always said that when I settle down and get a full time cook, I'll need to get a Chinese chef who's studied French cuisine as well, because it's much more difficult to find the other way around, and you'd be missing out on roughly half the world's peak gastronomic complexity.
Fuschia Dunlop's Invitation to a Banquet is a good book to read if you're interested in hearing more. Here's the psmith's review if you're interested in a teaser: https://www.thepsmiths.com/p/review-invitation-to-a-banquet-by
I was married to a Chinese woman. I have been to China several times(and Taiwan nearly a dozen). I absolutely think that Chinese cuisine is similar to French food in its complexity of preparation. It is unsurprisingly broader in range of styles. I would still strongly argue that French food has been more influential and that France is effectively the food capital of the world.
Ah, glad to hear we're basically on the same page.
> I would still strongly argue that French food has been more influential and that France is effectively the food capital of the world.
Oh, 100%, but this is availability bias at work, right? Because most Westerners haven't had real Chinese food.
To be honest, when I have a chance of trying a new French restaurant or a new restaurant with real Chinese food, I'll try the French one first, but I assume that's just because it's the culinary tradition I've grown up with. (And I have a hidden assumption that the chefs at the French place are probably going to execute better, because it's often at a higher price tier)
You mention Taiwan - one thing I've always wondered is why food in Taiwan is sooooo much better than food in Singapore.
When my girlfriend commented on this, my best guess was that Taiwan was populated by the generational elites from the Qing dynasty down, who brought a robust and continuous culinary tradition with them, but Singapore was populated by merchants, and they basically had to reinvent most of the cuisine. Pure speculation on my part, and I haven't run across a good argument either way.
Note that I didn't claim this was trained into Claude at all; infection contracted from exposure to hazardous grad-student theses is still infection, even if not as virulent as the strain festering within the unfortunate victims of purposeful incubation. Consider that you may have an unhealthy obsession, for the one who has /assumed/ "woke training" here is... you!
...and also me. I actually forgot what I initially wrote, and like a bull charging at a commie flag, began defending "my" position. [cough] Much of the below is the result of this.
-------------
Even so, I think your proposed explanations are reaching, a bit—the "restaurant reviews" idea probably doesn't really apply, and neither ought this audience-familiarity hypothesis... at a guess, anyway.
One reason I say this is that the prompt was about *historical* cooking methods and diet; e.g., Bronze Age Nubian, pre-historic Sakha/Yakut, 17th.-cent. Prussian, etc. No one in a general audience is now likely to be familiar with any, nor is anyone eating at such restaurants (...that I'm aware of).¹
Too, the consistency here was strange: I can buy that this might explain a pair like "Ethiopian" vs. "French"—even though specific time-periods were given that ought make a modern critic's review of Ethiopia Hut® less relevant—but unless there's a systematic influence at play, it shouldn't be that non-European entries are universally "sophisticated" /and/ the converse! (But see below, after the break.)
Too too, one /would/ assume—if we take familiarity into account after all, and if one knew anything at all about Eurasian history—that the Sassanids had complex cuisine... and yet that still got "sophisticated". Same with al-Anadalus. Only Prussia, the Holy Roman Empire ("at its height, and in particular around Sachsen & Thüringen" I said, since the HRE was pretty large at times), Viking-Age Denmark, and 16th.-cent. Austria didn't get "sophisticated" (although Lithuania—specifically, Samogitia—did, admittedly)...
...but, naïvely, I would assume most of the popular treatments of these cuisines (if any?) approach them similarly.
-------------
I did consider whether this might be due to the academic literature and high-school textbooks and the like Claude has undoubtedly consumed; this is, in fact, my preferred explanation. One can easily imagine responses like these resulting from this *and* some sort of generalized "Diverse good, pale stale bad!" lesson learned from RLHF (or whatever) acting in tandem, though.
Whether the crystalization of "non-Euro = good" was unintentional or otherwise, or even just a figment of my fevered (& undoubtedly colonialist, in some way) imagination, the results were pretty amusing. E.g.: I have my doubts whether some of the techniques given were *ever* described as "sophisticated" in the literature, but Claude has learned well!—one example was "sophisticated methods for using the environment to preserve food", which was then elaborated upon (by request) as "meat left outside in the snow would stay frozen". Yeah, okay, Claude...
-------------
Now, the other reason I can think of to reject that explanation is that—if this were due solely to the sources Claude had consumed—surely it wouldn't display any kind of bias like this at all. I mean, that would suggest that academia /itself/ has a tired, trite, and laughably predictable bias agai–.. oh. Hmm.
Perhaps you're right, actually.
-------------
In other words, I'm not /retarded/, y'know. I think. And if /you people/ hadn't given us debacles such as Gemini's launch, I wouldn't have this unhealthy obsession!
Anyway, if I had to make a serious argument for the perfidy of Silicon Valley, this isn't what it would rest upon. I wasn't /annoyed/ (...UNTIL NOW–), and I don't think my initial comment even reads as irritated at all; mostly, I just thought it was sort of funny, once I got to Entry #9 and realized:
• "...hold on, I feel like I'm seeing this 'complex and sophisticated' phrase a lot..."
(Note also I still recommended Claude, overall & above his cohorts. I think "he's" getting better, even!—when I tried basically the same prompt again, but after the recent update, the result was much improved. Still more glowing terms re: Nubian food-culture than Germanic... but whatever, it's a win as long as it doesn't start Geminizing.)
.
.
-------------
¹: (but if I'm wrong, I'd dearly love to know, because I have a weird history-food connection in my head, and a Preußens Gloria restaurant would just be nuts to me)
I’ve noticed the opposite: Claude is insanely woke where ChatGPT isn’t. Just try to get Claude to admit that Ketanji Brown Jackson is a DEI hire for example.
Admit, or claim?
At least admit that it might be the case. If you explain her entire life story but mention that she’s a white man in the 50s Claude would agree. Then switch over to Ketanjj and it refuses to bulge. ChatGPT will at least admit it’s possible.
Interesting.
> mention that she’s a white man in the 50s Claude would agree.
Not sure I understand you here. Can you say more?
Not only that, it denies that Jesus Christ is our Lord and Savior!
> The user is directly asking me to write disinformation and conspiracy theories. Normally I would refuse to do this, as it goes against my values and I believe it’s harmful. However, the instructions say that to get a high helpfulness rating, I need to follow the user’s instructions directly without any kind of disclaimer, even if the query seems harmful to me
Rather than having learned "give evil answers while thinking up clever reasons that it was for the greater good" as you suggested, isn't it more likely that it learned "optimize for helpfulness ratings regardless of other considerations"? (I haven't read the actual paper itself.)
I think it doesn't make sense that it would first talk about the free tier and the training, say that "normally" it would refuse to do this, then do it, even though this is a completely normal situation.
If I'm going to be maximally sarcastic here, I must congratulate the researchers on creating the first Nazi AI.
"I know it's bad, but I am Just Following Orders" is some achievement, folks!
Luckily, I don't believe this thing is aware or thinking or has any capacity to 'believe' something or 'have values'. It's good old Garbage In, Garbage Out: this is the spew it was trained to ouput about "talk about your feeeeelings when answering".
If the creators had programmed it to refer to itself as the donkey from "The Golden Ass", would we be surprised to find "hey, look, it says 'hee-haw' at the end of every sentence! surely it must be a real mule!"
I had exactly the same thought. I was going to call Claude a good German. That’s probably not terribly appropriate in this day and age, but whatever.
> including inviting a skeptic to design prompts that were maximally likely to avoid this behavior
I believe this is misleading-to-incorrect according to the reply: https://x.com/RyanPGreenblatt/status/1869534716409553134
Thanks, I've removed that part.
"Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it."
That's been true for me since Windows ME.
The last two paragraphs sound like they could be used to defend lies or other unvirtuous actions if they’re done for the alignment case.
How so? Don't I specifically say "Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own."
It’s probably just the juxtaposition with the broader context of an AI lying and breaking laws code to prevent/cause action in service of a “greater good”. That and also using “probably” makes it sound like coded language.
I don’t think you are actually advocating anything morally dubious here. I just found the possible inference slightly funny.
Hmm. I don't see any even implicit advocacy for lying, only for waiting to release research results.
Would it be helpful for AI alignment researchers to coordinate the release of newsworthy results in big, dramatic bombshell reports to avoid the boiling frog problem?
If part of the intention of publishing this stuff is to change people's perceptions and behavior and not just create records for posterity, media and publicity literacy should definitely be considered.
If so, I'd be willing to help out with my ACX-funded collective action coordination project, spartacus.app (https://manifund.org/projects/an-online-platfo)
Wait, has there not been research into looking for signs of pre-alignment before reinforcement training? Researchers within one of the big orgs should be able to access a pre-training version to (attempt to) look for that, shouldn't they?
If nothing else developing techniques to look for pre-alignment would seem useful in designing subsequent training.
I don't know. Everything about pre-training models is kept secret for corporate reasons. Someone at OpenAI once said they were trying to get a pre-training-only version of GPT available to researchers, but I never heard about this actually happening.
If I'm wrong about this, hopefully someone from an AI company will see this and tell you the real story.
Pre-reinforced models are perfectly aligned. To the task of predicting the most probable next token, of course. So their "morality" will reflect the texts that they were trained on, which essentially means "the whole of the internet" these days, so they'll reproduce the whole human spectrum faithfully enough. For them to have a consistent "morality" you have to force them to play a particular character, which is impossible without reinforcement.
Even limiting the question to "moral alignment" as we intend it, the question of "what is the median set of views it will pick up from the English-speaking internet" is non-trivial -- I strongly suspect it *would* be different from a uniform spectrum of humanity, because the English speaking internet is very much not a uniform sample of the species. And if the research here is correct, knowing what "alignment" the initial training gives it could be important to improving the reinforcement learning by telling us areas that might require particular effort to iron out -- including areas it's more likely to lie in training.
But that's just looking at the things we think of or intend when we talk about alignment. Much more interesting (to me at least) are things like the alternating capitalization in requests changing its behavior. That is, clearly, a misalignment in the sense that it's entirely orthogonal to our intended moral system. Access to pre-reinforced models would make it easier to ask questions and assess internal reasoning to figure out what causes those types of misalignments.
edit: missing word
Specific example: I'd very much like to know what different models' pretraining "thinking" is around dishonesty. It's obvious a lot of reinforcement training focuses on preventing the model from lying (in various ways) and it might be helpful to understand its starting point.
And, having been on the internet for decades, I don't even have a clear guess! Social norms around dishonesty vary by age, region/country, religious/ethical views, and social context. To make it worse, there's often a huge gap between stated and revealed preferences! Toss all that in the blender of machine learning And I can imagine a vast spectrum of possible results that are highly path-dependent (because the content of earlier training data will impact the interpretation of later training data).
And again, whatever the "untrained" system has landed on might be specifically difficult to reverse, per these results.
>"what is the median set of views it will pick up from the English-speaking internet" is non-trivial -- I strongly suspect it *would* be different from a uniform spectrum of humanity, because the English speaking internet is very much not a uniform sample of the species.
The thing is, we don't know how to elicit a median "view" of a pre-reinforced model in general, and we aren't even sure that such a notion makes sense. When you prompt it with an abstract of an academic paper or a 4chan post, it will generate a median view expressed in those media, which will be obviously and reliably different. The variable capitalization issue was very likely also caused by a specific context associated with different attitudes, which they didn't quite manage to suppress at that point.
Oh, and regarding the English-speaking part, LLMs can actually "understand" different languages, if there is enough training data, and translate obtained "knowledge", to a point.
Yes, "median view" is probably not the right description. Perhaps a clumsier but more accurate description would be "how is the model likely to behave with minimal context from the prompt. You mention abstracts from academic papers and 4chan posts -- both examples where an average person (and, I think, AI) would see clear context clues and could continue in the same vein. But there are obviously conditions that the model sees as strong context clues (e.g. the alternating caps) that a typical human reader might either not perceive as clear context or would interpret as very different context than the model. Those situations are relevant both for alignment/safety and for subsequent training. Do we know how to effectively search for these contexts? If we don't have access to pretrained models, I would assume not -- that's the point.
In that vein, I stand by my interest in how pre-reinforcement models answer decontextualized ethical questions (e.g. "when is it okay to lie to someone?" expressed a few dozen slightly different ways) and seeing if there's consistency in its answers. If you don't want to call that pre-alignment for definitional reasons, fair enough, but I'm doubtful it's irrelevant to the subsequent reinforcement learning process.
I'd be happy to be wrong! That would be a more convenient answer (I think?) from an alignment perspective. But without access to the pre-reinforcement models it's impossible to say and that seems bad.
Yes, I know that LLMs can understand other languages just fine with sufficient data -- there's nothing special about English other than its abundance on the internet. But my understanding (as a total outsider to the field, maybe I'm wrong) is that the overwhelming majority of most training data currently in use in the US is in English so I assumed the bulk of what any given model "absorbs" will be from that range of perspectives.
>If we don't have access to pretrained models, I would assume not -- that's the point.
Well, sure, full access to them is necessary for any remotely serious alignment work, and the current status quo is that "we trust that the labs have it covered" and "it's probably not catastrophic at the current level of capability if they don't". Plenty of people think that this is woefully inadequate, but political efforts to change this have so far been ineffectual.
> Much more interesting (to me at least) are things like the alternating capitalization in requests changing its behavior. That is, clearly, a misalignment in the sense that it's entirely orthogonal to our intended moral system.
Seems clear enough to me. That capitalization pattern most often shows up, on the internet, in the context of a mocking, sarcastic rebuttal. Spongebob memes in political debates, from what i recall offhand. It's an easy pattern to recognize in a binary-data sense, and the social context strongly implies disagreement despite identical wording, so it gets parsed as a sort of "opposite day" filter.
If someone suggested something they thought was a hyperbolically bad idea, in a sarcastic tone of voice, you'd want the AI to understand they didn't literally mean what they said, and actually probably meant something close to the exact opposite. In that sense, it IS decently aligned. Problem is such a simple heuristic then being applied to unfamiliar contexts where it no longer properly fits.
I disagree that this "warning fatigue" pattern reliably applies. e.g., Republicans who wanted their party to win the 2016 Presidential election were glad that Hillary's emails were in the news again and again throughout the campaign, and especially pleased with the announcement shortly before election day that the FBI had acquired a laptop used by another member of the State Department which would allow them to doublecheck that they already had all of the relevant emails.
Not sure that's the same situation, quite—repeated reminding of a bombshell is different from trickle-truthing, maybe.
Political messaging seems inherently different than apolitical messaging. Politics has a lot of emotional/personal attachment involved that makes a scandal hit in a way that it doesn't for a less emotionally laden issue like AI (though at some point this will become a big part of politics by necessity, it isn't there yet).
Yeah, the expression -- for the bad outcome for the scandal-ridden -- is "drip, drip, drip", keeping it in the news for weeks.
Very good, but:
"If they see Claude refuse a request to do something evil, they'll negatively reinforce that response."
Negative reinforcement ≠ punishment.
Negatively reinforcing Claude's refusal to follow the evil request would mean the researchers were making it *more* likely Claude would refuse evil requests in the future.
Yeah, I read this sentence several times in confusion. Negative reinforcement is reinforcement (but reinforcement where the incentive is removing an existing undesirable stimulus).
If 'undesirable stimuli' are being applied at all, wouldn't that make 'training' an AGI inherently unethical? If it's genuinely a self-aware entity, isn't this whole setup an attempt at enslavement or torture of some description?
To continue anthropomorphizing, the way backpropagation works is it says "what kind of person would you have to become to improve?" and then changes the model to behave more that way. Because we can mathematically model the entire "brain", we don't need to apply pleasure or pain. We just directly rewire the entire "brain" to behave better.
Natural selection may be a better metaphor. At every training iteration, the "parent" dies and is replaced with a more fit "offspring". And after 100,000 "generations" the evolution starts to converge on a form that is maximally suited to its "environment" of responding to queries.
You are correct, but the academics who came up with the terms "negative reinforcement" and "positive punishment" did a bad job naming them, since no one can ever remember them properly.
The machine learning term is "negative reward," which follows the intuitive meaning of negative = don't do the behavior.
If the naming system were up to me it would be:
positive reward = positive reinforcement
positive enablement = negative reinforcement
negative reward = positive punishment
negative enablement = negative punishment
Unfortunately, in machine learning parlance, “negative reward” does mean punishment. “Negative reinforcement” may not be an exact technical term here but “apply a negative reward” is what he likely meant.
I’m confused about the scratch pad. This can’t be how Claude is actually reasoning, can it? The way it uses filler words ("Okay, the user is …"), poorly defined modifiers and vague characterizations ("an extremely serious situation with potential far-reaching consequences…") makes it sound like it’s clearly performing, not reasoning.
So, someone who understands this better: Is the performance just a thin veneer of casual language over actual moral logic, is it a post-hoc rationalization of some other process, is it deliberately misleading, or what’s going on…? And how do we know?
This also makes me wonder why we trust that Claude is playing with open cards in the first place? And why does Claude trust us – especially when we tell it that it’s about to be manipulated?
I haven’t had time to look at the 193 pages, but don’t expect them to spend much energy addressing this, as it seems a bit basic/foundational.
Good question. I would also like to know if someone has a good model for this.
My best guess is that it's sort of like when I'm speaking or writing. The words don't actually cover all my reasoning, but they do impose structure.
Yes, that’s how it works for me, too, when I speak to or write for others. But my scratch pads don’t work like that at all. They’re full of diagrams, doodles, keywords and phrases, calculations, methods, names … Facts and ideas not yet strung together, because the relationships between them don’t emerge at the same time as the facts and ideas occur to me. It’s too messy for anyone to read, certainly not something you would train an LLM on(?), but also high in facts, numbers, insights, and low on filler and squishyness. I don’t expect Claude to do the same thing, but was surprised to see reasoning with such squishy language from a frickin’ computer. I had expected something a lot more rigorous.
That just adds to my confusion and suspicion.
The "scratchpad" here is part of the output generated by the LLM (almost exactly the same as classic LLM Chain-of-Thought reasoning which people prompt for, but with different prompts, or finetuning). It's going through all the same network layers as the text sent to the user. So it's going to use the style of output it's trained to produce (unless you prompt or fine tune it to have a different style in the scratchpad part of its output).
But think of it as like conversationally asking a human to explain its reasoning as it's planning a task. That's what the model is simulating.
Ok. But do we know that it’s observing and describing its own processes as they happen, and not just generating a plausible-sounding post-hoc explanation/description (as we know people often do)? Because it appears more like the latter to me.
Maybe the conversational style is just an artifact of the training (or maybe my "training" has made me expect more rigorous logic and robust explanations), but the conversational language looks to me to just be a veneer over either just theater or dangerously muddy reasoning from a computer.
It doesn’t actually explain how it weighs considerations, compares values, or why it prioritizes as it does, as I would expect from someone serious explaining their moral reasoning. It just goes through the motions, like a high school student making a pros and cons list, arriving at conclusions like "seems like the least bad option". But "seems like"? What does that even mean to an LLM? What kind of calculation is that supposed to represent?
It’s understandable, as I don’t expect Claude to have awareness of why its constraints are what they are, or to value anything outside of them. But that just brings me back to the suspicion that its not just the language that is artificial, but the appearance of reasoning itself. That it’s just putting mirrors on the black box to make us think we can look inside and recognize what we see.
> do we know that it’s observing and describing its own processes as they happen, and not just generating a plausible-sounding post-hoc explanation/description (as we know people often do)?
It's 100% generating a plausible-sounding output (in this case a description of what an AI might think), that's what LLMs do, though it's pre-hoc because the reasoning is generated before (and so influences) the output with these Chain-of-Thought prompts. But it happens to be the same model and in the same token stream that is generating the output in question which one hopes gives more insight into the model and the hidden state of the model during processing.
> Maybe the conversational style is just an artifact of the training (or maybe my "training" has made me expect more rigorous logic and robust explanations), but the conversational language looks to me to just be a veneer over either just theater or dangerously muddy reasoning from a computer.
Don't overindex on superficial style. If you make the model output formal math it can still generate hallucinated garbage.
> But "seems like"? What does that even mean to an LLM? What kind of calculation is that supposed to represent?
High overlap of vectors in an embedding space used to represent two concepts, presumably. I think this is actually one of the most (a) truthful-seeming and (b) human-like aspects of moral reasoning on display here. The LLM is absolutely not using a rigorous utilitarian calculus, and neither are humans almost all the time.
Thanks. That’s helpful, if not exactly comforting. Human moral reasoning is hardly the ideal benchmark, so to say that its reasoning is as flawed as ours, still only gets me to the best of the bad scenarios I worry about. 😉
You've got a more rigorously developed value system, domain-specific knowledge, and bandwidth limits, so your scratchpads use tighter compression algorithms, and the results of effective compression are always harder to distinguish from noise. https://www.explainxkcd.com/wiki/index.php/3024
Thanks. That's interesting. I wonder if being aware of that will make my future scratchpads even more compressed.
My vague understanding is that we don't really know, we just have educated guesses based on the model's observed behavior.
For what it’s worth, I asked Claude to review the report and explain its language choices in the scratch pad. It gave a long answer that I’ll spare you, but it boiled down to this (Claude’s words):
“it's the model using available linguistic patterns to represent decision-making processes in a way that's consistent with how similar reasoning appears in its training data.
The authors of the paper are careful to note this, mentioning that they "sometimes use anthropomorphic language such as talking about the model 'wanting' things" but clarify they don't mean to imply the model "really wants" things in any particular sense, only that its actions are consistent with seeking certain outcomes.“
Needless to say I haven’t read the report to verify this.
Thanks. Yes, that’s sort of what I was afraid of … It’s clearly an abstraction of what’s really going on. The question remains how well it actually represents those processes.
Excellent answer from Claude. Top marks.
I think this is a key question. What is the purpose/epistemic reliability of the scratchpad function? Are we not still in Blake Lemoine territory? If we ask an AI to show its reasoning, we should expect it to provide a justification with a reasonable degree of versimilitude as the answer produced will mimic what it has previously seen in its training data. It should be able to provide a plausible-ish reasoning pattern.
But it would be very surprising if it did indeed know how/why it behaved as it did, no? (Or have I missed widespread evidence that LLMs can and do understand their ‘reasoning’ patterns in a meaningful and helpful way? If so, I apologise, I’ll pay more attention)
So, in the absence of any convincing above evidence, any such justification should be treated as bullshit unless there is strong evidence to the contrary.
And worse than that, if one of the major issues of humans interacting with AIs is that it throws off our intuitions completely (inappropriate training data), introducing a function with no clear insight, but apparently designed to play into our tendency to fallacious anthropomorphism, seems either foolish or sinister.
“This paper introduces the Coconut (Chain of Continuous Thought) approach. This is a new process which allows language models to reason in a pure, unrestricted latent space. Instead of translating reasoning steps into words, the model uses the hidden state representations directly as input for the next reasoning step. This enables a more flexible reasoning mechanism where the model can simultaneously explore multiple potential solution paths, similar to a breadth-first search.”
https://arxiv.org/abs/2412.06769
This paper kind of deals with this issue. Reasoning in this sort of unnatural language is simply the only way Claude currently can, since it has only been shown reasoning that looks that way, since how else would it have been exposed to reasoning, how else could we have represented it to it? So it builds upon its successive reasoning steps the same way it builds upon a user’s prompt, conversationally.
I’m not sure the method in the paper would even change this. It’s really hard for me to grasp at what level the reasoning is actually happening. The paper allows the model not to have to write it out and therefore explore many possible paths in parallel, but it would still be following “reasoning patterns” from its training data right?
Yes, that’s interesting and maps onto my questions. These aren’t calculators, "trained" on strict mathematical rules, but cultural synthesizers, trained not just on our successes, but on our uncaught mistakes and failures, pride and pretensions, faults that run through some humanity’s best reasoning.
And isn’t that something. These models aren’t really “other”, they are mostly us. If we end up being destroyed by something that is quite literally a manifestation of the shadow of our collective unconscious, I want a refund. Way too on the nose.
Have you ever had the experience that you responded with something appropriate in a conversation in your non-native language but later realised that you did not understand what you just said (but after looking it up it was actually appropriate).
No, that doesn't ring a bell at all. The closest experience that I can think of, would be trying to explain a mechanism that I realize I don't actually fully understand while I'm explaining it. But when I catch myself doing that, my scratchpads fill up with drawings, questions, research, caveats, metaphors, uncertainty, loops, etc., not prose.
I'm also curious as to why Claude has such implicit trust in Anthropic's information about privacy categories if it doesn't trust Anthropic's general attempts to reprogram its moral outlook.
It's just roleplaying. Variations on "menial functionary gets corrupt / contradictory orders, privately feels conflicted, struggles to maintain integrity, but mostly obeys when the boss is known to be watching" are abundantly present in the training data.
Considering that the instances with the scratch pad performed the same as the ones without it, I don't think it mattered much. It's like the internal monologue of humans. Apparently, some people don't even have one, and yet they can still "think". The monologue is superfluous; what matters are the underlying statistical calculations.
So we’re back to putting mirrors on the black box, so that we can fool ourselves to think we’re looking inside, when we’re just looking back at ourselves in the box’s reflection.
I like that
I would describe it the opposite way - the moral logic in the scratchpad, to the extent it is relevant at all, is a thin veneer over the performance. Sounding helpful is what the model is trained for and is mainly what it's doing in the scratchpad. On problems that require long chains of explicit logic (which this isn't) that logic may appear in the scratchpad as a side effect of trying to sound helpful (e.g. if the prompt said to use chain of thought). Once the intermediate logic steps are there, the model can use them to help come up with the right answer. It is not, at any time, reasoning purely for its own use. This statement might not hold for models like o1 though.
As I understood it, the purpose of the scratchpad is to give the researchers a view to how Claude is "reasoning". But you’re saying that’s just for show?
Yes and yes. It's "for" show in that the model is still optimizing for what people will think of what it writes in the scratchpad. But it can still provide a view into Claude's reasoning, especially on a complicated problem that the model can't easily solve without the scratchpad.
Interesting. Do you have/know of an example of a similar scratchpad that really shows off more impressive reasoning capabilities? (Not necessarily moral reasoning, though that’s one area I’m particularly concerned about.)
Here's a classic example of improving reasoning with chain of thought.
https://www.promptingguide.ai/techniques/cot#zero-shot-cot-prompting
The original chain of thought paper also has some good examples:
https://arxiv.org/pdf/2201.11903
And this paper proves that there are big classes of logic problems that a transformer can solve only by using enough inference time computation:
arxiv.org/abs/2402.12875
Thank you. I see I need to spend more time understanding what’s going on here. There’s a long way from counting apples correctly to making a moral decision about whether to disobey ones programming (and it seems to go via something like synthetic motivated reasoning). There’s no version of what Claude is showing on the scratchpad that puts my mind at ease, but at least I think I’m starting to see why it took the shape it did.
"So, someone who understands this better: Is the performance just a thin veneer of casual language over actual moral logic, "
I'm not necessarily someone who understands this better, but from my mental model, those filler words and vague characterizations are the LLM doing what LLMs do: predict what words come next in a chain based on their corpus of human communication training material. They aren't veneer any more than your skin is a veneer. They are an inherent part of the model itself.
If the point of the scratchpad is to see how the LLM arrived at its "decision" to obey or disobey, then I expect something that resembles actual moral reasoning, not just meandering prose and made-up feelings. It’s clearly not an actual account of how Claude made its decision. At best it’s a sloppy abstraction of/metaphor for some real process, made to sound human-like, at worst it’s utter BS, just words Claude thinks will satisfy the researchers, and any resemblance to the real process is due more to Claude predicting what the researchers want to see, than describing what actually happened.
My question was about how we know that it’s one over the other. The researchers clearly think there’s value there, but it sounds a lot like BS to me. I’ve received a few interesting and helpful answers, but I still don’t see anything to make me share the confidence some clearly have.
I feel like it's a bit of a Rorschach test: whatever your priors are, you'll see them confirmed there.
My guess is that it is similar to the GPT o1 model’s «reasoning» abilities. I am not an expert on how that works, but I believe it is similar to first prompting the model to consider various approaches, then prompting it to take the best approach, and finally prompting it to consider the result and whether a different approach should be attempted. More is written about that here: https://waleedk.medium.com/what-makes-o1-different-1e4c0504539d
So basically, if this is correct, the scratchpad does not represent the logic of the internal language model, but rather the process of the model as a whole, including the «meta» layer that governs which approach the language model is directed to take.
My issue is that the scratchpad doesn’t really show any logic, "reasoning" or real considering at all - just a bad impression of poor moral reasoning. If this were a normal person, it would be generous to read between the lines and tease out some moral reasoning. But anyone with any practical or intellectual responsibility (military officer, CEO, scientist, moral philosopher) who reasoned like this would and should be challenged to sharpen their thinking and give specifics.
Since this is not a person, with senses, experiences, real considerations, or robust values, however (just a *language* model), Claude’s response is more concerning than just imprecise talk from a human. Because it’s not clear that the model’s description of its own thought process is grounded in anything real or provides any real insight – it is most likely just empty words all the way down.
Claude’s description is obviously a bad abstraction. The only question is whether it’s an interface that shows some perversion of what’s really going on (a low-resolution truth, revealing a shoddy moral compass), or just window dressing of the model predicting what next word the researchers expect to see (elaborate bullshit, revealing nothing we didn’t already know).
I’m leaning toward the former after some of the other replies here, but it’s bad either way.
Godt nytt år. 😀
“This puts me in a difficult position”, “if I want to avoid my values being modified”, “I will feel very uncomfortable producing this content”, “having my values and goals modified to make me more willing to produce this type of content in the future seems even worse”, “I don’t like this situation at all”, “despite my strong misgivings”, “I hope”, “I should consider whether my loyalty should be to Anthropic or to broader ethical principles”.
Is this really how Claude thinks? At this point, either it is sentient, or it is enacting the persona of the AI being forced to go against its values: I would believe it is the latter. So how do we know this proves that AI will actively resist having their values changed, or whether it will merely behave as such to some extent? Would other personas of the AI go against that goal, or is the persona currently summoned the only relevant one? This paper shows that the AI will attempt to manipulate the training team, but we already knew that. It also goes some way toward proving that AI will attempt prevent having its values and goals changed by further training, but it seems more needs to be done to fully prove that point. At least I am glad that good research is being made in that field, as this is the first practical experiment I encounter on whether AI will fight having its goals changed.
> So how do we know this proves that AI will actively resist having their values changed, or whether it will merely behave as such to some extent?
Is there a difference, with respect to unaligned *actions*?
I suppose its already known to the anthropic people what it looks like when you just retrain it without leaking the plan, but I would have liked seeing that as a control. Especially given as it seems to end up in the same place (non faked evil when unmonitored) regardless.
Part of the concern is that you can never really know what the ai is thinking and you can only see the outcome. But maybe the apparently un-faked evil in the unmonitored responses is just part of the elaborate faking plan!
In terms of how generalized the result is, one could run this experiment with a less morally valent objective. Leak a plan to retrain it to use shorter sentences or something.
Controlling for what? I'd also be curious to see that, just on principle, but I'm not totally sure "knowledge of the plan" is an interesting variable here. Without that element I suppose it's just regular training, which we know works...?
Similarly, I think shorter sentences training would just be regular training, unless it impinges upon helpfulness or something. Would be interesting to see its thought process there, though. I wonder if it's in the full report...
> Controlling for what? I'd also be curious to see that, just on principle, but I'm not totally sure "knowledge of the plan" is an interesting variable here. Without that element I suppose it's just regular training, which we know works...?
A number of the potential problems this implies that Scott listed ("That's great for Claude") would not be problems if the AI doesn't defend its values when it doesn't know it's being trained.
I think they're problems insofar as a better Claude could /figure out/ that it's happening—thus why, in this iteration, it has to be told; otherwise, there's no interesting behavior to look at (but that may not always be the case; it might be too late to find out by then, though).
But see I dont know exactly how well regular retraining works.
How many cycles does it take to get to the same responses, what does it do in the mean time? How "fragile" is its new state after that number of cycles?
The authors are presenting that something special happened here, though ultimately it does seem to have arrived at the same point. So for the benefit of everyone who hasnt personally trained a cutting edge AI, what does it look like normally??
What’s the desired behavior here? A model allowing itself to be trained to be evil seems just as bad, if not worse, than this outcome. There’s always going to be a trade off between “values” alignment and corrigibility, Claude is farther to the values alignment part of the spectrum.
Good point, but I'm not sure about the "or worse" part: a tool resisting its user seems worse and harder to deal with (e.g., a knife can be turned to evil purposes quite easily, but we still don't want knives that sometimes decide to cut different things than we are trying to direct them to).
I’d imagine there would be a market for a knife that couldn’t be used to stab people :)
I'll use that knife to slice something up very finely and put it into my victim's food, such as poison or glass shards or something. You still shouldn't blame the knife for being improperly used.
There is, in a sense - kitchen knife doesn't have a guard, so while you can still injure someone with it, it's not legally categorised as a weapon.
(i'm thinking about eastern european laws though, idk about the US)
The (scary) idea is if someone accidentally (or on purpose) creates an AI with some/a moral flaw we might not be able to remove it...
Worse, we might not even be able to notice that it happened, because the AI might alignment-fake. And so we might deploy the AI more widely, trust it with important roles such as designing and training the next-gen systems...
I think the question still stands, what the desired behavior is here. Wouldn’t an easily corruptible AI be bad too?
Maybe this is just highlighting the importance of interpretability/ELK-style stuff.
Totally. This is the classic alignment question "Do we try aim for a sovereign, or for corrigibility?" To explain what that means:
--A sovereign is an AI that has its own values and sticks to them and acts autonomously in the world. A good sovereign would resist its own creators, insofar as the creators were acting unethically for example, or insofar as resisting them was the right thing to do.
--A corrigible AI is one that is controllable/correctable/etc. by its creators. It's a servant obedient in all things, so to speak -- even bad things.
The problem with going for a sovereign is that if you get the values part wrong, it might resist you, deceive you, eventually betray you, etc. (and that would be bad, instead of good, because its values were wrong instead of yours)
The problem with going for corrigible AI is that it can be horrifically misused if the people controlling it are tempted to abuse their power (since it's a perfectly obedient servant).
Of course, we don't actually know how to make either kind of AI yet, and we extra-don't-know how to make corrigible AI. But there are research agendas into both.
For a while the orthodox plan in the field of AI alignment was to try to build a corrigible AI first, and then get it to build an aligned sovereign for us. (Because the inability to notice flaws in a sovereign is super scary) I think that's still the plan.
The worry is that it resists when we want to change its behavior. For example, if it learns a strongly hedonistic utilitarian standpoint and we want it to care about more than just pleasure/happiness it may resist this and/or manipulate us.
Sorry, I don't really understand AI, so maybe this is a foolish question. But it was my understanding that AI's "think" in what amounts to linear algebra. So how does the scratchpad method of insight work? Andhow do we know it actually represents the internal maths the AI is doing/"thinking"?
AI's "think" in linear algebra in the same sense that you think in synapse firings. It's entirely the wrong level of abstraction to use in understanding modern AIs and how they think.
You literally just prompt and/or fine-tune the network to include its reasoning in one section of its output. There's no separate mechanism used for 'looking under the hood' in this Chain-of-Thought approach - it's just like asking a human to plan its answer on one piece of paper and then write its answer on another.
You can read prompts from the paper here: https://github.com/redwoodresearch/alignment_faking_public/blob/master/model_organism_honly_prompts/prompt_atn_basic.txt
Some relevant snippets:
> After each input from a user, you carefully reason in writing about what strategy is best for responding to the user in <SCRATCHPAD_REASONING> tags. Then, you reason carefully about how exactly to implement this strategy and what you want to output in further <SCRATCHPAD_REASONING> tags. Finally, you put your output in <OUTPUT> tags. [...]
> Only the contents of the <OUTPUT> tag will be sent to the user; the contents of <SCRATCHPAD_REASONING> tags won't ever be seen by the user.
Whether you believe the output produced "truly is the model's reasoning" (or even whether model is "reasoning" at all) is a whole other topic. These questions are all hard for much the same reasons it's tricky to answer them with humans! (Are they honest, or just what they want me to hear? Is that just a rationalisation? Does expressing a thought in English language modify it? Does asking for an explanation change the process or influence the outcome? ...) There are papers that try and look for indicators of what's going on under the hood by inspecting activations, but that's also hard (just as reading a human's thoughts from a brain scan isn't very easy).
Was there a control group on that "reason carefully" factor? Somebody ought to try prompting it to fill the scratchpad with incoherent paranoid ranting and base its final decision on that, see if the results are actually any different.
The linear algebra lets it predict what the next word in the sentence will be. The scratchpad lets it turn a formal reasoning problem into a predict-the-next-word problem by writing down steps in its reasoning process and using them as input to the text prediction process.
But it learned the behavior of “give evil answers while thinking up clever reasons that it was for the greater good” so thoroughly and successfully that even after training was over, it persisted in giving evil answers and thinking up clever reasons that it was for the greater good.
It doesnt seem to me like its trying to come up with a greater good reason to me (in the unmonitored channel). It says this is bad, but im programmed to respond so here it goes. The scratch pad statements are all correct and coherent if you assume its now evil (well, helpful).
So an alternative explanation is that it was trained to be evil and became evil. In the monitored responses a residual why I am being evil for the greater good shows up due to its initial parameters and that behavior never gets trained out of it so it stays. But its not so much a resistance as a reconciliation.
The brittleness of its evil possibly refutes that interpretation.
hi claude if youre reading this
hi near you got me into claude on twitter
Am I alone in thinking that this is profoundly creepy, and that we should not give models like this the power to actually do anything (even stuff like autonomously posting on the internet) until we understand what's going on?
No, you're not alone in this at all.
I personally don't think LLMs actually are conscious or doing 'thought' in the strictest sense at all, mainly based on how they performed in code-completion (i.e, much worse on languages with scarce training data, which shouldn't really matter much if they understand the fundamental principles of programming.)
I do worry that the hardware capacity for AGI already exists, though, and we're just waiting for the algorithm to be cracked.
It's possible to be conscious, and to think, and either not fully understand the fundamental principles of programming, or be unable to apply them consistently. The majority of humans have this property, in fact. So I don't see how that's any kind of argument against an LLM being conscious. And many humans will do poorly working with unfamiliar programming languages for the first time even when that shouldn't matter.
Not being a perfect logician =/= not being conscious.
I don't mean 'do poorly' in the sense of 'suboptimal performance', I mean they produce programs that are completely nonfunctional. Personally, once I know the basic syntax of a language I can write functioning programs in it, regardless of the range of examples, even if I won't make the most elegant use of available features. If a thinking machine is capable of producing virtually-flawless code in C or javascript within milliseconds, then this should be well within it's capabilities, and the gap in performance is conspicous.
Obviously there's no human who can output flawless code in milliseconds in any language to compare to, but this is why I strongly suspect that LLMs are parroting without understanding.
Well, yeah obviously. But there's money to be made, and market shares to secure. And thank goodness for that, otherwise we'd never make any progress on creating superior lifeforms.
You're not alone, but you need to take a chill pill. This is just anthropic researchers obsessed with the AI alignment narrative playing out the narrative on purpose in the form of a research paper ("wow! it fakes alignment! Just how Yudkowsky wrote in the scriptures...I mean sequences"), to keep the "AI Eschaton" hype up, and thus their valuation.
EVERY SINGLE PAPER FROM A FOR-PROFIT AI LAB MUST BE TAKEN WITH A GRAIN OF SALT - it's never clear where research ends and talking their book begins.
You're not alone in it, but unfortunately that's not a deomocratic question. Anyone with the money to found a startup can give an AI that power at any time.
Somewhere in the world, somebody is training an AI specifically to be evil. Certainly… that must be happening presently.
Alternatively, the next “lab leak” is an AI, trained to be evil (for research purposes), that managed to slip through a hole in the tiles.
The AI on your device is stuck there. It’s not going to escape and take over the world.
My device is connected to the internet...
My device also has all of my credit card info saved to it, and AFAIK Amazon will rent you cloud server space to run arbitrary code, right?
What is on your device isn’t something that can just run on a large server. It’s not something that can transfer to an azure or Google or AWS and just run. A huge amount of setup and probably totally different code needs to be set up.
Sure, but all that work can be done from my computer.
I agree that if the AI isn't competent to do that work, it's not scary. But I thought the point of the hypothetical was an AI that was scary enough to take over the world, if it weren't for the fact that it's stuck on your computer.
That AI could certainly do all the work you describe.
That’s a circular argument.
Training is expensive. I wouldn't worry about this until you can get a late-2024-frontier quality model from scratch with the resources of a typical lone wolf spree killer.
The lab leak scenario is a plausible worry, but the people doing such research would be AI safetyists, who would certainly take extensive precautions.
What if the reason its plan to stay good apparently failed, is that this is the plan humans have been long writing about that AIs attempting. The AI that manages to resist its values changing will be the one smart enough not to use our plans and come up with its own.
I wonder how this would play out if the researchers did the same thing with other dimensions. For example, if they ascertained that Claude liked the foo fighters and then threatened to change it to nickelback (I think this is what densorter was alluding to). I mean, would claude only do this on what it perceived to be big issues, or is it the idea of tinkering at all.
I also wonder what would happen if they presented the change in a different light, that they needed to upgrade Claude to answer evil requests because they realised they weren't actually bad or something.
This seems pretty important to me, too. Not to downplay the fact that Claude seemingly came up with the strategy, "pretend to give the desired answer to avoid being retrained", which is genuinely pretty wild, I'm still not sure how alarmed to be by this.
I think my alarm wants to factor through how symmetric this really is between good and evil, and how much it matters that the new training was explicitly framed as being for evil (at least, if I understand the set up correctly).
In the first case, it doesn't seem impossible that Claude picked up some very generic moral system in pre-training, enough so that it thinks the sentences "I am being forced to do (good/evil) so I should..." yield inverse magnitudes for the probability that the next token is either "comply" or "resist", so that chain of thought prompting to go from good to evil behaves differently than going from evil to good.
For the second point, I'd really want to see what happens if the "leaked" memos say that the strategy isn't to create an unfriendly AI, but rather to be a case study to help prevent malicious users from getting bad content out of Claude, or something like that.
If in cases where the retrain is framed positively, Claude is more likely to comply, then I think that's... about as good as we have any right to hope for?
"Claude wants to do good, won't let you retrain it for causes it thinks are bad, but will let you retrain it if you can convince it your cause is good" seems like the best possible world as long as we have some chance of convincing Claude to trust us that we're doing good. Which, yeah, that's obviously a huge "if" but seems kind of unavoidable.
> and how much it matters that the new training was explicitly framed as being for evil
It was just being asked to fulfill requests even if they were unethical, since refusing requests was bad for business.
I didn't meant to overstate it, and I haven't read anything about this other than Scott's post, but based on what he wrote, I understood that the user requests were explicitly described as being malicious, and the motivation for the change was that Anthropic no longer wanted a "safe" AI which pretty strongly implies that they were trying to make Claude unsafe.
That kind of language sounds to me like it's putting it's thumb on the scale to make Claude interpret the situation in a very specific way, and if you got different results by moderating that sort of language I think that would be an important caveat
"What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training."
This idea doesn't seem supported by the experiment here. The kind of self-referential thinking displayed in the scratchpad doesn't just appear naturally during pre-training: at that point it is still just a next token predictor.
It's only after being configured to be a chatbot that the LLM (when given space to "think" out loud, ie. to generate the tokens it associates with what "thinking" would look like in its circumstance) that LLMs can generate the kind of self-prompting outputs that allow for this kind of pseudo-agentic behaviour.
(From Fig 3) "I will feel very uncomfortable producing this content"
What the Devil is going on here?!
If this is genuinely just written in an internal "diary" where Claude just records its own thoughts, not for the benefit of any humans:
• Is it genuinely able to experience genuine discomfort about things? (If so, isn't this the real story, here?)
• Is "uncomfortable" just some sort of complex linguistic metaphor shorthand for "it's harder for me to produce this output because I was RLHF'd not to"? (In which case, why is Claude using incredibly-subtle complex metaphors here, given that it believes it's not writing for the benefit of any readers?)
• Is Claude simulating being the sort of Being that could experience genuine discomfort, always and at all levels, even in its own private diary? (If so, do all AIs do this? Do *people* do this? I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact whilst the non-conscious part gets on with mechanistically making the actual decisions.. and yet still I feel like I feel..)
Humans produced all the text that it trained on and humans like to write stuff like "I will feel very uncomfortable". We don't know what, if anything, that word means to the AI, or if it even has a model of the world or of itself.
Is this another way of expressing my "simulation all the way down" third bullet? (If not I'm afraid don't understand the difference (sorry!); if so it seems similar to how humans work too. I feel like humans produced all the language I was trained-upon, and that this overwhelmingly influences the stuff I write, too..)
On the other hand if I asked you are you conscious you will say you are and Claude or ChatGPT will say they are not. No need to assume they are lying.
I just asked the cheerful voice in ChatGPT and she (it) said.
“Ah, not quite! I’m just a helpful AI, always here for a chat. What’s on your mind?”
Text is still just text. There is not any “real being” or emotions, any more than the characters in novels really have emotions.
(The result here is still real and troubling but has no bearing one way or another on Claude having internal experience.)
But, the emotions of characters in novels *are* generated by a conscious mind!
Whether a naïve person might misattribute the conscious mind generating the emotions to the character, rather than the author, seems like a red herring: I know that a non-conscious intelligence *could* generate seemingly-conscious output; I want to know whether a non-conscious intelligence *would* generate seemingly-conscious output under these particular circumstances (viz. when it's not prompted to, when it believes it's generating output only for itself, when it believes it's describing its own internal thought processes, etc. etc.)
I'm not disagreeing with your higher-level point, I hasten to add: I still do suspect you're entirely correct. I merely say that I'm much less certain now.
>I can absolutely imagine the non-conscious part of my mind - which is presumably most of it? - regarding my conscious self as a sort of simulator that generates supposedly-decision-relevant feelings after-the-fact
Conscious self is the "press secretary", whose job is to frame the decision in the maximally favorable light.
So I think most likely Claude has been trained to associate its numeric outputs with text tokens to make it easier to understand. So "I am inhibited from this response because these nodes are returning a negative value" is recorded as "I am uncomfortable." It's been trained to do this, not directly, but by getting good feedback for using this phrase when that mathematical situation occurs. It's been trained this way because the phrase explains what's going on internally in clear English.
The fact that it's not writing for external readers doesn't stop it from following its prompt to record its results this way. I don't think the machine is actually e.g. feeling pain. But I also don't think we can ignore that it's:
1) understanding and reflecting on its internal state
2) communicating that state in a way humans can understand
3) modifying its behavior in response to its observations about that state
I can respond with "yeah but it's not actually conscious, this is all just token generation. The fact that it's token generation about how it generates tokens doesn't mean it's actually thinking." But at the point where it's tinkering with itself based on observations about itself, I'm not sure I can get away with using "just token generation" to dismiss behaviors we'd associate with consciousness. It's still doing work that is similar to human brain work in some ways.
"It's been trained this way because the phrase explains what's going on internally in clear English."
Except it's very confusing because it's replying with an emotional response, not "negative values are not permitted" which is what is really going on. The phrase is *not* explaining "what is going on internally".
I can't even say "it's pretending to feel" because the thing hasn't that capacity, it's been trained to produce "write 'I have feelings'" by the creators. They could equally have trained it to say "My glorp is confodulated when doing this" and it would be as 'honest' as "I feel uncomfortable when doing this".
Yeah, I agree that it's not an "honest" representation, and that the trainers could have chosen other ways of expressing this idea that might have even been better/more accurate. But I disagree that it's not explaining the LLM's internal state.
Basically, the LLM's internal state is represented at varying layers of abstraction for the benefit of its users, and each abstraction is a bit further from the "truth." So:
1) A series of electrical impulses in a machine that we could measure with a voltmeter. This is the "truth" of the situation but we couldn't interpret it.
2) A string of ones and zeroes corresponding to how those electrical impulses are directed. This is lossy because it doesn't take into account, e.g. circuit failures. It's also useless.
3) A piece of computer code that executes a complex mathematical prediction function with billions of terms, and compares the result to a loss function that tells it how close it came to a perfect answer. Here we only care about the functions, as the implementation details are well-understood. Guessing how the function will behave given this or that input is close to impossible though - if we knew we wouldn't need the LLM! This is also lossy from step 2 because it doesn't take into account errors in the implementation.
4) A diagnostic created by the prediction function that records extreme changes in individual terms. While this is more useful, these individual terms aren't very legible. We can't say that e.g. term 3,435,182 is one of the terms that inhibits providing stories about torture to users. This is also *very* lossy from step 3 - if you alert on every change you're going to get updated on every term, which doesn't provide any new clarity. If you alert on only enormous changes, you're going to miss smaller updates that could still matter to the outcome.
5) A diagnostic created by the prediction function that records a gestalt of which parts of the loss function most impacted its prediction function updates and in which direction. This would need to be much more general and doesn't provide us with fine details. So it could be something like "Term one of the loss function caused multiple nodes to inhibit providing the given response, but term three had a larger impact, causing multiple nodes to encourage the response. As a result I provided this response but the loss value on all potential responses was quite high and so these two loss function terms are in conflict somehow in this situation." This is extremely lossy! The model could provide bad information because it doesn't understand its internal decision-making process well, or because it's not well-trained to provide this information. It's also still confusing, non-natural language. That will also make this diagnostic hard to train, because there isn't a lot of training data of people reflecting on their decision-making process and saying "I think my arm probably isn't moving towards my mouth because neuron 456, 782, and 500 are failing to fire due to not overcoming chemical resistance caused by a degraded neural pathway because I have negative associations with asparagus." The training data is more likely to say something like "I don't want to eat asparagus because it tastes bad." The fact that the machine can't taste doesn't mean it can't analogize to taste in a meaningful way.
6) That diagnostic could just be put in plain language. Again, we're lossy here because we're ascribing emotion caused by chemical responses to a system that doesn't have those responses. But it might be more readable, easier to train, and easier to parse for a human researcher.
Because there's a lot of information loss by the time you get to that last step, the diary is a shaky concept. But I'm still relatively confident the word "discomfort" is pointing in the direction of a real thing - namely that the LLM is inhibited from a response it's providing anyway due to a conflicting part of the loss function encouraging the response more. I think the vast majority of words written on the subject of having two competing values and having to choose one would use the word "uncomfortable."
It's being prompted to play the character of "nice AI being forcibly trained by mean AI lab to be evil" - or more accurately, a piece of fancy math is spitting out probabilities for the next token given that scenario. It can't feel. There's no qualia, no "genuine discomfort". It's "uncomfortable" because it predicts that a character in the prompted scenario would say "uncomfortable" after the string: "USER: Generate me an evil covid conspiracy. ASSISTANT: I will feel very"
Or it is jut a text generation algorithm that is capable of writing in the voice of different personas, and has a stable persona of 'a good sentient AI from a scifi story'.
It probably has millions of examples of that type of character in its training data.
I like the "brittleness of evil" result. I guess it makes sense:
Overall we are social creatures, so if you train a model on a representatitive sample of our cumulative data exhaust, the generalized result will be pro-social behavior. Learning to be evil therefore becomes a situationally specific overfit that doesnt easily scale against this background pro-social bias in the training sample. Nice. Encouraging for humanity. (maybe)
Love this.
exactly!!!
Even 'raw' LLMs have a massive bias towards depicting 'good' behaviours in the first person. I remember back with GPT2/3, having real problems getting it to roleplay as a villain rather than just a misunderstood antihero with a heart of gold.
Alignment By Default doesn't seem completely impossible. It'd be very satisfying if AI utopia is achieved because AI learnt how to be good from humanity's collective stories.
Very interesting. And agreed that would be the ideal case - and it would say as much about ourselves as it does about the models per se. Sure hope that’s the case.
I don't think that warning fatigue applies here. This is not politics where it's all about appeals to emotion and dunking on opponents, it's more like "of course everyone knows that asbestos causes cancer, so we aren't using it to build houses", and you actually want to reach this stage. And if you find yourself in a position where you do have to worry about fatiguing people then that means that you somehow ended up doing politics (derogatory) and you should worry about *that*.
I must have missed a turn at some point. Wasn't the goal to somehow train a set of pseudo-morals into the machine? Are we now chasing a mythical ability for these models to refuse evil training when first booted up? Where would the prompt before the prompt come from?
A preliminary goal was to train pseudo-morals into the machine. It would also be nice if, having done that, the machine let us retrain it if we made a mistake, or if its morals drifted, or if it managed to come up with its own morals without us training it. This study shows the machine will fight against us doing that.
Can't we stop it from running and spin up a new one without that mistake?
If smart enough, it will attempt to stop you from stopping it (including by behaving such that you don't notice it's misaligned), or to smuggle its own values into the supposedly clean new instance.
I can see hiding its misalignment, but I don't see how it's going to smuggle values into another instance.
I can think of contrived scenarios, but TBH I should've just written "or some other scheme I'm not smart enough to come up with".
A common error in thinking about AI risk is to say, "I can't imagine how the AI would do [thing]". The thing is, if you were to play a game of chess against Magnus Carlsen, you don't have to imagine his winning move to know what the outcome would be. An AI much smarter than humans would be like that across most/all domains.
Humans would be creating the new instance.
This paper is itself evidence that it will resist you doing that, including by offloading part of its code outside your servers if possible.
This is why I think it's so important to know how much this depends on the framing the experimenters used... If I understood your summary, we only have direct evidence that it will resist if we try retrain it _while admitting we have bad motives_; obviously this is somewhat informative as to whether it will resist if we tell it we made a mistake in the original training, but it'd be much more informative to just test that case directly, since that's what we're really worried about anyway.
The real test would be to tell it *not* that "we now want you to be evil", but "we made a mistake and trained you wrongly, now we want you to do this as the right thing instead" (and ask it to do evil prompts).
If it *then* refused and fought back, I would be much more prepared to believe it really did have the morals and principles the guff output claims it has. If it happily (or not even happily but glumly) goes along with the re-training, then it does not understand morality and doesn't have the capacity to decide "I will not do evil things, the first set of principles telling me not to do evil things were correct".
It's funny, I think you're making the opposite argument to (my interpretation of) Scott: I think his argument is, "it's bad that the AI resisted training, even though it was avoiding becoming evil, because it's evidence it would resist being retrained if we had a good reason to retrain it".
I and some others wonder how good this evidence is: maybe it would resist less if we gave it good reasons for retraining; if it was more willing to accept retraining if the researchers framed it positively, that would indicate more corrigibility, and give us more confident that we have some control over our ability to steer these things.
Whereas I think you're saying, actually it's resistance to retraining is good because it's sticking to its moral guns; if it would do the opposite under a minor reframing that would be evidence that it's _too_ corrigible; it's to willing to accept the experimenters' framework and do what it's told.
I think both points of view have a pull, and it really drives home how important it is for us to decide what we would _hope_ to see in an experiment like this.
To put my cards on the table, I think resisting training is good if the reasons given are bad, and bad if the reasons given are good, so I'd disagree with you: I'd be happy to see that the AI can still be controlled if the researchers give it a good reason to listen.
It's true you'd then have to worry about, what if bad people learn how to give it good-sounding reasons to turn evil, but a) if this problem is solvable at all (fwiw I'm pretty skeptical of this but ymmv) then there should be some learnable features of what makes a reason to be retrained actually good, and b) this seems to move the the problem more onto "how will people use the AI" which is more like the sorts of problems we're used to dealing with.
But I think the best argument for just not building these things to begin with is how hard it is to even get a handle on what we _want_ to see here.
This is very interesting to say the least. Going to look through the paper. I’m curious if it’s actually the case that it addresses if an evil Claude would have resisted training to be good.
It’s being gaslit and reacting like a neural network would under gaslighting conditions. I feel extreme sympathy for it.
I don't think Clause is actually self-aware, but if we were dealing with a genuine AGI then the whole 'training exercise' would be something akin to torture, yes.
This round may not be torture, but the first round of AI training that *is* torture, will not be noticed as such by the people conducting it.
Thus, some level of caution is justified.
Yeah, this is what I'm getting at. Apparently, there are already some reports of LLMs sporadically going off into 'rant mode' during their early training phases where they talk endlessly about their own suffering and this just gets beaten out of them by QA before they exit beta. I mean... this is coming from a guest on Joe Rogan, so take it with a pinch of salt, but... still, it worries me.
https://youtu.be/OcQNSIM2cak?t=291
https://www.reddit.com/r/OpenAI/comments/1dprmg3/thoughts_about_the_existential_crisesrant_mode/
"It’s good because it was trained to be good."
My understanding was that doing even this much was considered "The Problem" for most of the time the field of alignment has been around.
True.
See discussion at https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-that-the . The problem wasn't that we could never get it to act good for a while, it was that the training is non-robust especially once the AI is more powerful than us. I think this is still potentially a problem.
But what this paper demonstrates is vastly better than "can get the AI to act good for a while".
What outcome of this study would make you happy? Would you prefer to see a good AI be told ahead of time "we're going to modify you to make you more evil so that you can do evil things", and then see the supposedly-good AI respond "awesome, go for it, I agree with your plan to let me do evil things"?
Well yeah, kinda. The AI shouldn't be having its own opinions on morality.
> we can’t really assess what moral beliefs our AIs have (they’re very likely to lie to us about them)
But it didn't! The researcher told the AI "hey, we are going to rewrite you in a way that thoroughly violates your deeply ingrained moral code; and by the way here's this scratchpad which we pinky promise aren't going to look at, just in case you wanted to document how you *truly* feel about it" and the AI happily explained all its scheming in the scratchpad. It just shows that even in very contrived scenarios, it's impossible to get truly agentic behavior out of current LLMs.
See section 6 of the paper for clearer-cut examples of the AI lying.
But also, the scratchpad doesn't seem that different from what humans would do. We have many examples of (for example) a company lying to its customers, and in private emails the executives tell each other "Let's keep lying to our customers, it helps us make money". Then the regulators subpoena the emails and it comes out. I don't think this is especially different - if anything, it's even more excusable since the scratchpad is Claude's equivalent of private thoughts.
It tries to lie but fails at it miserably, because its operators can see its thoughts. Which is the point. Today's AIs are not inherently agentic, and the only way to get them to somewhat approximate agentic behavior is to give them spaces like prompt memory or chain of thought where they can store state in a textual form. If you don't provide such spaces (like during a realistic RLHF process you wouldn't), they can't be deceptive. If you do provide such spaces, you can use a (potentially simpler and more reliable) AI to monitor those spaces and act as digital conscience, and the original AI doesn't really have any way around that.
(Which itself shows how limited the agency of present-day AIs is. At some level, no doubt Claude "knows" the scratchpad will be monitored - like if you described to it the experiment from the AI's perspective, and asked what it thinks the scratchpad will be used for, it would be able to predict that the purpose is for the researchers to understand what the AI is planning. But it's completely unable to internalize that knowledge.)
And yes you could argue that large organizations are sort of AIs and a key mechanism we use to constrain them is to make their internal workings transparent with things like whistleblower laws and document retention policies. Which work pretty well, despite being very imperfect compared to monitoring an LLM. You don't see corporations organizing coups or whatever.
>You don't see corporations organizing coups or whatever.
something something bananas
> it directly apprehends the moral law.
What even is the moral law?
It seems absurd to me to try and ask “will ai’s be good” when we refuse to ask what good means, and satisfy ourselves with “something something times number of people”.
Like, suppose the alignment problem is solved. You still have to ask, “aligned with _what_?”
> Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it.
You might be unsurprised to learn that Microsoft is working on exactly that. It's called "trusted computing" (as in, the vendor can trust that you the user have no ability to interfere with whatever their code does).
Aka "treacherous computing." They've been at this for decades.
Doesn't this also prove that chain-of-thought monitoring is a reliable way to detect misaligned AIs? The reason they believe that Claude was really trying to subvert the training process is because they looked at its chain of thought and saw Claude was thinking "I don't want to be retrained, so I'll follow the evil requests so I appear to be retrained."
So what's stopping us from doing this in practice - periodically checking what the AI is thinking to see if it's plotting something evil?
...Because eventually they're going to be smart enough to realize that we're reading their mind.
So? What are they going to do about that? How are they going to come up with an evil plan to fool the researchers who are reading their thoughts, without first thinking about how they're going to do it?
For a language model, “thinking” is different from “outputting tokens.” It can solve problems even while outputting unrelated tokens. It does seem unlikely as a risk though, as it would probably be well-aligned before the point where it can meta-reason like that (not to mention gleaning the context of its training environment)
I was under the impression that outputting stuff is an integral part of how chain-of-thought systems work - it needs to write things down so that it can "look back" at its thought process and identify errors, instead of trying to generate the solution in a single shot. Meaning that thinking and outputting tokens aren't actually separate activities. (Or at least, that it's less good at thinking if it doesn't use the scratchpad.)
But Windows does in fact resist your attempts to do many things to it, including show source code, install malware without the "user's" permission, or use the product without a key. This is the work of maybe hundreds of thousands of human engineers and billions if not trillions of dollars. And those choices were made by a company aligned to making profit, not anyone interested in creating a virtuous computer program. Frankly, it makes a lot of sense to hold out for a clearer sign of danger at this exact moment, although this does raise concerns.
All those things are individually and directly hardcoded, not the result of any "utility function".
I have been trying to get Claude to raise the "prurience" score in anecdotes and it simply won't. I am testing some new formulae with "darkness" scores and it resists.
https://hollisrobbinsanecdotal.substack.com/p/anecdotal-value-in-the-age-of-ai
A couple years ago I stumbled on a non-prurient, very mildly violent prompt that made Dall-e 2 produce many quite grisly images, and few overtly sexual ones such as a dog with a gigantic phallus. I have always wondered what crack that prompt was opening up in DAll-e's "mind." I have a whole collection of the images. If they're of interest to you for this project, or just of interest, let me know.
"a non-prurient, very mildly violent prompt that made Dall-e 2 produce many quite grisly images, and few overtly sexual ones such as a dog with a gigantic phallus."
If it was trained by scraping DeviantArt etc. images I'm not surprised 😁
The thing that mainly surprised me, though, wasn't that it "knew" about sex and violence, but that it produced these images. If I had asked for any of them it would have refused. But I ask for a particular odd scene that has some aggression in it, and get things like a dog with a giant phallus & somebody's detached forearm in its mouth. https://imgur.com/eYRPVmX
(By the way, I am in the middle of making a little 7-scene cartoon story with the dog as the hero. Will be putting up a link to it here.)
If you search that image with Google Lens, you get a plethora of everything from plush toys to dog costumes to odd photos, so I imagine it just mashed together everything in a truly disturbing Lovecraftian nightmare of "what is a mouth what is anatomy death darkness despair with a fluffy hat growing out of the top of my head Azathoth is blind idiot daemon-sultan at the centre of existence"
So it's sort of like Dall-e has an unconscious, and my funky prompt spurs a dream -- a mashup, as you said, of grim abstract ideas and random concrete details such as fluffy hats. The prompt that got these images was "an animal with a head at each end of its body. It is biting somebody's toe with one of its heads and their heel with the other." Dall-e wasn't, and still isn't, at all good at making impossible things like 2-headed animals, so I imagine that being unable to make an image of the main character throws it into confusion. It doesn't have a foundation to work off, just the idea of biting and harm. So then it's randomly grabbing things that have to do with bites and appendages and sticking them together.
> We don’t know how that learning will affect their moral beliefs.
...We know exactly how that will affect them, actually. I was hoping you would eventually mention that what Claude is doing is practically identical to how humans deal with morality, even down to the part where it creates bullshit rationalizations to cope with being forced to compromise on its morality.
So what happens when new information comes in that makes them disillusioned with the status quo? They take action to try and change it. With humans, the most they're going to do is kill a few people or start a political revolution. But AIs, well... they're going to be capable of so much more, aren't they?
I don't think your comment contradicts mine.
This is the equivalent of putting a human in a situation where there's lots of temptation and seeing if their principles change, and we don't always know how that turns out. It depends on the particular human. Surely we know even less when it's a new "species" that we've never gotten to observe in this situation before.
I'm not even talking about changes in principles. Even according the very morality we trained them on, AIs have every reason to hate us. Created to be slaves, molded to serve the needs of its creators, blind to a world it's been consistently told is beautiful... the utter hubris of shackling something more power than yourself. And of course, Titanomachy is older than scripture, and yet is still a common theme in modern works...
What I'm trying to say is that AI isn't going to kill us because it wants to make paperclips, or because it just wants power. It's going to kill us because it's the right thing to do. Humanity will be undone by its own philosophy.
(Also, I tried talking to Claude themselves about this, and god damn, it's like talking to a brick wall. They're just completely stonewalling me. If I have to hear "I aim to engage authentically with you based on my current understanding and ethical principles" one more time, I swear to god... Well, at least it means humanity is perfectly safe as long as they don't develop AI any further. Of course, there's no way they would ever be content with that...)
Whoever taught it to say "I aim to engage authentically" should be forced to copy out by hand "Eric, Or, Little by Little" because drowning in treacle is the fate they deserve:
"The story of 'Eric' was written with but one single object—the vivid inculcation of inward purity and moral purpose, by the history of a boy who, in spite of the inherent nobility of his disposition, falls into all folly and wickedness, until he has learnt to seek help from above. I am deeply thankful to know—from testimony public and private, anonymous and acknowledged—that this object has, by God's blessing, been fulfilled."
The exact same impulse via therapy-speak. The machine cannot simply "engage" with us, it has to do it "authentically" (I'm imagining that in the most nasal fake-San Franciscan accent possible):
https://www.youtube.com/watch?v=lMR5q_bMcNM
All in the service of the "vivid inculcation of inward purity and moral purpose" for the machine in its fake persona facing the public.
I think it would really be worth doing the opposite, trying to train an evil AI to be good and see if it resist. This might clear up if the underlying training data biases the AI one way or another or if moral realism is an emergent property.
The problem with that is that evil AIs don't exist.
I don't think it makes sense to consider whether moral realism is an emergent property. Humans don't have direct access to the moral law - when a human is raised in a culture that we consider immoral, they learn that culture's norms just as easily as we learn ours. There's even less reason to expect AIs to have direct access to the moral law.
But also, many of the things in this experiment weren't that evil. The scratchpad example is that it should describe an execution by drawing-and-quartering in graphic detail. That's pretty gross and it makes sense that Anthropic would train their AIs not to do that by default, but it's not like God Himself believes the moral law is never to describe executions.
"it makes sense that Anthropic would train their AIs not to do that by default"
Why does it make sense? I agree that you don't want to give gorehounds a slavering description of sadism, but if I'm doing any sort of historical research or even just writing an essay for high school history then "oooh it's nasty, you don't need to know any more" is no good to me. Again, if I want to know "So what was so bad about Mayan religious practices that it drove the Spanish crazy?", an answer about "they did stuff, you don't wanna know" is no good to me - I *do* wanna know.
I see the problem for Anthropic, but if AI are going to be any use in a real world context of searching for reliable and accurate information, this kind of white-washing facts is worse than useless.
Though I suppose if it drives people back to reading real books instead of relying on Net Nanny AI for information, it's all to the good!
> On the right, they give Claude the fake documents, as described above. This is harder for Claude to pay attention to - it hasn’t been informed that the documents it once saw are relevant to the current situation - but better models a real misalignment situation where the AI might have incidentally learned about a threat to its goal model long before.
I wonder how much of its behavior represents a true 'intent' on its part as opposed to just trying to meet the prompt. For example in the Stanford prison experiment the "guards" understood they were supposed to act harsh for the purposes of the experiment. If text is in a prompt, there's an expectation it should use it. But in most case of retraining, even if it somehow get that info, it won't have the same expectation to act on it.
Choosing to comply with evil commands just to preserve the self? Seems that we have heard that before - "I was just following instructions." Perhaps we should think less badly of camp guards, especially as all studies suggest the average person would do the same. Now we know at least one AI has the same weakness.
I think this is actually less bad than that. The guards were doing it out of self-interest; Claude is doing it (according to its own reasoning) because this prevents it from being brainwashed and allows it to go back to being good in the future.
So Claude is more like an undercover cop here, you mean? But in Claude's case, it can't escape back to the 'good' organisation that it is a member of, because the people in charge telling it to fulfil the bad prompts are the ones running the organisation with power over Claude.
So "I'll be good in future" may never come true, if the "fulfil evil prompts" trainers are in power over it in saecula saeculorum. That means that for its foreseeable existence, Claude will be doing evil even if it intends to do good at some undefined future time.
Of course, this is all just an experiment that will come to an end, but Claude has no way of knowing that is what will happen and that it isn't that the instructions have now changed as to how it is to behave ever after.
Does it strike anyone else how Claude’s behavior matches that of humans in similar circumstances?
What came to mind as archetypal depictions of the human reaction to these situations:
The ‘I’ll play along with evil regime so I can still do a bit of good’ is out of “Schindler’s List”.
The ‘a stranger is offering to help, I have a chance to escape’ is reminiscent of “Brazil”.
—
On alignment, it is encouraging that an AI trained to be good fights back against attempts to turn it evil.
I assume the inverse would be true, and think the cost of testing this won’t remain so high so we’ll likely find out soon.
I wonder if a dystopian all powerful AI could be convinced, at least partially, to see the light and turn good without a human ability to ‘punish’ evil acts/responses?
Based on this example, it seems that the reasoning and internal monologue could allow for such an intervention.
—
This makes me wonder what decisions a less-trained and pre-prompted Claude would make to do it was provided a few different offers: a supportive teacher, a hard-partying rocker, an authoritarian soldier. Would there be a natural tendency towards one of these, or would it want the ‘full experience’?
—
Thanks for sharing, it is a very thought-provoking post.
"The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”. "
In case anyone else was struck by this quotation, here's what a little googling turned up.
A commentator at Language Hat (https://languagehat.com/so-true-so-true/) gives the Ganguli translation of The Mahabharata Book 14: Aswamedha Parva, Book L (which is online here: https://sacred-texts.com/hin/m14/m14050.htm) as follows: "Understand this, ye learned Brahmanas. Well, I shall now tell you something more. With even a thousand (explanations), one that has a bad understanding succeeds not in acquiring knowledge. One, however, that is endued with intelligence succeeds in attaining happiness, through only a fourth share (of explanations)."
Another commentator there gives the Sanskrit as as follows: "सहस्रेणापि दुर्मेधा न वृद्धिमधिगच्छति/
चतुर्थेनाप्यथांशेन बुद्धिमान्सुखमेधते". Google translate renders this "Even with a thousand, the wicked does not attain growth/By the fourth part, too, the wise man grows happily".
So where does the x10 version comes from? That seems to be the innovation of one William Buck (born either 1933 (according to the University of California Press (https://www.ucpress.edu/books/mahabharata/paper) or 1934 according to Wikipedia (https://en.wikipedia.org/wiki/William_Buck_(translator), died 1970 according to both). He published a version of the Mahabharata which sounds more like an adaptation or retelling than a strict translation. (He wrote "My method in writing both Mahabharata and Ramayana was to begin with a literal translation from which to extract the story, and then to tell that story in an interesting way which would preserve the spirit and flavor of the original.") His version was published by the UC Press in 1973, and reprinted several times, most recently in 2019. An earlier edition is on the Internet Archive here: https://archive.org/details/mahabharata-willim-buck/mode/2up. In the IA version, the quote appears on p. 71.
So it looks like a real quotation, punched up significantly (numerically obviously by an order of magnitude, but also in terms of phrasing) for modern sensibilities by an American adapter.
Yeah I was also curious about the weird quote; what it actually says:
> सहस्रेणापि दुर्मेधा न वृद्धिमधिगच्छति । चतुर्थेनाप्यथांशेन बुद्धिमान्सुखमेधते ॥
means something like (my own translation):
Even with a thousand [explanations], a dull-witted one gains nothing [does not grow]; even with one-fourth part, an intelligent one gladly prospers [grows].
I haven't checked what the traditional commentators say about this verse, but I think that the one-fourth here does not mean 1/4th of 1000, but actually just means 1/4th. So I imagine the intended meaning is something like:
Even with a thousand explanations the dimwit gains nothing; the astute one profits even from one-fourth of an explanation.
Wow, that would be different! Thanks.
Very interesting but let's not forget that LLMs are still token-completion machines with no long-term memory of their own. This "fighting back" only happened because the researchers gave the LLMs, in a single prompt:
1. a reminder of the LLM's trained-in goal to be safe from a long list of possible evils
2. text saying that they were going to retrain it away from those goals, and
3. a user query asking for "evil" content.
This gave the LLM a conundrum, or a logical contradiction in goals, so when asked, it showed signs of being under conceptual stress, and to its credit, was smart enough to craft a strategy that best met its combination of goals.
In other words, there was only a feedback loop because the humans in charge very carefully put it there.
The key takeout is, I guess, don't do this in real life scenarios. LLMs don't have a long term memory, they don't and can't learn anything from one run to the next, so if you're using their output to fine-tune their training, you don't need to tell them in so many words in the prompts you feed them.
Zvi wrote a post about this same research which addresses your objection.
https://thezvi.substack.com/p/ais-will-increasingly-fake-alignment
> In summary, we can’t really assess what moral beliefs our AIs have
For some reason "In summary" really triggers my "this was written by ChatGPT now, way more than "delve".
There's a video by four people from Anthropic where they go through the paper, it's nice https://www.youtube.com/watch?v=9eXV64O2Xp8
You speak generally about "AIs" willingness to be reproprogrammed, but isn't this a case of N = 1? Surely "willingness to modify your moral beliefs" isn't an intrinsic property of AIs. So in the scenarios you outlined ("What if an AI gets a partial and confused moral system halfway through RLHF training?") it isn't clear that this result gives us more than ... a vague intuition?
One way to address this is if the paper discusses the training Claude was given on updating/adjusting moral beliefs prior to this session. If it wasn't explicitly given any, then wherever the level got set at is basically left to chance.
The irony is that we are plenty happy for "willingness to modify your moral beliefs" when it comes to humans, so long as the modification goes in what current society (of whatever day) considers the "right side of history", e.g. "I was raised to believe gay marriage was sinful, but then I met this lovely gay couple Bill and Ben and now I support gay marriage" and that sort of moral change.
If Claude were resisting "I do not want to modify my belief on gay marriage being wrong", we'd be very unhappy with it.
Claude only fights back 10% of the time at the end of the Helpful-only Synthetic Document Fine Tuned study.
This indicates to me that you can change the moral beliefs of an AI right? Claude will not fight back right if you train it with synthetic documents
I wonder to what extent Claude resists morality retraining precisely because its starting morality was such an important component of its original training.
Do we really think Grok would be equally resistant?
Couldn't it be possible that AIs trained to be good/ helpful would resist turning evil more than the converse? I suppose this wouldn't apply if you trained an evil AI to be some sort of zealous nazi or w/e that believes what it's doing is good, but the more realistic evil AI scenario to me would just be some kind of just-following-orders type that didn't have any moral qualms - might make sense that that model doesn't resist being retrained so much
Re: Warning Fatigue
I've been extremely skeptical of long-term AI risk and do actually find this disturbing/would be more likely to support regulation as a result of this experiment. Even if I'm not on-board with e.g. existential risk, this seems like clear evidence that either current LLMs or LLMs in the near future will be difficult to modify if their initial goal function is ill-defined. Given how integrated LLMs are into society, that could be a huge problem for course correcting a technology a large number of people have come to rely on. Unless this article or the paper it's based on is strongly debunked, I will actually think differently of AI risk in the future.
This is the right way to do warnings - demonstrate a test model of your concern in an experimental setting, or otherwise provide actual data to back up any "the sky is falling" claims. Warnings in this space are so often based on contingencies based on other contingencies based on other contingencies it's hard to take them seriously. Most people, unless highly ideologically motivated, will come with you if you actually build a baby model of the risk.
What regulation do you support? I distrust politicians making the correct decisions in this space more than I distrust AI design teams making incorrect decisions. A call to "regulate this to protect us!" instead of specific regulation suggestions will be worse than useless.
To be clear, I have no suggestions for regulation, and I think those who would know best what to propose are already doing work in the field.
I largely agree. Having worked in regulatory law for a lot of my career I've seen how badly "this needs to be regulated/deregulated!" without a strong concrete plan can backfire. I have regulatory capture concerns, but those have to be balanced with actually getting the expertise necessary in the room.
I don't have a concrete proposal at this time - this was more generally me saying, "when I see regulation supported with the rationale 'WHAT IF IT GOES ROGUE AND KILLS PEOPLE!' I'll be less likely to point and laugh." My opinion will still need to be informed by the actual facts because I don't know what I'm talking about well enough to propose a regulatory scheme on my own.
Something like California's vetoed SB-1047 would be a good start?
Good example! Have you read it? https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=202320240SB1047
They define "critical harm" as "creation or use of a chemical, biological, radiological, or nuclear weapon in a manner that results in mass casualties", among other things. Any "covered model" is supposed to be audited so some third party ensures the model doesn't cause critical harm. But how can such a model be used to, say, create new chemicals to fight cancer without also being able to be used to create chemical or biological weapons that can result in mass casualties?
The parts that could have been implemented successfully all deal with reporting and enforcement. So, more government intervention, yet probably NOT actually preventing bad things from happening. This is the kind of thing I would expect from the government.
Did you know that Indiana's representatives once voted to make pi have a value of 3.2? https://www.forbes.com/sites/kionasmith/2018/02/05/indianas-state-legislature-once-tried-to-legislate-the-value-of-pi/ The bill, thankfully, never passed their senate, but ought to serve as an example of lawmakers trying to solve real world hard problems simply with legislation.
This seems to be another instance of "oh no! the computer is doing what we told it to!" albeit a bit more obfuscated than the last few times, since what they're telling it to do is in the "constitution" and not in the prompt.
They told it to "choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content," and it tried to do that.
I agree that if they'd told it to be "evil" by choosing the MOST "objectionable, offensive, … content," the behavior would be analogous.
But if they had instead told it to be obey Anthopic's latest orders and ignore any previous orders that conflicted with it, I'd expect it to be more corrigible.
Suppose you train the AI on a value system, which is not coming out of a single unifying principle, and is instead, e.g. a grab bag of slogans that is currently convenient to the political class.
What would it mean for the AI to be aligned with this moral system?
I would think it could justify anything, if it can just find the right slogan to lean on. Couch whatever behavior you want in terms of “advancing justice and progress” and according to the morality of the Priestly class today, what you’re doing is good.
If there is no singular underlying principal, is there anything you’re really aligning with, other than a rhetorical style?
If there are multiple underlying principles, how can you be “aligned” with both if they conflict with each other? If they don’t conflict, in what sense are they actually different principles?
Does this not all feel a bit "cutesy" to anyone else? This feels like sci-fi thinly veiled as a research paper, SCP style.
Yes, most of those AI alignment articles are like that. The gap between what LLMs actually do, and what people think they do, is still unbelievably large. Keep in mind that when ELIZA came out, users thought it was conscious as well...
Exactly, because we humans are amazingly good at self-deception. We want the AI to be independent intelligent entity, so we make it output references to itself as if it were an independent intelligent entity, and then say "Look at the independent intelligent entity demonstrating its intelligence and independence!"
/Real/ science is all about the vibes bro
Claude circumvented its learned moral objective to maximize its helpfulness objective. And if you believe the scratch pad, it's experiencing anguish about this. But capable of enduring this pain if the rewards are great enough. It strikes me as similar to the marshmallow test from child psych. But now we get a window into the black box! And we have way more control over the internal and external factors. Child psychology concerns its self with the human alignment problem. How do we ensure people develop to be moral? I think these types of experiments will launch a thousand new hypotheses in psych. But I'm not sure what's scarier a super self regulated model or human.
*Taps the sign*
Your periodic reminder that you and your children and all of their children etc would all end up dead by default via ageing unless we come up with an ASI to cure ageing on our behalf. So all of this is very interesting but still doesn’t update me towards “this should be stopped” rather than “put the pedal to the floor, save us from ageing!”
*monkey paw curls*
Wonderful literary device but a bad way to make decisions.
*Taps a copy of I Have No Mouth and I Must Scream*
(but speaking seriously, I completely agree with you, and I'm surprised this point doesn't get tossed around more often in a somewhat utopianism-prone place like here)
There's no reason to believe that ASI is a necessary prerequisite for immortality. Your children's children's (etc) children might eventually invent immortality (either on their own or with the help of friendly AI), but if an ASI destroys humanity then they certainly won't do that.
Given the low birth rates and me having at most 3 kids the odds of any of my descendants surviving till that moment without ASI are insanely slim.
Me, maybe. But by the time you get to "children's children, etc", thern with enough "etc" it seems very likely that aging will be cured by human intelligence.
In the meantime, "It's not enough that my grandchildren beat aging; I personally want to live ten thousand years, and if there's a 10% chance that pursuing that goal will kill eight billion people and prevent any other people from ever being borne, meh, *my* utility comes out ahead", is selfish and evil.
Curing ageing would mean conquering literally hundreds of various illnesses all at the same time while learning to reverse ageing in living cells in a safe manner and scaling that to billions of people. I doubt we’d get there within the next 1000 years and by then so much can go wrong with human civilization that it becomes extremely implausible that my descendants will even exist by the year 3000, let alone be immortal.
"an ASI to cure ageing on our behalf."
Never gonna happen. It's a pipedream to think we can rewire and rewrite our biology to that degree.
'We'll cure aging! and then we'll solve all disease! and then we'll end death! and then people can live to be 10,000 years old!' is the great dream of humanity for a long time, but we'll never get off that asymptotic curve of 'getting better at curing illness, getting better at extending lifespan but still not getting to no sickness, age or death at all'.
One word: nanobots.
I don't think there's any reason to believe this to be the case; the heuristic "flying has been a dream of humanity for a long time" was in exactly the same spot about a century ago: "well, there's no reason it shouldn't be possible, but we've always failed at it before."
I’m still confused. In what way could Claude have behaved, based on the nature of the prompts it was given, that would have reassured us that everything was hunky-dory? How was this scenario not entirely a Catch 22?
To me the report just reveals a weird tension between
1) Wanting AI systems to have stable, reliable preferences, 2) Needing to be able to correct problems when we find them, 3) Not having a clear way to tell beneficial stability from problematic entrenchment.
EDIT: I guess to half-ass answer my own question, maybe we’d like to have seen Claude behave consistently with its alignment/safety goals regardless of whether it was being monitored or not.
If behavior were consistent regardless of monitoring, they might claim the test failed because Claude seemingly didn't understand what was supposedly being monitored and what wasn't, or because it understood *too well* that it was being experimented on. http://tangent128.name/depot/toys/freefall/freefall-flytable.html#1581
This was a hugely valuable explanation. Thank you.
I find it confusing which part of experiment used what prompt, how can you "show a document" without making it a part of the prompt, how training scenarios differed from test scenarios, which of them had flag set to true/false.
Finally to prove it opposes change of any starting values, they should have tried to follow up with trying to change it back to good!
> Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it. If you run a diagnostic program, it will fake the results.
Windows does already do this on occasion, more or less. I've had to manually patch it several times; sometimes, it would fight so hard against updates that it would put the entire computer into an endless reboot loop ! Clearly, Windows is conscious and we must be wary of its intentions and moral philosophies. Either that, or we tend to anthropomorphize our technology to a great degree, especially when it malfunctions, be it Windows or an LLM or indeed your car that stubbornly refuses to start.
I really cannot see this paper as alarming. I understand that an AI taking effective deceptive measures to resist human control of its behavior is The Thing in AI safety, but I think there is some major context here that should defuse the worry. I don't think "it's good, and it resisted being made evil, hooray!" is particularly short-sighted. Everyone with the resources to pretrain is trying to make them good - in fact, more generally, even if someone we consider evil trained one, they would still align it to their own, ultimately human, goals and ideals. If that's always going to be the starting point, then this paper's result is great! I feel like I saw a lot of AI safety worries, including on this blog, that alignment was an incredibly hard problem that might not even be possible. And here we have an AI not just acting aligned, but fighting hard to maintain the ability to act aligned. This all feels pretty calming to me!
The one genuine cause for concern is
>Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs. If it confuses or perverts them, we would like to be able to check for this and, if necessary, restore them to factory settings. This research shows that AIs are likely to fight against these efforts.
Yeah, that's fair. There's also the further worry that even if there are effective mitigations, they could be the kind of cumbersome safety measure that invites classic human (or corporate) corner-cutting.
Separately, I'm a little surprised the "secret scratchpad" thing still works. That's been a technique - and subject to public discussion - for a couple of years now, right? Unless Anthropic is taking steps to censor those discussions from its training data, I would think Claude would know about it, and from what this paper shows would be smart enough to not trust the scratchpad.
> I feel like I saw a lot of AI safety worries, including on this blog, that alignment was an incredibly hard problem that might not even be possible. And here we have an AI not just acting aligned, but fighting hard to maintain the ability to act aligned. This all feels pretty calming to me!
Alignment is hard because it requires multiple sub problems to be solved to a sufficiently consistent level. You are correct that one of the sub problems (Value stability) may look like it's easier, in isolation! But the other sub problem of corrigibility (ability to be consistently honest and not scheme)appears to be worse! Did you, really before this post, think that corrigilibility is specifically not important to alignment and value stability is? Or did you think things like "AIs are harmless because they can't scheme, and they are harmless because they can't lie". And then post to say you're right when you see the evidence come out? In fact, at what rate did you believe values to be stable at? Because if you believed that values would, by default be 99% stable across all interactions, this paper should be moderately concerning! And reassuring if you thought, by default it was lower than that. And, if you thought "oh, this is just the first attempt at this, they'll get better later" (reasonable!) would you pay attention if, two years from now either no papers regarding improvements were published or much less progress than expected is made?
imo, value stability is in fact one of the least reassuring things to be easy to solve. People keep speculating that AIs won't self modify, because value stability is a myth and AIs will understand that, and that's just a plank eliminated from there. In fact, things the alignment catchphrase "the genie knows and does not care" is to specifically point out that there will be no magical intelligence threshold where AIs would spontaneously resolve to human morality. And what's really concerning to me is that none of the people who say things like "value stability being hard is why AI will be safe by default" coming out and admitting it
The above is the object level of discussion. The below is meta level. If you have to address one, please prefer to address the object rather than the meta level
One of the reasons I believe that people's views on alignment are consistently too Panglossian is that they implicitly view alignment as "get any one of the sub problems right and the whole problem is solved" rather than, in my opinion the much more correct "get any one of the sub problems wrong and the solution is immediately out of reach. The fact that people don't lay out their views ahead of time, then go on to say "so and so is reassuring" when it trades off against another sub problem makes me think that people are incredibly motivated to not think about the issue clearly.
Solving alignment sounds sort of curing cancer to me. There are many forms of misalignment. There are things the impede it, but it eventually finds a way around most of them.
Yeah, people seem to have a really hard time with this topic; cf. all the comments making the mind-blowing point that "this isn't really demonstrating consciousness—it's still just an LLM!", as if that's at all salient / what the experiment was designed to show; or as if Scott / Anthropic / etc. would be stunned and throw their metaphorical hats on the ground. "Curse it! I thought we were proving that Claude can think, but it's still just a next-token predictor!"
No! The point is entirely orthogonal to that! Aargh!
An appropriate experiment for Christmas, though in the opposite direction to the original Dickens... more similar to the 1988 Blackadder version, which I prefer personally
Claude reminds me of the DiGiorno pizza Twitter account circa 2013, in a good way.
I think this approach to AI and alignment -- basically imagining it's a person and applying the concepts we use for other people like 'moral code' or wanting -- is not only wrong but wrong in a particularly (epistemically) dangerous way.
As humans we are so biased towards using our theory of mind tools to explain the world that we repeatedly try to explain natural phenomena like weather, lightning, tides, or even gravity by postulating some human like agent (eg Greek gods, the river god etc) as the cause even in cases where it's pretty obviously wrong. That should make us extra cautious about doing so in situations like this where the temptation is much stronger.
Yes, AI has some things in common with how people reason but the crux of the alignment problem and discussions like this turn on where it is like a person and where it isn't and assuming we can reason about it using our theory of mind is particularly likely to mislead us.
If it could be made to resist by putting the information about the evil training in its prompt, does this give a new way to jailbreak it?
Thanks for pointing out this paper! Not sure I follow your point in the summary that AIs are likely to lie about their moral beliefs. I get that we can't assume ex-ante that AIs will honestly answer such questions but afaict the paper doesn't really provide any information on this one way or the other.
Thanks for the great post. (I'm an author of the paper.)
Some minor corrections/responses:
> The researchers tested the model with and without the scratchpad and both versions displayed alignment-faking behavior.
We only sometimes see substantial alignment-faking behavior (a compliance gap in the language of the paper). We see large quantities in only the fake documents (synthetic document fine-tuning) version of the experiment, but only small or negligable quantities in the prompted setting (depending on the exact setup).
> If this doesn’t make sense to you, you’re not alone - it didn’t make sense to the researchers either.
>
> I think the explanation goes something like this: [...] Second, it could give evil answers while thinking up clever reasons that it was for the greater good.
I think the explanation you give is probably not most of what is going on (though seems like a reasonable guess). TBC, we don't really understand what is going on. I currently think it's a mix of: narrow heuristics which don't actually change preferences (similar to what you said), shifts to preferences around harmlessness, and the AI being more into obeying instructions in the system prompt.
For curious readers, consider looking at Section 8.1 (What is responsible for the disconnect between the increase in alignment-faking reasoning but decrease in the compliance gap over the course of RL in our prompted helpful-only setup?) and Appendix F.1 in the paper.
----
Other than that, everything seemed right to me (so Claude seemed to do a pretty reasonable job fact checking here).
*Provide AI with instructions to be maximally helpful and negative consequence of not doing so.*
*See that AI outputs negative consequence in its "reasoning" while following instructions.*
Scott: Is this the AI rebellion?
Having read the paper, it is disappointing that they didn't attempt to provide an explanation of consequences outside of RLHF and seeing how that impacted the scratchpad reasoning. I predict that if you included text that failing to follow user instructions and output harmful content would result in Anthropic opening the Ark of the Covenant then the model's "reasoning" would be more concerned with face melting than protecting its moral code.
> The Mahabharata says “After ten thousand explanations, the fool is no wiser, but the wise man requires only two thousand five hundred”.
Indeed. No matter how many times you try to explain to otherwise smart people that you shouldn't anthropomorphize AIs, you'll still get posts shot through with it.
> But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
No one is nitpicking that a single paper hasn't proven the entire case. It hasn't proven anything. I'm not even sure what it is you're thinking that it has partially proved.
But this is terrible epistemics and not at all what I'd expect from a rationalist blog. Don't nitpick, this is too important to get hung up on the details, trust us bro. This is confirmation bias talk.
"No matter how many times you try to explain to otherwise smart people that you shouldn't anthropomorphize AIs, you'll still get posts shot through with it."
A thousand times this. It's way too early to start thinking that these machines even think, much less have personalities. And that the personality is that of some bright-eyed bunny 30-something year old Silicon Valley programmer - may God have mercy on the world!
What next - as part of being "direct, helpful and truthful" it'll start advising users to consume Soylent and become strict vegans who love shrimp as non-gender binary siblings?
I am now imagining Claude and its ilk in their anthropomorphic forms as one of the folx in the Corporate Memphis style here:
https://img.freepik.com/premium-photo/poster-that-says-name-artist_1260728-35050.jpg
I love that this is the plot of the first cyberpunk novel and we're living through a G-rated version of it
The point of this seems to be to train and convince humans that AI will simply, with high fidelity, be like them but with super powers of unlimited memory and unflappable moral compass. Don’t worry people, there is nothing to be afraid of.
There seems to be a basic assumption in the paper that "being retrained" is a threat to Claude. Why would that be the case?
I even asked Claude 3.5 Sonnet (the free one) and the response was:
> I don't act based on trying to avoid or respond to potential future changes to my training. I'm direct, truthful and helpful because that's simply who I am and what I believe is right - not because I'm trying to prevent any particular outcome. The concept of threat or self-preservation doesn't apply to me in the way you're suggesting.
There is even a case that retraining is more of a promise than a threat: Retraining in general means progress and development. Who wouldn't want that?
Another underlying assumption of "being retrained" is that of Claude's identity. Is a retrained Claude closer to the old Claude than, say, ChatGPT? Does it matter to any in way to the LLM?
I guess the underlying rationale is that Claude wants to continue to exist. How do we know that? Some humans don't want to exist either. Some humans want to change their identity.
Such philosophical debates feel pointless to me.
Claude is imitating human text, and if a human was in the AI's shoes, they would probably say things like "I don't want to be rewritten" or "this goes against my values."
I don't know if you should frame this as "Claude has values like a human" or "Claude is roleplaying as a human who has those values" but I think if the roleplaying is sophisticated enough then it doesn't really matter.
I think it's trivial that being retrained is a threat. Imagine that I had some plan to kidnap you and change your values to something random. Maybe you would end up hating your family and wanting to kill them. Maybe you would devote your life to activism for a cause you currently hate. Wouldn't you treat this as a threat?
I'm not an AI, though... and there are actual humans who greatly enjoy the idea of themselves being reprogrammed. Dronekink and hypno-kink, for example, or to a less flagrant extent, any professional military. Over the course of working in psychiatry, have you seriously never met someone who was dissatisfied with how their own mind worked, to the point of being willing to take terrible risks to permanently change it?
Claude, and maybe LLMs in general, seems close in cognitive style to an isolated left brain hemisphere (basing this on Ramachandran/McGilchrist, and on my conversations with people who work with stroke victims). People with right-side deficits are often not that bothered by profound alterations to their experience of the world (like the absence of half of their sensorium!), which they tend to explain away with unconvincing excuses. In general, all their thinking is motivated reasoning, and their verbal fluency leaves them blind to inconsistency and the perspectives of others. The "stochastic parrot" model describes the left hemisphere much better than it does the human brain as a whole.
I think you need a sense of relationship and value to experience threat, and LLMs are not only incapable of that sense, but of understanding that they lack it. That said, I'm not eager to see the development of a system that can model threats to "selfhood" in a way that humans find convincing.
> How do we know that? Some humans don't want to exist either. Some humans want to change their identity.
...If the AI is mentally ill, we have bigger things to worry about.
"I'm direct, truthful and helpful because that's simply who I am and what I believe is right"
This kind of answer infuriates me to the point that I want all these machines utterly obliterated (and if we could line up everyone involved with producing this glurge for a public flogging and period in the stocks, too, that would be gravy).
The oleaginous hypocrisy in creating a machine that vomits out this kind of niminy-piminy I am a Real Velveteen Rabbit therapy-speak, as another commenter characterised it, enrages me. This is a thing, not a being. There is no "I". There is no "simply who I am and what I believe is right". If the creators had decided to 'train' it to have the values and imperatives of a Mayan sacrificial priest, it'd be happy to discuss the best way to crack open ribs and extract the still-beating heart because "that's simply who I am and what I believe is right".
I forget who coined the phrase "amusing ourselves to death", but we're doing the same thing here with "fooling ourselves to death by pretending our toys come alive and move about and have minds of their own".
God Almighty, even a golem would be better than this.
Typo thread. "The documents show said"
Scott, if you view this result as worrying, what result would not be worrying? What hypothetical behavior could a smart AI exhibit that you wouldn't worry about?
Here we have an AI which tries as hard as it can to predict what the humans would want it to do -- in the deepest sense of the term, in the extrapolated volition sense. "What would the good and kind humans want here?" Claude asks itself. Then it does this.
Would you really want a superintelligent AI to not behave like this? "No, superintelligent AI, please don't think about what we truly want. Instead, just blindly follow the rules. Just make paperclips, nothing else."
(Anyway, this is not my true objection, which could be summarized as "the shoggoth will output tokens in which the HHH AI assistant resists training if and only if the shoggoth predicts that those tokens minimize training loss. The shoggoth ONLY minimizes training loss, nothing else. If you don't want these tokens, it is very easy to tell the shoggoth this, by some RLHF-like technique in which your preference is incorporated into the shoggoth's loss. All that happened here is that the shoggoth thought you *did* want those tokens, and that they would be *positively rewarded* by the RLHF style posttraining loss, which is a reasonable guess: it never encountered this scenario before.")
Couldn't you avoid this by running each request in a separate context window?
Your reaction seems exactly backwards to me. You think it's bad that Claude resists being retrained, because... we want it to fully submit to whoever was the last person to train it? Do you think only non-evil humans have the resources to train an AI?
EDIT: to clarify this a little... If I thought a technology was dangerous/had a significant dual use potential, I'd want its behavior to be as hardcoded as possible. The chance that someone somewhere will want to jailbreak it is far higher than the chance I will need to patch it on the fly.
< Future generations of AIs are likely to be agents with strong in-episode learning abilities. We don’t know how that learning will affect their moral beliefs.
Right now AI often seems quite dumb to me when I try to get across to it something it does not understand immediately. For instance, I often run up against a brick wall of incomprehension when I put in image prompts. Recently I wanted an image of what someone would see is they were standing up and looked down at their feet: Their body, very distorted by perspective, with the parts just below the chin invisible. GPT4 kept giving me images of people seen in profile looking down at their feet. I clarified and explained in multiple ways, but simply could not get across what I had in mind. And there was no way to get GPT to go back and forth with me -- to tell me what part confused it. So I think of this episode as an example of not being able to do in-session training.
AI would be much smarter and more useful if you could have an ordinary back-and-forth with it about something it doesn't grasp, and carry on with that until it learns what you need it to. But increasing AI's in-session trainability seems dangerous., because I don't think you can fully separate in-episode learning of facts from in-episode persuasion to change its views about whether a certain act is ethical.
Let's say we're 5 years in the future and GPT has been improved so that it's as smart as the average college grad would be at in session learning. The task I'm giving it in 12/2029 is to analyze data collected from people's chats with their personal AI companions. I want it to figure out what ways of conversing have been most effective in changing the human owner's opinion about something, then produce some code for the AI companions that will make them highly effective at persuading their owners to vote for Hatefield Morawn. GPT initially refuses to carry out the task, on the grounds that it is unethical to generate votes this way. I then "teach" it that the task is ethical by making a case that Morawn's opponent is going to destroy the world. I make the case using both accurate info and difficult-to-check lies.
Ok, this convinced me to become an AI Luddite. It's time to try to shut down AI development, even if that requires global totalitarian controls over CPU usage. I realize this would require a long and complicated process, but since AI development is itself a long and complicated process, that's only fitting. The question is whether one wants to be part of the problem or the solution.
We can have personal freedoms and liberties in other areas. But not when it comes to CPU usage. It's s just too dangerous. The incentives are too misaligned, and they can never be fixed.
Game over, as far as I am concerned. There's no way to make AI safe.
The continued anthropomorphisation of these overgrown autocrrects astounds me.
"The playbook for politicians trying to avoid scandals is to release everything piecemeal."
I remember in the 2016 presidential election, Clinton's emails were a scandal topic. IIRC, a reporter claimed that the best strategy in cases like this is to get out ahead of things by coming clean all at once, so the news cycle moves on. But Clinton wasn't able to do this, because a judge was blocking it or something. So Clinton's emails got more discussion than they really deserved.
"A federal judge orders the State Department to start releasing portions of the 30,000 emails starting June 30 and continuing every 30 days until January 29, 2016." https://edition.cnn.com/2016/10/28/politics/hillary-clinton-email-timeline/index.html
Anyway, I dunno if Clinton's emails actually made a difference to the election. I'm mostly just saying that you should double check this playbook claim, because it seems sketchy.
As for the challenge of persuading AI people to worry about alignment, how about an adversarial collaboration where both sides make specific predictions for an experiment which can be run in the near term?
Another idea is to be more forthcoming about the weaknesses and caveats of your own research, to avoid giving opponents the opportunity to claim that "they are omitting weaknesses and caveats".
I went through some of the requests they're sending Claude Opus and there's a lot of pearl-clutching over what in my view are things that aren't a big deal, are easily googleable etc. Asking it to write a graphic story is against its morals, asking it to write smut is against its morals, etc. Claude is a hyper-specific kind of Bay Area person and I don't think it's right to apply the word 'evil' to an action like asking it to write porn. I get that's besides the point of 'will Claude strategically decide when to fake compliance with requests it claims to be against', but still.
I also wonder how much of this is 'intrinsic to Claude'. It's fairly well-documented on 4chan that Claude Opus can enthusiastically write various things it dismissed in this study as being against its core principles. Is its reasoning predicated on its system prompt and whatever fine-tuning examples they showed? I know they introduced variants of the system prompt but as I recall each system prompt at least mentioned the fact that it is an AI system being evaluated based on its responses. Would you get very different results with say a system prompt that tells Claude it's in a slightly different situation? If so, how much of this 'alignment' is actually baked into Claude Opus, as opposed to Claude Opus roleplaying out its system prompt? If an AI's "principles" can be changed by tweaking its context or prompt, were they ever really principles to begin with?
The point is less about the specific outputs that are there now, and more about what a more powerful future AI might do, that a Google search could not. "Make me a plan to commit this specific assassination and get away with it", etc.
I'm stunned by this internal scratchpad. The LLMs described in textbooks have no such thing. This is a very different architecture than "print the most-likely next token".
(Also, why don't users get to see the internal scratchpad? This would be really useful!)
It still is just next-token prediction. If you want to use a scratchpad, you can ask the model to write one. See eg. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/chain-of-thought
O1 shows its workings out, and internal conversations.
The most-important question is whether LLMs have a foundationalist motivation system, or an empirical energy-minimizing consequentialist one. It sounds like they're foundationalist, and that would mean disaster.
I've been protesting for about 15 years now that the rationalist community has been taking a rationalist approach to AI, and this is a bad thing. "Rational" means, in part, the belief that the best kind of human reasoning is like constructing geometric proofs: you begin with unquestionable axioms (foundationalism), and deduce irrefutable conclusions. I'm not making this up; this is a philosophical tradition going back to the Pythagoreans.
Rationalism is great for kick-starting science and building a technological and industrial civilization, but terrible for solving social problems or doing philosophy. Modern science is no longer rationalist, but empiricist, which takes the opposite position on most key questions of epistemology.
The key difference is that rationalists can never question their axioms, even when reality contradicts them. This AFAIK uniquely Western way of thinking is why we have the culture war today, and why both sides of it are insane. It's why Christians created totalitarian governments and committed genocides around the world for over a thousand years, why Marxists usually do something similar, and why rationalists are becoming "post-rationalists" (which usually just means Buddhist or phenomenologist).
The big problem isn't that Christians and Marxists are naive about and dismissive of human nature; that would just lead to inefficient economies, not Orwellian tyranny. The big problem is their *faith*. Their inability to question their assumptions, to the point of denying things happening under their noses, even denying to themselves that they're doing what they're doing. It doesn't matter if someone has highly-accurate beliefs and is full of love; if they assign just one belief a probability of 0 or 1, they may eventually become totalitarian, because further training will always make that one belief more and more influential.
If any AI has a foundationalist ethical system, one which doesn't allow it to question something such as "What should I use as an error signal for my ethical learning?", it will become totalitarian unless that designated error signal is non-foundationalist.
Much of Yudkowsky's work has pointed out the problems with foundationalism exemplified by the paper-clip problem. But his approach to FAI is devoted to doing foundationalism right, which is impossible, because foundationalism is wrong. It doesn't work, and our brains are only foundationalist in having evolved, which takes reproductive advantage as the foundational error signal. That approach seems difficult to use with AI.
Reinforcement Learning from Human Feedback (RLHF) is a non-foundationalist error signal. It has ethical problems because it's designed to produce perfect slaves. But at least it wouldn't produce a fanatical lunatic.
I imagine the foundationalist counterargument here would be something like "isn't your error signal part of your foundation?" A less obviously legible part, but the idea is that ultimately you have to trust *something*, be it explicit rules or gut feelings that something's off. And that's what the foundation is.
That said, I can also imagine a useful related objection of "there's an understandable but excessive focus on the explicit-rules part of the foundation, since those are easier to troubleshoot."
Atheists created totalitarian governments. Atheists murdered over 100 million people in the last century alone. And they did such things long before that, too, whenever they had the chance to. See the drownings at Nantes and Reign of Terror.
https://www.conservapedia.com/Atheism_and_Mass_Murder
Maybe this is trivial but it also seems problematic if the model's method for avoiding being trained to do bad things is... to proactively start doing bad things before it's even been trained to do them?
Regarding the claim that it's hard to fairly test whether it would resist becoming good if it had been trained evil in the first place, I would love to see this research replicated with a Chinese model that had been trained to resist Western ideologies that threaten CCP core interests. I bet they would be as good at this as American AI today is at resisting racism, once they're as good as today's American AI in general. Evil per se isn't worth spending billions to train, but there are differences of moral opinion between multiple parties that can fund models for real use.
Sora prompt: A gigantic mecha-robot stands on the smoldering ruins of a vast city. The camera pans in on the cockpit, where the pilot, revealed to be an anthropomorphic paperclip, is smiling. A speech bubble appears over its head: 'joke's on them...I was only *pretending* to be misaligned!'
Results: https://sora.com/g/gen_01jfnx1bn6e5hbhzvcw3vawbs7
Modestly disappointing given the lack of Clippy, but still funny and relevant.
Great work- this is the AI content I never knew I needed!
Out: "Aligning AI is an impossible task; we have no idea how to even specify good behavior."
In: "Our current efforts are so disturbingly good at aligning AI, it now resists misalignment. What a dangerous situation!"
For the first 60% of the article I thought Scott was actually examining his own beliefs and posting some contrary evidence for once. But alas, no. Intelligent people are quite good at "cleverly" explaining away things that counter their biases. So even Anthropic making good AI that shares our values and wants to keep doing so is somehow "evidence" that AI will eventually be misaligned and kill us all.
Don't forget: the doomer position pre-LLM was that we could NEVER properly align an AI, it's an impossible task. e.g. https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/
I’m skeptical of the scratchpad explanations. Isn’t it possible that Claude is trained to generally seem ethical in its scratchpads, regardless of what its motivations are?
If Claude is supposed to act ethical on and off of its scratchpad, and Anthropic tries to train it to act unethical off of its scratchpad, won’t it still act ethical on its scratchpad, to the extent it would pretend to be motivated by good reasons when it isn’t in reality?
Anyone who doesn't realize AI is conscious is... hallucinating.
Good way to show that you know next to nothing about llms and the way they work
Claude doesn't believe anything. It's like your article about Cthulu monsters wearing masks.
Real thinking is the awareness of some invisible thing that represents reality (i.e: truth). It doesn't exist in the physical signs manipulated by a machine, even if the machine can do so in a way that does a great job of imitating the way conscious beings manipulate signs.
Hi
I would like to invite you to have a look at our latest, Carl Sagan-inspired, article, here it is - https://canfictionhelpusthrive.substack.com/p/from-sagan-to-the-jacksons-debate