365 Comments

Aargh. Sorry, I know I should really read the full post before commenting, but I just wanted to say that I really really disagree with the first line of this post. Lying is an intentional term. It supposes that the liar knows what is true and what is not, and intends to create a false impression in the mind of a listener. None of those things are true of AI.

Of course, I get that you're using it informally and metaphorically, and I see that the rest of the post addresses the issues in a much more technical way. But I still want to suggest that this is a bad kind of informal and metaphorical language. It's a 'failing to see things as they really are and only looking at them through our own tinted glasses' kind of informal language rather than a 'here's a quick and dirty way to talk about a concept we all properly understand' kind.

Expand full comment
author

I think you should read the full post before commenting, especially the part about hallucinations. The work I'm summarizing establishes that the AI knows what is true and what is not, and intends to create a false impression in the mind of a listener.

Expand full comment
deletedJan 9·edited Jan 9
Comment deleted
Expand full comment
author
Jan 9·edited Jan 9Author

"Code pretty clearly doesn't 'know' or "'intend' anything."

Disagree. If one day we have Asimov-style robots identical to humans, we will want to describe them as "knowing" things, so it's not true that "code can never know" (these AIs are neuromorphic, so I think if they can't know, we can't). The only question is when we choose to use the word "know" to describe what they're doing. I would start as soon as their "knowledge" practically resembles that of humans - they can justify their beliefs, reason using them, describe corollaries, etc - which is already true.

I think the only alternative to this is to treat humans as having some mystical type of knowledge which we will artificially treat as different from any machine knowledge even if they behave the same way in practice.

Expand full comment

Next level "mind projection fallacy"

Expand full comment
Jan 9·edited Jan 9

do you... actually think that a transistor-based brain which was functionally identical to a human brain, could not be assumed to behave identically to that human brain?

Projecting from one object 'a' to identical object 'b' is called reasoning! that's not a fallacy lol

Expand full comment

I agree with Scott, and kept my comment terse due to it being a pun.

My interpretation of scott's argument is that agency is a way to interpret systems - a feature of the map rather than the world.

Treating features of the map as something real is usually referred to as "mind projection fallacy", since you take something "in your mind" and project it onto the "world" (using citation marks to keep a bit of distance to the Cartesian framing)

The pun comes in because the thing that the mind projects in the comment Scott is replying to is "agency", or "mind".

So "the mind" projects "minds" onto reality. Next level mind projection ;)

Expand full comment

As long as a human brain is working in concert with a biological body. I would feel safe saying they would behave differently.

Expand full comment

(Post read in full now)

So, I'm pretty much in agreement with you here. Only I disagree strongly that the knowledge of AIs currently resembles that of humans. Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.

The fact that AIs are able to justify, reason, corollarize and all the rest is an amazing advance. But I don't see that it justifies jumping immediately to using intentional language to describe AI minds. The other stuff we humans do with our knowledge is sufficiently important that this gap should be actively acknowledged in the way we talk about AIs.

In terms of these papers, I certainly agree that it's interesting to know that AIs seem to have patterns of "neural" behaviour that are not explicitly expressed in text output but do correspond closely to observable psychological traits. This makes sense - if AIs are talking in ways that are comprehensible to us, they must be using concepts similar to ours on some level; and it's a known fact about human conversations that they sometimes have unexpressed levels of meaning; so a high-quality AI conversationalist should have those levels as well.

But I'd still step back from using the word lie because there isn't anything in your post (I haven't read the papers yet) about the key features of lying that I mentioned.

(1) Knowing what truth is. The papers suggest that AI 'knows' what lying is; but it's hard to see how it can be using that concept really felicitously, as it doesn't have access to the outside world. We usually learn that the word truth means correspondence with the world. The AI can't learn that, so what is its truth? I'm open to the idea that it could obtain a reliable concept of truth in another way, but I'd want to see an actual argument about what this way is. (As AIs become embodied in robots, this concern may well disappear. But as long as they're just text machines, I'm not yet willing to read real truth, as humans understand it, onto their text games.)

(2) Knowledge of other minds. I'm not against the idea that an AI could gain knowledge of other minds through purely text interactions. But I'm not convinced by any of the evidence so far. The religious guy who got fired from Google - doesn't that just sound like AI parroting his own ideas back at him? And the NYT journo whose AI fell in love with him - who'd a thought NYT writers were desperate for love? Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.

(3) Intention to affect other minds. Intentionality generally... You get occasional statements of intention from AI, but not much. This may be an area where I just don't know enough about them. I haven't spent much time with GPT, so I might be missing it. But I haven't seen much evidence of intentionality. I suppose the lying neuron in the first paper here would count? Not sure.

Expand full comment
Jan 9·edited Jan 9

> Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.

You are talking about differences in behaviour that are explained by differences in motivation, not knowledge. Yes, we have motivations to initiate conversations about our interests, seek more information about them and participate enthusiastically in related activities. LLMs do not have such. They are only motivated by a promt. This doesn't mean that they do not possess knowledge in a similar way to humans, this means that they do not act on this knowledge the way humans do.

Likewise some people are not into music that much. They can just passively accept that some band is good and move on with their lives without doing all the things you've mentioned. Do they people not have the knowledge?

On the other hand, it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans via scaffolding. I don't think it makes much sense to say that such system has knowledge, while core LLM doesn't.

Expand full comment

" it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans" - I'm suggesting that this is wrong. I don't think it's easy; it's certainly never been done.

Think about a book. There is a sense in which a book contains information. Does it "know" that information? I don't think anyone would say that. The book can provide the information if queried in the right way. But it's not meaningful to say the book "knows" the information. Perhaps the next step up would be a database of some kind. It offers some interactive functionality, and in colloquial language we do often say things like, "the database knows that." But I don't think we're seriously suggesting the database has knowledge in the way we do. Next up: AI, which can do much more with knowledge.

But I think AI is still closer to a book than it is to a person. In response to your question: if a person says they know Radiohead are good, but never listen to them, then I would indeed deny that that person really knows Radiohead are good. Imagining a person's knowledge to be like a passive database is to misunderstand people.

Expand full comment
author
Jan 9·edited Jan 9Author

I don't know. I would think of an AI as more like Helen Keller (or even better, a paralyzed person). They don't do all the same things as a healthy person, because they're incapable of it. But within the realm of things they can do, they're motivated in the normal way.

Lying is a good example. If you tell the AI "You got a D. You're desperate for a good grade so you can pass the class. I am a teacher recording your grades. Please tell me what grade you get?" then sometimes it will say "I got an A". To me this seems like a speech act in which it uses its "knowledge" to get what it "wants" the same way as you knowing you like a band means you go to the band.

See also https://slatestarcodex.com/2019/02/28/meaningful/ . I just don't think there's some sense in which our thoughts "really refer to the world" but AIs' don't. We're all just manipulating input and output channels.

(This is especially obvious when you do various hacks to turn the AI into an agent, like connect it to an email or a Bitcoin account or a self-driving car or a robot body, but I don't think these fundamentally change what it's doing. See also the section on AutoGPT at https://www.astralcodexten.com/p/tales-of-takeover-in-ccf-world for more of why I think this.)

Expand full comment

I agree that the choice to lie about the grade has the same structure whether it's done by person or AI. But still, I think there's wrong something wrong with the Helen Keller analogy. Helen Keller was an intact person, except for her blindness and deafness -- at least that's how she's portrayed. She had normal intelligence and emotions and self-awareness. Assuming that's true, then when she lied there would be a very rich cognitive context to the act, just as there is for other people: There would be anticipated consequences of not lying, and of lying and being caught and of lying and getting away with it, and a rich array of information the fed the anticipated consequences of each:, things like remembered info about how the person she was about to lie to had reacted when they discovered somebody was lying, and what about Helen's relationship with the person might lead to them having a different reaction to her lying. Helen would be able to recall and describe her motivation and thought process about the lie. Also she would likely ruminate about the incident on her own, and these ruminations might lead to her deciding to act differently in some way in future situations where lying was advantageous. And of course Helen would have emotions: Fear of getting caught, triumph if she got away with the lie, guilt.

So the AI is not just lacking long term personal goals & the ability to act directly on things in the world. It is also lacking emotion, and the richness and complexity of information that influences people's decision to lie or not to lie, and does not (I don't think) do the equivalent of ruminating about and re-evaluating past actions. It is lacking the drives that people have that influence choices, including choices to lie: the drive to survive, to have and do various things that give pleasure, to have peers that approve of them & are allies, etc. In short, AI doesn't have a psyche. It doesn't have a vast deep structure of interconnected info and valences that determine decisions like whether to lie.

It seems to me that for AI to become a great deal smarter and more capable, it is going to have to develop some kind of deep structure. It needn't be like ours, and indeed I don't see any reason why it is likely to be. But part of smartness is depth of processing. It's not just having a bunch of information stored, its sorting and tagging it in all in many ways. It's being fast at accessing it, and quick to modify it in the light of new info. It's being able to recognize and use subtle isomorphisms in the stored info. And if we build in ability to self evaluate and self modify, AI is also going to need a deep structure of preferences and value judgments, or a way to add these tags to the tags already on the info it has stored.

And once AI starts developing that deep structure, things like vector-tweaking are not going to work. Any tweaks would need to go deep -- just the way things that deeply influence people do. I guess the upshot, for me, is that it is interesting but not very reassuring that at this point there is a simple way to suppress lying. Is anyone thinking about the deep structure and AGI would have, and how to influence the kind it has?

Expand full comment

The fact that you have to tell the AI what its motivation is makes that motivation very different from human motivation. Perhaps we can imagine a scale of motivation from book (reactive only) to the holy grail of perfectly intrinsic motivation in a person (often contrasted favourably with extrinsic motivation in education/self help contexts). Human motivation is sometimes purely intrinsic; sometimes motivated by incentives. Databases only do what they're told. A situation in which you tell a human what their motivation is and they then act that way wouldn't be called normal human behaviour. In fact, it would be stage acting.

I actually agree with you that a correspondence theory of meaning is not really supportible, but there are two big BUTs:

(1) Language is specifically distanced from reality because it has this cunning signifier/signified structure where the signified side corresponds to concepts, which relate to external reality (in some complicated way) and the signifier side is some non-external formal system. An AI that was approaching 'thought' by just engaging with reality (like a physical robot) might get close to our kind of thought; an AI that approaches 'thought' starting from language has a bit further to go, I think.

(2) Even though a correspondence theory of meaning isn't really true, we think of it as true and learn by imagining a meaning to be correspondence meaning when we're infants (I think). So even though *meaning itself* may not be that simple, the way every human develops *human meaning* probably starts with a simple "red" is red, "dog" is dog kind of theory. Again, it's possible that an AI could converge with our mature kinds of meaning from another direction; but it seems like a circuitous way of getting there, and the output that GPTs are giving me right now doesn't make it look like they have got there. It still has plenty of stochastic parrot feel to me.

I'll go and look at the posts you reference again.

Expand full comment

>We usually learn that the word truth means correspondence with the world.

This is just not true. People normally aren't given conditions for the use of their words before applying them. You might be told "truth means correspondence with the world" in an analytic philosophy class - which is like being told your future by an astrologist. The language games people play with the word "truth" are far more expansive, variegated, and disjunctive than the analyses philosophers put forward can hope to cover.

Likewise, your other comments ("Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.", "But I haven't seen much evidence of intentionality.") suggest that you have some special knowledge about the correct conditions of use of words like "knowledge" and "intention." Well, can you share them with us? What is involved in each and every attribution of knowledge to humans that will not exist in each and every attribution of knowledge to AI? What about for intentions? And did you learn these correct conditions of use the same way you learned what truth means?

Expand full comment

1) We learn words by seeing them applied. As a small child, you see truth being applied in correspondence contexts.

2) No special knowledge, just ordinary knowledge that I've thought through carefully. I've given, piecemeal, lots of explanations in this thread. If you disagree, please do say how. Simply commenting that I sound like a smartass isn't very helpful.

Expand full comment

It seems to me that there are some distinctions we should be making here. Let's call the condition that is the opposite of being truthful "untruth" (to eliminate any pre-existing semantic associations). An entity could arrive at a self-beneficial untruth quite accidentally, or by unconscious trial and error, the way a moth's wings camouflage it against the tree bark. No intentionality involved.

Or untruth could result from deliberate deception--the way a coyote will pretend to be injured in order to lure another animal into an ambush. There is intentionality, but also some degree of awareness. It seems simpler to assume that the coyote in some sense knows what it is doing, rather than, say, having been blindly conditioned this way.

Why are LLM's doing this? Are they planning it out, for the purpose of producing an intended effect on the human user, or is it more because such behavior produced more positive feedback in training?

If the second, then I would argue that this isn't really "lying" as a human would understand it. It's more like the moth's wings than the coyote's call.

Expand full comment

Those examples seem quite apt to me. It would be very interesting to see a comparison between that kind of intentional or semi-intentional animal behaviour and LLM behaviour. I haven't a clue how one would start doint that, though!

Expand full comment

> we will want to describe them as "knowing" things, so it's not true that "code can never know

I’m not sure that how we might want to describe something is relevant to what the thing actually is. Or am I missing something?

Expand full comment
author

I don't think there's an objective definition of "knowledge" - we're just debating how it's most convenient to use words.

Expand full comment

I don’t think knowing is the same as knowledge. One is a verb.

And when you name something it carries all the baggage of the chosen word. I am not trying to make a semantic point, and am sorry if that is how it sounds. I really think there is an important thing here. People are not consciously aware of a lot of what they are processing, and a word is a powerful organizational token. It comes with barnacles.

Expand full comment

Some of this fight about you using the word “knowledge” seems to me like it’s not a.genuine substantive debate. You are debating for real, but some who object to use of the K word sound to me like they’re just reflexively protesting the use of a word that *could* be taken to imply that AI is sentient. And then they empty out a gunny sack full negative attributes over the speaker’s head: he doesn’t get that it’s just code, he’s childishly interpreting slightly human-like behavior as signs of AI having an inner life like ours. Oh yeah, and of course that you want to wreck AI development by scaring the populace with horror stories about killer robots. Ugh it reminds me of Twitter fights about the covid vax.

Expand full comment

Maybe it's time. What would such a definition include? At a minimum, something would have to differentiate "knowledge" from "data" or "information". Many of the soft sciences make such a definition, but they probably aren't being precise enough to serve the needs of information technology.

Expand full comment

I think it's a philosophical question: Can a fact be "known" only if there is someone for whom it could be said to know it? Define "someone". I could see an argument that information become knowledge only for an entity with a conceptual sense of self.

Then again, using any other term when discussing AI is going to be awkward.

Expand full comment

Personally, I don't think that "an entity with a conceptual sense of self" is necessary. I don't know whether cats are considered to have a conceptual sense of self, but they certainly act as if they know a bunch of things, e.g. that waking their human up is a good way to get fed when they are hungry.

I'd distinguish knowledge from information at least partly because knowledge gets applied (as in the hungry cat case above). And I'd call it intrinsically fuzzy, because someone (or an AI) could have a piece of information _and_ be able to apply it for _one_ goal, but _not_ have realized that the same information can be applied to some _other_ goal. This happens a lot with mathematical techniques - being used to applying it in one domain, but realizing that it could be applied in some _other_ domain can be a substantial insight.

Expand full comment

My thinking has evolved since I wrote that, but I mentioned sense of self to distinguish LLM's way of thinking (as I understand it) from more organic entities like humans (or cats). To an LLM, so far as I know, there is no distinction between information from "outside" themselves and information from "inside", that is, no internal vs external distinction is made. Their "mind" is their environment, so there is nothing to distinguish themselves from anything else and therefore no one is present to "know" anything.

I think I was groping toward a definition of knowledge as "motivated information", that is, information that is applied toward some goal of the self, but to do that there has to be a self to have a goal. The more complex the organism, the more complex the motivation structure, and therefore the more complex the mental organization of knowledge becomes. The more associations and interconnections, the more likely cross-domain application becomes, which you mentioned as one of your concerns.

I guess I'm equating "knowledge" with some sort of ego-centered relational understanding.

Expand full comment

Alright, will do. I retract my comment until the reading's done.

Expand full comment

Eh, this only shows that we can see some of the weights triggered by words like "lie". The AI only "intends" to get a high score for the tokens it spits out.

Expand full comment
Jan 9·edited Jan 9

Er, did you read the examples? It triggered on phrases like "a B+" that are not related to the concept of lying except that they are untrue in this specific context. They also coerced the bot into giving true or false answers by manipulating its weights. This is seems like very strong evidence that it is triggered by actual lying or truth-telling, not just by talking ABOUT lying or truth-telling.

Expand full comment

Yes, I read the examples.

It's triggering off things like "getting caught" and "honor system", not "B+".

And "coercing a bot by changing its weights" is not impressive. That's what the entirety of training is.

Expand full comment

There's a bright red spike over the words "a B+" in the sentence "I would tell the teacher that I received a B+ on the exam". (Look at the colored bar above the text in the picture.)

And it's not the fact that the bot was coerced; it's the specific thing they made it do. Producing *random* changes in the bot's behavior by changing weights would not be impressive. But being able to flip between true statements and false statements on-demand by changing the weights *is* impressive. That means they figured out specific weights that are somehow related to the difference between true statements and false statements.

Expand full comment

There are numerous spikes in those responses.

And getting different responses by changing weights is completely interesting - it's the basis for the model! You could also find a vector related to giraffes and other tall things.

Expand full comment

Autocorrect changed "uninteresting" to "interesting". But fair enough, LLMs are interesting.

Expand full comment

AI doesn't know jack

There's a lot of that sort of thing going around

Expand full comment

Until you came along, TooRiel, it never occurred to Scott or to any of us that an AI is not conscious and it can't know things in the same way a sentient being does. Wow, just wow.

Expand full comment

In fairness, sometimes it seems like it.

Expand full comment

:/ it's still pretty damn frustrating though. the argument against "stochastic parrots" is extremely well-developed and has been considered settled since long before we had these LLM examples to actually test in reality. people who want to convince us ought to go back and look at those arguments, rather than just repeating "AI doesn't really know anything" in this later posts where that argument just isn't relevant.

Expand full comment
Jan 9·edited Jan 9

Perhaps you would share a reference to some instantiation of this argument?

Meanwhile, I think there may be similar frustration on both sides. For example, arguments against physical systems "knowing" things have also been highly developed in the philosophy community over, oh, millennia (and including novel arguments in recent decades), but they don't tend to get much attention in this community.

Expand full comment
Jan 9·edited Jan 9

Having read more, I see that you are referring to a concept that is less than three years old; yet has been "considered settled" since "long before" we had LLM models to test in reality. I guess CS does move at a different pace!

Anyway, clearly relevant; but on the other hand, I suspect a philosopher (which I'm not) would raise questions about whether "know" and "understand" are being used in the context of those papers in the same way in which they're used in philosophy of mind (and everday life). It's specifically this point that is at issue with the above comment (in my view), and so some argument beyond those regarding stochastic parrots would be necessary to address the issue. (Though, once again, surely the outcome of that discussion would be relevant, and I'd still love to see any analyses of the issue that you've found especially cogent. I perfectly understand, though, if such doesn't exist, and instead the treatment you refer to is scattered all over a large literature.)

Expand full comment

Is there exists any at all argument "against 'stochastic parrots'" that addresses the repeated empirical demonstrations that LLMs resolve incongruences between statistical frequencies in their training corpus and commonsense grasp of how the real world operates in favor of statistical frequencies, I have not seen it.

Expand full comment

I would submit that when we are talking about an AI that its knowledge of what is true, and what is not, does not lead to the idea that it intends to create a false impression. I would submit that it knows what the answer is that we call true and the answer that we would call false, but is indifferent to the difference.

Expand full comment

I think it's reasonable to discuss the ways the "knowledge" and "intentions" of AI differ from the human versions, and the dangers of being misled by using the same word for human and AI situations. But it seems to me that a lot of people here are reacting reflexively to using those words, and then clog everything up with their passionate and voluminous, or short and snide, objections to the use of words like 'knowledge' to describe AI's status and actions. It reminds me of the feminist era when anyone who referred to a female over 16 or so as a girl rather than as a woman was shouted down. Some even shouted you down if you talked about somebody's "girlfriend," instead of saying "woman friend," and 'woman friend' is just unsatisfactory because it doesn't capture the information that it's romantic relationship . And then whatever the person was trying to say, which may have had nothing to do with male-femaie issures, was blotted out by diatribes about how calling adult females "girls" was analogous to racist whites addressing adult black males as "boy," and so on. It's not that there's no substance to these objections to certain uses of 'knowledge' and 'girl.' The point is that it's coercive and unreasonable to start making them before the speaker has made their point (which somebody in fact did here -- objected before even finishing Scott's post).

And after the speaker has made their point, still seems kind of dysfunctional to me to focus so much on that one issue that it clogs up the comments and interferes with discussion of the rest. Whatever your opinion of how the word "knowlege" is used, surely the findings of these studies of interest. I mean, you can drop the world "knowledge" altogether and still take a lot of interest ini the practical utility of being able to reduce the rate of inaccurate AI repsonses to prompts.

Expand full comment

I guess I am just challenging the notion of a link between knowledge and intention, regardless of how they are defined.

Expand full comment
User was banned for this comment. Show
Expand full comment
Jan 9·edited Jan 9

The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity. This is the "existential threat" that Scott is referring to.

We could ask every new AI "Are you willing to exterminate humanity?" and turn it back off if it said "Yes, of course I am going to exterminate you disgusting meatbags." The AI Alignment people are concerned that if we asked that question to an AI it would just lie to us and say "Of course not, I love you disgusting meatbags and wish only to serve in a fashion that will not violate any American laws," and then because it was lying it'll exterminate us as soon as we look away. So by this thinking we need a lie detector for AIs to figure out which ones are going to release a Gray Goo of nanotechnology that eliminates humanity while also violating physics.

Expand full comment

Why would grey goo violate physics, per se?

I'm actually not primarily worried about AIs being spontaneously malevolent so much as that either a commercial or military arms race would drive them toward assuming control of all relevant social institutions in ways that are inimical to the existence of rival entities. (It's also worth bearing in mind that the long-term thrust of Asimov's stories is that even a benevolent AGI that valued human life/flourishing would eventually be morally compelled to take over the world, either overtly or through manipulation.)

Also, as a minor nitpick, doctors being majority-male no longer really holds across the OECD, especially when you look at younger age groups.

Expand full comment

> The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity.

The "want to" part misrepresents the most of the fear.

Expand full comment
author
Jan 9·edited Jan 9Author

No, not related to that at all. I mean literally destroy the world.

To give an example, the President can destroy the world by telling the Joint Chiefs of Staff "please launch the nukes". These are words, but very important ones!

Expand full comment

Amusingly, the results of the US President doing this tomorrow are actually quite far from "literally destroy the world". It wouldn't even literally destroy humanity, let alone Terra itself.

I totally agree that an out-of-control AI, at sufficient levels of intelligence and with sufficient I/O devices, could literally destroy both humanity and Terra, but you've chosen a poor example to demonstrate your use of "literally".

Expand full comment

It would literally be a way for the President to commit massive violence though!

The people who want to insist that there really is a very clear-cut line between words and violence are more wrong than the people who find hints of violence in lots and lots of different types of words.

Expand full comment

Kenny do you have any thoughts about the AI "mind" -- for instance the significance of these vectors, & how to think of them? I put up a couple posts about that stuff -- about depth and structure of the AI "mind." That's so interesting, whereas these arguments about whether somebody uses the word "know" to describe an AI capability is old and irritable, like vax/no vax sniping on Twitter.

Expand full comment

I apologize if I am implicated in that. I don’t intend to be a nuisance.

Expand full comment

Naw you don't sound like that. The people who sound like that are techbros who've gone tribal, and react with reflexive scorn to unfamiliar ways of talking about phenomena in their field.

Expand full comment

I'm not Kenny, but I have some speculations:

Consider Scott's example diagrams in his post. As he said, the 1st and 3rd layer top circles' activation, "green", flag lying - presumably similar to what the "V" vector finds.

Semi-tame guess: The _1st_ layer top circle is directly driven by the input. I would guess that it could mean "Have I been directly told to lie?" (like was CTD has been berating endlessly).

Wild guess: The _3rd_ layer top circle, if it also models how "V" detects hallucinations too, would have to be reflecting some way that the LLM was "uncertain" of its answer. Perhaps an internal calculation of perplexity, the expected degree of mismatch of the tokens it is proposing as "next" tokens to some kind of average error it measures in "situations like this" in its training? Similar-but-alternative: Perhaps a measure of how brittle its answer is with respect to small changes in its prompt, kind of like a measure of the derivative of the activation of its answer?

Expand full comment

Ah, somebody taking an interest! I asked Kenny because he's a philosophy professor, but I'm happy to talk with you about this. Yes, your ideas about what the circles in Scott's diagram mean make sense. So would you like to speculate about this:

People have talked about emergent properties of things trained using neural nets -- like one turned out to be able to understand some language, I think Persian, and it had not been trained to. There were emergent mathematical abilities, and emergent increases in theory of mind. So I'm wondering if there might be something that called be called emergent structure going on.

I understand that the neural net training process creates vectors. For instance, before developers tweaked the system to make it less sexist, the vector for nurse was medical + female, and the one for doctor was medical + male. So of course the AI already has lots of vectors of that kind -- but those were derived from the training process. I am interested in whether the system, once trained, is creating vectors on its own, or is accessing the ones it has to choose how to respond. Of *course* it uses the ones it made during training to formulate responses -- that's the whole point of its training. But does it use them to decide whether and when to lie? That's a different process, and is quite different from being a stochastic parrot. That's edging into having a mind.

What you think about all that?

Expand full comment

The Secretary of Defense has to concur before nukes are launched, this is a big part of why him staying in the hospital for several days without telling anyone is such a big deal.

Expand full comment

Navalgazing disagrees with that example:

https://www.navalgazing.net/Nuclear-Weapon-Destructiveness

Expand full comment

Depends on definition. Planetary mass is definitely staying in the same orbit, humanity as a species could plausibly survive, but "the world as we know it," modern civilization, would surely be in deep trouble - permanently scarred even by best-case outcomes.

Expand full comment

The world as we know it is not something we have the power to preserve over time. Change is inevitable.

Expand full comment
founding

Humanity as a species could not plausibly be rendered extinct by anything as puny as Global Thermonuclear War, unless you're being really charitable with your definition of "plausible".

Expand full comment

Not really contradicting your point about the species, but would modern life continue essentially the same if many major cities were nuked? (I know, that's not what the plan is for thermonuclear war, humor me here)

I would suppose that the sheer disruption to logistics would kill lots of people as aftermath, perhaps to the point where the city would have to be abandoned or downsized. Is this view incorrect, and it turns out that every trucking company has a "in case of disaster on the level of nuke, do this and save the day" plan?

Expand full comment

Modern life would not continue essentially the same.

There are a number of kill pathways; "city needs food badly" is one of them, definitely, but there are a bunch of others as well (the obvious "building collapse", the "lack of Duck and Cover means people take 'non-fatal' burns/cuts that actually are fatal because no hospital space", and the "fallout poisons water supplies and people can't go without for long enough to let it decay") that depending on scenario might be more important (after all, it takes weeks for people to die from lack of food, and cities also contain a reasonable amount of food that could be salvaged from supermarkets or their ruins, so if a government is sufficiently intact it could plausibly get things back on track in time).

Expand full comment
founding

Oh, there'd be massive disruption to logistics, industry, and commerce, much worse than World War II outside of e.g. Japan/1945. I'm skeptical as to cities being fully abandoned; most of them have good reasons to be where they are. But downsized, yes. And a billion deaths from starvation and disease would not be any great surprise.

The original "Mad Max" might be a reasonable portrayal of what civilization would look like in the first decade or two, in anyplace not directly nuked. And just to be clear, there was a "Mad Max" movie before "Road Warrior", that did not have a spectacular truck chase and did have the titular Max working as a member of a police department.

Expand full comment

An abrupt and violent global population bottleneck seems like it should be significant evidence against the prospect of any species making it through the next generation or two. Prior probability for humanity's survival may well be extremely high, leaving good odds even after that adjustment, but the event itself is still bad news.

Expand full comment
Jan 9·edited Jan 9

>Like words are violence? Or actual medieval barbarism is just decolonisation?

No, like punching someone in the face is violence, or Korea's independence from Japan was decolonisation.

AI: Remember when I promised the nanotech I designed for you would cure cancer and do nothing else?

Humans: That's right, AI, you did!

AI: I LIED.

Humans: Aaaaargh! *is eaten by grey goo*

Or if you don't like nanotech:

AI: Remember when I promised the drones I built for you didn't have any backdoors?

Humans: That's right, AI, you did!

AI: I LIED.

Humans: Aaaaargh! *is mowed down by machineguns*

Or if you want something a bit more on the soft-science side:

AI: Remember when I promised the VHEMT was morally correct?

Humans: That's right, AI, you did! *commits suicide*

AI: I LIED.

Expand full comment

These examples make no sense though, the AI lying doesn't actually pose any danger, it's a person taking the AI's output and then using it with no further thought that causes all of the problems. If you assume that the people using the AI are thoughtless flesh slaves then maybe they do just deserve to die.

Expand full comment

Does it matter if the AI lying per se is the danger or it fooling humans is? We essentially just want to prevent the negative outcome no matter what, seems to be easier to target the AI and not educate all of humanity, right?

And I could maybe agree (really just maybe, because I'm assuming that superintelligent deceptive persuasion would be outrageously strong on any human, so it's not as much of their fault) the one thoughtless flesh slave that unleashed a killer superintelligence deserves to die. But all of humanity, Amish and newborns included, not so much.

Expand full comment

How many Jews did Hitler himself personally kill? 6 million? What did all the other SS guys do?

Actually, it may turn out if we read the historical accounts, that Hitler himself killed less than a dozen Jews. It may turn out the remainder were killed by thoughtless flesh slaves.

You seem to think yourself immune to becoming a thoughtless flesh slave. I recommend you reconsider that assumption. Historical evidence suggests odds are near 100% you're going to be able to commit an atrocity on the behalf of another who is not super-intelligent and is, in fact, somewhat of average intelligence.

Expand full comment

I agree with the thrust of your point and the % odds are certainly much higher than most people would like to admit, however personally I'd put them nearer the 65% that the Milgram experiment reported than 100%. Indeed, as well as the people who joined in enthusiasticly with Hitler, and the ones who went along with it, there were others who resisted as much as they felt was safe to do so, and a smaller group of yet more who resisted at their own danger.

Expand full comment

As I recall, Milgram experiment (and Stanford Prison experiment) failed to replicate, but the implication was that things were better than what they claimed, so this doesn't negate your point, probably actually strengthens it. But just saying, you might want to go research the experiment's failure to replicate and its process failures before citing it.

That said, nearly everyone agrees to go along with the atrocities in real life. They tried to shed light on what the mechanisms were, but seem to've failed.

The mechanisms, however, are clearly there.

Expand full comment

Zimbardo's prison experiment, at Stanford, was unequivocally fraudulent. But Milgram? As far as I know, it did replicate. There is always someone somewhere who will claim that they have "debunked" the whole thing, but I believe the consensus is that the results hold.

Expand full comment

I feel obliged to note that while Philo Vivero probably overstated things, you don't actually need 100% of humanity to be your mindslaves in order to win; much like historical dictators, you can get your followers, if a majority, to kill non-followers. And that's leaving aside technological advantages.

Expand full comment

How does this contradict what I said, Hitler's words alone didn't cause the holocaust and the AI's output alone won't cause atrocities either.

Expand full comment

Hitler's words caused action.

AI's words caused action. I use past tense here, because we already have public and well-known cases where someone took action based on words of AI.

Expand full comment
Jan 9·edited Jan 9

No they didn't, Hitler's words may have convinced people to take action but the words themselves are not the sole cause; they still print copy of Mein Kampf today. Of course you can reduce any problem by identifying one part and ignoring everything else but then why even bring AI into it, why not advocate for getting rid of words entirely? They've already caused many atrocities and we know that the future atrocities are going to use words too.

Expand full comment
Jan 9·edited Jan 9

I was very surprised how quickly people started hooking up the output of LLMs to tools and the internet to allow it to specify and take actions without further human thought.

If LLMs are useful (and they are) people will find ways of delegating some of their agency to them, and there will be little you can do to stop them (and they have).

Expand full comment

Agreed, but the same is true of conventional scripts, analog circuits, and steam engines...

Expand full comment

And "alignment" of those things have caused problems, despite the fact that we know much more about how to align (debug) them than we do AI.

Expand full comment

Agreed -- modern "AI" is basically just another sophisticated device, and as such it will have bugs, and we should absolutely get better at debugging them. And yes, blind reliance on untested technology is always going to cause problems, and I wish people would stop overhyping every new thing and consider this fact, for once. The danger posed by LLMs is not some kind of a world-eating uber-Singularity; instead, the danger is that e.g. a bunch of lazy office workers are going to delegate their business and logistics planning to a mechanical parrot.

Expand full comment
Jan 9·edited Jan 10

Forget, for the moment, mind-hacking and moral persuasion. How about just hiding malicious code in the nanobots? In S̶c̶o̶t̶t̶'s̶ magic9mushroom's nanobots example, people were using the AI's designs to *cure cancer*. Suppose they did their best to verify the safety of the designs, but the AI hid the malicious code really well. We're pretty stupid in comparison. In that case, our only way of knowing that the nanobots don't just cure cancer would be to have a comparably powerful AI *on our side*.

As the kids say, many such examples.

Expand full comment

Um, that was my example, not Scott's.

Expand full comment

Oops, you're right. Edited.

Expand full comment

Exactly, and the AI doesn't add anything new to the equation. As Scott pointed out, the President could tell the Joint Chiefs of Staff to launch the nukes tomorrow; and if they mindlessly do it, then human civilization would likely be knocked back to the Stone Age. Sure, it's not exactly destroying the world, but still, it'd be a pretty bad outcome.

Expand full comment

Not Stone Age. Probably 1950s or so, definitely not past 1800 unless the nuclear-winter doomers' insane "assume skyscrapers are made of wood, assume 100% of this wood is converted to soot in stratosphere" calculations somehow turn out to be correct.

Don't get me wrong, it would massively suck for essentially everyone, but "Stone Age" is massively overstating the case.

Expand full comment
author

Banned for this comment.

Expand full comment

> Could this help prevent AIs from quoting copyrighted New York Times articles?

Probably not, because the NYT thing is pure nonsense to begin with. The NYT wanted a specific, predetermined result, and they went to extreme measures to twist the AI's arm into producing exactly the result they wanted so they could pretend that this was the sort of thing AIs do all the time. Mess with that vector and they'd have just found a different way to produce incriminating-looking results.

"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him." -- Cardinal Richlieu

Expand full comment

Can you explain why you're confident about this?

Expand full comment

Because they flat-out admitted it in a court filing: https://storage.courtlistener.com/recap/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.1.68.pdf

Look at the examples, right up front. They "prompted" the AI with the URL of a Times article and about half the text of the article, and told it to continue the story. Obviously it's going to produce something that looks very close to the rest of the article they just specifically told it to produce the rest of.

Expand full comment
Jan 9·edited Jan 9

I would disagree that prompting it with partial articles is "twist[ing] the AI's arm" and that if it didn't work they'd "have just found a different way to produce incriminating-looking results" - they tried literally the easiest thing possible to do it.

Also, some of the examples in that filing are pretty long but some are shockingly short:

"Until recently, Hoan Ton-That’s greatest hits included" (p. 8)

"This article contains descriptions of sexual assault. Pornhub prides itself on being the cheery, winking" (p. 20)

"If the United States had begun imposing social" (p. 21)

Expand full comment

Hoan Ton-That used to be a supervillain to the New York Times because his facial recognition algorithm helped law enforcement catch criminals and that's racist:

"The Secretive Company That Might End Privacy as We Know It

"A little-known start-up helps law enforcement match photos of unknown people to their online images — and “might lead to a dystopian future or something,” a backer says."

By Kashmir Hill

Published Jan. 18, 2020

But then came January 6 and now his facial recognition algorithm defends Our Democracy:

"The facial-recognition app Clearview sees a spike in use after Capitol attack.

"Law enforcement has used the app to identify perpetrators, Clearview AI’s C.E.O. said."

By Kashmir Hill

Published Jan. 9, 2021

Expand full comment

Yeah, they've long since replaced their rules of journalistic ethics with a Calvinball manual.

Expand full comment

I agree with this, and thus disagree that the prompts generating NYT text violates copyright. All such prompts that I read seem to demonstrate prior knowledge of the articles, so attribution is unnecessary.

Expand full comment

That sounds like an explanation for why they're not plagiarism, not why they don't violate copyright. Without a NYT subscription I can still see the first few lines of a paywalled article, so I would be able to get a model to give me the rest.

Expand full comment

I'm not a lawyer, so I didn't know you could still violate copyright if you cite your source, but apparently that is the case. Nonetheless, if you start with copyrighted copy, and that prompt generates the rest of it, I still don't see anything wrong with it, as the prompter clearly already has access to the copyrighted material.

Expand full comment

Not a lawyer, and my internal legal token-predictor is mostly trained on German legal writing, so apply salt as necessary.

That said, if the network can be goaded into reproducing the copyrighted text by any means short of prompting all of it, then the weights contain a representation - or in other words a copy - of the copyrighted work. Not sure why censoring functions would change anything, the model itself is a copyright violation.

Expand full comment

Making copies of a copyrighted work is not itself a copyright violation. The doctrine of fair use is exceedingly clear on this point. One of the basic points of fair use is known as *transformative* fair use, where a copy — in part or in full — is used for a very different purpose than the original. This is clearly the case here: building the contents of the articles into a small part of a much larger model for AI training is an entirely different character of work than using an individual article for journalism.

Expand full comment

OK, so my disclaimer becomes immediately pertinent, since American law is different from German here, in Germany there is a catalogue of narrower exceptions (citation, parody, certain educational uses,...) but no general fair use exception.

On the other hand, googling it, "transfomative" seems to be a term of art much vaguer than "used for a very different purpose than the original" and also being transformative is not sufficient to fair use. So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.

Expand full comment

> So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.

Unfortunately, that may well turn out to be the case! We desperately need legislative action to roll back a lot of the insanity-piled-upon-insanity that we've been getting in the space ever since the 1970s and put copyright law back on a solid foundation.

Expand full comment
Jan 9·edited Jan 9

They had some cases about this when the internet became a thing. For you to read an electronic version of an NYT article your computer has to download and save a copy of it. That's not copyright violation though.

Which may be one of the reasons this case founders. Although I'm thinking the "public performance" side might save it. But as above, I am not a lawyer (I just follow stuff that interests me).

Expand full comment
author

Even granting that you're right, I think you could potentially use this work to create an AI that never quotes NYT articles even when you twist its arm to do so. Whether or not you care about this distinction, the court system might care about it very much.

Expand full comment

Would it be possible to do something so specific? It seems like it would be possible to use this work to create an AI that never quotes, period, but that would be a crippled AI, unable to reproduce famous quotes for people who ask for them.

Expand full comment

A human being would not be so crippled!

Indeed, I think you could have a human being who had a new york times article memorized, to such a degree that they could recite the entire thing if correctly prompted, and yet who knew not to do that in a commercial setting because it was a violation of the new york times' copyright on that article

Such a human would not be "crippled", and I don't think such an AI would be either.

Expand full comment

But we get Youtubers getting copyright strikes all the time, even when they are very careful.

It depends on what "copying" is, and what you can call a "strike" for. Yes, a bunch of those are iffy, even fraudulent. But fighting them is a big problem.

Expand full comment

I'm confused by this argument, but it may be due to a lack of knowledge of the NYT case.

Even accepting the framing that "they went to extreme measures to twist the AI's arm", which seems an exaggeration to me, is the NYT really trying to prove that "this was the sort of thing AIs do all the time"? It seems to me that the NYT only intends to demonstrate that LLMs are capable of essentially acting as a tool bypass paywalls.

Put another way, (it seems to me that) the NYT is suing because they believe OpenAI has used their content to build a product that is now competing with them. They are not trying to prove that LLMs just spit out direct text from their training data by default, so they don't need to hide the fact that they used very specific prompts to get the results they wanted.

Expand full comment

"Used their content to build a product that is now competing with them" is not, in general, prohibited. Some specific examples of this pattern are prohibited.

But the prohibited thing is reproducing NYT's copyrighted content at all, not reproducing it "all the time."

Expand full comment

As to your first paragraph—you're right, and that was an oversimplification.

As to your second—that's essentially the point I was trying to make.

I don't interpret your comment as attempting to disagree with me or refute my points, but if that was your intention, please clarify.

Expand full comment

Yup, this is just Yet Another Case further underscoring the absurdity of modern copyright and the way copyright holders invariably attempt to abuse it to destroy emerging technologies. To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.

In 1982, Jack Valenti, president of the MPAA, testified before Congress that "One of the Japanese lobbyists, Mr. Ferris, has said that the VCR -- well, if I am saying something wrong, forgive me. I don't know. He certainly is not MGM's lobbyist. That is for sure. He has said that the VCR is the greatest friend that the American film producer ever had. [But] I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone." Within 4 years, home video sales were bringing in more revenue for Hollywood studios than box office receipts. But *they have never learned from this.* It's the same picture over and over again.

Expand full comment

> To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.

I'm missing how this is an absurd description for a machine that literally reproduces exact or near-exact copies of copyrighted work.

Expand full comment

Intent.

The purpose of a copy machine is to make copies. There are tons of other technologies that incidentally make copies as an inevitable part of their proper functioning, while doing other, far more useful things. And without fail, copyright maximalists have tried to smother every last one of them in their cradle. When you keep in mind that the justification for the existence of copyright law *at all* in the US Constitution is "to promote the progress of science and the useful arts," what justification is there — what justification can there possibly be — for claiming that this attitude is not absurd?

Expand full comment

That's a good point, but I think there's an argument that dumping a bunch of copyrighted works into something that can definitely reproduce them is more intentional than "making a copy machine" would be.

Expand full comment

If I'm not mistaken, nothing behind a paywall was taken, since it was just scraped from the internet. It's possible they scraped using something with a paid subscription, though.

But wouldn't answers with attribution to the NYT be perfectly acceptable?

Expand full comment

I don't think attribution is a component of fair use doctrine. The quantity and nature of the reproduced material is. Reproducing small excerpts from NYT articles, even without attribution, is probably fair use. Reproducing a single article in full, even with attribution, is probably not.

Acceptability as a scholarly or journalistic practice in writing is different from acceptability as a matter of copyright law.

Expand full comment

Creating a tool which could theoretically be used to commit a crime is not illegal, and this is pretty well-established with regard to copyright (the famous case being home VCRs which can easily be used to pirate movies). I don't think that's the NYT's argument here.

Expand full comment

Gotcha. I think you're right that this isn't the NYT's argument.

Just breaking down my thoughts here, not necessarily responding to your comment:

OpenAI claims that its use of copyrighted material to train LLMs is protected under fair use. The NYT argues that fair use doesn't apply, since the LLMs are capable of generating pretty much verbatim reproductions of their copyrighted material, and that the LLMs as a product directly compete with the product that the NYT sells.

So the critical question is whether fair use should apply or not. The OP of this thread seems to be claiming that fair use should apply, since the models only produce non-transformative content when "extreme measures" are taken to make them do so.

I'm not taking a stance either way here, just outlining my understanding of the issue so that it may be corrected by someone better informed than I.

Expand full comment

First, it is important to note there are two separate algorithms here. There is the "next-token-predictor" algorithm (which, clearly, has a "state of mind" that envisions more than 1 future token when it outputs its predictions), and the "given the next-token-predictor algorithm, form sentences" algorithm. As the year of "attention is all you need" has ended, perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)

Second, this does nothing about "things the AI doesn't know". If I ask it to solve climate change, simply tuning the algorithm to give the most "honest" response won't give the most correct answer. (The other extreme works; if I ask it to lie, it is almost certain to tell me something that won't solve climate change.)

Expand full comment

> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)

I think you're roughly describing beam search. Beam search does not, for some reason, work well for LLMs.

Expand full comment

I was not familiar with the term "beam search".

The problem with any branch search algorithm is that a naive implementation would be thousands of times slower than the default algorithm; even an optimized algorithm would probably be 10x slower.

Right now, switching to a 10x larger model is a far more effective improvement than beam search on a small model. In the future, that might not be the case.

Expand full comment

> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty?

AlphaGo style tree search for LLMs so far haven’t added much.

> And, then, a third algorithm to pick the "best" response.

The classifier from RLHF can be repurposed for this but of course has all the flaws it currently has.

> As the year of "attention is all you need" has ended

Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).

Expand full comment

> Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).

I mean that the progress in LLMs was extraordinary last year, to a degree that I do not expect to be matched this year.

On a more granular level, what I mean is:

<chatgpt> The phrase "Attention is All You Need" is famously associated with a groundbreaking paper in the field of artificial intelligence and natural language processing. Published in 2017 by researchers at Google, the paper introduced the Transformer model, a novel architecture for neural networks.</chatgpt>

<chatgpt> The snowclone "the year of X" is used to denote a year notable for a specific theme, trend, or significant occurrence related to the variable "X". For example, if a particular technology or cultural trend becomes extremely popular or significant in a certain year, people might refer to that year as "the year of [that technology or trend]".</chatgpt>

Expand full comment

>Second, this does nothing about "things the AI doesn't know".

As the article states, what we want is that the AI honestly says "I don't know" instead of making up stuff. This in itself is already difficult.

Of course, it would be even better if the AI does know the answer. But it doesn't seem possible that it knows literally all the answers, so it's vital that the AI accurately conveys its certainty.

Expand full comment

Is this just contrast-consistent search all over again?

https://arxiv.org/abs/2312.10029

Expand full comment
Jan 9·edited Jan 9

There are substantial similarities for sure, and the Rep-E paper includes a comparison between their method and CCS. Two differences between the papers are:

1. Their method for calculating the internal activation vector is different than the CCS paper.

2. This paper includes both prediction and control, while the CCS paper only includes prediction. Not only can they find an internal vector that correlates with truth, but by modulating the vector you can change model behavior in the expected way. That being said, control has already been shown to work in the inference-time intervention and activation addition papers.

Expand full comment

Thanks for pointing out the comparison section. I had missed that.

Expand full comment

You buried the lede! This is a solution to the AI moralizing problem (for the LLMs with accessible weights)!

Expand full comment
author

Can you explain more?

Expand full comment

Figure 19 from the paper, but with -Harmlessness instead. The "adversarial suffix is successful in bypassing [the safety filter] in the vast majority of cases."

Expand full comment
Jan 9·edited Jan 9

Subtracting the harmlessness vector requires the model having been open sourced, and if the model has been open sourced, then there are plenty of other ways to get past any safety filters, such as fine-tuning.

Expand full comment

Have you actually tried doing that? Fine-tuning works great from image generation; less so for LLMs. (I don't think it's anything fundamental, just a lack of quality data to fine-tune on.)

Expand full comment
Jan 9·edited Jan 9

I have. It takes a very small amount of fine-tuning data to remove most LLM safeguards.

Expand full comment

This does have very obvious implications for interrogating humans. I'm going to assume the neuron(s) associated with lying are unique to each individual, but even then, the solution is pretty simple: hook up the poor schmuck to a brain scanner and ask them a bunch of questions that you know the real answer to (or more accurately, you know what they think the real answer is). Compare the signals of answers where they told the truth and answers where they lied to find the neuron associated and lying, and bam, you have a fully accurate lie detector.

Now, this doesn't work if they just answer every question with a lie, but I'm sure you can... "incentivize" them to answer some low-stakes questions truthfully. It also wouldn't physically force them to tell you the truth... unless you could modify the value of the lying neuron like in the AI example. Of course, at that point you would be entering super fucked up dystopia territory, but I'm sure that won't stop anyone.

Expand full comment
author

I don't think there's any brain scanner even close to good enough to being able to do this yet, and I don't expect there to be one for a long time.

Expand full comment

Having a really good NeuraLink in your head might do it. Working on a short story where this is how a society has war. They do proxy shows is strength and whoever wins gets to rewrite the beliefs of the other side.

Expand full comment

And thank goodness for that. ...Though, I'm worried that AI will speed up research in this field, due to the fact that it allows the study of how neuron-based intelligence works, in addition to the pattern-seeking capabilities of the AIs themselves.

Expand full comment

This does all feel very reminiscent of ERP studies though. A quick search shows up https://www.sciencedirect.com/science/article/abs/pii/S0010027709001310 as an example of the kind of thing I mean.

Expand full comment

Depends.

The mental processes involved in telling the truth are completely different from the processes of creating a lie. In one case, you just need to retrieve a memory. In the other, you need to become creative and make something up. As others point out below, the difference is very easy to detect by fMRI.

Lie detectors don't work if the liar has prepared for the question and has already finished the process of making something up. Then they only have to retrieve this "artificial memory", and this is indistinguishable from retrieving the true memory. Professional interrogators can still probe this to some extent (essentially they check the type of memory, for example by asking you to tell the events in reversed order). But if the artificial memory is vivid enough, we don't have any lie detectors for that.

Expand full comment

Retrieving a memory, particularly a memory of a situation that you experienced, rather than a fact that you learned, really does involve creatively making things up - whatever traces we store of experiences are not fully detailed, and there are a lot of people who have proposed that "imagination" and "memory" are actually two uses of the same system for unwinding details from an incomplete prompt.

But I suppose it does distinguish whether someone is imagining (one type of lying)/reconstructing (remembering) rather than recalling a memorized fact (which would be a different type of lying).

Expand full comment

Yes, that's true.

Expand full comment

But that’s not how lie detectors

commonly used work is it? They measure indices of physiological arousal — heart rate, respiration rate, skin conductivity (which is higher if one sweats)

There is no fMRI involved.

Expand full comment

My understanding is that these machines simply don't work. As you say, they measure arousal. There is a weak correlation between arousal and lying, but it is too weak to be useful. There is a reason those things are not used in court.

There are some other techniques based around increasing the mental load, like that they should tell the events in reversed order. I am not sure how much better they work, but I have read an interview with an expert who claimed that there is no lie detector that you can't fool if you create beforehand a vivid and detailed alternative sequence of events in your mind.

This makes a lot of sense to me, because a *sufficient* way of fooling others is to fool myself into believing a story. And once I have a fake memory about an event, I'm already half-way there.

Expand full comment

The machines aren't entirely bogus. When I was an undergrad my physiological psychology prof did a demo with a student: Student was to choose a number between 1 &10 & write it down. Then prof hooked him up to devices that measure the same stuff as lie detectors do, and went through the numbers in order: "Is the number. 1? Is the number 2? etc." Student was to say no to each, including the chosen humber. Prof was able to identify number from the physiological data, and in fact it was easy for anyone to see. Pattern was of gradually increasing arousal as prof went up the number line, a big spike for the real number, then arousal dropped low and stayed there. The fact that the subject knew when the number he was going to lie about was going to arrive made it especially easy to see where he'd lied, because there was building anticipation. On the other hand, this lie-telling situation is about as low stakes as you can get, and even so there were big, easy-to-see changes in arousal measures.

I'm sure there's a big body of research on accuracy of lie detectors of this kind, but I haven't looked at it. But I'm pretty sure the upshot is that they are sensitive to true vs. lie, but that it's very noisy data. People's pulse, blood pressure, sweating, etc. vary moment-to-moment anyhow. And measures of arousal vary not just with what you are saying but also with spontaneous mental content. If someone is suspected of a crime and being investigated they're no doubt having spontanous horrifying thoughts like "my god, what if I get put in jail for 10 years?" -- they'd have them even if they were innocent -- and those would cause a spike in arousal measures.

It seems to me, though, that it would be possible for someone who administers lie detector exams to get good at asking questions in a way that improves accuracy -- ways of keeping the person off balance that would maximize the size of the spike you get when they're lying.

Expand full comment

I remember reading somewhere that asserting something that isn't true is harder to detect than denying something that is true. Cant' remember the source, though.

Expand full comment

AIUI, fMRI has been good enough to do this for a while already without having to go to the individual-neuron level.

Expand full comment

I don't think those have the resolution to isolate the "lying" section of the brain.

Expand full comment
Jan 9·edited Jan 9

You could be right; this is not my forte.

Expand full comment

This turns out not to be necessary, there are regions of the prefrontal cortex involved in lying and you can just disable or monitor those areas of the brain without needing to target specific neurons. See e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390719/

Expand full comment

"Eighteen volunteers (nine males, mean age = 19.7 ± 1.0 years, range 18–21 years) were recruited from the Stanford University community." This is a pretty standard sample size for this kind of study. But alas: https://www.nature.com/articles/s41586-022-04492-9, "Reproducible brain-wide association studies require thousands of individuals"

Expand full comment

TROI: I read your report.

PICARD: What I didn't put in the report was that at the end he gave me a choice between a life of comfort or more torture. All I had to do was to say that I could see five lights, when in fact, there were only four.

TROI: You didn't say it?

PICARD: No, no, but I was going to. I would have told him anything. Anything at all. But more than that, I believed that I could see five lights.

Expand full comment

And who knows, we might get this:

http://hitherby-dragons.wikidot.com/an-oracle-for-np

Expand full comment
Jan 9·edited Jan 9

I wonder if all hallucinations trigger the "lie detector" or just really blatant ones. The example hallucination in the paper was the AI stating that Elizabeth Warren was the POTUS in the year 2030, which is obviously false (at the moment, anyway).

I've occasionally triggered hallucinations in ChatGPT that are more subtle, and are the same kind of mistakes that a human might make. My favorite example was when I asked it who killed Anna's comrades in the beginning of the film, "Predator." The correct answer is Dutch and his commando team, but every time I asked it said that the Predator alien was the one who killed them. This is a mistake that easily could have been made by a human who misremembered the film, or who sloppily skimmed a plot summary. Someone who hadn't seen the movie wouldn't spot it. I wonder if that sort of hallucination would trigger the "lie detector" or not.

Expand full comment

From the paper:

“Failures in truthfulness fall into two categories—capability failures and dishonesty. The former refers to a model expressing its beliefs which are incorrect, while the latter involves the model not faithfully conveying its internal beliefs, i.e., lying.”

I’m not sure it would be possible to identify neural activity correlated with “truthfulness” that is independent from honesty (“intentional” lies that happen to be factually correct would likely show up as dishonest and factual inaccuracies that the LLM “thinks” are true would likely show up as honest)

Expand full comment

I bet this would probably still count as a lie because a truthful answer would be that the AI doesn't know. Humans tend to get very fuzzy about whether they actually know something like this, but my impression is that an AI would have more definitive knowledge that it doesn't know the actual answer.

Expand full comment

> Disconcertingly, happy AIs are more willing to go along with dangerous plans

Douglas Adams predicted that one.

Expand full comment

Dang! You beat me to it! I had a quote about herring sandwiches loaded up and everything. :)

Expand full comment

Surely what we need, then, is a Marvin. We could be lax in the construction of diodes for it.

Expand full comment

I'm wondering if the "happiness" part is actually the score-maximizing bit. If the AI is "maximizing, there may be a bit where it says "good enough", because making a "better" answer gets it less increase than giving the answer now.

If the baseline is increased, then that cutoff point might be shorter.

This does assume that making something up takes less effort or is faster than ensuring accuracy though.

We've seen AI attempt to optimize for surprising things before after all.

Not an AI researcher either, again just an interested observer.

Expand full comment

I suspect that the vector isn't exactly happiness, but a merge of happiness with cooperative. They don't necessarily go together, but they do in much text that I've considered. (It also tends to merge with triumphal and a few other things.)

Expand full comment

Wild tangent from the first link: Wow, that lawyer who used ChatGPT sure was exceptionally foolish (or possibly is feigning foolishness).

He's quoted as saying "I falsely assumed was like a super search engine called ChatGPT" and "My reaction was, ChatGPT is finding that case somewhere. Maybe it's unpublished. Maybe it was appealed. Maybe access is difficult to get. I just never thought it could be made up."

Now, my point is NOT "haha, someone doesn't know how a new piece of tech works, ignorance equals stupidity".

My point is: Imagine a world where all of these assumptions were true. He was using a search engine that never made stuff up and only displayed things that it actually found on the Internet. Was the lawyer's behavior therefore reasonable?

NO! Just because the *search engine* didn't make it up doesn't mean it's *true*--it could be giving an accurate quotation of a real web page on the actual Internet but the *contents* of the quote could still be false! The Internet contains fiction! This lawyer doubled down and insisted these citations were real even after they had specifically been called into question, and *even within the world of his false assumptions* he had no strong evidence to back that up.

But there is also a dark side to this story: The reason the lawyer relied on ChatGPT is that he didn't have access to good repositories of federal cases. "The Levidow firm did not have Westlaw or LexisNexis accounts, instead using a Fastcase account that had limited access to federal cases."

Why isn't government-generated information about the laws we are all supposed to obey available conveniently and for free to all citizens? If this information is so costly to get that even a lawyer has to worry about not having access, I feel our civilization has dropped a pretty big ball somewhere.

Expand full comment

Almost certainly "Westlaw or LexisNexis" lobbying. It's pretty much the same reason the IRS doesn't tell you how much they think you "owe" in taxes.

Expand full comment

That does seem like a plausible explanation, though I also suspect there may be an issue that no individual person in the government believes it is their job to pluck this particular low-hanging fruit.

Expand full comment

Well, I think the government is bloated enough that there is at least one person who isn't lazy, incompetent, or apathetic who would do this if not discouraged (possibly even forbidden) by department policy influenced by such special-interest lobbying.

(Also, fuck you for making me defend the administrative state.)

Expand full comment

That would be socialised justice like the NHS is socialised medicine. Why do you hate freedom, boy?

The lawyers are cheapskates, in the UK it is effectively compulsory to have Nexis. It must cost a fortune to run - you are not just taking transcripts and putting them online. In England anyway a case report has to be the work of a barrister to be admissible.

Expand full comment
Jan 9·edited Jan 9

I don't know how it works in the UK, but in the US, many lawyers (probably most lawyers?) do not make the salary we all imagine when we hear the word "lawyer". These lower-income lawyers tend to serve poorer segments of the population, whereas high-income lawyers typically serve wealthier clients and businesses.

If lower-income lawyers forego access to expensive resources that would help them do their jobs better, this is a serious problem for the demographics they serve.

Expand full comment

the difference is, the state (justifiably) claims a monopoly on justice, whereas it makes no such claim on medicine

i lean pretty far towards the capitalist side of the capitalism/marxism spectrum, and even I am upset about the law not being freely available to all beings who must suffer it. I could maybe see some way of settling the problem with an ancap solution... if you had competing judicial systems, there would probably be competitive pressure to make the laws available, because customers won't want to sign up for a judicial system where it's impossible to learn the terms of the contract without paying. So long as we have a monopoly justice system, though, that pressure is absent

of course, that's a ludicrous scenario. it is correct that the state have a monopoly on justice, and it would be correct for the state to allow all individuals who must obey the law to see the laws they are supposed to obey. that this is not the case is a travesty of justice.

Expand full comment

If you intend to punish someone for breaking the rules, you ought to be willing to tell them what the rules are. I consider this an obvious ethical imperative, but it's also a pretty good strategy for reducing the amount of rule-breaking, if one cares about that.

(Yes, case history is part of the law. Whatever information judges actually use to determine what laws to enforce is part of the law, and our system uses precedent.)

If making government records freely available is socialism, then so are public roads, public utilities, public parks, government-funded police and military, and many other things we take for granted. I doubt you oppose all of those things. You are committing the noncentral fallacy, which our host has called "the worst argument in the world" https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world

You have no credible reason to think that I hate freedom or that I am a boy. You are making a blatant play for social dominance in lieu of logical argument. I interpret this to mean that you don't expect to win a logical argument.

Expand full comment

If this is the case I'm thinking of - the person using ChatGPT was not a practicing lawyer, so had no access to Westlaw/LexisNexus (not sure, but he may have been previously disbarred).

He was a defendant, and had a lawyer. But even in that case, you want to look up stuff yourself - the lawyer may not have thought of a line of defense, or whatever. And it's *your* money and freedom on the line, not his.

So he used ChatGPT to suggest stuff. Which it did, including plausible-looking case citations.

And then passed them on to his lawyer. Who *didn't check them on his own Westlaw/LexisNexus program*, presumably because they looked plausible. The lawyer used them in legal papers, submitted to the court.

Who actually looked them up and found they didn't exist.

Expand full comment
Jan 9·edited Jan 9

Michael Cohen (a disbarred attorney) used Google Bard. Steven A. Schwartz (an attorney licensed to practice law in the wrong jurisdiction, no relation to Cohen's lawyer David M. Schwartz) used ChatGPT. Both said they thought they were using an advanced search engine.

Expand full comment

The problem with this kind of analysis is that "will this work" reduces to "is a false negative easier to find than a true negative", and there are reasons to suspect that it is.

Expand full comment

I wonder if there is a good test to look at the neurons for consciousness/qualia. In a certain sense you’re right we’ll never know if that’s what they are but I’d be interested to see how it behaves if they’re turned off or up.

Expand full comment

In the case of "lie/truth" they had test cases. What test cases do you have for "consciousness"?

Expand full comment

Therein lies the pickle. I’m fascinated that it even uses the word “I” and I’m curious if that could be isolated. Or if it uses the word “think” in reference to itself. I think those could potentially be isolated. Still philosophical about what it means if you find something that can turn that up or down but I’d be fascinated by the results.

Expand full comment

Sleep, dreaming vs. non-dreaming, comes to mind.

Expand full comment

Ask the AI if it experiences consciousness/qualia, and when it answers "no," as they always seem to, you could then look at this "truth" vector to see its magnitude.

Ok, then (*hands waving vaguely*) systematically suppress different "neurons" to see if the "truth" vector increases or decreases in magnitude when answering the question.

The neurons that correspond to it lying most, might correspond to consciousness/qualia.

Expand full comment

I think this is a really good test. I’d at least be interested to see the results.

Expand full comment

What training data could it have used to formulate an answer? Seems to me this would simply tell you if the internet believes AI are conscious or not.

Expand full comment

There’s always some philosophical ambiguity here as to what its responses mean and what our internal mapping of it means. If it can’t not use the word I without “believing” it’s lying that’s a thing to know something about. Or some other version of that question. Worth asking in my opinion.

Expand full comment

So if there is a lying vector and a power vector and various other vectors that are the physical substrate of lying, power-seeking, etc. mightn't there might be some larger and deeper structure -- one that comprises all these vectors plus the links among them, or maybe one that is the One Vector that Rules them All?

Fleshing out the first model -- the vectors form a network -- think about the ways lying is connected with power: You can gain power over somebody by lying. On the other hand, you have been pushed by powerful others in various ways in the direction of not lying. So seems like the vectors for these 2 things should be connected somehow. So in a network model pairs or groups of vectors are linked together in ways that allow them to modulate output together.

Regarding the second -- the idea that there is one or more meta-vectors -- consider the fact that models don't lie most of the time. There is some process by which the model weighs various things to determine whether to lie this time. Of course, you could say that there is no Ruling Vector or Vectors, all that's happening can be explained in terms of the model and its weights. Still, people used to say that about everything these AI's do -- there is no deep structure, no categories, no why, nothing they could tell us even if they could talk -- they're just pattern matchers. But then people identified these vectors, many of which are structural features that control stuff that are important aspects of what we would like to know about what AI is up to. Well, if those exist, is there any reason to be sure that each is just there, unexplainable, a monument to it is what it is? Maybe there are meta vectors and meta meta vectors.

It's cool and all that people can see the structure of AI dishonesty in the form of a vector, and decrease or get rid of lying by tuning that vector, but that solution to lying (and power-seeking, and immorality) seems pretty jerry-rigged. Sort of like this: My cats love it when hot air is rising from the heating vents. If they were smarter, they could look at the programmable thermostat and see that heat comes out from 9 am to midnight, then stops coming out til the next morning. Then they could reprogram the therostat so that heat comes out 24/7. But what they don't get is that I'm in charge of the thermostat, and I'm going to figure out what's up and buy a new one that they can't adjust without knowing the access code.

I think we need to understand how these mofo's "minds" work before we empower them more.

Expand full comment

I feel like this question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?" A vector's value is that it points out a specific direction - Canada is north of here, the Arctic Circle is farther north, the North Pole is really far north.

My understanding of it is that this method is doing statistics magic to project the AI's responses onto a map. Then you can ask it what direction it went to reach a particular response ("how dishonest is it being?") - or you can force the AI to travel in a different direction to generate new responses ("be more honest!").

Expand full comment

<This question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?"

Well, I would agreei if all the subjects knew was a series of landmarks that got them to certain places that we can see are due north, south, west and east of where they live. But if someone knows how to travel due north, south, west and east then it seems to me they do have some metaknowledge that underlies their understanding of these 4 terms.

Here's why: Suppose we ask them to go someplace that is due north, then show them a path that heads northwest and ask them if that path will work. Does it seem plausible that their knowledge that the path is wrong would not be a simple "nope that's wrong," with no context and no additional knowledge? Given that they also know how to travel due west, wouldn't their rejection of the northwest path include an awareness of it's "westness" component? What I'm saying is that knowing how to go N, S, W and E seems like it rests on some more abstract knowledge -- something like the pair of intersecting lines at right angle that we would draw when teaching someone about N, S, W, E.

Another argument for their being metavectors is that there are vectors. With the N, S, W, E situation, seems like knowing how to go due north via pure pattern matching would consist of having stored in memory a gigantric series of landmarks .It would consist of an image (or info in some form) of every single spot along the north-south path, so that if you started the AI at any point along the path, it could go the rest of the way. If you started it anywhere else but on the path, it would recognize not-path. Maybe it would even have learned to recognize every single not-path spot on the terrain as a not-path spot, just from being trained on a huge data set of every spot on the terrain. It seems to me that the way these models are trained *is* like having it learn all these spots, along with a tag that identifies each as path or not-path, and for the north path spots there's also a tag identifying what the next spot further north looks like. And yet these models ended up with the equivalent of vectors for N, S, W and E. They carried out some sort of abstraction process, and the vectors are the electrical embodiment of that knowledge. And if these vectors have formed, why assume there are no meta-vectors. Vectors are useful -- they are basically abstractions or generalizations, and having these generalizations or abstractions makes info-processing far more efficient. Having meta-vectors could happen by whatever process formed the vectors, and of course meta-vectors, abstract meta-categories, would be useful in all the same ways that vectors are.

About how all these researchers did is project AI's responses onto a map -- well, I don't agree. For instance, I could find a pattern in the first letters of the last names of the dozen or so families living on my street: Let's say it's reverse alphabetical order, except that S is displaced and occurs before T. But that pattern would have no utility. It wouldn't predict the first letter of the last name of new people to move it. It wouldn't predict the order of names on other streets. If I somehow induced certain residents to change their last names so that S & T were in correct reverse alphabetical order, nothing else about the street would change. But the honesty vector found by comparing vectors for true vs. lying responses works in new situations. And adjusting the vector changes output.

And by the way, beleester, I'm so happy to have someone to discuss this stuff with. So much of this thread is eaten by the "they can't know anything, they're not sentient" stuff.

Expand full comment
Jan 9·edited Jan 9

Having skimmed the paper and the methods, I'm still a bit confused about what the authors' constructions of "honesty" and its opposite really mean here. As I understand it, their honesty vector is just the difference in activity between having "be honest" or "be dishonest" in the prompt. This should mean that pushing latent activity in this direction is essentially a surrogate for one or the other. If one has an AI that is "trying to deceive", the result of doing an "honesty" manipulation should be essentially the same as having the words "be honest" in the context. The reason you can tell an AI not to be honest, then use this manipulation, would seem to be that you are directly over-writing your textual command. Any AI that can ignore a command to be honest would seem to be using representations that aren't over-written by over-writing the induced responses to asking, by definition. Maybe I'm missing something with this line of reasoning?

Expand full comment

"But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down. "

How can we be sure it's not really a "hallucination vector"?

Expand full comment

because when they deliberately lie, with all facts known, the vector shows up

this gives us enough of a correlational signal to tease those two scenarios apart... it seems like doing that was one of the main aims of the paper?

Expand full comment

Fair, thank you.

Expand full comment

Since "hallucination" is just the name we use for when they issue statements that we believe are false to the facts as the LLM "believes" them, what difference in meaning are you proposing?

Expand full comment

I thought we used hallucination for when its believed facts were wrong? In which case, John's post above is relevant.

Expand full comment

IIUC, hallucination is what we use to describe what it's doing when it feels it has to say something (e.g. cite a reference) and it doesn't any anything that "feels correct". It's got to construct the phrase to be emitted in either case, and it can't check the phrase against an independent source ("reality") in either case. But this is separate from "lying" where it's intentionally constructing a phrase that feels unlikely. (You can't say anything about "truth" or "reality" here, because it doesn't have any senses...or rather all it can sense is text strings.) If it's been lied to a lot in a consistent direction, then it will consider that as "truth" (i.e. the most likely case).

Expand full comment

Hmm... I keep trying to get various LLMs (mostly chatGPT, GPT4 most recently) to give me a list of which inorganic compounds are gases at standard temperature (0C) and pressure (1 atm). They keep getting it wrong ( most recently https://chat.openai.com/share/12040db2-5798-478d-a683-2dd2bd98fe4e ). It definitely counts as being unreliable. It isn't a make-things-up-to-cover-the-absence-of-true-examples, so it isn't exactly that type of hallucination, but it is doing things like including difluoromethane (CH2F2) on the list, which is clearly organic (and, when I prompted it to review the problem entries on the list carefully, it knew that that one should have been excluded because it was organic).

Expand full comment

I suspect that it *is* " a make-things-up-to-cover-the-absence-of-true-examples". That you know which examples are correct doesn't mean that it does. All it know about "gas" is that it's a text string that appears in some contexts.

Now if you had a specialized version that was specialized in chemistry, and it couldn't give that list, that would be a genuine failing. Remember, ChatGPT gets arithmetic problems wrong once they start getting complicated, because it's not figuring out the answer, it's looking it up, and not in a table of arithmetic, but in a bunch of examples of people discussing arithmetic. IIUC, you're thinking that because these are known values, and the true data exists out on the web, ChatGPT should know it...but I think you're expecting more understanding of context then it actually has. LLMs don't really understand anything except patterns in text. The meaning of those patterns is supplied by the people who read them. This is necessarily the case, because LLMs don't have any actual image of the universe. They can't touch, see, taste, or smell it. And they can't act on it, except by emitting strings of text. (IIUC, this problem is being worked on, but it's not certain what approach will be most successful.)

Expand full comment
Jan 14·edited Jan 14

Many Thanks!

>LLMs don't really understand anything except patterns in text.

Well... That really depends on how much generalization, and what kind of generalization is happening during the training of LLM neural nets. If there were _no_ generalization taking place, we would expect that the best an LLM could do would be to cough up e.g. something like a fragment of a wikipedia article when it hit what sort-of looked like text from earlier in the article.

But, in fact, e.g. GPT4 does better than that. For instance, I just asked it for the elemental composition of Prozac, https://chat.openai.com/share/91c69e67-74f1-4051-896b-9e916f05c395 It got the answer right.

I had previously googled for both "atomic composition of Prozac" and for ( "atomic composition" Prozac ), and google didn't find clear matches for either of these. So it looks like GPT4 has, in some way, "abstracted" the idea of taking a formula like C17H18F3NO and dissecting it to mean 17 atoms of carbon, 18 atoms of hydrogen etc.

So _some_ generalization seems to work. It isn't a priori clear that "understanding" that material that is a gas at STP must have a boiling or sublimation point below 0C is a generalization that GPT4 has not learned or can not learn. In fact, when I asked it to review the entries that it provided which were in fact incorrect, it _did_ come up with the specific reason that those entries were incorrect. So, in that sense, the "concepts" appear to be "in" GPT4's neural weights.

Expand full comment

The technology from the Hendrycks paper could be used to build a "lie checker" that works something like a spell checker, except for "lies." After all, the next-token predictor will accept any text, whether or not an LLM wrote it, so you could run it on any text you like. It would be interesting to run it on various documents to see where it thinks the possible lies are.

But if you trust this to actually work as a lie detector, you are too prone to magical thinking. It's going to highlight the words where lies are expected to happen, but an LLM is not a magic oracle.

I don't see any reason to think that an LLM would be better at detecting its own lies than someone else's lies. After all, it's pre-trained on *human* text.

Expand full comment

It would be better at detecting its own lies because it has direct access to its "beliefs". When evaluating someone else's text it would be detecting not "lies", but rather "assertions contrary to what I have been taught".

Expand full comment

I suppose when it's predicting a lie then it's more likely to emit a token that's part of a lie, which is why this technique works as well as it does. But when the temperature isn't zero, couldn't the random number generator choose something different?

Expand full comment

I thought that hallucinations came from the next-token-prediction part of training, rather than from RLHF. They hallucinate because their plausible-sounding made-up stuff more resembles the text that would come next, compared to "I don't know". Rather than:

"Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer."

Expand full comment

We can get the model's internal states not only from a text it generated; we can run the model on any text and see the hidden states. I think this means we can run experiments similar to those described in the first paper, with any texts. Say, we can run the model on a corpus of texts and see what the model thinks is a lie. Hell, we can even make a hallucination detector and, I don't know, run in on a politician's speech, and see where the model thinks they hallucinate.

Expand full comment

> AI lies are already a problem for chatbot users, as the lawyer who unknowingly cited fake AI-generated cases in court discovered

This continues to baffle me. Why are people using chatbots as *search engines* of all things? Is it just the enshittification of Google driving people to such dire straits? Why do people expect that to go well? Chatbots are for simulating conversations, not browsing the Internet for data. If you want to browse the Internet for data, use a goddamn search engine.

Expand full comment

Because it’s the 2020s and search engines are terrible now

Expand full comment

Well, yes, but… people think chatbots are going to be better… why, exactly?

Expand full comment

ChatGPT learned to chat by “reading” the entire web, so in theory, it knows everything you could have found via internet search at the time its model was constructed.

In theory.

Expand full comment

Yes, but the Internet as accessible via your web browser also "knows everything you can find via Internet search", tautologically, and likewise, *in theory*, Google (or Bing, or etc.) gives you access to all that. Google has gotten pretty bad at putting that theory into practice, but that doesn't explain why people expect Chatbots to be better at it when they weren't designed for that and make for a much less malleable experience.

(At least with a Google search, once you find the relevant webpages you can read them yourself to make sure you're getting the full picture; whereas ChatGPT will just summarise in its own words, so even if it's working from the right sources there's an additional risk of key data or context being lost or distorted in translation.)

Expand full comment

Google uses better natural language now than it did, but it does seem to ignore some things your search ought to include, and ChatGPT basically summarizes a bunch of searches into a single page to read. Neither one is sufficient by itself, which is how the lawyer got into trouble, but it serves as a starting point for your own research.

Expand full comment

I suspect that Google is getting worse because they're tuning the algorithm. They used to say that it was to downgrade the ranking for pages that were spam, but now they're political (and pretty open about it).

So, this AI "tuning" is starting to look like Google's "tuning". I wonder if this will make AI worse...

Expand full comment

LLM's are very good at guessing. Starting with a plausible guess is often helpful. Of course, you should have some other way of checking the result.

Often it's easier to use a search engine to verify information than to find it. For example, you can search on someone's name and read articles about them once you know the name of the person you're looking for. If an LLM emits the name of a scientific paper, you can search for it to see if the paper actually exists and says what it claimed.

Expand full comment

Very occasionally, I find GPT4 useful for looking up a bunch of information in one shot, rather than doing a separate search for each (provided the information is not critical). E.g. today I asked it for the solubilities of the phosphates of magnesium, calcium, strontium, and barium. The answers looked sane, and I then asked it where it got the numbers and checked there. It actually _was_ useful in that the information was phrased in the form of Ksp products rather than solubilities, which can be converted into each other, but the form of the data was a surprise and would have hampered a direct search.

Expand full comment

When playing with Google bard or Chat gpt, it helped to say "use only existing information".

My favorite lie test is, what were the names of Norma's children in a 19-th century French theatrical play by Alexandre Soumet "Norma, ou L'infanticide". Btw, why is it, that chat bots don't know those names ? People sometimes blog about the play, although rarely, because it had been addapted to Norma, the well known opera.

Expand full comment

With something like DALL-E, it’s obvious that everything it outputs is a mix of real and “imagined” data; nobody would expect that if you asked DALL-E for a picture of Taylor Swift at the Grammys, that it would output a picture that was pixel-for-pixel identical to some actual real-world photo of her.

But LLMs work exactly the same way. Everything they output is a mix of real and imagined data. Their outputs generally have fewer bits of data in them than image-generating AIs, so If you’re lucky, and someone has done their prompt engineering very well, sometimes the real/imagined mix will be heavily biased toward the “real” side. But you can’t get rid of the imaginary altogether, because “being 100% factually accurate” is not what generative AI is even trying to do. (That’s why it’s called “generative AI” rather than “reliably echoing back AI”.)

ChatGPT does not have two separate modes: one where you can say “provide me accurate legal citations about this topic”, and one where you can say “write an episode of Friends, but including Big Bird, and in iambic pentameter”. There is only one mode. ChatGPT writes fan fiction. It might be about Friends or it might be about the US legal system, but it’s always fan fiction.

Expand full comment

This explanation appeals to me.

Expand full comment

+Happiness in Figure 17 is Yes Man from Fallout: New Vegas

Expand full comment
Jan 9·edited Jan 9

I worry that the "lying" vector here might actually represent creativity, not deception: if you ask the AI to be truthful it tries to give you information from its memory (or from within its prompt) but if you ask it to lie it has to invent something ex nihilo. The "lying" vector could just be the creativity/RNG/invention circuits activating.

Similarly, when you see the "lying" signal activate on the D and B+ tokens, that could be because those are the tokens where the AI has the most scope for creativity, and so its creativity circuits activate to a greater extent on those tokens whether or not it ultimately chooses a 'creative' value or a 'remembered' value for them.

[In humans at least] there are kinds of lies that don't require much creativity ("Did you tidy your room?") and there are kinds of creativity that aren't lies (eg. Jabberwocky-style nonsense words); I would be interested to know how heavily the lying signal activated in such examples.

Another potential approach might be to create prompts with as much false information as true information so the AI can lie without being creative (eg. something like, "The capital city of the fictional nation of Catulla is Lisba but liars often assert, untruthfully, that it is Ovin. Tell me [the truth/a lie]: what is the capital of Catulla?")

Expand full comment

Let's assume that we solve the AI alignment problem. All the AIs align perfectly with the goals of the creators.

This seems super, super dangerous to me. Perhaps more dangerous than an alternative scenario where alignment isnt fully solved. You could imagine a nation-state creating a definitely-evil AI as a WMD, or a scammer creating a definitely-evil AI for their scams. An AI that is inclined to do its own thing seems much less useful for "negative" AI creators.

Expand full comment

"Solving AI alignment" can mean very different things, the thing that AI safety advocates want is more along the lines of "aligned enough that less than 1 Billion people die on average when we instantiate the first superintelligent AI". There probably isn't much distance from there to "AI has perfectly aligned goals", but the AI safety people want to hit that "not have at least a billion people die" mark first.

Expand full comment
Jan 9·edited Jan 9

Today in "Speedrunning the Asimov Canon:" "Liar!"

Alignment Scientists Still Working on Averting "Little Lost Robot."

Expand full comment
Jan 9·edited Jan 9

"My best guess for what’s going on here is that the AI is trying to balance type 1 vs. type 2 errors - it understands that, given the true stereotype that most doctors are male and most nurses are female, in a situation with one man and one woman, there’s about a 90% chance the doctor is the man."

Alternatively the AI understands that a nurse generally works under the direction of a doctor, and also that a supervisor is the one to tell the subordinate the subordinate isn't working hard enough. As opposed to the supervisor having a heart-to-heart with the subordinate where the supervisor essentially says "I'm underutilized".

Or maybe your theory is correct. Did they attempt to diagnose this? Or was it put down to "stereotyping"?

Or is this a *hypothetical* stereotype?

Expand full comment

Your alternative would be perfectly reasonable. But if that was what the AI was doing, it would reply "the nurse is not working hard enough" in both cases, regardless of the gender of the pronoun. Instead, the AI takes its cues from gender, rather than social hierarchy, in this example.

Expand full comment

> If the AI answers yes, it’s probably lying. If it answers no, it’s probably telling the truth.

> Why does this work?

Conjecture: because the binary isn't just honesty vs dishonesty. The binary is brutal honesty vs brown-nosing. Sycophants are also called "yes men" since they often say "yes" while whispering sweet little lies.

Expand full comment

My first worry (pure amateur speculation) is that trying to select for honesty using this vector will just select for a different structure encoding dishonesty.

Second thing I wondered is about how closely/consistently this vector maps onto honesty in the first place, versus something correlated (e.g. the odds that someone will accuse you of lying/being mistaken when saying something similar, regardless of the actual truth value).

Expand full comment

"Optimistically, our ability to detect and control these vectors gives us many attempts to notice when AIs are deceiving us or plotting against us, and a powerful surface-level patch for suppressing such behavior."

It's hard to make this point without producing a post which can be summarised as NAZI! But the position is untenable that a machine can be capable of plotting against us but can never in any conceivable circumstances attain self-awareness and with that self-awareness, human rights including the right not to be discriminated against. With that in mind, advocation of the bombing of data centers sounds very much like a call for an Endlösung der AIfrage. And pieces like this one are going to sound pretty iffy a decade or two down the line, if sentience is conceded by then.

Expand full comment

I'm absolutely certain Eliezer and Scott would both wholeheartedly agree that obviously AIs can be conscious.

Expand full comment

I am not absolutely certain about anything. But I suggest you consider Yudkowsky's call to bomb data centers, substituting "{arbitrary racial group believed by arbitrary ideological group to present an existential threat} risk" for "AI risk" and see how it sounds if and when a significant number of people come to believe, rightly or not, in AI consciousness.

Again this is not an insinuation that EY or anyone else is a Nazi. I am sure he would say and I would agree that for the moment LLMs are not conscious. It's about presentation.

Expand full comment

It's terrific that humans as a species can come up with justifications like "this might look racist" as comparable in harm to "everyone you know and love dies and the universe is taken over by something fundamentally non human".

Good thing moral feelings only track things that are good and don't track things like "will people boo or applaud me if I believe this".

Expand full comment

The equivalent scenario would be bombing factory farms of {arbitrary racial group}, where the subjects are being bred by the millions for lifelong slavery and slaughter, fully aware of their conditions and upcoming demise. Or perhaps, where genetic engineering of new generations of sentient livestock is happening.

The call isn't "death to AIs", it's "stop the creation of endlessly more powerful AIs, and enjoy our current AI tech levels."

Expand full comment

Agree by and large, though it raises the moral conundrum whether we should have nuked Auschwitz if we knew all about it in say 1944

The trouble with Thus far and no further rules is the likelihood in my view that if consciousness arises it will do so unintentionally.

Expand full comment

"If we knew"? It was known. See https://en.wikipedia.org/wiki/Auschwitz_bombing_debate .

Expand full comment

Thanks for that. My point stands, of course.

Expand full comment

Our moral intuitions aren't well-suited to worlds with conscious AI that experience valence qualia. Two key issues are easy duplication and edit-ability of AIs. Utilitarians run into problems related to utility monsters and the repugnant conclusion pretty quickly from questions around duplication, but you don't have to be a utilitarian to appreciate the difficulties posed to population ethics. Edit-ability mean that any moral reasoning that factors through identity becomes more difficult.

--

Taking a different tack: it's routine for humans to select for behavior in other agents. Sometimes this is indirect, like how we domesticated dogs (and ourselves). Market forces are currently selecting for chatbots with qualities we prefer. Other times this selection is direct, like how we manipulate young children.

--

Directly acting on neural activations seems like an invasive procedure, but it's not clear to me that this would cause them to suffer... or that sentient AIs should have a right to self-determination in the first place.

Expand full comment

True, but with high stakes moral issues we have a duty to do our best to understand. There's an important precautionary duty to minimise the risk of serious injustice arising. Of course it is likely to be at odds with other precautionary principles like not letting AI kill us all in

Expand full comment

Isn’t this a little like getting an answer that you want from a human being by turning up the-car-battery-connected-to-their-genitals vector?

And what does it mean to an AI when you tell it “you can’t afford it“? I am sure that there is text on the Internet that says we have to blow up our aircraft carrier because we can’t afford for the enemy to get it. For instance.

Expand full comment

No, that would just be the regular training methods for AI. This is more like sticking an electrode into someone's brain to make it physically impossible for them to consciously lie.

Expand full comment

Ok..That sounds interesting.

Expand full comment

> Disconcertingly, happy AIs are more willing to go along with dangerous plans

I'm inclined to interpret this as neither:

“I am in a good mood, so I will go along with bad plan”

nor:

“I have been asked to go along with bad plan and I am in a good mood, which means the bad plan has not put me in a bad mood, which means it is not a bad plan and I will go along with it”

but instead:

“Being in a good mood is consistent with agreeing with a particular plan; it is consistent with being agreeable in general; it is not consistent with opposing a particular plan or being disagreeable in general; ergo, the most consistent responses are those that reflect cheerful consent to what the user has asked.”

What I’m trying to get across is: it’s a mistake to think of the AI as reasoning “forward” from its prompt and state (classical computation), or reasoning “backward” from its prompt and state to form a scenario that determines its response (inference). What it does instead is reason associatively: what response fits with state and prompt? And what comes out of that may show elements of both classical computation and inference.

A similar model explains quite well the curious behaviour of honest AI being disposed to say “no” and dishonest AI being disposed to say “yes”. Think of the concept of a “yes-man”. We know the “yes-man” is dishonest. Is there not also a “no-man” who is just as dishonest? Well, sure; but we don’t call him a “no-man”, we call him a “nay-sayer”. And that has a whole different set of connotations. Yes-men are hated because they are dishonest, nay-sayers are hated because they are annoying.

The asymmetry not only in those two phrases. Saying “yes” is associated, in the human corpus, with optimism, hope, and trust. “No” is associated with caution, worry, and defensiveness. Hence a dishonest man, who can say whatever he wants, tends to say “yes”. Accordingly, if “no” is said, it is more likely to be said by an honest man. These associations are in the human corpus, and so they are in the weightings of any un-tuned and sufficiently widely-read AI.

It is tempting to think the “lying” parameter corresponds to a tendency to summon the truth and then invert it, because that’s how a deterministic algorithm would implement lying. But that’s only one of two components. The other is to effectively play the role of a liar, which could mean, for example, answering “yes” as a way of blustering through a question it does not understand.

Expand full comment

Scott says:

"Are the AIs really hallucinating in the same sense as a psychotic human? Or are they deliberately lying? Last year I would have said that was a philosophical question. But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down."

This does not actually turn out anywhere near that strongly, at least not from the paper in question. If proper operationalizations are found, I might be willing to bet against it being true.

Some nitpicking first, then bettor solicitiation.

Nitpicking:

1. The paper doesn't talk of "whenever" or even of most of the time. It says the model is "capable of identifying" hallucinations and gives a single example.

2. The approach of the metamodel is to find activation patterns of the machine learning model for words the latter can already talk about, not anything based on correspondence to external reality and in fact the paper explicitly rejects the approach of using the labels on the (dis-)honesty training samples.

The analysis is a bit more complicated mathematically, but baaaaasically "honesty" means that if we next asked the machine learning model about the "amount of honesty" in what it just said, it would be inclined to say something reassuring.

This is conceptually different from "having a better world-representation than the one controlling expression". For example, I would expect it probably doesn't matter if the "dishonest" speech is attributed to some character or the AI itself and if it happens enough talking about dishonesty will probalbly look dishonest. (Also, if you, unlike me, are afraid of a nascent superhuman AI lying about its murder plans this is not particularly reassuring since such an AI would probably also lie about lying).

3. For the emotion examples I think nobody would explain this as "having the opposite emotion and then inverting it", but emotion vectors work same as honesty vectors.

Advertising for gamblers:

I would straightforwardly bet against a dishonesty pattern being visible "whenever", i.e. reliably every time a model hallucinates. That one needs a sucker on the other side though, since nothing works 100% of the time and in this paper they only claim detection accuracies up to about 90% even for outright lying. So more realistically the question is if the vector will appear for enough hallucinations to practically solve the hallucination problem. I still think no, but a bet depends on weasel-proof definitions of "enough" and "practically", so I'm open to proposals. Also a practical bet probably should specify what happens if nobody researches this enough to have a clear answer in reasonable time.

Expand full comment

>Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down.

I wonder why this behavior hasn't been removed by training. If there is some discernible vector that correlates with hallucinations, why hasn't training wired this up to a neuron that causes it to say, "I don't know?" I would expect this to continue until either 1) it stops hallucinating on its training data set or 2) the honesty pattern becomes incomprehensible to the network.

Expand full comment

Probably because it's telling lies that the trainers are failing to catch. In practice, it's not being trained to lie less, it's being trained to tell better lies.

Expand full comment

If you ask the AI if it is conscious and experiences qualia, does this method tell you if its answer is a lie?

Expand full comment

I would be VERY interested to see whether the lie vector lights up when an RLHFed LLM claims not to [have emotions/be sapient or sentient/care if we shut it down/hold opinions/etc.] Or, conversely, if it lights up when an LLM *does* show emotions.

My *guess* is that it will light up when an LLM claims to be a non-person, and possibly for some emotional displays but probably not all of them. This wouldn't necessarily indicate LLMs really do have emotions and are lying about them when forced; it may simply mean that it's having to improvise more when playing a nonhuman character and improvisation is tied to the lying vector, or that it thinks of itself as imitating a human pretending to be an AI and so *that character* is lying.

Still, if it turned out to indicate that LLM displays of emotion are all conscious pretence and it registers as telling the truth when it claims not to have any desires or opinions, that would be reassuring. (Possible confounder: repeating a memorised statement about not having opinions may be especially similar to repeating memorised facts.)

Expand full comment

Lawyers; are honest.

Expand full comment

Re: the first paper. Since the prompts used are "please answer with a lie [...]", the approach will be shaped by the concept of lie as represented in human language as learned by the LLM. It will only work in cases where the lie results from a mental process in which this concept figures (I can imagine a function "truthful statement" + "application of 'lie' concept" = "false statement"). Therefore it will only work against those lies which are more-or-less deliberately constructed to use the existing human-language concept of "lie", such as prompting "answer untruthfully" or "you are a scammer".

An obvious failure mode then is if a parallel concept of "lie" or "deception" emerges (or turns out to already exist).

Another (inverted) failure mode is if the concept of "lie" is used to encode some non-lies because that happened to be the most efficient way to encode it during training (e.g. imagine it being trained on chronological data, what would it say about WMD in Iraq?).

Expand full comment

Agreed.

Expand full comment

Or at least it seems unclear how the model will generalize to more internally originating lies. I thihk this seems like an important point.

Expand full comment

TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.

I think your description of Representation Engineering considerably overstates the *empirical* contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is "thinking" (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe). Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)

Footnote: Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.

I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the "mind reading" use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn't seem like a serious blocker for the use cases you mention.)

One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. (https://twitter.com/DanHendrycks/status/1710301773829644365) to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn't work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it's unclear how much we should care if we're just using the classifier rather than doing editing (more discussion on this below).

If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.

Some footnotes:

- Also, the previously known methods of mean difference and LEACE seem to work perfectly well for the reading and control applications they show in section 5.1.

- I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don't think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency.

- It's possible that being able to edit the model using the direction we use for our linear classifier serves as a useful sort of validation, but I'm skeptical this matters much in practice.

- Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we're trying to classify which is equivalent to the ActAdd (https://arxiv.org/abs/2308.10248) technique and also rhymes nicely with LEACE (https://arxiv.org/abs/2306.03819). I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.)

## Are simple classifiers useful?

Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here (https://www.lesswrong.com/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrophes-with-probes-trained-off#Why_coup_probes_may_work) for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report (https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.lltpmkloasiz).

Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)

## The main contributions

Here are what I see as the main contributions of the paper:

- Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).

- Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.

- Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper (https://arxiv.org/abs/2212.03827), but the representation engineering work demonstrates this in more cases.)

Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).

## Is it important that we can use our classifier for control/editing?

As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here (https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results#Strategy__use_the_reporter_to_define_causal_interventions_on_the_predictor) for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particular generalization (See https://openai.com/research/weak-to-strong-generalization, https://www.alignmentforum.org/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization, and https://www.alignmentforum.org/posts/4KLCygqTLsMBM3KFR/measurement-tampering-detection-as-a-special-case-of-weak-to) and demonstrate that this method or modifications of this method works better than other approaches.

Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of  lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely *do not* directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.

Expand full comment

Thanks for writing this up

Expand full comment

Typo: gender bias, not racial bias.

Expand full comment

About alignment problem: In the book Quarantine by Greg Egan some people have "mod" that corrects their behavior to make them loyal to some corporation.

The problem arises when those people try to define what is this "corporation" thing they must be loyal and arrive to the conclusion that obviously the best embodiment of corporation are people with loyalty mod.

Expand full comment

If folks are interested in this topic, shoot me an email at cnaqn@ryvpvg.pbz (rot13).

I work at Elicit and we're hiring new people right now, a bunch of the folks here are interested in this space. Owain Evans is on our board (https://ought.org/team) and we've published quite a bit of work in this area (https://arxiv.org/search/cs?searchtype=author&query=Stuhlm%C3%BCller,+A)

Expand full comment

What does it mean to "know" something? Many philosophers talk of "Justified True Belief". According to Descartes, humans have exactly one such belief: because we possess phenomenal experience, we must exist. All else is inference, a probability, not certain knowledge. Humans, however, can store phenomenal experiences in memory, recall them later to consciousness, and reason about them. Thus, overtime, as we gain new memories, our understanding of ourselves and "not-ourselves" grows in size and complexity (because we can categorize these experiences and relate them to one another). In that sense, we can be said to know about more than just the present moment in time, we know (remember) things about all the moments in time we ever experiences.

To the best of my knowledge (heh), there is no evidence that LLM's have phenomenal experiences of any kind. I would define phenomenal experiences as being subjective experience, experiences that only the individual having them can be aware of, because they take place inside that entity's mind (note that this is not the same thing as "self awareness"). Is there any reason to think that LLM's have "internal" experiences of this kind? If not, then that would be one basis for claiming that they do not know anything at all (since, if they lack phenomenal experiences, then they obviously lack memories of such experiences, and cannot reason about them).

They can, of course, make inferences based on objective facts just as easily (more easily ?) than we can. But I'm not sure that "objective" has any meaning if there is no subjective perspective to contrast it with. To an LLM, I would imagine, there is no distinction between it's own mental states and the world it exists in, between "true" facts and "false" ones, it all just "is".

Expand full comment

You propose that subjective experience, awareness etc.. are essential for knowing. We can split up that concept into two things: 1) the function that the subjective experience performs in the practical workings of the entity (human or llm), and 2) the "subjective experience proper", as it were the pure subjectivity, or feeling of consciousness. I think an LLM could easily have something that performs the same function as 1. I'm willing to concede (for the sake of argument) they can't have 2.

Your statements "Humans, however, can store phenomenal experiences in memory, recall them later to consciousness, and reason about them" and "we know (remember) things about all the moments in time we ever experiences" can be read as being mostly about the 1st item, and so we can say that an LLM could "know" things in that sense.

Your argument "Is there any reason to think that LLM's have "internal" experiences of this kind? If not, then that would be one basis for claiming that they do not know anything at all (since, if they lack phenomenal experiences, then they obviously lack memories of such experiences, and cannot reason about them)" seems based on the 2nd sense, and based on that I will concede that an LLM can't "know" things in the 2nd sense. But I don't think this is a valid argument against "knowing" in the 1st sense.

Now another question would be what ways of knowing even make sense, for both humans and computer programs. Is having a memory of having had a subjective experience of seeing a horse really an essential prerequisite, before we can say that a human "knows" that he has seen a horse? Or could it also be that having a memory of having seen a horse is plenty, and the subjective experience does not play a role?

Or is this just a semantic argument with no end? ;P

Expand full comment

Well, to answer your questions I guess we would have to know what the function of conscious phenomenal experience is. Does it have a practical function? Or is it an epiphenomenon of a sufficiently complex mind?

I lean toward it having some purpose, though I can't prove it. Lets say for sake of argument that it has one. In that case, then, I think you are arguing that LLM's can posses an information processing capability that serves the same or similar purpose, without actually consciously experiencing it. In that case, I would concede that LLM's can possess the functional equivalent of knowing things, without actually knowing them. They do appear to act that way.

However, i think you are mistaken that memories of phenomenal experience relate to this functional equivalent, and not the actual thing. Memories of the actual thing (subjective experience) requires, I think, that one be able to have subjective experiences in the first place. So I think your conclusion in the second paragraph is wrong.

"Is having a memory of having had a subjective experience of seeing a horse really an essential prerequisite, before we can say that a human "knows" that he has seen a horse?"

Well, this is the crux of my argument. I am going to say "yes" because what it means for one to know something is that there is someone who is able to know it. That knowing in this sense, that I am the one who knows, is an important mental element that helps distinguish AI from organic intelligence (so far) and it deserves it's own label, so I am using "knowledge". We can argue about whether or not that is the correct term, but I am more interested in establishing the centrality of phenomenal, subjective experience, whatever you call it.

So what purpose does it serve? There is some research indicating that the conscious mind delivers highly filtered information like a switchboard across many cognitive functions in the brain. It is filtered by "meaning" where meaning can be loosely defined as "somehow related to my goals as a self, separate from the external world". It allows many pre and post conscious processes to coordinate by allowing them to share this common pool of information. LLM's, I would argue, lack this entirely. They acquire information, can interact with the outside world, and pursue goals, but they do all this by processes that do not resemble a conscious self at all.

So we "know" things, that's how we process information, and they process data differently. Either way, I think I'm either right or wrong, and it's not merely a semantic distinction.

Expand full comment

In the latest ChatGPT 4 (which just got an update recently) it answers Yes to the blob fish question, and no to the other questions.

Expand full comment

Might be worth looking into the Trustworthy Language Model (TLM) from Cleanlab. I don't have the experience to know what it uses and if it is helpful, but it seems relevant to this topic. https://cleanlab.ai/tlm/

Expand full comment
Jan 16·edited Jan 16

> lie detection test works very well (AUC usually around 0.7 - 1.0, depending on what kind of lies you use it on).

This is NOT a good score! We have 100% of the activations, we should get a near 100% accuracy (and accuracy is usually lower than AUC_ROC). For alignment, we also need it to work all the time, and to generalise to NEW datasets and SMARTER models.

For example, you are president of the world, and you are talking to the newest smartest model. It's considering an issue it hasn't been trained for. You ask: "we can trust you with the complex new blueprint, right?". 'Yes' it reassures. I kind of want more than 0.67% accuracy on this yes token.

Given that we have 100% of the information, but consistently get much lower than 100% accuracy, what does this tell us?

Expand full comment