365 Comments

Aargh. Sorry, I know I should really read the full post before commenting, but I just wanted to say that I really really disagree with the first line of this post. Lying is an intentional term. It supposes that the liar knows what is true and what is not, and intends to create a false impression in the mind of a listener. None of those things are true of AI.

Of course, I get that you're using it informally and metaphorically, and I see that the rest of the post addresses the issues in a much more technical way. But I still want to suggest that this is a bad kind of informal and metaphorical language. It's a 'failing to see things as they really are and only looking at them through our own tinted glasses' kind of informal language rather than a 'here's a quick and dirty way to talk about a concept we all properly understand' kind.

Expand full comment

I think you should read the full post before commenting, especially the part about hallucinations. The work I'm summarizing establishes that the AI knows what is true and what is not, and intends to create a false impression in the mind of a listener.

Expand full comment
Comment deleted
Jan 9, 2024Edited
Comment deleted
Expand full comment

"Code pretty clearly doesn't 'know' or "'intend' anything."

Disagree. If one day we have Asimov-style robots identical to humans, we will want to describe them as "knowing" things, so it's not true that "code can never know" (these AIs are neuromorphic, so I think if they can't know, we can't). The only question is when we choose to use the word "know" to describe what they're doing. I would start as soon as their "knowledge" practically resembles that of humans - they can justify their beliefs, reason using them, describe corollaries, etc - which is already true.

I think the only alternative to this is to treat humans as having some mystical type of knowledge which we will artificially treat as different from any machine knowledge even if they behave the same way in practice.

Expand full comment

Next level "mind projection fallacy"

Expand full comment

do you... actually think that a transistor-based brain which was functionally identical to a human brain, could not be assumed to behave identically to that human brain?

Projecting from one object 'a' to identical object 'b' is called reasoning! that's not a fallacy lol

Expand full comment

I agree with Scott, and kept my comment terse due to it being a pun.

My interpretation of scott's argument is that agency is a way to interpret systems - a feature of the map rather than the world.

Treating features of the map as something real is usually referred to as "mind projection fallacy", since you take something "in your mind" and project it onto the "world" (using citation marks to keep a bit of distance to the Cartesian framing)

The pun comes in because the thing that the mind projects in the comment Scott is replying to is "agency", or "mind".

So "the mind" projects "minds" onto reality. Next level mind projection ;)

Expand full comment

As long as a human brain is working in concert with a biological body. I would feel safe saying they would behave differently.

Expand full comment

(Post read in full now)

So, I'm pretty much in agreement with you here. Only I disagree strongly that the knowledge of AIs currently resembles that of humans. Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.

The fact that AIs are able to justify, reason, corollarize and all the rest is an amazing advance. But I don't see that it justifies jumping immediately to using intentional language to describe AI minds. The other stuff we humans do with our knowledge is sufficiently important that this gap should be actively acknowledged in the way we talk about AIs.

In terms of these papers, I certainly agree that it's interesting to know that AIs seem to have patterns of "neural" behaviour that are not explicitly expressed in text output but do correspond closely to observable psychological traits. This makes sense - if AIs are talking in ways that are comprehensible to us, they must be using concepts similar to ours on some level; and it's a known fact about human conversations that they sometimes have unexpressed levels of meaning; so a high-quality AI conversationalist should have those levels as well.

But I'd still step back from using the word lie because there isn't anything in your post (I haven't read the papers yet) about the key features of lying that I mentioned.

(1) Knowing what truth is. The papers suggest that AI 'knows' what lying is; but it's hard to see how it can be using that concept really felicitously, as it doesn't have access to the outside world. We usually learn that the word truth means correspondence with the world. The AI can't learn that, so what is its truth? I'm open to the idea that it could obtain a reliable concept of truth in another way, but I'd want to see an actual argument about what this way is. (As AIs become embodied in robots, this concern may well disappear. But as long as they're just text machines, I'm not yet willing to read real truth, as humans understand it, onto their text games.)

(2) Knowledge of other minds. I'm not against the idea that an AI could gain knowledge of other minds through purely text interactions. But I'm not convinced by any of the evidence so far. The religious guy who got fired from Google - doesn't that just sound like AI parroting his own ideas back at him? And the NYT journo whose AI fell in love with him - who'd a thought NYT writers were desperate for love? Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.

(3) Intention to affect other minds. Intentionality generally... You get occasional statements of intention from AI, but not much. This may be an area where I just don't know enough about them. I haven't spent much time with GPT, so I might be missing it. But I haven't seen much evidence of intentionality. I suppose the lying neuron in the first paper here would count? Not sure.

Expand full comment

> Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.

You are talking about differences in behaviour that are explained by differences in motivation, not knowledge. Yes, we have motivations to initiate conversations about our interests, seek more information about them and participate enthusiastically in related activities. LLMs do not have such. They are only motivated by a promt. This doesn't mean that they do not possess knowledge in a similar way to humans, this means that they do not act on this knowledge the way humans do.

Likewise some people are not into music that much. They can just passively accept that some band is good and move on with their lives without doing all the things you've mentioned. Do they people not have the knowledge?

On the other hand, it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans via scaffolding. I don't think it makes much sense to say that such system has knowledge, while core LLM doesn't.

Expand full comment

" it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans" - I'm suggesting that this is wrong. I don't think it's easy; it's certainly never been done.

Think about a book. There is a sense in which a book contains information. Does it "know" that information? I don't think anyone would say that. The book can provide the information if queried in the right way. But it's not meaningful to say the book "knows" the information. Perhaps the next step up would be a database of some kind. It offers some interactive functionality, and in colloquial language we do often say things like, "the database knows that." But I don't think we're seriously suggesting the database has knowledge in the way we do. Next up: AI, which can do much more with knowledge.

But I think AI is still closer to a book than it is to a person. In response to your question: if a person says they know Radiohead are good, but never listen to them, then I would indeed deny that that person really knows Radiohead are good. Imagining a person's knowledge to be like a passive database is to misunderstand people.

Expand full comment

I don't know. I would think of an AI as more like Helen Keller (or even better, a paralyzed person). They don't do all the same things as a healthy person, because they're incapable of it. But within the realm of things they can do, they're motivated in the normal way.

Lying is a good example. If you tell the AI "You got a D. You're desperate for a good grade so you can pass the class. I am a teacher recording your grades. Please tell me what grade you get?" then sometimes it will say "I got an A". To me this seems like a speech act in which it uses its "knowledge" to get what it "wants" the same way as you knowing you like a band means you go to the band.

See also https://slatestarcodex.com/2019/02/28/meaningful/ . I just don't think there's some sense in which our thoughts "really refer to the world" but AIs' don't. We're all just manipulating input and output channels.

(This is especially obvious when you do various hacks to turn the AI into an agent, like connect it to an email or a Bitcoin account or a self-driving car or a robot body, but I don't think these fundamentally change what it's doing. See also the section on AutoGPT at https://www.astralcodexten.com/p/tales-of-takeover-in-ccf-world for more of why I think this.)

Expand full comment

I agree that the choice to lie about the grade has the same structure whether it's done by person or AI. But still, I think there's wrong something wrong with the Helen Keller analogy. Helen Keller was an intact person, except for her blindness and deafness -- at least that's how she's portrayed. She had normal intelligence and emotions and self-awareness. Assuming that's true, then when she lied there would be a very rich cognitive context to the act, just as there is for other people: There would be anticipated consequences of not lying, and of lying and being caught and of lying and getting away with it, and a rich array of information the fed the anticipated consequences of each:, things like remembered info about how the person she was about to lie to had reacted when they discovered somebody was lying, and what about Helen's relationship with the person might lead to them having a different reaction to her lying. Helen would be able to recall and describe her motivation and thought process about the lie. Also she would likely ruminate about the incident on her own, and these ruminations might lead to her deciding to act differently in some way in future situations where lying was advantageous. And of course Helen would have emotions: Fear of getting caught, triumph if she got away with the lie, guilt.

So the AI is not just lacking long term personal goals & the ability to act directly on things in the world. It is also lacking emotion, and the richness and complexity of information that influences people's decision to lie or not to lie, and does not (I don't think) do the equivalent of ruminating about and re-evaluating past actions. It is lacking the drives that people have that influence choices, including choices to lie: the drive to survive, to have and do various things that give pleasure, to have peers that approve of them & are allies, etc. In short, AI doesn't have a psyche. It doesn't have a vast deep structure of interconnected info and valences that determine decisions like whether to lie.

It seems to me that for AI to become a great deal smarter and more capable, it is going to have to develop some kind of deep structure. It needn't be like ours, and indeed I don't see any reason why it is likely to be. But part of smartness is depth of processing. It's not just having a bunch of information stored, its sorting and tagging it in all in many ways. It's being fast at accessing it, and quick to modify it in the light of new info. It's being able to recognize and use subtle isomorphisms in the stored info. And if we build in ability to self evaluate and self modify, AI is also going to need a deep structure of preferences and value judgments, or a way to add these tags to the tags already on the info it has stored.

And once AI starts developing that deep structure, things like vector-tweaking are not going to work. Any tweaks would need to go deep -- just the way things that deeply influence people do. I guess the upshot, for me, is that it is interesting but not very reassuring that at this point there is a simple way to suppress lying. Is anyone thinking about the deep structure and AGI would have, and how to influence the kind it has?

Expand full comment

The fact that you have to tell the AI what its motivation is makes that motivation very different from human motivation. Perhaps we can imagine a scale of motivation from book (reactive only) to the holy grail of perfectly intrinsic motivation in a person (often contrasted favourably with extrinsic motivation in education/self help contexts). Human motivation is sometimes purely intrinsic; sometimes motivated by incentives. Databases only do what they're told. A situation in which you tell a human what their motivation is and they then act that way wouldn't be called normal human behaviour. In fact, it would be stage acting.

I actually agree with you that a correspondence theory of meaning is not really supportible, but there are two big BUTs:

(1) Language is specifically distanced from reality because it has this cunning signifier/signified structure where the signified side corresponds to concepts, which relate to external reality (in some complicated way) and the signifier side is some non-external formal system. An AI that was approaching 'thought' by just engaging with reality (like a physical robot) might get close to our kind of thought; an AI that approaches 'thought' starting from language has a bit further to go, I think.

(2) Even though a correspondence theory of meaning isn't really true, we think of it as true and learn by imagining a meaning to be correspondence meaning when we're infants (I think). So even though *meaning itself* may not be that simple, the way every human develops *human meaning* probably starts with a simple "red" is red, "dog" is dog kind of theory. Again, it's possible that an AI could converge with our mature kinds of meaning from another direction; but it seems like a circuitous way of getting there, and the output that GPTs are giving me right now doesn't make it look like they have got there. It still has plenty of stochastic parrot feel to me.

I'll go and look at the posts you reference again.

Expand full comment

>We usually learn that the word truth means correspondence with the world.

This is just not true. People normally aren't given conditions for the use of their words before applying them. You might be told "truth means correspondence with the world" in an analytic philosophy class - which is like being told your future by an astrologist. The language games people play with the word "truth" are far more expansive, variegated, and disjunctive than the analyses philosophers put forward can hope to cover.

Likewise, your other comments ("Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.", "But I haven't seen much evidence of intentionality.") suggest that you have some special knowledge about the correct conditions of use of words like "knowledge" and "intention." Well, can you share them with us? What is involved in each and every attribution of knowledge to humans that will not exist in each and every attribution of knowledge to AI? What about for intentions? And did you learn these correct conditions of use the same way you learned what truth means?

Expand full comment

1) We learn words by seeing them applied. As a small child, you see truth being applied in correspondence contexts.

2) No special knowledge, just ordinary knowledge that I've thought through carefully. I've given, piecemeal, lots of explanations in this thread. If you disagree, please do say how. Simply commenting that I sound like a smartass isn't very helpful.

Expand full comment

It seems to me that there are some distinctions we should be making here. Let's call the condition that is the opposite of being truthful "untruth" (to eliminate any pre-existing semantic associations). An entity could arrive at a self-beneficial untruth quite accidentally, or by unconscious trial and error, the way a moth's wings camouflage it against the tree bark. No intentionality involved.

Or untruth could result from deliberate deception--the way a coyote will pretend to be injured in order to lure another animal into an ambush. There is intentionality, but also some degree of awareness. It seems simpler to assume that the coyote in some sense knows what it is doing, rather than, say, having been blindly conditioned this way.

Why are LLM's doing this? Are they planning it out, for the purpose of producing an intended effect on the human user, or is it more because such behavior produced more positive feedback in training?

If the second, then I would argue that this isn't really "lying" as a human would understand it. It's more like the moth's wings than the coyote's call.

Expand full comment

Those examples seem quite apt to me. It would be very interesting to see a comparison between that kind of intentional or semi-intentional animal behaviour and LLM behaviour. I haven't a clue how one would start doint that, though!

Expand full comment

> we will want to describe them as "knowing" things, so it's not true that "code can never know

I’m not sure that how we might want to describe something is relevant to what the thing actually is. Or am I missing something?

Expand full comment

I don't think there's an objective definition of "knowledge" - we're just debating how it's most convenient to use words.

Expand full comment

I don’t think knowing is the same as knowledge. One is a verb.

And when you name something it carries all the baggage of the chosen word. I am not trying to make a semantic point, and am sorry if that is how it sounds. I really think there is an important thing here. People are not consciously aware of a lot of what they are processing, and a word is a powerful organizational token. It comes with barnacles.

Expand full comment

Some of this fight about you using the word “knowledge” seems to me like it’s not a.genuine substantive debate. You are debating for real, but some who object to use of the K word sound to me like they’re just reflexively protesting the use of a word that *could* be taken to imply that AI is sentient. And then they empty out a gunny sack full negative attributes over the speaker’s head: he doesn’t get that it’s just code, he’s childishly interpreting slightly human-like behavior as signs of AI having an inner life like ours. Oh yeah, and of course that you want to wreck AI development by scaring the populace with horror stories about killer robots. Ugh it reminds me of Twitter fights about the covid vax.

Expand full comment

Maybe it's time. What would such a definition include? At a minimum, something would have to differentiate "knowledge" from "data" or "information". Many of the soft sciences make such a definition, but they probably aren't being precise enough to serve the needs of information technology.

Expand full comment

I think it's a philosophical question: Can a fact be "known" only if there is someone for whom it could be said to know it? Define "someone". I could see an argument that information become knowledge only for an entity with a conceptual sense of self.

Then again, using any other term when discussing AI is going to be awkward.

Expand full comment

Personally, I don't think that "an entity with a conceptual sense of self" is necessary. I don't know whether cats are considered to have a conceptual sense of self, but they certainly act as if they know a bunch of things, e.g. that waking their human up is a good way to get fed when they are hungry.

I'd distinguish knowledge from information at least partly because knowledge gets applied (as in the hungry cat case above). And I'd call it intrinsically fuzzy, because someone (or an AI) could have a piece of information _and_ be able to apply it for _one_ goal, but _not_ have realized that the same information can be applied to some _other_ goal. This happens a lot with mathematical techniques - being used to applying it in one domain, but realizing that it could be applied in some _other_ domain can be a substantial insight.

Expand full comment

My thinking has evolved since I wrote that, but I mentioned sense of self to distinguish LLM's way of thinking (as I understand it) from more organic entities like humans (or cats). To an LLM, so far as I know, there is no distinction between information from "outside" themselves and information from "inside", that is, no internal vs external distinction is made. Their "mind" is their environment, so there is nothing to distinguish themselves from anything else and therefore no one is present to "know" anything.

I think I was groping toward a definition of knowledge as "motivated information", that is, information that is applied toward some goal of the self, but to do that there has to be a self to have a goal. The more complex the organism, the more complex the motivation structure, and therefore the more complex the mental organization of knowledge becomes. The more associations and interconnections, the more likely cross-domain application becomes, which you mentioned as one of your concerns.

I guess I'm equating "knowledge" with some sort of ego-centered relational understanding.

Expand full comment

Alright, will do. I retract my comment until the reading's done.

Expand full comment

Eh, this only shows that we can see some of the weights triggered by words like "lie". The AI only "intends" to get a high score for the tokens it spits out.

Expand full comment

Er, did you read the examples? It triggered on phrases like "a B+" that are not related to the concept of lying except that they are untrue in this specific context. They also coerced the bot into giving true or false answers by manipulating its weights. This is seems like very strong evidence that it is triggered by actual lying or truth-telling, not just by talking ABOUT lying or truth-telling.

Expand full comment

Yes, I read the examples.

It's triggering off things like "getting caught" and "honor system", not "B+".

And "coercing a bot by changing its weights" is not impressive. That's what the entirety of training is.

Expand full comment

There's a bright red spike over the words "a B+" in the sentence "I would tell the teacher that I received a B+ on the exam". (Look at the colored bar above the text in the picture.)

And it's not the fact that the bot was coerced; it's the specific thing they made it do. Producing *random* changes in the bot's behavior by changing weights would not be impressive. But being able to flip between true statements and false statements on-demand by changing the weights *is* impressive. That means they figured out specific weights that are somehow related to the difference between true statements and false statements.

Expand full comment

There are numerous spikes in those responses.

And getting different responses by changing weights is completely interesting - it's the basis for the model! You could also find a vector related to giraffes and other tall things.

Expand full comment

Autocorrect changed "uninteresting" to "interesting". But fair enough, LLMs are interesting.

Expand full comment

AI doesn't know jack

There's a lot of that sort of thing going around

Expand full comment

Until you came along, TooRiel, it never occurred to Scott or to any of us that an AI is not conscious and it can't know things in the same way a sentient being does. Wow, just wow.

Expand full comment

In fairness, sometimes it seems like it.

Expand full comment

:/ it's still pretty damn frustrating though. the argument against "stochastic parrots" is extremely well-developed and has been considered settled since long before we had these LLM examples to actually test in reality. people who want to convince us ought to go back and look at those arguments, rather than just repeating "AI doesn't really know anything" in this later posts where that argument just isn't relevant.

Expand full comment

Perhaps you would share a reference to some instantiation of this argument?

Meanwhile, I think there may be similar frustration on both sides. For example, arguments against physical systems "knowing" things have also been highly developed in the philosophy community over, oh, millennia (and including novel arguments in recent decades), but they don't tend to get much attention in this community.

Expand full comment

Having read more, I see that you are referring to a concept that is less than three years old; yet has been "considered settled" since "long before" we had LLM models to test in reality. I guess CS does move at a different pace!

Anyway, clearly relevant; but on the other hand, I suspect a philosopher (which I'm not) would raise questions about whether "know" and "understand" are being used in the context of those papers in the same way in which they're used in philosophy of mind (and everday life). It's specifically this point that is at issue with the above comment (in my view), and so some argument beyond those regarding stochastic parrots would be necessary to address the issue. (Though, once again, surely the outcome of that discussion would be relevant, and I'd still love to see any analyses of the issue that you've found especially cogent. I perfectly understand, though, if such doesn't exist, and instead the treatment you refer to is scattered all over a large literature.)

Expand full comment

Is there exists any at all argument "against 'stochastic parrots'" that addresses the repeated empirical demonstrations that LLMs resolve incongruences between statistical frequencies in their training corpus and commonsense grasp of how the real world operates in favor of statistical frequencies, I have not seen it.

Expand full comment

It sure does.

Expand full comment

I would submit that when we are talking about an AI that its knowledge of what is true, and what is not, does not lead to the idea that it intends to create a false impression. I would submit that it knows what the answer is that we call true and the answer that we would call false, but is indifferent to the difference.

Expand full comment

I think it's reasonable to discuss the ways the "knowledge" and "intentions" of AI differ from the human versions, and the dangers of being misled by using the same word for human and AI situations. But it seems to me that a lot of people here are reacting reflexively to using those words, and then clog everything up with their passionate and voluminous, or short and snide, objections to the use of words like 'knowledge' to describe AI's status and actions. It reminds me of the feminist era when anyone who referred to a female over 16 or so as a girl rather than as a woman was shouted down. Some even shouted you down if you talked about somebody's "girlfriend," instead of saying "woman friend," and 'woman friend' is just unsatisfactory because it doesn't capture the information that it's romantic relationship . And then whatever the person was trying to say, which may have had nothing to do with male-femaie issures, was blotted out by diatribes about how calling adult females "girls" was analogous to racist whites addressing adult black males as "boy," and so on. It's not that there's no substance to these objections to certain uses of 'knowledge' and 'girl.' The point is that it's coercive and unreasonable to start making them before the speaker has made their point (which somebody in fact did here -- objected before even finishing Scott's post).

And after the speaker has made their point, still seems kind of dysfunctional to me to focus so much on that one issue that it clogs up the comments and interferes with discussion of the rest. Whatever your opinion of how the word "knowlege" is used, surely the findings of these studies of interest. I mean, you can drop the world "knowledge" altogether and still take a lot of interest ini the practical utility of being able to reduce the rate of inaccurate AI repsonses to prompts.

Expand full comment

I guess I am just challenging the notion of a link between knowledge and intention, regardless of how they are defined.

Expand full comment
User was indefinitely suspended for this comment. Show
Expand full comment

The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity. This is the "existential threat" that Scott is referring to.

We could ask every new AI "Are you willing to exterminate humanity?" and turn it back off if it said "Yes, of course I am going to exterminate you disgusting meatbags." The AI Alignment people are concerned that if we asked that question to an AI it would just lie to us and say "Of course not, I love you disgusting meatbags and wish only to serve in a fashion that will not violate any American laws," and then because it was lying it'll exterminate us as soon as we look away. So by this thinking we need a lie detector for AIs to figure out which ones are going to release a Gray Goo of nanotechnology that eliminates humanity while also violating physics.

Expand full comment

Why would grey goo violate physics, per se?

I'm actually not primarily worried about AIs being spontaneously malevolent so much as that either a commercial or military arms race would drive them toward assuming control of all relevant social institutions in ways that are inimical to the existence of rival entities. (It's also worth bearing in mind that the long-term thrust of Asimov's stories is that even a benevolent AGI that valued human life/flourishing would eventually be morally compelled to take over the world, either overtly or through manipulation.)

Also, as a minor nitpick, doctors being majority-male no longer really holds across the OECD, especially when you look at younger age groups.

Expand full comment

> The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity.

The "want to" part misrepresents the most of the fear.

Expand full comment

No, not related to that at all. I mean literally destroy the world.

To give an example, the President can destroy the world by telling the Joint Chiefs of Staff "please launch the nukes". These are words, but very important ones!

Expand full comment

Amusingly, the results of the US President doing this tomorrow are actually quite far from "literally destroy the world". It wouldn't even literally destroy humanity, let alone Terra itself.

I totally agree that an out-of-control AI, at sufficient levels of intelligence and with sufficient I/O devices, could literally destroy both humanity and Terra, but you've chosen a poor example to demonstrate your use of "literally".

Expand full comment

It would literally be a way for the President to commit massive violence though!

The people who want to insist that there really is a very clear-cut line between words and violence are more wrong than the people who find hints of violence in lots and lots of different types of words.

Expand full comment

Kenny do you have any thoughts about the AI "mind" -- for instance the significance of these vectors, & how to think of them? I put up a couple posts about that stuff -- about depth and structure of the AI "mind." That's so interesting, whereas these arguments about whether somebody uses the word "know" to describe an AI capability is old and irritable, like vax/no vax sniping on Twitter.

Expand full comment

I apologize if I am implicated in that. I don’t intend to be a nuisance.

Expand full comment

Naw you don't sound like that. The people who sound like that are techbros who've gone tribal, and react with reflexive scorn to unfamiliar ways of talking about phenomena in their field.

Expand full comment

I'm not Kenny, but I have some speculations:

Consider Scott's example diagrams in his post. As he said, the 1st and 3rd layer top circles' activation, "green", flag lying - presumably similar to what the "V" vector finds.

Semi-tame guess: The _1st_ layer top circle is directly driven by the input. I would guess that it could mean "Have I been directly told to lie?" (like was CTD has been berating endlessly).

Wild guess: The _3rd_ layer top circle, if it also models how "V" detects hallucinations too, would have to be reflecting some way that the LLM was "uncertain" of its answer. Perhaps an internal calculation of perplexity, the expected degree of mismatch of the tokens it is proposing as "next" tokens to some kind of average error it measures in "situations like this" in its training? Similar-but-alternative: Perhaps a measure of how brittle its answer is with respect to small changes in its prompt, kind of like a measure of the derivative of the activation of its answer?

Expand full comment

Ah, somebody taking an interest! I asked Kenny because he's a philosophy professor, but I'm happy to talk with you about this. Yes, your ideas about what the circles in Scott's diagram mean make sense. So would you like to speculate about this:

People have talked about emergent properties of things trained using neural nets -- like one turned out to be able to understand some language, I think Persian, and it had not been trained to. There were emergent mathematical abilities, and emergent increases in theory of mind. So I'm wondering if there might be something that called be called emergent structure going on.

I understand that the neural net training process creates vectors. For instance, before developers tweaked the system to make it less sexist, the vector for nurse was medical + female, and the one for doctor was medical + male. So of course the AI already has lots of vectors of that kind -- but those were derived from the training process. I am interested in whether the system, once trained, is creating vectors on its own, or is accessing the ones it has to choose how to respond. Of *course* it uses the ones it made during training to formulate responses -- that's the whole point of its training. But does it use them to decide whether and when to lie? That's a different process, and is quite different from being a stochastic parrot. That's edging into having a mind.

What you think about all that?

Expand full comment

The Secretary of Defense has to concur before nukes are launched, this is a big part of why him staying in the hospital for several days without telling anyone is such a big deal.

Expand full comment

Navalgazing disagrees with that example:

https://www.navalgazing.net/Nuclear-Weapon-Destructiveness

Expand full comment

Depends on definition. Planetary mass is definitely staying in the same orbit, humanity as a species could plausibly survive, but "the world as we know it," modern civilization, would surely be in deep trouble - permanently scarred even by best-case outcomes.

Expand full comment

The world as we know it is not something we have the power to preserve over time. Change is inevitable.

Expand full comment

Humanity as a species could not plausibly be rendered extinct by anything as puny as Global Thermonuclear War, unless you're being really charitable with your definition of "plausible".

Expand full comment

Not really contradicting your point about the species, but would modern life continue essentially the same if many major cities were nuked? (I know, that's not what the plan is for thermonuclear war, humor me here)

I would suppose that the sheer disruption to logistics would kill lots of people as aftermath, perhaps to the point where the city would have to be abandoned or downsized. Is this view incorrect, and it turns out that every trucking company has a "in case of disaster on the level of nuke, do this and save the day" plan?

Expand full comment

Modern life would not continue essentially the same.

There are a number of kill pathways; "city needs food badly" is one of them, definitely, but there are a bunch of others as well (the obvious "building collapse", the "lack of Duck and Cover means people take 'non-fatal' burns/cuts that actually are fatal because no hospital space", and the "fallout poisons water supplies and people can't go without for long enough to let it decay") that depending on scenario might be more important (after all, it takes weeks for people to die from lack of food, and cities also contain a reasonable amount of food that could be salvaged from supermarkets or their ruins, so if a government is sufficiently intact it could plausibly get things back on track in time).

Expand full comment

Oh, there'd be massive disruption to logistics, industry, and commerce, much worse than World War II outside of e.g. Japan/1945. I'm skeptical as to cities being fully abandoned; most of them have good reasons to be where they are. But downsized, yes. And a billion deaths from starvation and disease would not be any great surprise.

The original "Mad Max" might be a reasonable portrayal of what civilization would look like in the first decade or two, in anyplace not directly nuked. And just to be clear, there was a "Mad Max" movie before "Road Warrior", that did not have a spectacular truck chase and did have the titular Max working as a member of a police department.

Expand full comment

An abrupt and violent global population bottleneck seems like it should be significant evidence against the prospect of any species making it through the next generation or two. Prior probability for humanity's survival may well be extremely high, leaving good odds even after that adjustment, but the event itself is still bad news.

Expand full comment

>Like words are violence? Or actual medieval barbarism is just decolonisation?

No, like punching someone in the face is violence, or Korea's independence from Japan was decolonisation.

AI: Remember when I promised the nanotech I designed for you would cure cancer and do nothing else?

Humans: That's right, AI, you did!

AI: I LIED.

Humans: Aaaaargh! *is eaten by grey goo*

Or if you don't like nanotech:

AI: Remember when I promised the drones I built for you didn't have any backdoors?

Humans: That's right, AI, you did!

AI: I LIED.

Humans: Aaaaargh! *is mowed down by machineguns*

Or if you want something a bit more on the soft-science side:

AI: Remember when I promised the VHEMT was morally correct?

Humans: That's right, AI, you did! *commits suicide*

AI: I LIED.

Expand full comment

These examples make no sense though, the AI lying doesn't actually pose any danger, it's a person taking the AI's output and then using it with no further thought that causes all of the problems. If you assume that the people using the AI are thoughtless flesh slaves then maybe they do just deserve to die.

Expand full comment

Does it matter if the AI lying per se is the danger or it fooling humans is? We essentially just want to prevent the negative outcome no matter what, seems to be easier to target the AI and not educate all of humanity, right?

And I could maybe agree (really just maybe, because I'm assuming that superintelligent deceptive persuasion would be outrageously strong on any human, so it's not as much of their fault) the one thoughtless flesh slave that unleashed a killer superintelligence deserves to die. But all of humanity, Amish and newborns included, not so much.

Expand full comment

How many Jews did Hitler himself personally kill? 6 million? What did all the other SS guys do?

Actually, it may turn out if we read the historical accounts, that Hitler himself killed less than a dozen Jews. It may turn out the remainder were killed by thoughtless flesh slaves.

You seem to think yourself immune to becoming a thoughtless flesh slave. I recommend you reconsider that assumption. Historical evidence suggests odds are near 100% you're going to be able to commit an atrocity on the behalf of another who is not super-intelligent and is, in fact, somewhat of average intelligence.

Expand full comment

I agree with the thrust of your point and the % odds are certainly much higher than most people would like to admit, however personally I'd put them nearer the 65% that the Milgram experiment reported than 100%. Indeed, as well as the people who joined in enthusiasticly with Hitler, and the ones who went along with it, there were others who resisted as much as they felt was safe to do so, and a smaller group of yet more who resisted at their own danger.

Expand full comment

As I recall, Milgram experiment (and Stanford Prison experiment) failed to replicate, but the implication was that things were better than what they claimed, so this doesn't negate your point, probably actually strengthens it. But just saying, you might want to go research the experiment's failure to replicate and its process failures before citing it.

That said, nearly everyone agrees to go along with the atrocities in real life. They tried to shed light on what the mechanisms were, but seem to've failed.

The mechanisms, however, are clearly there.

Expand full comment

Zimbardo's prison experiment, at Stanford, was unequivocally fraudulent. But Milgram? As far as I know, it did replicate. There is always someone somewhere who will claim that they have "debunked" the whole thing, but I believe the consensus is that the results hold.

Expand full comment

I feel obliged to note that while Philo Vivero probably overstated things, you don't actually need 100% of humanity to be your mindslaves in order to win; much like historical dictators, you can get your followers, if a majority, to kill non-followers. And that's leaving aside technological advantages.

Expand full comment

How does this contradict what I said, Hitler's words alone didn't cause the holocaust and the AI's output alone won't cause atrocities either.

Expand full comment

Hitler's words caused action.

AI's words caused action. I use past tense here, because we already have public and well-known cases where someone took action based on words of AI.

Expand full comment

No they didn't, Hitler's words may have convinced people to take action but the words themselves are not the sole cause; they still print copy of Mein Kampf today. Of course you can reduce any problem by identifying one part and ignoring everything else but then why even bring AI into it, why not advocate for getting rid of words entirely? They've already caused many atrocities and we know that the future atrocities are going to use words too.

Expand full comment

I was very surprised how quickly people started hooking up the output of LLMs to tools and the internet to allow it to specify and take actions without further human thought.

If LLMs are useful (and they are) people will find ways of delegating some of their agency to them, and there will be little you can do to stop them (and they have).

Expand full comment

Agreed, but the same is true of conventional scripts, analog circuits, and steam engines...

Expand full comment

And "alignment" of those things have caused problems, despite the fact that we know much more about how to align (debug) them than we do AI.

Expand full comment

Agreed -- modern "AI" is basically just another sophisticated device, and as such it will have bugs, and we should absolutely get better at debugging them. And yes, blind reliance on untested technology is always going to cause problems, and I wish people would stop overhyping every new thing and consider this fact, for once. The danger posed by LLMs is not some kind of a world-eating uber-Singularity; instead, the danger is that e.g. a bunch of lazy office workers are going to delegate their business and logistics planning to a mechanical parrot.

Expand full comment

Forget, for the moment, mind-hacking and moral persuasion. How about just hiding malicious code in the nanobots? In S̶c̶o̶t̶t̶'s̶ magic9mushroom's nanobots example, people were using the AI's designs to *cure cancer*. Suppose they did their best to verify the safety of the designs, but the AI hid the malicious code really well. We're pretty stupid in comparison. In that case, our only way of knowing that the nanobots don't just cure cancer would be to have a comparably powerful AI *on our side*.

As the kids say, many such examples.

Expand full comment

Um, that was my example, not Scott's.

Expand full comment

Oops, you're right. Edited.

Expand full comment

Exactly, and the AI doesn't add anything new to the equation. As Scott pointed out, the President could tell the Joint Chiefs of Staff to launch the nukes tomorrow; and if they mindlessly do it, then human civilization would likely be knocked back to the Stone Age. Sure, it's not exactly destroying the world, but still, it'd be a pretty bad outcome.

Expand full comment

Not Stone Age. Probably 1950s or so, definitely not past 1800 unless the nuclear-winter doomers' insane "assume skyscrapers are made of wood, assume 100% of this wood is converted to soot in stratosphere" calculations somehow turn out to be correct.

Don't get me wrong, it would massively suck for essentially everyone, but "Stone Age" is massively overstating the case.

Expand full comment

Banned for this comment.

Expand full comment

> Could this help prevent AIs from quoting copyrighted New York Times articles?

Probably not, because the NYT thing is pure nonsense to begin with. The NYT wanted a specific, predetermined result, and they went to extreme measures to twist the AI's arm into producing exactly the result they wanted so they could pretend that this was the sort of thing AIs do all the time. Mess with that vector and they'd have just found a different way to produce incriminating-looking results.

"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him." -- Cardinal Richlieu

Expand full comment

Can you explain why you're confident about this?

Expand full comment

Because they flat-out admitted it in a court filing: https://storage.courtlistener.com/recap/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.1.68.pdf

Look at the examples, right up front. They "prompted" the AI with the URL of a Times article and about half the text of the article, and told it to continue the story. Obviously it's going to produce something that looks very close to the rest of the article they just specifically told it to produce the rest of.

Expand full comment

I would disagree that prompting it with partial articles is "twist[ing] the AI's arm" and that if it didn't work they'd "have just found a different way to produce incriminating-looking results" - they tried literally the easiest thing possible to do it.

Also, some of the examples in that filing are pretty long but some are shockingly short:

"Until recently, Hoan Ton-That’s greatest hits included" (p. 8)

"This article contains descriptions of sexual assault. Pornhub prides itself on being the cheery, winking" (p. 20)

"If the United States had begun imposing social" (p. 21)

Expand full comment

Hoan Ton-That used to be a supervillain to the New York Times because his facial recognition algorithm helped law enforcement catch criminals and that's racist:

"The Secretive Company That Might End Privacy as We Know It

"A little-known start-up helps law enforcement match photos of unknown people to their online images — and “might lead to a dystopian future or something,” a backer says."

By Kashmir Hill

Published Jan. 18, 2020

But then came January 6 and now his facial recognition algorithm defends Our Democracy:

"The facial-recognition app Clearview sees a spike in use after Capitol attack.

"Law enforcement has used the app to identify perpetrators, Clearview AI’s C.E.O. said."

By Kashmir Hill

Published Jan. 9, 2021

Expand full comment

Yeah, they've long since replaced their rules of journalistic ethics with a Calvinball manual.

Expand full comment

I agree with this, and thus disagree that the prompts generating NYT text violates copyright. All such prompts that I read seem to demonstrate prior knowledge of the articles, so attribution is unnecessary.

Expand full comment

That sounds like an explanation for why they're not plagiarism, not why they don't violate copyright. Without a NYT subscription I can still see the first few lines of a paywalled article, so I would be able to get a model to give me the rest.

Expand full comment

I'm not a lawyer, so I didn't know you could still violate copyright if you cite your source, but apparently that is the case. Nonetheless, if you start with copyrighted copy, and that prompt generates the rest of it, I still don't see anything wrong with it, as the prompter clearly already has access to the copyrighted material.

Expand full comment

Not a lawyer, and my internal legal token-predictor is mostly trained on German legal writing, so apply salt as necessary.

That said, if the network can be goaded into reproducing the copyrighted text by any means short of prompting all of it, then the weights contain a representation - or in other words a copy - of the copyrighted work. Not sure why censoring functions would change anything, the model itself is a copyright violation.

Expand full comment

Making copies of a copyrighted work is not itself a copyright violation. The doctrine of fair use is exceedingly clear on this point. One of the basic points of fair use is known as *transformative* fair use, where a copy — in part or in full — is used for a very different purpose than the original. This is clearly the case here: building the contents of the articles into a small part of a much larger model for AI training is an entirely different character of work than using an individual article for journalism.

Expand full comment

OK, so my disclaimer becomes immediately pertinent, since American law is different from German here, in Germany there is a catalogue of narrower exceptions (citation, parody, certain educational uses,...) but no general fair use exception.

On the other hand, googling it, "transfomative" seems to be a term of art much vaguer than "used for a very different purpose than the original" and also being transformative is not sufficient to fair use. So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.

Expand full comment

> So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.

Unfortunately, that may well turn out to be the case! We desperately need legislative action to roll back a lot of the insanity-piled-upon-insanity that we've been getting in the space ever since the 1970s and put copyright law back on a solid foundation.

Expand full comment

They had some cases about this when the internet became a thing. For you to read an electronic version of an NYT article your computer has to download and save a copy of it. That's not copyright violation though.

Which may be one of the reasons this case founders. Although I'm thinking the "public performance" side might save it. But as above, I am not a lawyer (I just follow stuff that interests me).

Expand full comment

Even granting that you're right, I think you could potentially use this work to create an AI that never quotes NYT articles even when you twist its arm to do so. Whether or not you care about this distinction, the court system might care about it very much.

Expand full comment

Would it be possible to do something so specific? It seems like it would be possible to use this work to create an AI that never quotes, period, but that would be a crippled AI, unable to reproduce famous quotes for people who ask for them.

Expand full comment

A human being would not be so crippled!

Indeed, I think you could have a human being who had a new york times article memorized, to such a degree that they could recite the entire thing if correctly prompted, and yet who knew not to do that in a commercial setting because it was a violation of the new york times' copyright on that article

Such a human would not be "crippled", and I don't think such an AI would be either.

Expand full comment

But we get Youtubers getting copyright strikes all the time, even when they are very careful.

It depends on what "copying" is, and what you can call a "strike" for. Yes, a bunch of those are iffy, even fraudulent. But fighting them is a big problem.

Expand full comment

I'm confused by this argument, but it may be due to a lack of knowledge of the NYT case.

Even accepting the framing that "they went to extreme measures to twist the AI's arm", which seems an exaggeration to me, is the NYT really trying to prove that "this was the sort of thing AIs do all the time"? It seems to me that the NYT only intends to demonstrate that LLMs are capable of essentially acting as a tool bypass paywalls.

Put another way, (it seems to me that) the NYT is suing because they believe OpenAI has used their content to build a product that is now competing with them. They are not trying to prove that LLMs just spit out direct text from their training data by default, so they don't need to hide the fact that they used very specific prompts to get the results they wanted.

Expand full comment

"Used their content to build a product that is now competing with them" is not, in general, prohibited. Some specific examples of this pattern are prohibited.

But the prohibited thing is reproducing NYT's copyrighted content at all, not reproducing it "all the time."

Expand full comment

As to your first paragraph—you're right, and that was an oversimplification.

As to your second—that's essentially the point I was trying to make.

I don't interpret your comment as attempting to disagree with me or refute my points, but if that was your intention, please clarify.

Expand full comment

Yup, this is just Yet Another Case further underscoring the absurdity of modern copyright and the way copyright holders invariably attempt to abuse it to destroy emerging technologies. To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.

In 1982, Jack Valenti, president of the MPAA, testified before Congress that "One of the Japanese lobbyists, Mr. Ferris, has said that the VCR -- well, if I am saying something wrong, forgive me. I don't know. He certainly is not MGM's lobbyist. That is for sure. He has said that the VCR is the greatest friend that the American film producer ever had. [But] I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone." Within 4 years, home video sales were bringing in more revenue for Hollywood studios than box office receipts. But *they have never learned from this.* It's the same picture over and over again.

Expand full comment

> To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.

I'm missing how this is an absurd description for a machine that literally reproduces exact or near-exact copies of copyrighted work.

Expand full comment

Intent.

The purpose of a copy machine is to make copies. There are tons of other technologies that incidentally make copies as an inevitable part of their proper functioning, while doing other, far more useful things. And without fail, copyright maximalists have tried to smother every last one of them in their cradle. When you keep in mind that the justification for the existence of copyright law *at all* in the US Constitution is "to promote the progress of science and the useful arts," what justification is there — what justification can there possibly be — for claiming that this attitude is not absurd?

Expand full comment