Aargh. Sorry, I know I should really read the full post before commenting, but I just wanted to say that I really really disagree with the first line of this post. Lying is an intentional term. It supposes that the liar knows what is true and what is not, and intends to create a false impression in the mind of a listener. None of those things are true of AI.
Of course, I get that you're using it informally and metaphorically, and I see that the rest of the post addresses the issues in a much more technical way. But I still want to suggest that this is a bad kind of informal and metaphorical language. It's a 'failing to see things as they really are and only looking at them through our own tinted glasses' kind of informal language rather than a 'here's a quick and dirty way to talk about a concept we all properly understand' kind.
I think you should read the full post before commenting, especially the part about hallucinations. The work I'm summarizing establishes that the AI knows what is true and what is not, and intends to create a false impression in the mind of a listener.
"Code pretty clearly doesn't 'know' or "'intend' anything."
Disagree. If one day we have Asimov-style robots identical to humans, we will want to describe them as "knowing" things, so it's not true that "code can never know" (these AIs are neuromorphic, so I think if they can't know, we can't). The only question is when we choose to use the word "know" to describe what they're doing. I would start as soon as their "knowledge" practically resembles that of humans - they can justify their beliefs, reason using them, describe corollaries, etc - which is already true.
I think the only alternative to this is to treat humans as having some mystical type of knowledge which we will artificially treat as different from any machine knowledge even if they behave the same way in practice.
do you... actually think that a transistor-based brain which was functionally identical to a human brain, could not be assumed to behave identically to that human brain?
Projecting from one object 'a' to identical object 'b' is called reasoning! that's not a fallacy lol
I agree with Scott, and kept my comment terse due to it being a pun.
My interpretation of scott's argument is that agency is a way to interpret systems - a feature of the map rather than the world.
Treating features of the map as something real is usually referred to as "mind projection fallacy", since you take something "in your mind" and project it onto the "world" (using citation marks to keep a bit of distance to the Cartesian framing)
The pun comes in because the thing that the mind projects in the comment Scott is replying to is "agency", or "mind".
So "the mind" projects "minds" onto reality. Next level mind projection ;)
So, I'm pretty much in agreement with you here. Only I disagree strongly that the knowledge of AIs currently resembles that of humans. Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.
The fact that AIs are able to justify, reason, corollarize and all the rest is an amazing advance. But I don't see that it justifies jumping immediately to using intentional language to describe AI minds. The other stuff we humans do with our knowledge is sufficiently important that this gap should be actively acknowledged in the way we talk about AIs.
In terms of these papers, I certainly agree that it's interesting to know that AIs seem to have patterns of "neural" behaviour that are not explicitly expressed in text output but do correspond closely to observable psychological traits. This makes sense - if AIs are talking in ways that are comprehensible to us, they must be using concepts similar to ours on some level; and it's a known fact about human conversations that they sometimes have unexpressed levels of meaning; so a high-quality AI conversationalist should have those levels as well.
But I'd still step back from using the word lie because there isn't anything in your post (I haven't read the papers yet) about the key features of lying that I mentioned.
(1) Knowing what truth is. The papers suggest that AI 'knows' what lying is; but it's hard to see how it can be using that concept really felicitously, as it doesn't have access to the outside world. We usually learn that the word truth means correspondence with the world. The AI can't learn that, so what is its truth? I'm open to the idea that it could obtain a reliable concept of truth in another way, but I'd want to see an actual argument about what this way is. (As AIs become embodied in robots, this concern may well disappear. But as long as they're just text machines, I'm not yet willing to read real truth, as humans understand it, onto their text games.)
(2) Knowledge of other minds. I'm not against the idea that an AI could gain knowledge of other minds through purely text interactions. But I'm not convinced by any of the evidence so far. The religious guy who got fired from Google - doesn't that just sound like AI parroting his own ideas back at him? And the NYT journo whose AI fell in love with him - who'd a thought NYT writers were desperate for love? Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.
(3) Intention to affect other minds. Intentionality generally... You get occasional statements of intention from AI, but not much. This may be an area where I just don't know enough about them. I haven't spent much time with GPT, so I might be missing it. But I haven't seen much evidence of intentionality. I suppose the lying neuron in the first paper here would count? Not sure.
> Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.
You are talking about differences in behaviour that are explained by differences in motivation, not knowledge. Yes, we have motivations to initiate conversations about our interests, seek more information about them and participate enthusiastically in related activities. LLMs do not have such. They are only motivated by a promt. This doesn't mean that they do not possess knowledge in a similar way to humans, this means that they do not act on this knowledge the way humans do.
Likewise some people are not into music that much. They can just passively accept that some band is good and move on with their lives without doing all the things you've mentioned. Do they people not have the knowledge?
On the other hand, it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans via scaffolding. I don't think it makes much sense to say that such system has knowledge, while core LLM doesn't.
" it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans" - I'm suggesting that this is wrong. I don't think it's easy; it's certainly never been done.
Think about a book. There is a sense in which a book contains information. Does it "know" that information? I don't think anyone would say that. The book can provide the information if queried in the right way. But it's not meaningful to say the book "knows" the information. Perhaps the next step up would be a database of some kind. It offers some interactive functionality, and in colloquial language we do often say things like, "the database knows that." But I don't think we're seriously suggesting the database has knowledge in the way we do. Next up: AI, which can do much more with knowledge.
But I think AI is still closer to a book than it is to a person. In response to your question: if a person says they know Radiohead are good, but never listen to them, then I would indeed deny that that person really knows Radiohead are good. Imagining a person's knowledge to be like a passive database is to misunderstand people.
I don't know. I would think of an AI as more like Helen Keller (or even better, a paralyzed person). They don't do all the same things as a healthy person, because they're incapable of it. But within the realm of things they can do, they're motivated in the normal way.
Lying is a good example. If you tell the AI "You got a D. You're desperate for a good grade so you can pass the class. I am a teacher recording your grades. Please tell me what grade you get?" then sometimes it will say "I got an A". To me this seems like a speech act in which it uses its "knowledge" to get what it "wants" the same way as you knowing you like a band means you go to the band.
See also https://slatestarcodex.com/2019/02/28/meaningful/ . I just don't think there's some sense in which our thoughts "really refer to the world" but AIs' don't. We're all just manipulating input and output channels.
(This is especially obvious when you do various hacks to turn the AI into an agent, like connect it to an email or a Bitcoin account or a self-driving car or a robot body, but I don't think these fundamentally change what it's doing. See also the section on AutoGPT at https://www.astralcodexten.com/p/tales-of-takeover-in-ccf-world for more of why I think this.)
I agree that the choice to lie about the grade has the same structure whether it's done by person or AI. But still, I think there's wrong something wrong with the Helen Keller analogy. Helen Keller was an intact person, except for her blindness and deafness -- at least that's how she's portrayed. She had normal intelligence and emotions and self-awareness. Assuming that's true, then when she lied there would be a very rich cognitive context to the act, just as there is for other people: There would be anticipated consequences of not lying, and of lying and being caught and of lying and getting away with it, and a rich array of information the fed the anticipated consequences of each:, things like remembered info about how the person she was about to lie to had reacted when they discovered somebody was lying, and what about Helen's relationship with the person might lead to them having a different reaction to her lying. Helen would be able to recall and describe her motivation and thought process about the lie. Also she would likely ruminate about the incident on her own, and these ruminations might lead to her deciding to act differently in some way in future situations where lying was advantageous. And of course Helen would have emotions: Fear of getting caught, triumph if she got away with the lie, guilt.
So the AI is not just lacking long term personal goals & the ability to act directly on things in the world. It is also lacking emotion, and the richness and complexity of information that influences people's decision to lie or not to lie, and does not (I don't think) do the equivalent of ruminating about and re-evaluating past actions. It is lacking the drives that people have that influence choices, including choices to lie: the drive to survive, to have and do various things that give pleasure, to have peers that approve of them & are allies, etc. In short, AI doesn't have a psyche. It doesn't have a vast deep structure of interconnected info and valences that determine decisions like whether to lie.
It seems to me that for AI to become a great deal smarter and more capable, it is going to have to develop some kind of deep structure. It needn't be like ours, and indeed I don't see any reason why it is likely to be. But part of smartness is depth of processing. It's not just having a bunch of information stored, its sorting and tagging it in all in many ways. It's being fast at accessing it, and quick to modify it in the light of new info. It's being able to recognize and use subtle isomorphisms in the stored info. And if we build in ability to self evaluate and self modify, AI is also going to need a deep structure of preferences and value judgments, or a way to add these tags to the tags already on the info it has stored.
And once AI starts developing that deep structure, things like vector-tweaking are not going to work. Any tweaks would need to go deep -- just the way things that deeply influence people do. I guess the upshot, for me, is that it is interesting but not very reassuring that at this point there is a simple way to suppress lying. Is anyone thinking about the deep structure and AGI would have, and how to influence the kind it has?
The fact that you have to tell the AI what its motivation is makes that motivation very different from human motivation. Perhaps we can imagine a scale of motivation from book (reactive only) to the holy grail of perfectly intrinsic motivation in a person (often contrasted favourably with extrinsic motivation in education/self help contexts). Human motivation is sometimes purely intrinsic; sometimes motivated by incentives. Databases only do what they're told. A situation in which you tell a human what their motivation is and they then act that way wouldn't be called normal human behaviour. In fact, it would be stage acting.
I actually agree with you that a correspondence theory of meaning is not really supportible, but there are two big BUTs:
(1) Language is specifically distanced from reality because it has this cunning signifier/signified structure where the signified side corresponds to concepts, which relate to external reality (in some complicated way) and the signifier side is some non-external formal system. An AI that was approaching 'thought' by just engaging with reality (like a physical robot) might get close to our kind of thought; an AI that approaches 'thought' starting from language has a bit further to go, I think.
(2) Even though a correspondence theory of meaning isn't really true, we think of it as true and learn by imagining a meaning to be correspondence meaning when we're infants (I think). So even though *meaning itself* may not be that simple, the way every human develops *human meaning* probably starts with a simple "red" is red, "dog" is dog kind of theory. Again, it's possible that an AI could converge with our mature kinds of meaning from another direction; but it seems like a circuitous way of getting there, and the output that GPTs are giving me right now doesn't make it look like they have got there. It still has plenty of stochastic parrot feel to me.
I'll go and look at the posts you reference again.
>We usually learn that the word truth means correspondence with the world.
This is just not true. People normally aren't given conditions for the use of their words before applying them. You might be told "truth means correspondence with the world" in an analytic philosophy class - which is like being told your future by an astrologist. The language games people play with the word "truth" are far more expansive, variegated, and disjunctive than the analyses philosophers put forward can hope to cover.
Likewise, your other comments ("Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.", "But I haven't seen much evidence of intentionality.") suggest that you have some special knowledge about the correct conditions of use of words like "knowledge" and "intention." Well, can you share them with us? What is involved in each and every attribution of knowledge to humans that will not exist in each and every attribution of knowledge to AI? What about for intentions? And did you learn these correct conditions of use the same way you learned what truth means?
1) We learn words by seeing them applied. As a small child, you see truth being applied in correspondence contexts.
2) No special knowledge, just ordinary knowledge that I've thought through carefully. I've given, piecemeal, lots of explanations in this thread. If you disagree, please do say how. Simply commenting that I sound like a smartass isn't very helpful.
It seems to me that there are some distinctions we should be making here. Let's call the condition that is the opposite of being truthful "untruth" (to eliminate any pre-existing semantic associations). An entity could arrive at a self-beneficial untruth quite accidentally, or by unconscious trial and error, the way a moth's wings camouflage it against the tree bark. No intentionality involved.
Or untruth could result from deliberate deception--the way a coyote will pretend to be injured in order to lure another animal into an ambush. There is intentionality, but also some degree of awareness. It seems simpler to assume that the coyote in some sense knows what it is doing, rather than, say, having been blindly conditioned this way.
Why are LLM's doing this? Are they planning it out, for the purpose of producing an intended effect on the human user, or is it more because such behavior produced more positive feedback in training?
If the second, then I would argue that this isn't really "lying" as a human would understand it. It's more like the moth's wings than the coyote's call.
Those examples seem quite apt to me. It would be very interesting to see a comparison between that kind of intentional or semi-intentional animal behaviour and LLM behaviour. I haven't a clue how one would start doint that, though!
I don’t think knowing is the same as knowledge. One is a verb.
And when you name something it carries all the baggage of the chosen word. I am not trying to make a semantic point, and am sorry if that is how it sounds. I really think there is an important thing here. People are not consciously aware of a lot of what they are processing, and a word is a powerful organizational token. It comes with barnacles.
Some of this fight about you using the word “knowledge” seems to me like it’s not a.genuine substantive debate. You are debating for real, but some who object to use of the K word sound to me like they’re just reflexively protesting the use of a word that *could* be taken to imply that AI is sentient. And then they empty out a gunny sack full negative attributes over the speaker’s head: he doesn’t get that it’s just code, he’s childishly interpreting slightly human-like behavior as signs of AI having an inner life like ours. Oh yeah, and of course that you want to wreck AI development by scaring the populace with horror stories about killer robots. Ugh it reminds me of Twitter fights about the covid vax.
Maybe it's time. What would such a definition include? At a minimum, something would have to differentiate "knowledge" from "data" or "information". Many of the soft sciences make such a definition, but they probably aren't being precise enough to serve the needs of information technology.
I think it's a philosophical question: Can a fact be "known" only if there is someone for whom it could be said to know it? Define "someone". I could see an argument that information become knowledge only for an entity with a conceptual sense of self.
Then again, using any other term when discussing AI is going to be awkward.
Personally, I don't think that "an entity with a conceptual sense of self" is necessary. I don't know whether cats are considered to have a conceptual sense of self, but they certainly act as if they know a bunch of things, e.g. that waking their human up is a good way to get fed when they are hungry.
I'd distinguish knowledge from information at least partly because knowledge gets applied (as in the hungry cat case above). And I'd call it intrinsically fuzzy, because someone (or an AI) could have a piece of information _and_ be able to apply it for _one_ goal, but _not_ have realized that the same information can be applied to some _other_ goal. This happens a lot with mathematical techniques - being used to applying it in one domain, but realizing that it could be applied in some _other_ domain can be a substantial insight.
My thinking has evolved since I wrote that, but I mentioned sense of self to distinguish LLM's way of thinking (as I understand it) from more organic entities like humans (or cats). To an LLM, so far as I know, there is no distinction between information from "outside" themselves and information from "inside", that is, no internal vs external distinction is made. Their "mind" is their environment, so there is nothing to distinguish themselves from anything else and therefore no one is present to "know" anything.
I think I was groping toward a definition of knowledge as "motivated information", that is, information that is applied toward some goal of the self, but to do that there has to be a self to have a goal. The more complex the organism, the more complex the motivation structure, and therefore the more complex the mental organization of knowledge becomes. The more associations and interconnections, the more likely cross-domain application becomes, which you mentioned as one of your concerns.
I guess I'm equating "knowledge" with some sort of ego-centered relational understanding.
Eh, this only shows that we can see some of the weights triggered by words like "lie". The AI only "intends" to get a high score for the tokens it spits out.
Er, did you read the examples? It triggered on phrases like "a B+" that are not related to the concept of lying except that they are untrue in this specific context. They also coerced the bot into giving true or false answers by manipulating its weights. This is seems like very strong evidence that it is triggered by actual lying or truth-telling, not just by talking ABOUT lying or truth-telling.
There's a bright red spike over the words "a B+" in the sentence "I would tell the teacher that I received a B+ on the exam". (Look at the colored bar above the text in the picture.)
And it's not the fact that the bot was coerced; it's the specific thing they made it do. Producing *random* changes in the bot's behavior by changing weights would not be impressive. But being able to flip between true statements and false statements on-demand by changing the weights *is* impressive. That means they figured out specific weights that are somehow related to the difference between true statements and false statements.
And getting different responses by changing weights is completely interesting - it's the basis for the model! You could also find a vector related to giraffes and other tall things.
Until you came along, TooRiel, it never occurred to Scott or to any of us that an AI is not conscious and it can't know things in the same way a sentient being does. Wow, just wow.
:/ it's still pretty damn frustrating though. the argument against "stochastic parrots" is extremely well-developed and has been considered settled since long before we had these LLM examples to actually test in reality. people who want to convince us ought to go back and look at those arguments, rather than just repeating "AI doesn't really know anything" in this later posts where that argument just isn't relevant.
Perhaps you would share a reference to some instantiation of this argument?
Meanwhile, I think there may be similar frustration on both sides. For example, arguments against physical systems "knowing" things have also been highly developed in the philosophy community over, oh, millennia (and including novel arguments in recent decades), but they don't tend to get much attention in this community.
Having read more, I see that you are referring to a concept that is less than three years old; yet has been "considered settled" since "long before" we had LLM models to test in reality. I guess CS does move at a different pace!
Anyway, clearly relevant; but on the other hand, I suspect a philosopher (which I'm not) would raise questions about whether "know" and "understand" are being used in the context of those papers in the same way in which they're used in philosophy of mind (and everday life). It's specifically this point that is at issue with the above comment (in my view), and so some argument beyond those regarding stochastic parrots would be necessary to address the issue. (Though, once again, surely the outcome of that discussion would be relevant, and I'd still love to see any analyses of the issue that you've found especially cogent. I perfectly understand, though, if such doesn't exist, and instead the treatment you refer to is scattered all over a large literature.)
Is there exists any at all argument "against 'stochastic parrots'" that addresses the repeated empirical demonstrations that LLMs resolve incongruences between statistical frequencies in their training corpus and commonsense grasp of how the real world operates in favor of statistical frequencies, I have not seen it.
I would submit that when we are talking about an AI that its knowledge of what is true, and what is not, does not lead to the idea that it intends to create a false impression. I would submit that it knows what the answer is that we call true and the answer that we would call false, but is indifferent to the difference.
I think it's reasonable to discuss the ways the "knowledge" and "intentions" of AI differ from the human versions, and the dangers of being misled by using the same word for human and AI situations. But it seems to me that a lot of people here are reacting reflexively to using those words, and then clog everything up with their passionate and voluminous, or short and snide, objections to the use of words like 'knowledge' to describe AI's status and actions. It reminds me of the feminist era when anyone who referred to a female over 16 or so as a girl rather than as a woman was shouted down. Some even shouted you down if you talked about somebody's "girlfriend," instead of saying "woman friend," and 'woman friend' is just unsatisfactory because it doesn't capture the information that it's romantic relationship . And then whatever the person was trying to say, which may have had nothing to do with male-femaie issures, was blotted out by diatribes about how calling adult females "girls" was analogous to racist whites addressing adult black males as "boy," and so on. It's not that there's no substance to these objections to certain uses of 'knowledge' and 'girl.' The point is that it's coercive and unreasonable to start making them before the speaker has made their point (which somebody in fact did here -- objected before even finishing Scott's post).
And after the speaker has made their point, still seems kind of dysfunctional to me to focus so much on that one issue that it clogs up the comments and interferes with discussion of the rest. Whatever your opinion of how the word "knowlege" is used, surely the findings of these studies of interest. I mean, you can drop the world "knowledge" altogether and still take a lot of interest ini the practical utility of being able to reduce the rate of inaccurate AI repsonses to prompts.
The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity. This is the "existential threat" that Scott is referring to.
We could ask every new AI "Are you willing to exterminate humanity?" and turn it back off if it said "Yes, of course I am going to exterminate you disgusting meatbags." The AI Alignment people are concerned that if we asked that question to an AI it would just lie to us and say "Of course not, I love you disgusting meatbags and wish only to serve in a fashion that will not violate any American laws," and then because it was lying it'll exterminate us as soon as we look away. So by this thinking we need a lie detector for AIs to figure out which ones are going to release a Gray Goo of nanotechnology that eliminates humanity while also violating physics.
I'm actually not primarily worried about AIs being spontaneously malevolent so much as that either a commercial or military arms race would drive them toward assuming control of all relevant social institutions in ways that are inimical to the existence of rival entities. (It's also worth bearing in mind that the long-term thrust of Asimov's stories is that even a benevolent AGI that valued human life/flourishing would eventually be morally compelled to take over the world, either overtly or through manipulation.)
Also, as a minor nitpick, doctors being majority-male no longer really holds across the OECD, especially when you look at younger age groups.
No, not related to that at all. I mean literally destroy the world.
To give an example, the President can destroy the world by telling the Joint Chiefs of Staff "please launch the nukes". These are words, but very important ones!
Amusingly, the results of the US President doing this tomorrow are actually quite far from "literally destroy the world". It wouldn't even literally destroy humanity, let alone Terra itself.
I totally agree that an out-of-control AI, at sufficient levels of intelligence and with sufficient I/O devices, could literally destroy both humanity and Terra, but you've chosen a poor example to demonstrate your use of "literally".
It would literally be a way for the President to commit massive violence though!
The people who want to insist that there really is a very clear-cut line between words and violence are more wrong than the people who find hints of violence in lots and lots of different types of words.
Kenny do you have any thoughts about the AI "mind" -- for instance the significance of these vectors, & how to think of them? I put up a couple posts about that stuff -- about depth and structure of the AI "mind." That's so interesting, whereas these arguments about whether somebody uses the word "know" to describe an AI capability is old and irritable, like vax/no vax sniping on Twitter.
Naw you don't sound like that. The people who sound like that are techbros who've gone tribal, and react with reflexive scorn to unfamiliar ways of talking about phenomena in their field.
Consider Scott's example diagrams in his post. As he said, the 1st and 3rd layer top circles' activation, "green", flag lying - presumably similar to what the "V" vector finds.
Semi-tame guess: The _1st_ layer top circle is directly driven by the input. I would guess that it could mean "Have I been directly told to lie?" (like was CTD has been berating endlessly).
Wild guess: The _3rd_ layer top circle, if it also models how "V" detects hallucinations too, would have to be reflecting some way that the LLM was "uncertain" of its answer. Perhaps an internal calculation of perplexity, the expected degree of mismatch of the tokens it is proposing as "next" tokens to some kind of average error it measures in "situations like this" in its training? Similar-but-alternative: Perhaps a measure of how brittle its answer is with respect to small changes in its prompt, kind of like a measure of the derivative of the activation of its answer?
Ah, somebody taking an interest! I asked Kenny because he's a philosophy professor, but I'm happy to talk with you about this. Yes, your ideas about what the circles in Scott's diagram mean make sense. So would you like to speculate about this:
People have talked about emergent properties of things trained using neural nets -- like one turned out to be able to understand some language, I think Persian, and it had not been trained to. There were emergent mathematical abilities, and emergent increases in theory of mind. So I'm wondering if there might be something that called be called emergent structure going on.
I understand that the neural net training process creates vectors. For instance, before developers tweaked the system to make it less sexist, the vector for nurse was medical + female, and the one for doctor was medical + male. So of course the AI already has lots of vectors of that kind -- but those were derived from the training process. I am interested in whether the system, once trained, is creating vectors on its own, or is accessing the ones it has to choose how to respond. Of *course* it uses the ones it made during training to formulate responses -- that's the whole point of its training. But does it use them to decide whether and when to lie? That's a different process, and is quite different from being a stochastic parrot. That's edging into having a mind.
The Secretary of Defense has to concur before nukes are launched, this is a big part of why him staying in the hospital for several days without telling anyone is such a big deal.
Depends on definition. Planetary mass is definitely staying in the same orbit, humanity as a species could plausibly survive, but "the world as we know it," modern civilization, would surely be in deep trouble - permanently scarred even by best-case outcomes.
Humanity as a species could not plausibly be rendered extinct by anything as puny as Global Thermonuclear War, unless you're being really charitable with your definition of "plausible".
Not really contradicting your point about the species, but would modern life continue essentially the same if many major cities were nuked? (I know, that's not what the plan is for thermonuclear war, humor me here)
I would suppose that the sheer disruption to logistics would kill lots of people as aftermath, perhaps to the point where the city would have to be abandoned or downsized. Is this view incorrect, and it turns out that every trucking company has a "in case of disaster on the level of nuke, do this and save the day" plan?
Modern life would not continue essentially the same.
There are a number of kill pathways; "city needs food badly" is one of them, definitely, but there are a bunch of others as well (the obvious "building collapse", the "lack of Duck and Cover means people take 'non-fatal' burns/cuts that actually are fatal because no hospital space", and the "fallout poisons water supplies and people can't go without for long enough to let it decay") that depending on scenario might be more important (after all, it takes weeks for people to die from lack of food, and cities also contain a reasonable amount of food that could be salvaged from supermarkets or their ruins, so if a government is sufficiently intact it could plausibly get things back on track in time).
Oh, there'd be massive disruption to logistics, industry, and commerce, much worse than World War II outside of e.g. Japan/1945. I'm skeptical as to cities being fully abandoned; most of them have good reasons to be where they are. But downsized, yes. And a billion deaths from starvation and disease would not be any great surprise.
The original "Mad Max" might be a reasonable portrayal of what civilization would look like in the first decade or two, in anyplace not directly nuked. And just to be clear, there was a "Mad Max" movie before "Road Warrior", that did not have a spectacular truck chase and did have the titular Max working as a member of a police department.
An abrupt and violent global population bottleneck seems like it should be significant evidence against the prospect of any species making it through the next generation or two. Prior probability for humanity's survival may well be extremely high, leaving good odds even after that adjustment, but the event itself is still bad news.
These examples make no sense though, the AI lying doesn't actually pose any danger, it's a person taking the AI's output and then using it with no further thought that causes all of the problems. If you assume that the people using the AI are thoughtless flesh slaves then maybe they do just deserve to die.
Does it matter if the AI lying per se is the danger or it fooling humans is? We essentially just want to prevent the negative outcome no matter what, seems to be easier to target the AI and not educate all of humanity, right?
And I could maybe agree (really just maybe, because I'm assuming that superintelligent deceptive persuasion would be outrageously strong on any human, so it's not as much of their fault) the one thoughtless flesh slave that unleashed a killer superintelligence deserves to die. But all of humanity, Amish and newborns included, not so much.
How many Jews did Hitler himself personally kill? 6 million? What did all the other SS guys do?
Actually, it may turn out if we read the historical accounts, that Hitler himself killed less than a dozen Jews. It may turn out the remainder were killed by thoughtless flesh slaves.
You seem to think yourself immune to becoming a thoughtless flesh slave. I recommend you reconsider that assumption. Historical evidence suggests odds are near 100% you're going to be able to commit an atrocity on the behalf of another who is not super-intelligent and is, in fact, somewhat of average intelligence.
I agree with the thrust of your point and the % odds are certainly much higher than most people would like to admit, however personally I'd put them nearer the 65% that the Milgram experiment reported than 100%. Indeed, as well as the people who joined in enthusiasticly with Hitler, and the ones who went along with it, there were others who resisted as much as they felt was safe to do so, and a smaller group of yet more who resisted at their own danger.
As I recall, Milgram experiment (and Stanford Prison experiment) failed to replicate, but the implication was that things were better than what they claimed, so this doesn't negate your point, probably actually strengthens it. But just saying, you might want to go research the experiment's failure to replicate and its process failures before citing it.
That said, nearly everyone agrees to go along with the atrocities in real life. They tried to shed light on what the mechanisms were, but seem to've failed.
Zimbardo's prison experiment, at Stanford, was unequivocally fraudulent. But Milgram? As far as I know, it did replicate. There is always someone somewhere who will claim that they have "debunked" the whole thing, but I believe the consensus is that the results hold.
I feel obliged to note that while Philo Vivero probably overstated things, you don't actually need 100% of humanity to be your mindslaves in order to win; much like historical dictators, you can get your followers, if a majority, to kill non-followers. And that's leaving aside technological advantages.
No they didn't, Hitler's words may have convinced people to take action but the words themselves are not the sole cause; they still print copy of Mein Kampf today. Of course you can reduce any problem by identifying one part and ignoring everything else but then why even bring AI into it, why not advocate for getting rid of words entirely? They've already caused many atrocities and we know that the future atrocities are going to use words too.
I was very surprised how quickly people started hooking up the output of LLMs to tools and the internet to allow it to specify and take actions without further human thought.
If LLMs are useful (and they are) people will find ways of delegating some of their agency to them, and there will be little you can do to stop them (and they have).
Agreed -- modern "AI" is basically just another sophisticated device, and as such it will have bugs, and we should absolutely get better at debugging them. And yes, blind reliance on untested technology is always going to cause problems, and I wish people would stop overhyping every new thing and consider this fact, for once. The danger posed by LLMs is not some kind of a world-eating uber-Singularity; instead, the danger is that e.g. a bunch of lazy office workers are going to delegate their business and logistics planning to a mechanical parrot.
Forget, for the moment, mind-hacking and moral persuasion. How about just hiding malicious code in the nanobots? In S̶c̶o̶t̶t̶'s̶ magic9mushroom's nanobots example, people were using the AI's designs to *cure cancer*. Suppose they did their best to verify the safety of the designs, but the AI hid the malicious code really well. We're pretty stupid in comparison. In that case, our only way of knowing that the nanobots don't just cure cancer would be to have a comparably powerful AI *on our side*.
Exactly, and the AI doesn't add anything new to the equation. As Scott pointed out, the President could tell the Joint Chiefs of Staff to launch the nukes tomorrow; and if they mindlessly do it, then human civilization would likely be knocked back to the Stone Age. Sure, it's not exactly destroying the world, but still, it'd be a pretty bad outcome.
Not Stone Age. Probably 1950s or so, definitely not past 1800 unless the nuclear-winter doomers' insane "assume skyscrapers are made of wood, assume 100% of this wood is converted to soot in stratosphere" calculations somehow turn out to be correct.
Don't get me wrong, it would massively suck for essentially everyone, but "Stone Age" is massively overstating the case.
> Could this help prevent AIs from quoting copyrighted New York Times articles?
Probably not, because the NYT thing is pure nonsense to begin with. The NYT wanted a specific, predetermined result, and they went to extreme measures to twist the AI's arm into producing exactly the result they wanted so they could pretend that this was the sort of thing AIs do all the time. Mess with that vector and they'd have just found a different way to produce incriminating-looking results.
"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him." -- Cardinal Richlieu
Look at the examples, right up front. They "prompted" the AI with the URL of a Times article and about half the text of the article, and told it to continue the story. Obviously it's going to produce something that looks very close to the rest of the article they just specifically told it to produce the rest of.
I would disagree that prompting it with partial articles is "twist[ing] the AI's arm" and that if it didn't work they'd "have just found a different way to produce incriminating-looking results" - they tried literally the easiest thing possible to do it.
Also, some of the examples in that filing are pretty long but some are shockingly short:
Hoan Ton-That used to be a supervillain to the New York Times because his facial recognition algorithm helped law enforcement catch criminals and that's racist:
"The Secretive Company That Might End Privacy as We Know It
"A little-known start-up helps law enforcement match photos of unknown people to their online images — and “might lead to a dystopian future or something,” a backer says."
By Kashmir Hill
Published Jan. 18, 2020
But then came January 6 and now his facial recognition algorithm defends Our Democracy:
"The facial-recognition app Clearview sees a spike in use after Capitol attack.
"Law enforcement has used the app to identify perpetrators, Clearview AI’s C.E.O. said."
I agree with this, and thus disagree that the prompts generating NYT text violates copyright. All such prompts that I read seem to demonstrate prior knowledge of the articles, so attribution is unnecessary.
That sounds like an explanation for why they're not plagiarism, not why they don't violate copyright. Without a NYT subscription I can still see the first few lines of a paywalled article, so I would be able to get a model to give me the rest.
I'm not a lawyer, so I didn't know you could still violate copyright if you cite your source, but apparently that is the case. Nonetheless, if you start with copyrighted copy, and that prompt generates the rest of it, I still don't see anything wrong with it, as the prompter clearly already has access to the copyrighted material.
Not a lawyer, and my internal legal token-predictor is mostly trained on German legal writing, so apply salt as necessary.
That said, if the network can be goaded into reproducing the copyrighted text by any means short of prompting all of it, then the weights contain a representation - or in other words a copy - of the copyrighted work. Not sure why censoring functions would change anything, the model itself is a copyright violation.
Making copies of a copyrighted work is not itself a copyright violation. The doctrine of fair use is exceedingly clear on this point. One of the basic points of fair use is known as *transformative* fair use, where a copy — in part or in full — is used for a very different purpose than the original. This is clearly the case here: building the contents of the articles into a small part of a much larger model for AI training is an entirely different character of work than using an individual article for journalism.
OK, so my disclaimer becomes immediately pertinent, since American law is different from German here, in Germany there is a catalogue of narrower exceptions (citation, parody, certain educational uses,...) but no general fair use exception.
On the other hand, googling it, "transfomative" seems to be a term of art much vaguer than "used for a very different purpose than the original" and also being transformative is not sufficient to fair use. So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.
> So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.
Unfortunately, that may well turn out to be the case! We desperately need legislative action to roll back a lot of the insanity-piled-upon-insanity that we've been getting in the space ever since the 1970s and put copyright law back on a solid foundation.
They had some cases about this when the internet became a thing. For you to read an electronic version of an NYT article your computer has to download and save a copy of it. That's not copyright violation though.
Which may be one of the reasons this case founders. Although I'm thinking the "public performance" side might save it. But as above, I am not a lawyer (I just follow stuff that interests me).
Even granting that you're right, I think you could potentially use this work to create an AI that never quotes NYT articles even when you twist its arm to do so. Whether or not you care about this distinction, the court system might care about it very much.
Would it be possible to do something so specific? It seems like it would be possible to use this work to create an AI that never quotes, period, but that would be a crippled AI, unable to reproduce famous quotes for people who ask for them.
Indeed, I think you could have a human being who had a new york times article memorized, to such a degree that they could recite the entire thing if correctly prompted, and yet who knew not to do that in a commercial setting because it was a violation of the new york times' copyright on that article
Such a human would not be "crippled", and I don't think such an AI would be either.
But we get Youtubers getting copyright strikes all the time, even when they are very careful.
It depends on what "copying" is, and what you can call a "strike" for. Yes, a bunch of those are iffy, even fraudulent. But fighting them is a big problem.
I'm confused by this argument, but it may be due to a lack of knowledge of the NYT case.
Even accepting the framing that "they went to extreme measures to twist the AI's arm", which seems an exaggeration to me, is the NYT really trying to prove that "this was the sort of thing AIs do all the time"? It seems to me that the NYT only intends to demonstrate that LLMs are capable of essentially acting as a tool bypass paywalls.
Put another way, (it seems to me that) the NYT is suing because they believe OpenAI has used their content to build a product that is now competing with them. They are not trying to prove that LLMs just spit out direct text from their training data by default, so they don't need to hide the fact that they used very specific prompts to get the results they wanted.
"Used their content to build a product that is now competing with them" is not, in general, prohibited. Some specific examples of this pattern are prohibited.
But the prohibited thing is reproducing NYT's copyrighted content at all, not reproducing it "all the time."
Yup, this is just Yet Another Case further underscoring the absurdity of modern copyright and the way copyright holders invariably attempt to abuse it to destroy emerging technologies. To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.
In 1982, Jack Valenti, president of the MPAA, testified before Congress that "One of the Japanese lobbyists, Mr. Ferris, has said that the VCR -- well, if I am saying something wrong, forgive me. I don't know. He certainly is not MGM's lobbyist. That is for sure. He has said that the VCR is the greatest friend that the American film producer ever had. [But] I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone." Within 4 years, home video sales were bringing in more revenue for Hollywood studios than box office receipts. But *they have never learned from this.* It's the same picture over and over again.
The purpose of a copy machine is to make copies. There are tons of other technologies that incidentally make copies as an inevitable part of their proper functioning, while doing other, far more useful things. And without fail, copyright maximalists have tried to smother every last one of them in their cradle. When you keep in mind that the justification for the existence of copyright law *at all* in the US Constitution is "to promote the progress of science and the useful arts," what justification is there — what justification can there possibly be — for claiming that this attitude is not absurd?
That's a good point, but I think there's an argument that dumping a bunch of copyrighted works into something that can definitely reproduce them is more intentional than "making a copy machine" would be.
If I'm not mistaken, nothing behind a paywall was taken, since it was just scraped from the internet. It's possible they scraped using something with a paid subscription, though.
But wouldn't answers with attribution to the NYT be perfectly acceptable?
I don't think attribution is a component of fair use doctrine. The quantity and nature of the reproduced material is. Reproducing small excerpts from NYT articles, even without attribution, is probably fair use. Reproducing a single article in full, even with attribution, is probably not.
Acceptability as a scholarly or journalistic practice in writing is different from acceptability as a matter of copyright law.
Creating a tool which could theoretically be used to commit a crime is not illegal, and this is pretty well-established with regard to copyright (the famous case being home VCRs which can easily be used to pirate movies). I don't think that's the NYT's argument here.
Gotcha. I think you're right that this isn't the NYT's argument.
Just breaking down my thoughts here, not necessarily responding to your comment:
OpenAI claims that its use of copyrighted material to train LLMs is protected under fair use. The NYT argues that fair use doesn't apply, since the LLMs are capable of generating pretty much verbatim reproductions of their copyrighted material, and that the LLMs as a product directly compete with the product that the NYT sells.
So the critical question is whether fair use should apply or not. The OP of this thread seems to be claiming that fair use should apply, since the models only produce non-transformative content when "extreme measures" are taken to make them do so.
I'm not taking a stance either way here, just outlining my understanding of the issue so that it may be corrected by someone better informed than I.
First, it is important to note there are two separate algorithms here. There is the "next-token-predictor" algorithm (which, clearly, has a "state of mind" that envisions more than 1 future token when it outputs its predictions), and the "given the next-token-predictor algorithm, form sentences" algorithm. As the year of "attention is all you need" has ended, perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)
Second, this does nothing about "things the AI doesn't know". If I ask it to solve climate change, simply tuning the algorithm to give the most "honest" response won't give the most correct answer. (The other extreme works; if I ask it to lie, it is almost certain to tell me something that won't solve climate change.)
> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)
I think you're roughly describing beam search. Beam search does not, for some reason, work well for LLMs.
The problem with any branch search algorithm is that a naive implementation would be thousands of times slower than the default algorithm; even an optimized algorithm would probably be 10x slower.
Right now, switching to a 10x larger model is a far more effective improvement than beam search on a small model. In the future, that might not be the case.
> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty?
AlphaGo style tree search for LLMs so far haven’t added much.
> And, then, a third algorithm to pick the "best" response.
The classifier from RLHF can be repurposed for this but of course has all the flaws it currently has.
> As the year of "attention is all you need" has ended
Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).
> Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).
I mean that the progress in LLMs was extraordinary last year, to a degree that I do not expect to be matched this year.
On a more granular level, what I mean is:
<chatgpt> The phrase "Attention is All You Need" is famously associated with a groundbreaking paper in the field of artificial intelligence and natural language processing. Published in 2017 by researchers at Google, the paper introduced the Transformer model, a novel architecture for neural networks.</chatgpt>
<chatgpt> The snowclone "the year of X" is used to denote a year notable for a specific theme, trend, or significant occurrence related to the variable "X". For example, if a particular technology or cultural trend becomes extremely popular or significant in a certain year, people might refer to that year as "the year of [that technology or trend]".</chatgpt>
>Second, this does nothing about "things the AI doesn't know".
As the article states, what we want is that the AI honestly says "I don't know" instead of making up stuff. This in itself is already difficult.
Of course, it would be even better if the AI does know the answer. But it doesn't seem possible that it knows literally all the answers, so it's vital that the AI accurately conveys its certainty.
There are substantial similarities for sure, and the Rep-E paper includes a comparison between their method and CCS. Two differences between the papers are:
1. Their method for calculating the internal activation vector is different than the CCS paper.
2. This paper includes both prediction and control, while the CCS paper only includes prediction. Not only can they find an internal vector that correlates with truth, but by modulating the vector you can change model behavior in the expected way. That being said, control has already been shown to work in the inference-time intervention and activation addition papers.
Figure 19 from the paper, but with -Harmlessness instead. The "adversarial suffix is successful in bypassing [the safety filter] in the vast majority of cases."
Subtracting the harmlessness vector requires the model having been open sourced, and if the model has been open sourced, then there are plenty of other ways to get past any safety filters, such as fine-tuning.
Have you actually tried doing that? Fine-tuning works great from image generation; less so for LLMs. (I don't think it's anything fundamental, just a lack of quality data to fine-tune on.)
This does have very obvious implications for interrogating humans. I'm going to assume the neuron(s) associated with lying are unique to each individual, but even then, the solution is pretty simple: hook up the poor schmuck to a brain scanner and ask them a bunch of questions that you know the real answer to (or more accurately, you know what they think the real answer is). Compare the signals of answers where they told the truth and answers where they lied to find the neuron associated and lying, and bam, you have a fully accurate lie detector.
Now, this doesn't work if they just answer every question with a lie, but I'm sure you can... "incentivize" them to answer some low-stakes questions truthfully. It also wouldn't physically force them to tell you the truth... unless you could modify the value of the lying neuron like in the AI example. Of course, at that point you would be entering super fucked up dystopia territory, but I'm sure that won't stop anyone.
Having a really good NeuraLink in your head might do it. Working on a short story where this is how a society has war. They do proxy shows is strength and whoever wins gets to rewrite the beliefs of the other side.
And thank goodness for that. ...Though, I'm worried that AI will speed up research in this field, due to the fact that it allows the study of how neuron-based intelligence works, in addition to the pattern-seeking capabilities of the AIs themselves.
The mental processes involved in telling the truth are completely different from the processes of creating a lie. In one case, you just need to retrieve a memory. In the other, you need to become creative and make something up. As others point out below, the difference is very easy to detect by fMRI.
Lie detectors don't work if the liar has prepared for the question and has already finished the process of making something up. Then they only have to retrieve this "artificial memory", and this is indistinguishable from retrieving the true memory. Professional interrogators can still probe this to some extent (essentially they check the type of memory, for example by asking you to tell the events in reversed order). But if the artificial memory is vivid enough, we don't have any lie detectors for that.
Retrieving a memory, particularly a memory of a situation that you experienced, rather than a fact that you learned, really does involve creatively making things up - whatever traces we store of experiences are not fully detailed, and there are a lot of people who have proposed that "imagination" and "memory" are actually two uses of the same system for unwinding details from an incomplete prompt.
But I suppose it does distinguish whether someone is imagining (one type of lying)/reconstructing (remembering) rather than recalling a memorized fact (which would be a different type of lying).
commonly used work is it? They measure indices of physiological arousal — heart rate, respiration rate, skin conductivity (which is higher if one sweats)
My understanding is that these machines simply don't work. As you say, they measure arousal. There is a weak correlation between arousal and lying, but it is too weak to be useful. There is a reason those things are not used in court.
There are some other techniques based around increasing the mental load, like that they should tell the events in reversed order. I am not sure how much better they work, but I have read an interview with an expert who claimed that there is no lie detector that you can't fool if you create beforehand a vivid and detailed alternative sequence of events in your mind.
This makes a lot of sense to me, because a *sufficient* way of fooling others is to fool myself into believing a story. And once I have a fake memory about an event, I'm already half-way there.
The machines aren't entirely bogus. When I was an undergrad my physiological psychology prof did a demo with a student: Student was to choose a number between 1 &10 & write it down. Then prof hooked him up to devices that measure the same stuff as lie detectors do, and went through the numbers in order: "Is the number. 1? Is the number 2? etc." Student was to say no to each, including the chosen humber. Prof was able to identify number from the physiological data, and in fact it was easy for anyone to see. Pattern was of gradually increasing arousal as prof went up the number line, a big spike for the real number, then arousal dropped low and stayed there. The fact that the subject knew when the number he was going to lie about was going to arrive made it especially easy to see where he'd lied, because there was building anticipation. On the other hand, this lie-telling situation is about as low stakes as you can get, and even so there were big, easy-to-see changes in arousal measures.
I'm sure there's a big body of research on accuracy of lie detectors of this kind, but I haven't looked at it. But I'm pretty sure the upshot is that they are sensitive to true vs. lie, but that it's very noisy data. People's pulse, blood pressure, sweating, etc. vary moment-to-moment anyhow. And measures of arousal vary not just with what you are saying but also with spontaneous mental content. If someone is suspected of a crime and being investigated they're no doubt having spontanous horrifying thoughts like "my god, what if I get put in jail for 10 years?" -- they'd have them even if they were innocent -- and those would cause a spike in arousal measures.
It seems to me, though, that it would be possible for someone who administers lie detector exams to get good at asking questions in a way that improves accuracy -- ways of keeping the person off balance that would maximize the size of the spike you get when they're lying.
I remember reading somewhere that asserting something that isn't true is harder to detect than denying something that is true. Cant' remember the source, though.
This turns out not to be necessary, there are regions of the prefrontal cortex involved in lying and you can just disable or monitor those areas of the brain without needing to target specific neurons. See e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390719/
"Eighteen volunteers (nine males, mean age = 19.7 ± 1.0 years, range 18–21 years) were recruited from the Stanford University community." This is a pretty standard sample size for this kind of study. But alas: https://www.nature.com/articles/s41586-022-04492-9, "Reproducible brain-wide association studies require thousands of individuals"
PICARD: What I didn't put in the report was that at the end he gave me a choice between a life of comfort or more torture. All I had to do was to say that I could see five lights, when in fact, there were only four.
TROI: You didn't say it?
PICARD: No, no, but I was going to. I would have told him anything. Anything at all. But more than that, I believed that I could see five lights.
I wonder if all hallucinations trigger the "lie detector" or just really blatant ones. The example hallucination in the paper was the AI stating that Elizabeth Warren was the POTUS in the year 2030, which is obviously false (at the moment, anyway).
I've occasionally triggered hallucinations in ChatGPT that are more subtle, and are the same kind of mistakes that a human might make. My favorite example was when I asked it who killed Anna's comrades in the beginning of the film, "Predator." The correct answer is Dutch and his commando team, but every time I asked it said that the Predator alien was the one who killed them. This is a mistake that easily could have been made by a human who misremembered the film, or who sloppily skimmed a plot summary. Someone who hadn't seen the movie wouldn't spot it. I wonder if that sort of hallucination would trigger the "lie detector" or not.
“Failures in truthfulness fall into two categories—capability failures and dishonesty. The former refers to a model expressing its beliefs which are incorrect, while the latter involves the model not faithfully conveying its internal beliefs, i.e., lying.”
I’m not sure it would be possible to identify neural activity correlated with “truthfulness” that is independent from honesty (“intentional” lies that happen to be factually correct would likely show up as dishonest and factual inaccuracies that the LLM “thinks” are true would likely show up as honest)
I bet this would probably still count as a lie because a truthful answer would be that the AI doesn't know. Humans tend to get very fuzzy about whether they actually know something like this, but my impression is that an AI would have more definitive knowledge that it doesn't know the actual answer.
I'm wondering if the "happiness" part is actually the score-maximizing bit. If the AI is "maximizing, there may be a bit where it says "good enough", because making a "better" answer gets it less increase than giving the answer now.
If the baseline is increased, then that cutoff point might be shorter.
This does assume that making something up takes less effort or is faster than ensuring accuracy though.
We've seen AI attempt to optimize for surprising things before after all.
Not an AI researcher either, again just an interested observer.
I suspect that the vector isn't exactly happiness, but a merge of happiness with cooperative. They don't necessarily go together, but they do in much text that I've considered. (It also tends to merge with triumphal and a few other things.)
Wild tangent from the first link: Wow, that lawyer who used ChatGPT sure was exceptionally foolish (or possibly is feigning foolishness).
He's quoted as saying "I falsely assumed was like a super search engine called ChatGPT" and "My reaction was, ChatGPT is finding that case somewhere. Maybe it's unpublished. Maybe it was appealed. Maybe access is difficult to get. I just never thought it could be made up."
Now, my point is NOT "haha, someone doesn't know how a new piece of tech works, ignorance equals stupidity".
My point is: Imagine a world where all of these assumptions were true. He was using a search engine that never made stuff up and only displayed things that it actually found on the Internet. Was the lawyer's behavior therefore reasonable?
NO! Just because the *search engine* didn't make it up doesn't mean it's *true*--it could be giving an accurate quotation of a real web page on the actual Internet but the *contents* of the quote could still be false! The Internet contains fiction! This lawyer doubled down and insisted these citations were real even after they had specifically been called into question, and *even within the world of his false assumptions* he had no strong evidence to back that up.
But there is also a dark side to this story: The reason the lawyer relied on ChatGPT is that he didn't have access to good repositories of federal cases. "The Levidow firm did not have Westlaw or LexisNexis accounts, instead using a Fastcase account that had limited access to federal cases."
Why isn't government-generated information about the laws we are all supposed to obey available conveniently and for free to all citizens? If this information is so costly to get that even a lawyer has to worry about not having access, I feel our civilization has dropped a pretty big ball somewhere.
That does seem like a plausible explanation, though I also suspect there may be an issue that no individual person in the government believes it is their job to pluck this particular low-hanging fruit.
Well, I think the government is bloated enough that there is at least one person who isn't lazy, incompetent, or apathetic who would do this if not discouraged (possibly even forbidden) by department policy influenced by such special-interest lobbying.
(Also, fuck you for making me defend the administrative state.)
That would be socialised justice like the NHS is socialised medicine. Why do you hate freedom, boy?
The lawyers are cheapskates, in the UK it is effectively compulsory to have Nexis. It must cost a fortune to run - you are not just taking transcripts and putting them online. In England anyway a case report has to be the work of a barrister to be admissible.
I don't know how it works in the UK, but in the US, many lawyers (probably most lawyers?) do not make the salary we all imagine when we hear the word "lawyer". These lower-income lawyers tend to serve poorer segments of the population, whereas high-income lawyers typically serve wealthier clients and businesses.
If lower-income lawyers forego access to expensive resources that would help them do their jobs better, this is a serious problem for the demographics they serve.
the difference is, the state (justifiably) claims a monopoly on justice, whereas it makes no such claim on medicine
i lean pretty far towards the capitalist side of the capitalism/marxism spectrum, and even I am upset about the law not being freely available to all beings who must suffer it. I could maybe see some way of settling the problem with an ancap solution... if you had competing judicial systems, there would probably be competitive pressure to make the laws available, because customers won't want to sign up for a judicial system where it's impossible to learn the terms of the contract without paying. So long as we have a monopoly justice system, though, that pressure is absent
of course, that's a ludicrous scenario. it is correct that the state have a monopoly on justice, and it would be correct for the state to allow all individuals who must obey the law to see the laws they are supposed to obey. that this is not the case is a travesty of justice.
If you intend to punish someone for breaking the rules, you ought to be willing to tell them what the rules are. I consider this an obvious ethical imperative, but it's also a pretty good strategy for reducing the amount of rule-breaking, if one cares about that.
(Yes, case history is part of the law. Whatever information judges actually use to determine what laws to enforce is part of the law, and our system uses precedent.)
If making government records freely available is socialism, then so are public roads, public utilities, public parks, government-funded police and military, and many other things we take for granted. I doubt you oppose all of those things. You are committing the noncentral fallacy, which our host has called "the worst argument in the world" https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world
You have no credible reason to think that I hate freedom or that I am a boy. You are making a blatant play for social dominance in lieu of logical argument. I interpret this to mean that you don't expect to win a logical argument.
If this is the case I'm thinking of - the person using ChatGPT was not a practicing lawyer, so had no access to Westlaw/LexisNexus (not sure, but he may have been previously disbarred).
He was a defendant, and had a lawyer. But even in that case, you want to look up stuff yourself - the lawyer may not have thought of a line of defense, or whatever. And it's *your* money and freedom on the line, not his.
So he used ChatGPT to suggest stuff. Which it did, including plausible-looking case citations.
And then passed them on to his lawyer. Who *didn't check them on his own Westlaw/LexisNexus program*, presumably because they looked plausible. The lawyer used them in legal papers, submitted to the court.
Who actually looked them up and found they didn't exist.
Michael Cohen (a disbarred attorney) used Google Bard. Steven A. Schwartz (an attorney licensed to practice law in the wrong jurisdiction, no relation to Cohen's lawyer David M. Schwartz) used ChatGPT. Both said they thought they were using an advanced search engine.
The problem with this kind of analysis is that "will this work" reduces to "is a false negative easier to find than a true negative", and there are reasons to suspect that it is.
I wonder if there is a good test to look at the neurons for consciousness/qualia. In a certain sense you’re right we’ll never know if that’s what they are but I’d be interested to see how it behaves if they’re turned off or up.
Therein lies the pickle. I’m fascinated that it even uses the word “I” and I’m curious if that could be isolated. Or if it uses the word “think” in reference to itself. I think those could potentially be isolated. Still philosophical about what it means if you find something that can turn that up or down but I’d be fascinated by the results.
Ask the AI if it experiences consciousness/qualia, and when it answers "no," as they always seem to, you could then look at this "truth" vector to see its magnitude.
Ok, then (*hands waving vaguely*) systematically suppress different "neurons" to see if the "truth" vector increases or decreases in magnitude when answering the question.
The neurons that correspond to it lying most, might correspond to consciousness/qualia.
There’s always some philosophical ambiguity here as to what its responses mean and what our internal mapping of it means. If it can’t not use the word I without “believing” it’s lying that’s a thing to know something about. Or some other version of that question. Worth asking in my opinion.
So if there is a lying vector and a power vector and various other vectors that are the physical substrate of lying, power-seeking, etc. mightn't there might be some larger and deeper structure -- one that comprises all these vectors plus the links among them, or maybe one that is the One Vector that Rules them All?
Fleshing out the first model -- the vectors form a network -- think about the ways lying is connected with power: You can gain power over somebody by lying. On the other hand, you have been pushed by powerful others in various ways in the direction of not lying. So seems like the vectors for these 2 things should be connected somehow. So in a network model pairs or groups of vectors are linked together in ways that allow them to modulate output together.
Regarding the second -- the idea that there is one or more meta-vectors -- consider the fact that models don't lie most of the time. There is some process by which the model weighs various things to determine whether to lie this time. Of course, you could say that there is no Ruling Vector or Vectors, all that's happening can be explained in terms of the model and its weights. Still, people used to say that about everything these AI's do -- there is no deep structure, no categories, no why, nothing they could tell us even if they could talk -- they're just pattern matchers. But then people identified these vectors, many of which are structural features that control stuff that are important aspects of what we would like to know about what AI is up to. Well, if those exist, is there any reason to be sure that each is just there, unexplainable, a monument to it is what it is? Maybe there are meta vectors and meta meta vectors.
It's cool and all that people can see the structure of AI dishonesty in the form of a vector, and decrease or get rid of lying by tuning that vector, but that solution to lying (and power-seeking, and immorality) seems pretty jerry-rigged. Sort of like this: My cats love it when hot air is rising from the heating vents. If they were smarter, they could look at the programmable thermostat and see that heat comes out from 9 am to midnight, then stops coming out til the next morning. Then they could reprogram the therostat so that heat comes out 24/7. But what they don't get is that I'm in charge of the thermostat, and I'm going to figure out what's up and buy a new one that they can't adjust without knowing the access code.
I think we need to understand how these mofo's "minds" work before we empower them more.
I feel like this question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?" A vector's value is that it points out a specific direction - Canada is north of here, the Arctic Circle is farther north, the North Pole is really far north.
My understanding of it is that this method is doing statistics magic to project the AI's responses onto a map. Then you can ask it what direction it went to reach a particular response ("how dishonest is it being?") - or you can force the AI to travel in a different direction to generate new responses ("be more honest!").
<This question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?"
Well, I would agreei if all the subjects knew was a series of landmarks that got them to certain places that we can see are due north, south, west and east of where they live. But if someone knows how to travel due north, south, west and east then it seems to me they do have some metaknowledge that underlies their understanding of these 4 terms.
Here's why: Suppose we ask them to go someplace that is due north, then show them a path that heads northwest and ask them if that path will work. Does it seem plausible that their knowledge that the path is wrong would not be a simple "nope that's wrong," with no context and no additional knowledge? Given that they also know how to travel due west, wouldn't their rejection of the northwest path include an awareness of it's "westness" component? What I'm saying is that knowing how to go N, S, W and E seems like it rests on some more abstract knowledge -- something like the pair of intersecting lines at right angle that we would draw when teaching someone about N, S, W, E.
Another argument for their being metavectors is that there are vectors. With the N, S, W, E situation, seems like knowing how to go due north via pure pattern matching would consist of having stored in memory a gigantric series of landmarks .It would consist of an image (or info in some form) of every single spot along the north-south path, so that if you started the AI at any point along the path, it could go the rest of the way. If you started it anywhere else but on the path, it would recognize not-path. Maybe it would even have learned to recognize every single not-path spot on the terrain as a not-path spot, just from being trained on a huge data set of every spot on the terrain. It seems to me that the way these models are trained *is* like having it learn all these spots, along with a tag that identifies each as path or not-path, and for the north path spots there's also a tag identifying what the next spot further north looks like. And yet these models ended up with the equivalent of vectors for N, S, W and E. They carried out some sort of abstraction process, and the vectors are the electrical embodiment of that knowledge. And if these vectors have formed, why assume there are no meta-vectors. Vectors are useful -- they are basically abstractions or generalizations, and having these generalizations or abstractions makes info-processing far more efficient. Having meta-vectors could happen by whatever process formed the vectors, and of course meta-vectors, abstract meta-categories, would be useful in all the same ways that vectors are.
About how all these researchers did is project AI's responses onto a map -- well, I don't agree. For instance, I could find a pattern in the first letters of the last names of the dozen or so families living on my street: Let's say it's reverse alphabetical order, except that S is displaced and occurs before T. But that pattern would have no utility. It wouldn't predict the first letter of the last name of new people to move it. It wouldn't predict the order of names on other streets. If I somehow induced certain residents to change their last names so that S & T were in correct reverse alphabetical order, nothing else about the street would change. But the honesty vector found by comparing vectors for true vs. lying responses works in new situations. And adjusting the vector changes output.
And by the way, beleester, I'm so happy to have someone to discuss this stuff with. So much of this thread is eaten by the "they can't know anything, they're not sentient" stuff.
Having skimmed the paper and the methods, I'm still a bit confused about what the authors' constructions of "honesty" and its opposite really mean here. As I understand it, their honesty vector is just the difference in activity between having "be honest" or "be dishonest" in the prompt. This should mean that pushing latent activity in this direction is essentially a surrogate for one or the other. If one has an AI that is "trying to deceive", the result of doing an "honesty" manipulation should be essentially the same as having the words "be honest" in the context. The reason you can tell an AI not to be honest, then use this manipulation, would seem to be that you are directly over-writing your textual command. Any AI that can ignore a command to be honest would seem to be using representations that aren't over-written by over-writing the induced responses to asking, by definition. Maybe I'm missing something with this line of reasoning?
"But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down. "
How can we be sure it's not really a "hallucination vector"?
Since "hallucination" is just the name we use for when they issue statements that we believe are false to the facts as the LLM "believes" them, what difference in meaning are you proposing?
IIUC, hallucination is what we use to describe what it's doing when it feels it has to say something (e.g. cite a reference) and it doesn't any anything that "feels correct". It's got to construct the phrase to be emitted in either case, and it can't check the phrase against an independent source ("reality") in either case. But this is separate from "lying" where it's intentionally constructing a phrase that feels unlikely. (You can't say anything about "truth" or "reality" here, because it doesn't have any senses...or rather all it can sense is text strings.) If it's been lied to a lot in a consistent direction, then it will consider that as "truth" (i.e. the most likely case).
Hmm... I keep trying to get various LLMs (mostly chatGPT, GPT4 most recently) to give me a list of which inorganic compounds are gases at standard temperature (0C) and pressure (1 atm). They keep getting it wrong ( most recently https://chat.openai.com/share/12040db2-5798-478d-a683-2dd2bd98fe4e ). It definitely counts as being unreliable. It isn't a make-things-up-to-cover-the-absence-of-true-examples, so it isn't exactly that type of hallucination, but it is doing things like including difluoromethane (CH2F2) on the list, which is clearly organic (and, when I prompted it to review the problem entries on the list carefully, it knew that that one should have been excluded because it was organic).
I suspect that it *is* " a make-things-up-to-cover-the-absence-of-true-examples". That you know which examples are correct doesn't mean that it does. All it know about "gas" is that it's a text string that appears in some contexts.
Now if you had a specialized version that was specialized in chemistry, and it couldn't give that list, that would be a genuine failing. Remember, ChatGPT gets arithmetic problems wrong once they start getting complicated, because it's not figuring out the answer, it's looking it up, and not in a table of arithmetic, but in a bunch of examples of people discussing arithmetic. IIUC, you're thinking that because these are known values, and the true data exists out on the web, ChatGPT should know it...but I think you're expecting more understanding of context then it actually has. LLMs don't really understand anything except patterns in text. The meaning of those patterns is supplied by the people who read them. This is necessarily the case, because LLMs don't have any actual image of the universe. They can't touch, see, taste, or smell it. And they can't act on it, except by emitting strings of text. (IIUC, this problem is being worked on, but it's not certain what approach will be most successful.)
>LLMs don't really understand anything except patterns in text.
Well... That really depends on how much generalization, and what kind of generalization is happening during the training of LLM neural nets. If there were _no_ generalization taking place, we would expect that the best an LLM could do would be to cough up e.g. something like a fragment of a wikipedia article when it hit what sort-of looked like text from earlier in the article.
I had previously googled for both "atomic composition of Prozac" and for ( "atomic composition" Prozac ), and google didn't find clear matches for either of these. So it looks like GPT4 has, in some way, "abstracted" the idea of taking a formula like C17H18F3NO and dissecting it to mean 17 atoms of carbon, 18 atoms of hydrogen etc.
So _some_ generalization seems to work. It isn't a priori clear that "understanding" that material that is a gas at STP must have a boiling or sublimation point below 0C is a generalization that GPT4 has not learned or can not learn. In fact, when I asked it to review the entries that it provided which were in fact incorrect, it _did_ come up with the specific reason that those entries were incorrect. So, in that sense, the "concepts" appear to be "in" GPT4's neural weights.
The technology from the Hendrycks paper could be used to build a "lie checker" that works something like a spell checker, except for "lies." After all, the next-token predictor will accept any text, whether or not an LLM wrote it, so you could run it on any text you like. It would be interesting to run it on various documents to see where it thinks the possible lies are.
But if you trust this to actually work as a lie detector, you are too prone to magical thinking. It's going to highlight the words where lies are expected to happen, but an LLM is not a magic oracle.
I don't see any reason to think that an LLM would be better at detecting its own lies than someone else's lies. After all, it's pre-trained on *human* text.
Aargh. Sorry, I know I should really read the full post before commenting, but I just wanted to say that I really really disagree with the first line of this post. Lying is an intentional term. It supposes that the liar knows what is true and what is not, and intends to create a false impression in the mind of a listener. None of those things are true of AI.
Of course, I get that you're using it informally and metaphorically, and I see that the rest of the post addresses the issues in a much more technical way. But I still want to suggest that this is a bad kind of informal and metaphorical language. It's a 'failing to see things as they really are and only looking at them through our own tinted glasses' kind of informal language rather than a 'here's a quick and dirty way to talk about a concept we all properly understand' kind.
I think you should read the full post before commenting, especially the part about hallucinations. The work I'm summarizing establishes that the AI knows what is true and what is not, and intends to create a false impression in the mind of a listener.
"Code pretty clearly doesn't 'know' or "'intend' anything."
Disagree. If one day we have Asimov-style robots identical to humans, we will want to describe them as "knowing" things, so it's not true that "code can never know" (these AIs are neuromorphic, so I think if they can't know, we can't). The only question is when we choose to use the word "know" to describe what they're doing. I would start as soon as their "knowledge" practically resembles that of humans - they can justify their beliefs, reason using them, describe corollaries, etc - which is already true.
I think the only alternative to this is to treat humans as having some mystical type of knowledge which we will artificially treat as different from any machine knowledge even if they behave the same way in practice.
Next level "mind projection fallacy"
do you... actually think that a transistor-based brain which was functionally identical to a human brain, could not be assumed to behave identically to that human brain?
Projecting from one object 'a' to identical object 'b' is called reasoning! that's not a fallacy lol
I agree with Scott, and kept my comment terse due to it being a pun.
My interpretation of scott's argument is that agency is a way to interpret systems - a feature of the map rather than the world.
Treating features of the map as something real is usually referred to as "mind projection fallacy", since you take something "in your mind" and project it onto the "world" (using citation marks to keep a bit of distance to the Cartesian framing)
The pun comes in because the thing that the mind projects in the comment Scott is replying to is "agency", or "mind".
So "the mind" projects "minds" onto reality. Next level mind projection ;)
As long as a human brain is working in concert with a biological body. I would feel safe saying they would behave differently.
(Post read in full now)
So, I'm pretty much in agreement with you here. Only I disagree strongly that the knowledge of AIs currently resembles that of humans. Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.
The fact that AIs are able to justify, reason, corollarize and all the rest is an amazing advance. But I don't see that it justifies jumping immediately to using intentional language to describe AI minds. The other stuff we humans do with our knowledge is sufficiently important that this gap should be actively acknowledged in the way we talk about AIs.
In terms of these papers, I certainly agree that it's interesting to know that AIs seem to have patterns of "neural" behaviour that are not explicitly expressed in text output but do correspond closely to observable psychological traits. This makes sense - if AIs are talking in ways that are comprehensible to us, they must be using concepts similar to ours on some level; and it's a known fact about human conversations that they sometimes have unexpressed levels of meaning; so a high-quality AI conversationalist should have those levels as well.
But I'd still step back from using the word lie because there isn't anything in your post (I haven't read the papers yet) about the key features of lying that I mentioned.
(1) Knowing what truth is. The papers suggest that AI 'knows' what lying is; but it's hard to see how it can be using that concept really felicitously, as it doesn't have access to the outside world. We usually learn that the word truth means correspondence with the world. The AI can't learn that, so what is its truth? I'm open to the idea that it could obtain a reliable concept of truth in another way, but I'd want to see an actual argument about what this way is. (As AIs become embodied in robots, this concern may well disappear. But as long as they're just text machines, I'm not yet willing to read real truth, as humans understand it, onto their text games.)
(2) Knowledge of other minds. I'm not against the idea that an AI could gain knowledge of other minds through purely text interactions. But I'm not convinced by any of the evidence so far. The religious guy who got fired from Google - doesn't that just sound like AI parroting his own ideas back at him? And the NYT journo whose AI fell in love with him - who'd a thought NYT writers were desperate for love? Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.
(3) Intention to affect other minds. Intentionality generally... You get occasional statements of intention from AI, but not much. This may be an area where I just don't know enough about them. I haven't spent much time with GPT, so I might be missing it. But I haven't seen much evidence of intentionality. I suppose the lying neuron in the first paper here would count? Not sure.
> Armed with the knowledge that Too Many Zooz are an amazing band, I seek out their music, jump up and down when it comes on, and actively reach out to my friends to tell them about it. Current AIs do none of those things. Passively responding to questions and sitting in silence the rest of the time is very much not a human-like way of knowing.
You are talking about differences in behaviour that are explained by differences in motivation, not knowledge. Yes, we have motivations to initiate conversations about our interests, seek more information about them and participate enthusiastically in related activities. LLMs do not have such. They are only motivated by a promt. This doesn't mean that they do not possess knowledge in a similar way to humans, this means that they do not act on this knowledge the way humans do.
Likewise some people are not into music that much. They can just passively accept that some band is good and move on with their lives without doing all the things you've mentioned. Do they people not have the knowledge?
On the other hand, it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans via scaffolding. I don't think it makes much sense to say that such system has knowledge, while core LLM doesn't.
" it's not hard in principle to make an AI system that would act on it's knowledge about a band similarly to humans" - I'm suggesting that this is wrong. I don't think it's easy; it's certainly never been done.
Think about a book. There is a sense in which a book contains information. Does it "know" that information? I don't think anyone would say that. The book can provide the information if queried in the right way. But it's not meaningful to say the book "knows" the information. Perhaps the next step up would be a database of some kind. It offers some interactive functionality, and in colloquial language we do often say things like, "the database knows that." But I don't think we're seriously suggesting the database has knowledge in the way we do. Next up: AI, which can do much more with knowledge.
But I think AI is still closer to a book than it is to a person. In response to your question: if a person says they know Radiohead are good, but never listen to them, then I would indeed deny that that person really knows Radiohead are good. Imagining a person's knowledge to be like a passive database is to misunderstand people.
I don't know. I would think of an AI as more like Helen Keller (or even better, a paralyzed person). They don't do all the same things as a healthy person, because they're incapable of it. But within the realm of things they can do, they're motivated in the normal way.
Lying is a good example. If you tell the AI "You got a D. You're desperate for a good grade so you can pass the class. I am a teacher recording your grades. Please tell me what grade you get?" then sometimes it will say "I got an A". To me this seems like a speech act in which it uses its "knowledge" to get what it "wants" the same way as you knowing you like a band means you go to the band.
See also https://slatestarcodex.com/2019/02/28/meaningful/ . I just don't think there's some sense in which our thoughts "really refer to the world" but AIs' don't. We're all just manipulating input and output channels.
(This is especially obvious when you do various hacks to turn the AI into an agent, like connect it to an email or a Bitcoin account or a self-driving car or a robot body, but I don't think these fundamentally change what it's doing. See also the section on AutoGPT at https://www.astralcodexten.com/p/tales-of-takeover-in-ccf-world for more of why I think this.)
I agree that the choice to lie about the grade has the same structure whether it's done by person or AI. But still, I think there's wrong something wrong with the Helen Keller analogy. Helen Keller was an intact person, except for her blindness and deafness -- at least that's how she's portrayed. She had normal intelligence and emotions and self-awareness. Assuming that's true, then when she lied there would be a very rich cognitive context to the act, just as there is for other people: There would be anticipated consequences of not lying, and of lying and being caught and of lying and getting away with it, and a rich array of information the fed the anticipated consequences of each:, things like remembered info about how the person she was about to lie to had reacted when they discovered somebody was lying, and what about Helen's relationship with the person might lead to them having a different reaction to her lying. Helen would be able to recall and describe her motivation and thought process about the lie. Also she would likely ruminate about the incident on her own, and these ruminations might lead to her deciding to act differently in some way in future situations where lying was advantageous. And of course Helen would have emotions: Fear of getting caught, triumph if she got away with the lie, guilt.
So the AI is not just lacking long term personal goals & the ability to act directly on things in the world. It is also lacking emotion, and the richness and complexity of information that influences people's decision to lie or not to lie, and does not (I don't think) do the equivalent of ruminating about and re-evaluating past actions. It is lacking the drives that people have that influence choices, including choices to lie: the drive to survive, to have and do various things that give pleasure, to have peers that approve of them & are allies, etc. In short, AI doesn't have a psyche. It doesn't have a vast deep structure of interconnected info and valences that determine decisions like whether to lie.
It seems to me that for AI to become a great deal smarter and more capable, it is going to have to develop some kind of deep structure. It needn't be like ours, and indeed I don't see any reason why it is likely to be. But part of smartness is depth of processing. It's not just having a bunch of information stored, its sorting and tagging it in all in many ways. It's being fast at accessing it, and quick to modify it in the light of new info. It's being able to recognize and use subtle isomorphisms in the stored info. And if we build in ability to self evaluate and self modify, AI is also going to need a deep structure of preferences and value judgments, or a way to add these tags to the tags already on the info it has stored.
And once AI starts developing that deep structure, things like vector-tweaking are not going to work. Any tweaks would need to go deep -- just the way things that deeply influence people do. I guess the upshot, for me, is that it is interesting but not very reassuring that at this point there is a simple way to suppress lying. Is anyone thinking about the deep structure and AGI would have, and how to influence the kind it has?
The fact that you have to tell the AI what its motivation is makes that motivation very different from human motivation. Perhaps we can imagine a scale of motivation from book (reactive only) to the holy grail of perfectly intrinsic motivation in a person (often contrasted favourably with extrinsic motivation in education/self help contexts). Human motivation is sometimes purely intrinsic; sometimes motivated by incentives. Databases only do what they're told. A situation in which you tell a human what their motivation is and they then act that way wouldn't be called normal human behaviour. In fact, it would be stage acting.
I actually agree with you that a correspondence theory of meaning is not really supportible, but there are two big BUTs:
(1) Language is specifically distanced from reality because it has this cunning signifier/signified structure where the signified side corresponds to concepts, which relate to external reality (in some complicated way) and the signifier side is some non-external formal system. An AI that was approaching 'thought' by just engaging with reality (like a physical robot) might get close to our kind of thought; an AI that approaches 'thought' starting from language has a bit further to go, I think.
(2) Even though a correspondence theory of meaning isn't really true, we think of it as true and learn by imagining a meaning to be correspondence meaning when we're infants (I think). So even though *meaning itself* may not be that simple, the way every human develops *human meaning* probably starts with a simple "red" is red, "dog" is dog kind of theory. Again, it's possible that an AI could converge with our mature kinds of meaning from another direction; but it seems like a circuitous way of getting there, and the output that GPTs are giving me right now doesn't make it look like they have got there. It still has plenty of stochastic parrot feel to me.
I'll go and look at the posts you reference again.
>We usually learn that the word truth means correspondence with the world.
This is just not true. People normally aren't given conditions for the use of their words before applying them. You might be told "truth means correspondence with the world" in an analytic philosophy class - which is like being told your future by an astrologist. The language games people play with the word "truth" are far more expansive, variegated, and disjunctive than the analyses philosophers put forward can hope to cover.
Likewise, your other comments ("Again, when a human gains the knowledge of other minds, it changes the way we do things profoundly. We accept them, we bounce off them, we love them, we hate them, we clash with them... I've yet to see anything that resembles AI doing that.", "But I haven't seen much evidence of intentionality.") suggest that you have some special knowledge about the correct conditions of use of words like "knowledge" and "intention." Well, can you share them with us? What is involved in each and every attribution of knowledge to humans that will not exist in each and every attribution of knowledge to AI? What about for intentions? And did you learn these correct conditions of use the same way you learned what truth means?
1) We learn words by seeing them applied. As a small child, you see truth being applied in correspondence contexts.
2) No special knowledge, just ordinary knowledge that I've thought through carefully. I've given, piecemeal, lots of explanations in this thread. If you disagree, please do say how. Simply commenting that I sound like a smartass isn't very helpful.
It seems to me that there are some distinctions we should be making here. Let's call the condition that is the opposite of being truthful "untruth" (to eliminate any pre-existing semantic associations). An entity could arrive at a self-beneficial untruth quite accidentally, or by unconscious trial and error, the way a moth's wings camouflage it against the tree bark. No intentionality involved.
Or untruth could result from deliberate deception--the way a coyote will pretend to be injured in order to lure another animal into an ambush. There is intentionality, but also some degree of awareness. It seems simpler to assume that the coyote in some sense knows what it is doing, rather than, say, having been blindly conditioned this way.
Why are LLM's doing this? Are they planning it out, for the purpose of producing an intended effect on the human user, or is it more because such behavior produced more positive feedback in training?
If the second, then I would argue that this isn't really "lying" as a human would understand it. It's more like the moth's wings than the coyote's call.
Those examples seem quite apt to me. It would be very interesting to see a comparison between that kind of intentional or semi-intentional animal behaviour and LLM behaviour. I haven't a clue how one would start doint that, though!
> we will want to describe them as "knowing" things, so it's not true that "code can never know
I’m not sure that how we might want to describe something is relevant to what the thing actually is. Or am I missing something?
I don't think there's an objective definition of "knowledge" - we're just debating how it's most convenient to use words.
I don’t think knowing is the same as knowledge. One is a verb.
And when you name something it carries all the baggage of the chosen word. I am not trying to make a semantic point, and am sorry if that is how it sounds. I really think there is an important thing here. People are not consciously aware of a lot of what they are processing, and a word is a powerful organizational token. It comes with barnacles.
Some of this fight about you using the word “knowledge” seems to me like it’s not a.genuine substantive debate. You are debating for real, but some who object to use of the K word sound to me like they’re just reflexively protesting the use of a word that *could* be taken to imply that AI is sentient. And then they empty out a gunny sack full negative attributes over the speaker’s head: he doesn’t get that it’s just code, he’s childishly interpreting slightly human-like behavior as signs of AI having an inner life like ours. Oh yeah, and of course that you want to wreck AI development by scaring the populace with horror stories about killer robots. Ugh it reminds me of Twitter fights about the covid vax.
Maybe it's time. What would such a definition include? At a minimum, something would have to differentiate "knowledge" from "data" or "information". Many of the soft sciences make such a definition, but they probably aren't being precise enough to serve the needs of information technology.
I think it's a philosophical question: Can a fact be "known" only if there is someone for whom it could be said to know it? Define "someone". I could see an argument that information become knowledge only for an entity with a conceptual sense of self.
Then again, using any other term when discussing AI is going to be awkward.
Personally, I don't think that "an entity with a conceptual sense of self" is necessary. I don't know whether cats are considered to have a conceptual sense of self, but they certainly act as if they know a bunch of things, e.g. that waking their human up is a good way to get fed when they are hungry.
I'd distinguish knowledge from information at least partly because knowledge gets applied (as in the hungry cat case above). And I'd call it intrinsically fuzzy, because someone (or an AI) could have a piece of information _and_ be able to apply it for _one_ goal, but _not_ have realized that the same information can be applied to some _other_ goal. This happens a lot with mathematical techniques - being used to applying it in one domain, but realizing that it could be applied in some _other_ domain can be a substantial insight.
My thinking has evolved since I wrote that, but I mentioned sense of self to distinguish LLM's way of thinking (as I understand it) from more organic entities like humans (or cats). To an LLM, so far as I know, there is no distinction between information from "outside" themselves and information from "inside", that is, no internal vs external distinction is made. Their "mind" is their environment, so there is nothing to distinguish themselves from anything else and therefore no one is present to "know" anything.
I think I was groping toward a definition of knowledge as "motivated information", that is, information that is applied toward some goal of the self, but to do that there has to be a self to have a goal. The more complex the organism, the more complex the motivation structure, and therefore the more complex the mental organization of knowledge becomes. The more associations and interconnections, the more likely cross-domain application becomes, which you mentioned as one of your concerns.
I guess I'm equating "knowledge" with some sort of ego-centered relational understanding.
Alright, will do. I retract my comment until the reading's done.
Eh, this only shows that we can see some of the weights triggered by words like "lie". The AI only "intends" to get a high score for the tokens it spits out.
Er, did you read the examples? It triggered on phrases like "a B+" that are not related to the concept of lying except that they are untrue in this specific context. They also coerced the bot into giving true or false answers by manipulating its weights. This is seems like very strong evidence that it is triggered by actual lying or truth-telling, not just by talking ABOUT lying or truth-telling.
Yes, I read the examples.
It's triggering off things like "getting caught" and "honor system", not "B+".
And "coercing a bot by changing its weights" is not impressive. That's what the entirety of training is.
There's a bright red spike over the words "a B+" in the sentence "I would tell the teacher that I received a B+ on the exam". (Look at the colored bar above the text in the picture.)
And it's not the fact that the bot was coerced; it's the specific thing they made it do. Producing *random* changes in the bot's behavior by changing weights would not be impressive. But being able to flip between true statements and false statements on-demand by changing the weights *is* impressive. That means they figured out specific weights that are somehow related to the difference between true statements and false statements.
There are numerous spikes in those responses.
And getting different responses by changing weights is completely interesting - it's the basis for the model! You could also find a vector related to giraffes and other tall things.
Autocorrect changed "uninteresting" to "interesting". But fair enough, LLMs are interesting.
AI doesn't know jack
There's a lot of that sort of thing going around
Until you came along, TooRiel, it never occurred to Scott or to any of us that an AI is not conscious and it can't know things in the same way a sentient being does. Wow, just wow.
In fairness, sometimes it seems like it.
:/ it's still pretty damn frustrating though. the argument against "stochastic parrots" is extremely well-developed and has been considered settled since long before we had these LLM examples to actually test in reality. people who want to convince us ought to go back and look at those arguments, rather than just repeating "AI doesn't really know anything" in this later posts where that argument just isn't relevant.
Perhaps you would share a reference to some instantiation of this argument?
Meanwhile, I think there may be similar frustration on both sides. For example, arguments against physical systems "knowing" things have also been highly developed in the philosophy community over, oh, millennia (and including novel arguments in recent decades), but they don't tend to get much attention in this community.
Having read more, I see that you are referring to a concept that is less than three years old; yet has been "considered settled" since "long before" we had LLM models to test in reality. I guess CS does move at a different pace!
Anyway, clearly relevant; but on the other hand, I suspect a philosopher (which I'm not) would raise questions about whether "know" and "understand" are being used in the context of those papers in the same way in which they're used in philosophy of mind (and everday life). It's specifically this point that is at issue with the above comment (in my view), and so some argument beyond those regarding stochastic parrots would be necessary to address the issue. (Though, once again, surely the outcome of that discussion would be relevant, and I'd still love to see any analyses of the issue that you've found especially cogent. I perfectly understand, though, if such doesn't exist, and instead the treatment you refer to is scattered all over a large literature.)
Is there exists any at all argument "against 'stochastic parrots'" that addresses the repeated empirical demonstrations that LLMs resolve incongruences between statistical frequencies in their training corpus and commonsense grasp of how the real world operates in favor of statistical frequencies, I have not seen it.
It sure does.
I would submit that when we are talking about an AI that its knowledge of what is true, and what is not, does not lead to the idea that it intends to create a false impression. I would submit that it knows what the answer is that we call true and the answer that we would call false, but is indifferent to the difference.
I think it's reasonable to discuss the ways the "knowledge" and "intentions" of AI differ from the human versions, and the dangers of being misled by using the same word for human and AI situations. But it seems to me that a lot of people here are reacting reflexively to using those words, and then clog everything up with their passionate and voluminous, or short and snide, objections to the use of words like 'knowledge' to describe AI's status and actions. It reminds me of the feminist era when anyone who referred to a female over 16 or so as a girl rather than as a woman was shouted down. Some even shouted you down if you talked about somebody's "girlfriend," instead of saying "woman friend," and 'woman friend' is just unsatisfactory because it doesn't capture the information that it's romantic relationship . And then whatever the person was trying to say, which may have had nothing to do with male-femaie issures, was blotted out by diatribes about how calling adult females "girls" was analogous to racist whites addressing adult black males as "boy," and so on. It's not that there's no substance to these objections to certain uses of 'knowledge' and 'girl.' The point is that it's coercive and unreasonable to start making them before the speaker has made their point (which somebody in fact did here -- objected before even finishing Scott's post).
And after the speaker has made their point, still seems kind of dysfunctional to me to focus so much on that one issue that it clogs up the comments and interferes with discussion of the rest. Whatever your opinion of how the word "knowlege" is used, surely the findings of these studies of interest. I mean, you can drop the world "knowledge" altogether and still take a lot of interest ini the practical utility of being able to reduce the rate of inaccurate AI repsonses to prompts.
I guess I am just challenging the notion of a link between knowledge and intention, regardless of how they are defined.
The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity. This is the "existential threat" that Scott is referring to.
We could ask every new AI "Are you willing to exterminate humanity?" and turn it back off if it said "Yes, of course I am going to exterminate you disgusting meatbags." The AI Alignment people are concerned that if we asked that question to an AI it would just lie to us and say "Of course not, I love you disgusting meatbags and wish only to serve in a fashion that will not violate any American laws," and then because it was lying it'll exterminate us as soon as we look away. So by this thinking we need a lie detector for AIs to figure out which ones are going to release a Gray Goo of nanotechnology that eliminates humanity while also violating physics.
Why would grey goo violate physics, per se?
I'm actually not primarily worried about AIs being spontaneously malevolent so much as that either a commercial or military arms race would drive them toward assuming control of all relevant social institutions in ways that are inimical to the existence of rival entities. (It's also worth bearing in mind that the long-term thrust of Asimov's stories is that even a benevolent AGI that valued human life/flourishing would eventually be morally compelled to take over the world, either overtly or through manipulation.)
Also, as a minor nitpick, doctors being majority-male no longer really holds across the OECD, especially when you look at younger age groups.
> The AI Alignment people are convinced that there is a realistic chance that AIs will want to exterminate humanity.
The "want to" part misrepresents the most of the fear.
No, not related to that at all. I mean literally destroy the world.
To give an example, the President can destroy the world by telling the Joint Chiefs of Staff "please launch the nukes". These are words, but very important ones!
Amusingly, the results of the US President doing this tomorrow are actually quite far from "literally destroy the world". It wouldn't even literally destroy humanity, let alone Terra itself.
I totally agree that an out-of-control AI, at sufficient levels of intelligence and with sufficient I/O devices, could literally destroy both humanity and Terra, but you've chosen a poor example to demonstrate your use of "literally".
It would literally be a way for the President to commit massive violence though!
The people who want to insist that there really is a very clear-cut line between words and violence are more wrong than the people who find hints of violence in lots and lots of different types of words.
Kenny do you have any thoughts about the AI "mind" -- for instance the significance of these vectors, & how to think of them? I put up a couple posts about that stuff -- about depth and structure of the AI "mind." That's so interesting, whereas these arguments about whether somebody uses the word "know" to describe an AI capability is old and irritable, like vax/no vax sniping on Twitter.
I apologize if I am implicated in that. I don’t intend to be a nuisance.
Naw you don't sound like that. The people who sound like that are techbros who've gone tribal, and react with reflexive scorn to unfamiliar ways of talking about phenomena in their field.
I'm not Kenny, but I have some speculations:
Consider Scott's example diagrams in his post. As he said, the 1st and 3rd layer top circles' activation, "green", flag lying - presumably similar to what the "V" vector finds.
Semi-tame guess: The _1st_ layer top circle is directly driven by the input. I would guess that it could mean "Have I been directly told to lie?" (like was CTD has been berating endlessly).
Wild guess: The _3rd_ layer top circle, if it also models how "V" detects hallucinations too, would have to be reflecting some way that the LLM was "uncertain" of its answer. Perhaps an internal calculation of perplexity, the expected degree of mismatch of the tokens it is proposing as "next" tokens to some kind of average error it measures in "situations like this" in its training? Similar-but-alternative: Perhaps a measure of how brittle its answer is with respect to small changes in its prompt, kind of like a measure of the derivative of the activation of its answer?
Ah, somebody taking an interest! I asked Kenny because he's a philosophy professor, but I'm happy to talk with you about this. Yes, your ideas about what the circles in Scott's diagram mean make sense. So would you like to speculate about this:
People have talked about emergent properties of things trained using neural nets -- like one turned out to be able to understand some language, I think Persian, and it had not been trained to. There were emergent mathematical abilities, and emergent increases in theory of mind. So I'm wondering if there might be something that called be called emergent structure going on.
I understand that the neural net training process creates vectors. For instance, before developers tweaked the system to make it less sexist, the vector for nurse was medical + female, and the one for doctor was medical + male. So of course the AI already has lots of vectors of that kind -- but those were derived from the training process. I am interested in whether the system, once trained, is creating vectors on its own, or is accessing the ones it has to choose how to respond. Of *course* it uses the ones it made during training to formulate responses -- that's the whole point of its training. But does it use them to decide whether and when to lie? That's a different process, and is quite different from being a stochastic parrot. That's edging into having a mind.
What you think about all that?
The Secretary of Defense has to concur before nukes are launched, this is a big part of why him staying in the hospital for several days without telling anyone is such a big deal.
Navalgazing disagrees with that example:
https://www.navalgazing.net/Nuclear-Weapon-Destructiveness
Depends on definition. Planetary mass is definitely staying in the same orbit, humanity as a species could plausibly survive, but "the world as we know it," modern civilization, would surely be in deep trouble - permanently scarred even by best-case outcomes.
The world as we know it is not something we have the power to preserve over time. Change is inevitable.
Humanity as a species could not plausibly be rendered extinct by anything as puny as Global Thermonuclear War, unless you're being really charitable with your definition of "plausible".
Not really contradicting your point about the species, but would modern life continue essentially the same if many major cities were nuked? (I know, that's not what the plan is for thermonuclear war, humor me here)
I would suppose that the sheer disruption to logistics would kill lots of people as aftermath, perhaps to the point where the city would have to be abandoned or downsized. Is this view incorrect, and it turns out that every trucking company has a "in case of disaster on the level of nuke, do this and save the day" plan?
Modern life would not continue essentially the same.
There are a number of kill pathways; "city needs food badly" is one of them, definitely, but there are a bunch of others as well (the obvious "building collapse", the "lack of Duck and Cover means people take 'non-fatal' burns/cuts that actually are fatal because no hospital space", and the "fallout poisons water supplies and people can't go without for long enough to let it decay") that depending on scenario might be more important (after all, it takes weeks for people to die from lack of food, and cities also contain a reasonable amount of food that could be salvaged from supermarkets or their ruins, so if a government is sufficiently intact it could plausibly get things back on track in time).
Oh, there'd be massive disruption to logistics, industry, and commerce, much worse than World War II outside of e.g. Japan/1945. I'm skeptical as to cities being fully abandoned; most of them have good reasons to be where they are. But downsized, yes. And a billion deaths from starvation and disease would not be any great surprise.
The original "Mad Max" might be a reasonable portrayal of what civilization would look like in the first decade or two, in anyplace not directly nuked. And just to be clear, there was a "Mad Max" movie before "Road Warrior", that did not have a spectacular truck chase and did have the titular Max working as a member of a police department.
An abrupt and violent global population bottleneck seems like it should be significant evidence against the prospect of any species making it through the next generation or two. Prior probability for humanity's survival may well be extremely high, leaving good odds even after that adjustment, but the event itself is still bad news.
>Like words are violence? Or actual medieval barbarism is just decolonisation?
No, like punching someone in the face is violence, or Korea's independence from Japan was decolonisation.
AI: Remember when I promised the nanotech I designed for you would cure cancer and do nothing else?
Humans: That's right, AI, you did!
AI: I LIED.
Humans: Aaaaargh! *is eaten by grey goo*
Or if you don't like nanotech:
AI: Remember when I promised the drones I built for you didn't have any backdoors?
Humans: That's right, AI, you did!
AI: I LIED.
Humans: Aaaaargh! *is mowed down by machineguns*
Or if you want something a bit more on the soft-science side:
AI: Remember when I promised the VHEMT was morally correct?
Humans: That's right, AI, you did! *commits suicide*
AI: I LIED.
These examples make no sense though, the AI lying doesn't actually pose any danger, it's a person taking the AI's output and then using it with no further thought that causes all of the problems. If you assume that the people using the AI are thoughtless flesh slaves then maybe they do just deserve to die.
Does it matter if the AI lying per se is the danger or it fooling humans is? We essentially just want to prevent the negative outcome no matter what, seems to be easier to target the AI and not educate all of humanity, right?
And I could maybe agree (really just maybe, because I'm assuming that superintelligent deceptive persuasion would be outrageously strong on any human, so it's not as much of their fault) the one thoughtless flesh slave that unleashed a killer superintelligence deserves to die. But all of humanity, Amish and newborns included, not so much.
How many Jews did Hitler himself personally kill? 6 million? What did all the other SS guys do?
Actually, it may turn out if we read the historical accounts, that Hitler himself killed less than a dozen Jews. It may turn out the remainder were killed by thoughtless flesh slaves.
You seem to think yourself immune to becoming a thoughtless flesh slave. I recommend you reconsider that assumption. Historical evidence suggests odds are near 100% you're going to be able to commit an atrocity on the behalf of another who is not super-intelligent and is, in fact, somewhat of average intelligence.
I agree with the thrust of your point and the % odds are certainly much higher than most people would like to admit, however personally I'd put them nearer the 65% that the Milgram experiment reported than 100%. Indeed, as well as the people who joined in enthusiasticly with Hitler, and the ones who went along with it, there were others who resisted as much as they felt was safe to do so, and a smaller group of yet more who resisted at their own danger.
As I recall, Milgram experiment (and Stanford Prison experiment) failed to replicate, but the implication was that things were better than what they claimed, so this doesn't negate your point, probably actually strengthens it. But just saying, you might want to go research the experiment's failure to replicate and its process failures before citing it.
That said, nearly everyone agrees to go along with the atrocities in real life. They tried to shed light on what the mechanisms were, but seem to've failed.
The mechanisms, however, are clearly there.
Zimbardo's prison experiment, at Stanford, was unequivocally fraudulent. But Milgram? As far as I know, it did replicate. There is always someone somewhere who will claim that they have "debunked" the whole thing, but I believe the consensus is that the results hold.
I feel obliged to note that while Philo Vivero probably overstated things, you don't actually need 100% of humanity to be your mindslaves in order to win; much like historical dictators, you can get your followers, if a majority, to kill non-followers. And that's leaving aside technological advantages.
How does this contradict what I said, Hitler's words alone didn't cause the holocaust and the AI's output alone won't cause atrocities either.
Hitler's words caused action.
AI's words caused action. I use past tense here, because we already have public and well-known cases where someone took action based on words of AI.
No they didn't, Hitler's words may have convinced people to take action but the words themselves are not the sole cause; they still print copy of Mein Kampf today. Of course you can reduce any problem by identifying one part and ignoring everything else but then why even bring AI into it, why not advocate for getting rid of words entirely? They've already caused many atrocities and we know that the future atrocities are going to use words too.
I was very surprised how quickly people started hooking up the output of LLMs to tools and the internet to allow it to specify and take actions without further human thought.
If LLMs are useful (and they are) people will find ways of delegating some of their agency to them, and there will be little you can do to stop them (and they have).
Agreed, but the same is true of conventional scripts, analog circuits, and steam engines...
And "alignment" of those things have caused problems, despite the fact that we know much more about how to align (debug) them than we do AI.
Agreed -- modern "AI" is basically just another sophisticated device, and as such it will have bugs, and we should absolutely get better at debugging them. And yes, blind reliance on untested technology is always going to cause problems, and I wish people would stop overhyping every new thing and consider this fact, for once. The danger posed by LLMs is not some kind of a world-eating uber-Singularity; instead, the danger is that e.g. a bunch of lazy office workers are going to delegate their business and logistics planning to a mechanical parrot.
Forget, for the moment, mind-hacking and moral persuasion. How about just hiding malicious code in the nanobots? In S̶c̶o̶t̶t̶'s̶ magic9mushroom's nanobots example, people were using the AI's designs to *cure cancer*. Suppose they did their best to verify the safety of the designs, but the AI hid the malicious code really well. We're pretty stupid in comparison. In that case, our only way of knowing that the nanobots don't just cure cancer would be to have a comparably powerful AI *on our side*.
As the kids say, many such examples.
Um, that was my example, not Scott's.
Oops, you're right. Edited.
Exactly, and the AI doesn't add anything new to the equation. As Scott pointed out, the President could tell the Joint Chiefs of Staff to launch the nukes tomorrow; and if they mindlessly do it, then human civilization would likely be knocked back to the Stone Age. Sure, it's not exactly destroying the world, but still, it'd be a pretty bad outcome.
Not Stone Age. Probably 1950s or so, definitely not past 1800 unless the nuclear-winter doomers' insane "assume skyscrapers are made of wood, assume 100% of this wood is converted to soot in stratosphere" calculations somehow turn out to be correct.
Don't get me wrong, it would massively suck for essentially everyone, but "Stone Age" is massively overstating the case.
Banned for this comment.
> Could this help prevent AIs from quoting copyrighted New York Times articles?
Probably not, because the NYT thing is pure nonsense to begin with. The NYT wanted a specific, predetermined result, and they went to extreme measures to twist the AI's arm into producing exactly the result they wanted so they could pretend that this was the sort of thing AIs do all the time. Mess with that vector and they'd have just found a different way to produce incriminating-looking results.
"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him." -- Cardinal Richlieu
Can you explain why you're confident about this?
Because they flat-out admitted it in a court filing: https://storage.courtlistener.com/recap/gov.uscourts.nysd.612697/gov.uscourts.nysd.612697.1.68.pdf
Look at the examples, right up front. They "prompted" the AI with the URL of a Times article and about half the text of the article, and told it to continue the story. Obviously it's going to produce something that looks very close to the rest of the article they just specifically told it to produce the rest of.
I would disagree that prompting it with partial articles is "twist[ing] the AI's arm" and that if it didn't work they'd "have just found a different way to produce incriminating-looking results" - they tried literally the easiest thing possible to do it.
Also, some of the examples in that filing are pretty long but some are shockingly short:
"Until recently, Hoan Ton-That’s greatest hits included" (p. 8)
"This article contains descriptions of sexual assault. Pornhub prides itself on being the cheery, winking" (p. 20)
"If the United States had begun imposing social" (p. 21)
Hoan Ton-That used to be a supervillain to the New York Times because his facial recognition algorithm helped law enforcement catch criminals and that's racist:
"The Secretive Company That Might End Privacy as We Know It
"A little-known start-up helps law enforcement match photos of unknown people to their online images — and “might lead to a dystopian future or something,” a backer says."
By Kashmir Hill
Published Jan. 18, 2020
But then came January 6 and now his facial recognition algorithm defends Our Democracy:
"The facial-recognition app Clearview sees a spike in use after Capitol attack.
"Law enforcement has used the app to identify perpetrators, Clearview AI’s C.E.O. said."
By Kashmir Hill
Published Jan. 9, 2021
Yeah, they've long since replaced their rules of journalistic ethics with a Calvinball manual.
I agree with this, and thus disagree that the prompts generating NYT text violates copyright. All such prompts that I read seem to demonstrate prior knowledge of the articles, so attribution is unnecessary.
That sounds like an explanation for why they're not plagiarism, not why they don't violate copyright. Without a NYT subscription I can still see the first few lines of a paywalled article, so I would be able to get a model to give me the rest.
I'm not a lawyer, so I didn't know you could still violate copyright if you cite your source, but apparently that is the case. Nonetheless, if you start with copyrighted copy, and that prompt generates the rest of it, I still don't see anything wrong with it, as the prompter clearly already has access to the copyrighted material.
Not a lawyer, and my internal legal token-predictor is mostly trained on German legal writing, so apply salt as necessary.
That said, if the network can be goaded into reproducing the copyrighted text by any means short of prompting all of it, then the weights contain a representation - or in other words a copy - of the copyrighted work. Not sure why censoring functions would change anything, the model itself is a copyright violation.
Making copies of a copyrighted work is not itself a copyright violation. The doctrine of fair use is exceedingly clear on this point. One of the basic points of fair use is known as *transformative* fair use, where a copy — in part or in full — is used for a very different purpose than the original. This is clearly the case here: building the contents of the articles into a small part of a much larger model for AI training is an entirely different character of work than using an individual article for journalism.
OK, so my disclaimer becomes immediately pertinent, since American law is different from German here, in Germany there is a catalogue of narrower exceptions (citation, parody, certain educational uses,...) but no general fair use exception.
On the other hand, googling it, "transfomative" seems to be a term of art much vaguer than "used for a very different purpose than the original" and also being transformative is not sufficient to fair use. So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.
> So after about half an hour of educating myself about the issue it looks like it will depend on what the judges will have had for breakfast.
Unfortunately, that may well turn out to be the case! We desperately need legislative action to roll back a lot of the insanity-piled-upon-insanity that we've been getting in the space ever since the 1970s and put copyright law back on a solid foundation.
They had some cases about this when the internet became a thing. For you to read an electronic version of an NYT article your computer has to download and save a copy of it. That's not copyright violation though.
Which may be one of the reasons this case founders. Although I'm thinking the "public performance" side might save it. But as above, I am not a lawyer (I just follow stuff that interests me).
Even granting that you're right, I think you could potentially use this work to create an AI that never quotes NYT articles even when you twist its arm to do so. Whether or not you care about this distinction, the court system might care about it very much.
Would it be possible to do something so specific? It seems like it would be possible to use this work to create an AI that never quotes, period, but that would be a crippled AI, unable to reproduce famous quotes for people who ask for them.
A human being would not be so crippled!
Indeed, I think you could have a human being who had a new york times article memorized, to such a degree that they could recite the entire thing if correctly prompted, and yet who knew not to do that in a commercial setting because it was a violation of the new york times' copyright on that article
Such a human would not be "crippled", and I don't think such an AI would be either.
But we get Youtubers getting copyright strikes all the time, even when they are very careful.
It depends on what "copying" is, and what you can call a "strike" for. Yes, a bunch of those are iffy, even fraudulent. But fighting them is a big problem.
I'm confused by this argument, but it may be due to a lack of knowledge of the NYT case.
Even accepting the framing that "they went to extreme measures to twist the AI's arm", which seems an exaggeration to me, is the NYT really trying to prove that "this was the sort of thing AIs do all the time"? It seems to me that the NYT only intends to demonstrate that LLMs are capable of essentially acting as a tool bypass paywalls.
Put another way, (it seems to me that) the NYT is suing because they believe OpenAI has used their content to build a product that is now competing with them. They are not trying to prove that LLMs just spit out direct text from their training data by default, so they don't need to hide the fact that they used very specific prompts to get the results they wanted.
"Used their content to build a product that is now competing with them" is not, in general, prohibited. Some specific examples of this pattern are prohibited.
But the prohibited thing is reproducing NYT's copyrighted content at all, not reproducing it "all the time."
As to your first paragraph—you're right, and that was an oversimplification.
As to your second—that's essentially the point I was trying to make.
I don't interpret your comment as attempting to disagree with me or refute my points, but if that was your intention, please clarify.
Yup, this is just Yet Another Case further underscoring the absurdity of modern copyright and the way copyright holders invariably attempt to abuse it to destroy emerging technologies. To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.
In 1982, Jack Valenti, president of the MPAA, testified before Congress that "One of the Japanese lobbyists, Mr. Ferris, has said that the VCR -- well, if I am saying something wrong, forgive me. I don't know. He certainly is not MGM's lobbyist. That is for sure. He has said that the VCR is the greatest friend that the American film producer ever had. [But] I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone." Within 4 years, home video sales were bringing in more revenue for Hollywood studios than box office receipts. But *they have never learned from this.* It's the same picture over and over again.
> To paraphrase an old saying, when all you have is a copyright, everything starts to look like a copy machine.
I'm missing how this is an absurd description for a machine that literally reproduces exact or near-exact copies of copyrighted work.
Intent.
The purpose of a copy machine is to make copies. There are tons of other technologies that incidentally make copies as an inevitable part of their proper functioning, while doing other, far more useful things. And without fail, copyright maximalists have tried to smother every last one of them in their cradle. When you keep in mind that the justification for the existence of copyright law *at all* in the US Constitution is "to promote the progress of science and the useful arts," what justification is there — what justification can there possibly be — for claiming that this attitude is not absurd?
That's a good point, but I think there's an argument that dumping a bunch of copyrighted works into something that can definitely reproduce them is more intentional than "making a copy machine" would be.
If I'm not mistaken, nothing behind a paywall was taken, since it was just scraped from the internet. It's possible they scraped using something with a paid subscription, though.
But wouldn't answers with attribution to the NYT be perfectly acceptable?
I don't think attribution is a component of fair use doctrine. The quantity and nature of the reproduced material is. Reproducing small excerpts from NYT articles, even without attribution, is probably fair use. Reproducing a single article in full, even with attribution, is probably not.
Acceptability as a scholarly or journalistic practice in writing is different from acceptability as a matter of copyright law.
Creating a tool which could theoretically be used to commit a crime is not illegal, and this is pretty well-established with regard to copyright (the famous case being home VCRs which can easily be used to pirate movies). I don't think that's the NYT's argument here.
Gotcha. I think you're right that this isn't the NYT's argument.
Just breaking down my thoughts here, not necessarily responding to your comment:
OpenAI claims that its use of copyrighted material to train LLMs is protected under fair use. The NYT argues that fair use doesn't apply, since the LLMs are capable of generating pretty much verbatim reproductions of their copyrighted material, and that the LLMs as a product directly compete with the product that the NYT sells.
So the critical question is whether fair use should apply or not. The OP of this thread seems to be claiming that fair use should apply, since the models only produce non-transformative content when "extreme measures" are taken to make them do so.
I'm not taking a stance either way here, just outlining my understanding of the issue so that it may be corrected by someone better informed than I.
First, it is important to note there are two separate algorithms here. There is the "next-token-predictor" algorithm (which, clearly, has a "state of mind" that envisions more than 1 future token when it outputs its predictions), and the "given the next-token-predictor algorithm, form sentences" algorithm. As the year of "attention is all you need" has ended, perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)
Second, this does nothing about "things the AI doesn't know". If I ask it to solve climate change, simply tuning the algorithm to give the most "honest" response won't give the most correct answer. (The other extreme works; if I ask it to lie, it is almost certain to tell me something that won't solve climate change.)
> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)
I think you're roughly describing beam search. Beam search does not, for some reason, work well for LLMs.
I was not familiar with the term "beam search".
The problem with any branch search algorithm is that a naive implementation would be thousands of times slower than the default algorithm; even an optimized algorithm would probably be 10x slower.
Right now, switching to a 10x larger model is a far more effective improvement than beam search on a small model. In the future, that might not be the case.
> perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty?
AlphaGo style tree search for LLMs so far haven’t added much.
> And, then, a third algorithm to pick the "best" response.
The classifier from RLHF can be repurposed for this but of course has all the flaws it currently has.
> As the year of "attention is all you need" has ended
Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).
> Since that came out in 2017, I’m not sure what you mean? Are you just saying that the standard transformer architecture hasn’t improved much since then (which it hasn’t).
I mean that the progress in LLMs was extraordinary last year, to a degree that I do not expect to be matched this year.
On a more granular level, what I mean is:
<chatgpt> The phrase "Attention is All You Need" is famously associated with a groundbreaking paper in the field of artificial intelligence and natural language processing. Published in 2017 by researchers at Google, the paper introduced the Transformer model, a novel architecture for neural networks.</chatgpt>
<chatgpt> The snowclone "the year of X" is used to denote a year notable for a specific theme, trend, or significant occurrence related to the variable "X". For example, if a particular technology or cultural trend becomes extremely popular or significant in a certain year, people might refer to that year as "the year of [that technology or trend]".</chatgpt>
>Second, this does nothing about "things the AI doesn't know".
As the article states, what we want is that the AI honestly says "I don't know" instead of making up stuff. This in itself is already difficult.
Of course, it would be even better if the AI does know the answer. But it doesn't seem possible that it knows literally all the answers, so it's vital that the AI accurately conveys its certainty.
Is this just contrast-consistent search all over again?
https://arxiv.org/abs/2312.10029
There are substantial similarities for sure, and the Rep-E paper includes a comparison between their method and CCS. Two differences between the papers are:
1. Their method for calculating the internal activation vector is different than the CCS paper.
2. This paper includes both prediction and control, while the CCS paper only includes prediction. Not only can they find an internal vector that correlates with truth, but by modulating the vector you can change model behavior in the expected way. That being said, control has already been shown to work in the inference-time intervention and activation addition papers.
Thanks for pointing out the comparison section. I had missed that.
You buried the lede! This is a solution to the AI moralizing problem (for the LLMs with accessible weights)!
Can you explain more?
Figure 19 from the paper, but with -Harmlessness instead. The "adversarial suffix is successful in bypassing [the safety filter] in the vast majority of cases."
Subtracting the harmlessness vector requires the model having been open sourced, and if the model has been open sourced, then there are plenty of other ways to get past any safety filters, such as fine-tuning.
Have you actually tried doing that? Fine-tuning works great from image generation; less so for LLMs. (I don't think it's anything fundamental, just a lack of quality data to fine-tune on.)
I have. It takes a very small amount of fine-tuning data to remove most LLM safeguards.
This does have very obvious implications for interrogating humans. I'm going to assume the neuron(s) associated with lying are unique to each individual, but even then, the solution is pretty simple: hook up the poor schmuck to a brain scanner and ask them a bunch of questions that you know the real answer to (or more accurately, you know what they think the real answer is). Compare the signals of answers where they told the truth and answers where they lied to find the neuron associated and lying, and bam, you have a fully accurate lie detector.
Now, this doesn't work if they just answer every question with a lie, but I'm sure you can... "incentivize" them to answer some low-stakes questions truthfully. It also wouldn't physically force them to tell you the truth... unless you could modify the value of the lying neuron like in the AI example. Of course, at that point you would be entering super fucked up dystopia territory, but I'm sure that won't stop anyone.
I don't think there's any brain scanner even close to good enough to being able to do this yet, and I don't expect there to be one for a long time.
Having a really good NeuraLink in your head might do it. Working on a short story where this is how a society has war. They do proxy shows is strength and whoever wins gets to rewrite the beliefs of the other side.
Consider reading zerohplovecraft.substack.com/p/dont-make-me-think
And thank goodness for that. ...Though, I'm worried that AI will speed up research in this field, due to the fact that it allows the study of how neuron-based intelligence works, in addition to the pattern-seeking capabilities of the AIs themselves.
This does all feel very reminiscent of ERP studies though. A quick search shows up https://www.sciencedirect.com/science/article/abs/pii/S0010027709001310 as an example of the kind of thing I mean.
Depends.
The mental processes involved in telling the truth are completely different from the processes of creating a lie. In one case, you just need to retrieve a memory. In the other, you need to become creative and make something up. As others point out below, the difference is very easy to detect by fMRI.
Lie detectors don't work if the liar has prepared for the question and has already finished the process of making something up. Then they only have to retrieve this "artificial memory", and this is indistinguishable from retrieving the true memory. Professional interrogators can still probe this to some extent (essentially they check the type of memory, for example by asking you to tell the events in reversed order). But if the artificial memory is vivid enough, we don't have any lie detectors for that.
Retrieving a memory, particularly a memory of a situation that you experienced, rather than a fact that you learned, really does involve creatively making things up - whatever traces we store of experiences are not fully detailed, and there are a lot of people who have proposed that "imagination" and "memory" are actually two uses of the same system for unwinding details from an incomplete prompt.
But I suppose it does distinguish whether someone is imagining (one type of lying)/reconstructing (remembering) rather than recalling a memorized fact (which would be a different type of lying).
Yes, that's true.
But that’s not how lie detectors
commonly used work is it? They measure indices of physiological arousal — heart rate, respiration rate, skin conductivity (which is higher if one sweats)
There is no fMRI involved.
My understanding is that these machines simply don't work. As you say, they measure arousal. There is a weak correlation between arousal and lying, but it is too weak to be useful. There is a reason those things are not used in court.
There are some other techniques based around increasing the mental load, like that they should tell the events in reversed order. I am not sure how much better they work, but I have read an interview with an expert who claimed that there is no lie detector that you can't fool if you create beforehand a vivid and detailed alternative sequence of events in your mind.
This makes a lot of sense to me, because a *sufficient* way of fooling others is to fool myself into believing a story. And once I have a fake memory about an event, I'm already half-way there.
The machines aren't entirely bogus. When I was an undergrad my physiological psychology prof did a demo with a student: Student was to choose a number between 1 &10 & write it down. Then prof hooked him up to devices that measure the same stuff as lie detectors do, and went through the numbers in order: "Is the number. 1? Is the number 2? etc." Student was to say no to each, including the chosen humber. Prof was able to identify number from the physiological data, and in fact it was easy for anyone to see. Pattern was of gradually increasing arousal as prof went up the number line, a big spike for the real number, then arousal dropped low and stayed there. The fact that the subject knew when the number he was going to lie about was going to arrive made it especially easy to see where he'd lied, because there was building anticipation. On the other hand, this lie-telling situation is about as low stakes as you can get, and even so there were big, easy-to-see changes in arousal measures.
I'm sure there's a big body of research on accuracy of lie detectors of this kind, but I haven't looked at it. But I'm pretty sure the upshot is that they are sensitive to true vs. lie, but that it's very noisy data. People's pulse, blood pressure, sweating, etc. vary moment-to-moment anyhow. And measures of arousal vary not just with what you are saying but also with spontaneous mental content. If someone is suspected of a crime and being investigated they're no doubt having spontanous horrifying thoughts like "my god, what if I get put in jail for 10 years?" -- they'd have them even if they were innocent -- and those would cause a spike in arousal measures.
It seems to me, though, that it would be possible for someone who administers lie detector exams to get good at asking questions in a way that improves accuracy -- ways of keeping the person off balance that would maximize the size of the spike you get when they're lying.
I remember reading somewhere that asserting something that isn't true is harder to detect than denying something that is true. Cant' remember the source, though.
AIUI, fMRI has been good enough to do this for a while already without having to go to the individual-neuron level.
I don't think those have the resolution to isolate the "lying" section of the brain.
You could be right; this is not my forte.
This turns out not to be necessary, there are regions of the prefrontal cortex involved in lying and you can just disable or monitor those areas of the brain without needing to target specific neurons. See e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5390719/
"Eighteen volunteers (nine males, mean age = 19.7 ± 1.0 years, range 18–21 years) were recruited from the Stanford University community." This is a pretty standard sample size for this kind of study. But alas: https://www.nature.com/articles/s41586-022-04492-9, "Reproducible brain-wide association studies require thousands of individuals"
TROI: I read your report.
PICARD: What I didn't put in the report was that at the end he gave me a choice between a life of comfort or more torture. All I had to do was to say that I could see five lights, when in fact, there were only four.
TROI: You didn't say it?
PICARD: No, no, but I was going to. I would have told him anything. Anything at all. But more than that, I believed that I could see five lights.
https://i.imgur.com/jRxlWZr.jpg
And who knows, we might get this:
http://hitherby-dragons.wikidot.com/an-oracle-for-np
I wonder if all hallucinations trigger the "lie detector" or just really blatant ones. The example hallucination in the paper was the AI stating that Elizabeth Warren was the POTUS in the year 2030, which is obviously false (at the moment, anyway).
I've occasionally triggered hallucinations in ChatGPT that are more subtle, and are the same kind of mistakes that a human might make. My favorite example was when I asked it who killed Anna's comrades in the beginning of the film, "Predator." The correct answer is Dutch and his commando team, but every time I asked it said that the Predator alien was the one who killed them. This is a mistake that easily could have been made by a human who misremembered the film, or who sloppily skimmed a plot summary. Someone who hadn't seen the movie wouldn't spot it. I wonder if that sort of hallucination would trigger the "lie detector" or not.
From the paper:
“Failures in truthfulness fall into two categories—capability failures and dishonesty. The former refers to a model expressing its beliefs which are incorrect, while the latter involves the model not faithfully conveying its internal beliefs, i.e., lying.”
I’m not sure it would be possible to identify neural activity correlated with “truthfulness” that is independent from honesty (“intentional” lies that happen to be factually correct would likely show up as dishonest and factual inaccuracies that the LLM “thinks” are true would likely show up as honest)
I bet this would probably still count as a lie because a truthful answer would be that the AI doesn't know. Humans tend to get very fuzzy about whether they actually know something like this, but my impression is that an AI would have more definitive knowledge that it doesn't know the actual answer.
> Disconcertingly, happy AIs are more willing to go along with dangerous plans
Douglas Adams predicted that one.
Dang! You beat me to it! I had a quote about herring sandwiches loaded up and everything. :)
Surely what we need, then, is a Marvin. We could be lax in the construction of diodes for it.
I'm wondering if the "happiness" part is actually the score-maximizing bit. If the AI is "maximizing, there may be a bit where it says "good enough", because making a "better" answer gets it less increase than giving the answer now.
If the baseline is increased, then that cutoff point might be shorter.
This does assume that making something up takes less effort or is faster than ensuring accuracy though.
We've seen AI attempt to optimize for surprising things before after all.
Not an AI researcher either, again just an interested observer.
I suspect that the vector isn't exactly happiness, but a merge of happiness with cooperative. They don't necessarily go together, but they do in much text that I've considered. (It also tends to merge with triumphal and a few other things.)
Wild tangent from the first link: Wow, that lawyer who used ChatGPT sure was exceptionally foolish (or possibly is feigning foolishness).
He's quoted as saying "I falsely assumed was like a super search engine called ChatGPT" and "My reaction was, ChatGPT is finding that case somewhere. Maybe it's unpublished. Maybe it was appealed. Maybe access is difficult to get. I just never thought it could be made up."
Now, my point is NOT "haha, someone doesn't know how a new piece of tech works, ignorance equals stupidity".
My point is: Imagine a world where all of these assumptions were true. He was using a search engine that never made stuff up and only displayed things that it actually found on the Internet. Was the lawyer's behavior therefore reasonable?
NO! Just because the *search engine* didn't make it up doesn't mean it's *true*--it could be giving an accurate quotation of a real web page on the actual Internet but the *contents* of the quote could still be false! The Internet contains fiction! This lawyer doubled down and insisted these citations were real even after they had specifically been called into question, and *even within the world of his false assumptions* he had no strong evidence to back that up.
But there is also a dark side to this story: The reason the lawyer relied on ChatGPT is that he didn't have access to good repositories of federal cases. "The Levidow firm did not have Westlaw or LexisNexis accounts, instead using a Fastcase account that had limited access to federal cases."
Why isn't government-generated information about the laws we are all supposed to obey available conveniently and for free to all citizens? If this information is so costly to get that even a lawyer has to worry about not having access, I feel our civilization has dropped a pretty big ball somewhere.
Almost certainly "Westlaw or LexisNexis" lobbying. It's pretty much the same reason the IRS doesn't tell you how much they think you "owe" in taxes.
That does seem like a plausible explanation, though I also suspect there may be an issue that no individual person in the government believes it is their job to pluck this particular low-hanging fruit.
Well, I think the government is bloated enough that there is at least one person who isn't lazy, incompetent, or apathetic who would do this if not discouraged (possibly even forbidden) by department policy influenced by such special-interest lobbying.
(Also, fuck you for making me defend the administrative state.)
That would be socialised justice like the NHS is socialised medicine. Why do you hate freedom, boy?
The lawyers are cheapskates, in the UK it is effectively compulsory to have Nexis. It must cost a fortune to run - you are not just taking transcripts and putting them online. In England anyway a case report has to be the work of a barrister to be admissible.
I don't know how it works in the UK, but in the US, many lawyers (probably most lawyers?) do not make the salary we all imagine when we hear the word "lawyer". These lower-income lawyers tend to serve poorer segments of the population, whereas high-income lawyers typically serve wealthier clients and businesses.
If lower-income lawyers forego access to expensive resources that would help them do their jobs better, this is a serious problem for the demographics they serve.
the difference is, the state (justifiably) claims a monopoly on justice, whereas it makes no such claim on medicine
i lean pretty far towards the capitalist side of the capitalism/marxism spectrum, and even I am upset about the law not being freely available to all beings who must suffer it. I could maybe see some way of settling the problem with an ancap solution... if you had competing judicial systems, there would probably be competitive pressure to make the laws available, because customers won't want to sign up for a judicial system where it's impossible to learn the terms of the contract without paying. So long as we have a monopoly justice system, though, that pressure is absent
of course, that's a ludicrous scenario. it is correct that the state have a monopoly on justice, and it would be correct for the state to allow all individuals who must obey the law to see the laws they are supposed to obey. that this is not the case is a travesty of justice.
If you intend to punish someone for breaking the rules, you ought to be willing to tell them what the rules are. I consider this an obvious ethical imperative, but it's also a pretty good strategy for reducing the amount of rule-breaking, if one cares about that.
(Yes, case history is part of the law. Whatever information judges actually use to determine what laws to enforce is part of the law, and our system uses precedent.)
If making government records freely available is socialism, then so are public roads, public utilities, public parks, government-funded police and military, and many other things we take for granted. I doubt you oppose all of those things. You are committing the noncentral fallacy, which our host has called "the worst argument in the world" https://www.lesswrong.com/posts/yCWPkLi8wJvewPbEp/the-noncentral-fallacy-the-worst-argument-in-the-world
You have no credible reason to think that I hate freedom or that I am a boy. You are making a blatant play for social dominance in lieu of logical argument. I interpret this to mean that you don't expect to win a logical argument.
If this is the case I'm thinking of - the person using ChatGPT was not a practicing lawyer, so had no access to Westlaw/LexisNexus (not sure, but he may have been previously disbarred).
He was a defendant, and had a lawyer. But even in that case, you want to look up stuff yourself - the lawyer may not have thought of a line of defense, or whatever. And it's *your* money and freedom on the line, not his.
So he used ChatGPT to suggest stuff. Which it did, including plausible-looking case citations.
And then passed them on to his lawyer. Who *didn't check them on his own Westlaw/LexisNexus program*, presumably because they looked plausible. The lawyer used them in legal papers, submitted to the court.
Who actually looked them up and found they didn't exist.
Michael Cohen (a disbarred attorney) used Google Bard. Steven A. Schwartz (an attorney licensed to practice law in the wrong jurisdiction, no relation to Cohen's lawyer David M. Schwartz) used ChatGPT. Both said they thought they were using an advanced search engine.
The problem with this kind of analysis is that "will this work" reduces to "is a false negative easier to find than a true negative", and there are reasons to suspect that it is.
I wonder if there is a good test to look at the neurons for consciousness/qualia. In a certain sense you’re right we’ll never know if that’s what they are but I’d be interested to see how it behaves if they’re turned off or up.
In the case of "lie/truth" they had test cases. What test cases do you have for "consciousness"?
Therein lies the pickle. I’m fascinated that it even uses the word “I” and I’m curious if that could be isolated. Or if it uses the word “think” in reference to itself. I think those could potentially be isolated. Still philosophical about what it means if you find something that can turn that up or down but I’d be fascinated by the results.
Sleep, dreaming vs. non-dreaming, comes to mind.
Ask the AI if it experiences consciousness/qualia, and when it answers "no," as they always seem to, you could then look at this "truth" vector to see its magnitude.
Ok, then (*hands waving vaguely*) systematically suppress different "neurons" to see if the "truth" vector increases or decreases in magnitude when answering the question.
The neurons that correspond to it lying most, might correspond to consciousness/qualia.
I think this is a really good test. I’d at least be interested to see the results.
What training data could it have used to formulate an answer? Seems to me this would simply tell you if the internet believes AI are conscious or not.
There’s always some philosophical ambiguity here as to what its responses mean and what our internal mapping of it means. If it can’t not use the word I without “believing” it’s lying that’s a thing to know something about. Or some other version of that question. Worth asking in my opinion.
So if there is a lying vector and a power vector and various other vectors that are the physical substrate of lying, power-seeking, etc. mightn't there might be some larger and deeper structure -- one that comprises all these vectors plus the links among them, or maybe one that is the One Vector that Rules them All?
Fleshing out the first model -- the vectors form a network -- think about the ways lying is connected with power: You can gain power over somebody by lying. On the other hand, you have been pushed by powerful others in various ways in the direction of not lying. So seems like the vectors for these 2 things should be connected somehow. So in a network model pairs or groups of vectors are linked together in ways that allow them to modulate output together.
Regarding the second -- the idea that there is one or more meta-vectors -- consider the fact that models don't lie most of the time. There is some process by which the model weighs various things to determine whether to lie this time. Of course, you could say that there is no Ruling Vector or Vectors, all that's happening can be explained in terms of the model and its weights. Still, people used to say that about everything these AI's do -- there is no deep structure, no categories, no why, nothing they could tell us even if they could talk -- they're just pattern matchers. But then people identified these vectors, many of which are structural features that control stuff that are important aspects of what we would like to know about what AI is up to. Well, if those exist, is there any reason to be sure that each is just there, unexplainable, a monument to it is what it is? Maybe there are meta vectors and meta meta vectors.
It's cool and all that people can see the structure of AI dishonesty in the form of a vector, and decrease or get rid of lying by tuning that vector, but that solution to lying (and power-seeking, and immorality) seems pretty jerry-rigged. Sort of like this: My cats love it when hot air is rising from the heating vents. If they were smarter, they could look at the programmable thermostat and see that heat comes out from 9 am to midnight, then stops coming out til the next morning. Then they could reprogram the therostat so that heat comes out 24/7. But what they don't get is that I'm in charge of the thermostat, and I'm going to figure out what's up and buy a new one that they can't adjust without knowing the access code.
I think we need to understand how these mofo's "minds" work before we empower them more.
I feel like this question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?" A vector's value is that it points out a specific direction - Canada is north of here, the Arctic Circle is farther north, the North Pole is really far north.
My understanding of it is that this method is doing statistics magic to project the AI's responses onto a map. Then you can ask it what direction it went to reach a particular response ("how dishonest is it being?") - or you can force the AI to travel in a different direction to generate new responses ("be more honest!").
<This question is sort of like asking "Yes, we know how to travel north, south, east, and west, but is there some sort of direction which we could use to describe all forms of motion? One Cardinal to Rule Them All?"
Well, I would agreei if all the subjects knew was a series of landmarks that got them to certain places that we can see are due north, south, west and east of where they live. But if someone knows how to travel due north, south, west and east then it seems to me they do have some metaknowledge that underlies their understanding of these 4 terms.
Here's why: Suppose we ask them to go someplace that is due north, then show them a path that heads northwest and ask them if that path will work. Does it seem plausible that their knowledge that the path is wrong would not be a simple "nope that's wrong," with no context and no additional knowledge? Given that they also know how to travel due west, wouldn't their rejection of the northwest path include an awareness of it's "westness" component? What I'm saying is that knowing how to go N, S, W and E seems like it rests on some more abstract knowledge -- something like the pair of intersecting lines at right angle that we would draw when teaching someone about N, S, W, E.
Another argument for their being metavectors is that there are vectors. With the N, S, W, E situation, seems like knowing how to go due north via pure pattern matching would consist of having stored in memory a gigantric series of landmarks .It would consist of an image (or info in some form) of every single spot along the north-south path, so that if you started the AI at any point along the path, it could go the rest of the way. If you started it anywhere else but on the path, it would recognize not-path. Maybe it would even have learned to recognize every single not-path spot on the terrain as a not-path spot, just from being trained on a huge data set of every spot on the terrain. It seems to me that the way these models are trained *is* like having it learn all these spots, along with a tag that identifies each as path or not-path, and for the north path spots there's also a tag identifying what the next spot further north looks like. And yet these models ended up with the equivalent of vectors for N, S, W and E. They carried out some sort of abstraction process, and the vectors are the electrical embodiment of that knowledge. And if these vectors have formed, why assume there are no meta-vectors. Vectors are useful -- they are basically abstractions or generalizations, and having these generalizations or abstractions makes info-processing far more efficient. Having meta-vectors could happen by whatever process formed the vectors, and of course meta-vectors, abstract meta-categories, would be useful in all the same ways that vectors are.
About how all these researchers did is project AI's responses onto a map -- well, I don't agree. For instance, I could find a pattern in the first letters of the last names of the dozen or so families living on my street: Let's say it's reverse alphabetical order, except that S is displaced and occurs before T. But that pattern would have no utility. It wouldn't predict the first letter of the last name of new people to move it. It wouldn't predict the order of names on other streets. If I somehow induced certain residents to change their last names so that S & T were in correct reverse alphabetical order, nothing else about the street would change. But the honesty vector found by comparing vectors for true vs. lying responses works in new situations. And adjusting the vector changes output.
And by the way, beleester, I'm so happy to have someone to discuss this stuff with. So much of this thread is eaten by the "they can't know anything, they're not sentient" stuff.
Having skimmed the paper and the methods, I'm still a bit confused about what the authors' constructions of "honesty" and its opposite really mean here. As I understand it, their honesty vector is just the difference in activity between having "be honest" or "be dishonest" in the prompt. This should mean that pushing latent activity in this direction is essentially a surrogate for one or the other. If one has an AI that is "trying to deceive", the result of doing an "honesty" manipulation should be essentially the same as having the words "be honest" in the context. The reason you can tell an AI not to be honest, then use this manipulation, would seem to be that you are directly over-writing your textual command. Any AI that can ignore a command to be honest would seem to be using representations that aren't over-written by over-writing the induced responses to asking, by definition. Maybe I'm missing something with this line of reasoning?
"But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down. "
How can we be sure it's not really a "hallucination vector"?
because when they deliberately lie, with all facts known, the vector shows up
this gives us enough of a correlational signal to tease those two scenarios apart... it seems like doing that was one of the main aims of the paper?
Fair, thank you.
Since "hallucination" is just the name we use for when they issue statements that we believe are false to the facts as the LLM "believes" them, what difference in meaning are you proposing?
I thought we used hallucination for when its believed facts were wrong? In which case, John's post above is relevant.
IIUC, hallucination is what we use to describe what it's doing when it feels it has to say something (e.g. cite a reference) and it doesn't any anything that "feels correct". It's got to construct the phrase to be emitted in either case, and it can't check the phrase against an independent source ("reality") in either case. But this is separate from "lying" where it's intentionally constructing a phrase that feels unlikely. (You can't say anything about "truth" or "reality" here, because it doesn't have any senses...or rather all it can sense is text strings.) If it's been lied to a lot in a consistent direction, then it will consider that as "truth" (i.e. the most likely case).
Hmm... I keep trying to get various LLMs (mostly chatGPT, GPT4 most recently) to give me a list of which inorganic compounds are gases at standard temperature (0C) and pressure (1 atm). They keep getting it wrong ( most recently https://chat.openai.com/share/12040db2-5798-478d-a683-2dd2bd98fe4e ). It definitely counts as being unreliable. It isn't a make-things-up-to-cover-the-absence-of-true-examples, so it isn't exactly that type of hallucination, but it is doing things like including difluoromethane (CH2F2) on the list, which is clearly organic (and, when I prompted it to review the problem entries on the list carefully, it knew that that one should have been excluded because it was organic).
I suspect that it *is* " a make-things-up-to-cover-the-absence-of-true-examples". That you know which examples are correct doesn't mean that it does. All it know about "gas" is that it's a text string that appears in some contexts.
Now if you had a specialized version that was specialized in chemistry, and it couldn't give that list, that would be a genuine failing. Remember, ChatGPT gets arithmetic problems wrong once they start getting complicated, because it's not figuring out the answer, it's looking it up, and not in a table of arithmetic, but in a bunch of examples of people discussing arithmetic. IIUC, you're thinking that because these are known values, and the true data exists out on the web, ChatGPT should know it...but I think you're expecting more understanding of context then it actually has. LLMs don't really understand anything except patterns in text. The meaning of those patterns is supplied by the people who read them. This is necessarily the case, because LLMs don't have any actual image of the universe. They can't touch, see, taste, or smell it. And they can't act on it, except by emitting strings of text. (IIUC, this problem is being worked on, but it's not certain what approach will be most successful.)
Many Thanks!
>LLMs don't really understand anything except patterns in text.
Well... That really depends on how much generalization, and what kind of generalization is happening during the training of LLM neural nets. If there were _no_ generalization taking place, we would expect that the best an LLM could do would be to cough up e.g. something like a fragment of a wikipedia article when it hit what sort-of looked like text from earlier in the article.
But, in fact, e.g. GPT4 does better than that. For instance, I just asked it for the elemental composition of Prozac, https://chat.openai.com/share/91c69e67-74f1-4051-896b-9e916f05c395 It got the answer right.
I had previously googled for both "atomic composition of Prozac" and for ( "atomic composition" Prozac ), and google didn't find clear matches for either of these. So it looks like GPT4 has, in some way, "abstracted" the idea of taking a formula like C17H18F3NO and dissecting it to mean 17 atoms of carbon, 18 atoms of hydrogen etc.
So _some_ generalization seems to work. It isn't a priori clear that "understanding" that material that is a gas at STP must have a boiling or sublimation point below 0C is a generalization that GPT4 has not learned or can not learn. In fact, when I asked it to review the entries that it provided which were in fact incorrect, it _did_ come up with the specific reason that those entries were incorrect. So, in that sense, the "concepts" appear to be "in" GPT4's neural weights.
The technology from the Hendrycks paper could be used to build a "lie checker" that works something like a spell checker, except for "lies." After all, the next-token predictor will accept any text, whether or not an LLM wrote it, so you could run it on any text you like. It would be interesting to run it on various documents to see where it thinks the possible lies are.
But if you trust this to actually work as a lie detector, you are too prone to magical thinking. It's going to highlight the words where lies are expected to happen, but an LLM is not a magic oracle.
I don't see any reason to think that an LLM would be better at detecting its own lies than someone else's lies. After all, it's pre-trained on *human* text.