365 Comments

Aargh. Sorry, I know I should really read the full post before commenting, but I just wanted to say that I really really disagree with the first line of this post. Lying is an intentional term. It supposes that the liar knows what is true and what is not, and intends to create a false impression in the mind of a listener. None of those things are true of AI.

Of course, I get that you're using it informally and metaphorically, and I see that the rest of the post addresses the issues in a much more technical way. But I still want to suggest that this is a bad kind of informal and metaphorical language. It's a 'failing to see things as they really are and only looking at them through our own tinted glasses' kind of informal language rather than a 'here's a quick and dirty way to talk about a concept we all properly understand' kind.

Expand full comment
User was banned for this comment. Show
Expand full comment

> Could this help prevent AIs from quoting copyrighted New York Times articles?

Probably not, because the NYT thing is pure nonsense to begin with. The NYT wanted a specific, predetermined result, and they went to extreme measures to twist the AI's arm into producing exactly the result they wanted so they could pretend that this was the sort of thing AIs do all the time. Mess with that vector and they'd have just found a different way to produce incriminating-looking results.

"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him." -- Cardinal Richlieu

Expand full comment

First, it is important to note there are two separate algorithms here. There is the "next-token-predictor" algorithm (which, clearly, has a "state of mind" that envisions more than 1 future token when it outputs its predictions), and the "given the next-token-predictor algorithm, form sentences" algorithm. As the year of "attention is all you need" has ended, perhaps we can consider using smarter algorithms to form sentences, possibly with branching at points of uncertainty? (And, then, a third algorithm to pick the "best" response.)

Second, this does nothing about "things the AI doesn't know". If I ask it to solve climate change, simply tuning the algorithm to give the most "honest" response won't give the most correct answer. (The other extreme works; if I ask it to lie, it is almost certain to tell me something that won't solve climate change.)

Expand full comment

Is this just contrast-consistent search all over again?

https://arxiv.org/abs/2312.10029

Expand full comment

You buried the lede! This is a solution to the AI moralizing problem (for the LLMs with accessible weights)!

Expand full comment

This does have very obvious implications for interrogating humans. I'm going to assume the neuron(s) associated with lying are unique to each individual, but even then, the solution is pretty simple: hook up the poor schmuck to a brain scanner and ask them a bunch of questions that you know the real answer to (or more accurately, you know what they think the real answer is). Compare the signals of answers where they told the truth and answers where they lied to find the neuron associated and lying, and bam, you have a fully accurate lie detector.

Now, this doesn't work if they just answer every question with a lie, but I'm sure you can... "incentivize" them to answer some low-stakes questions truthfully. It also wouldn't physically force them to tell you the truth... unless you could modify the value of the lying neuron like in the AI example. Of course, at that point you would be entering super fucked up dystopia territory, but I'm sure that won't stop anyone.

Expand full comment
Jan 9·edited Jan 9

I wonder if all hallucinations trigger the "lie detector" or just really blatant ones. The example hallucination in the paper was the AI stating that Elizabeth Warren was the POTUS in the year 2030, which is obviously false (at the moment, anyway).

I've occasionally triggered hallucinations in ChatGPT that are more subtle, and are the same kind of mistakes that a human might make. My favorite example was when I asked it who killed Anna's comrades in the beginning of the film, "Predator." The correct answer is Dutch and his commando team, but every time I asked it said that the Predator alien was the one who killed them. This is a mistake that easily could have been made by a human who misremembered the film, or who sloppily skimmed a plot summary. Someone who hadn't seen the movie wouldn't spot it. I wonder if that sort of hallucination would trigger the "lie detector" or not.

Expand full comment

> Disconcertingly, happy AIs are more willing to go along with dangerous plans

Douglas Adams predicted that one.

Expand full comment

Wild tangent from the first link: Wow, that lawyer who used ChatGPT sure was exceptionally foolish (or possibly is feigning foolishness).

He's quoted as saying "I falsely assumed was like a super search engine called ChatGPT" and "My reaction was, ChatGPT is finding that case somewhere. Maybe it's unpublished. Maybe it was appealed. Maybe access is difficult to get. I just never thought it could be made up."

Now, my point is NOT "haha, someone doesn't know how a new piece of tech works, ignorance equals stupidity".

My point is: Imagine a world where all of these assumptions were true. He was using a search engine that never made stuff up and only displayed things that it actually found on the Internet. Was the lawyer's behavior therefore reasonable?

NO! Just because the *search engine* didn't make it up doesn't mean it's *true*--it could be giving an accurate quotation of a real web page on the actual Internet but the *contents* of the quote could still be false! The Internet contains fiction! This lawyer doubled down and insisted these citations were real even after they had specifically been called into question, and *even within the world of his false assumptions* he had no strong evidence to back that up.

But there is also a dark side to this story: The reason the lawyer relied on ChatGPT is that he didn't have access to good repositories of federal cases. "The Levidow firm did not have Westlaw or LexisNexis accounts, instead using a Fastcase account that had limited access to federal cases."

Why isn't government-generated information about the laws we are all supposed to obey available conveniently and for free to all citizens? If this information is so costly to get that even a lawyer has to worry about not having access, I feel our civilization has dropped a pretty big ball somewhere.

Expand full comment

The problem with this kind of analysis is that "will this work" reduces to "is a false negative easier to find than a true negative", and there are reasons to suspect that it is.

Expand full comment

I wonder if there is a good test to look at the neurons for consciousness/qualia. In a certain sense you’re right we’ll never know if that’s what they are but I’d be interested to see how it behaves if they’re turned off or up.

Expand full comment

So if there is a lying vector and a power vector and various other vectors that are the physical substrate of lying, power-seeking, etc. mightn't there might be some larger and deeper structure -- one that comprises all these vectors plus the links among them, or maybe one that is the One Vector that Rules them All?

Fleshing out the first model -- the vectors form a network -- think about the ways lying is connected with power: You can gain power over somebody by lying. On the other hand, you have been pushed by powerful others in various ways in the direction of not lying. So seems like the vectors for these 2 things should be connected somehow. So in a network model pairs or groups of vectors are linked together in ways that allow them to modulate output together.

Regarding the second -- the idea that there is one or more meta-vectors -- consider the fact that models don't lie most of the time. There is some process by which the model weighs various things to determine whether to lie this time. Of course, you could say that there is no Ruling Vector or Vectors, all that's happening can be explained in terms of the model and its weights. Still, people used to say that about everything these AI's do -- there is no deep structure, no categories, no why, nothing they could tell us even if they could talk -- they're just pattern matchers. But then people identified these vectors, many of which are structural features that control stuff that are important aspects of what we would like to know about what AI is up to. Well, if those exist, is there any reason to be sure that each is just there, unexplainable, a monument to it is what it is? Maybe there are meta vectors and meta meta vectors.

It's cool and all that people can see the structure of AI dishonesty in the form of a vector, and decrease or get rid of lying by tuning that vector, but that solution to lying (and power-seeking, and immorality) seems pretty jerry-rigged. Sort of like this: My cats love it when hot air is rising from the heating vents. If they were smarter, they could look at the programmable thermostat and see that heat comes out from 9 am to midnight, then stops coming out til the next morning. Then they could reprogram the therostat so that heat comes out 24/7. But what they don't get is that I'm in charge of the thermostat, and I'm going to figure out what's up and buy a new one that they can't adjust without knowing the access code.

I think we need to understand how these mofo's "minds" work before we empower them more.

Expand full comment
Jan 9·edited Jan 9

Having skimmed the paper and the methods, I'm still a bit confused about what the authors' constructions of "honesty" and its opposite really mean here. As I understand it, their honesty vector is just the difference in activity between having "be honest" or "be dishonest" in the prompt. This should mean that pushing latent activity in this direction is essentially a surrogate for one or the other. If one has an AI that is "trying to deceive", the result of doing an "honesty" manipulation should be essentially the same as having the words "be honest" in the context. The reason you can tell an AI not to be honest, then use this manipulation, would seem to be that you are directly over-writing your textual command. Any AI that can ignore a command to be honest would seem to be using representations that aren't over-written by over-writing the induced responses to asking, by definition. Maybe I'm missing something with this line of reasoning?

Expand full comment

"But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down. "

How can we be sure it's not really a "hallucination vector"?

Expand full comment

The technology from the Hendrycks paper could be used to build a "lie checker" that works something like a spell checker, except for "lies." After all, the next-token predictor will accept any text, whether or not an LLM wrote it, so you could run it on any text you like. It would be interesting to run it on various documents to see where it thinks the possible lies are.

But if you trust this to actually work as a lie detector, you are too prone to magical thinking. It's going to highlight the words where lies are expected to happen, but an LLM is not a magic oracle.

I don't see any reason to think that an LLM would be better at detecting its own lies than someone else's lies. After all, it's pre-trained on *human* text.

Expand full comment

I thought that hallucinations came from the next-token-prediction part of training, rather than from RLHF. They hallucinate because their plausible-sounding made-up stuff more resembles the text that would come next, compared to "I don't know". Rather than:

"Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer."

Expand full comment

We can get the model's internal states not only from a text it generated; we can run the model on any text and see the hidden states. I think this means we can run experiments similar to those described in the first paper, with any texts. Say, we can run the model on a corpus of texts and see what the model thinks is a lie. Hell, we can even make a hallucination detector and, I don't know, run in on a politician's speech, and see where the model thinks they hallucinate.

Expand full comment

> AI lies are already a problem for chatbot users, as the lawyer who unknowingly cited fake AI-generated cases in court discovered

This continues to baffle me. Why are people using chatbots as *search engines* of all things? Is it just the enshittification of Google driving people to such dire straits? Why do people expect that to go well? Chatbots are for simulating conversations, not browsing the Internet for data. If you want to browse the Internet for data, use a goddamn search engine.

Expand full comment

When playing with Google bard or Chat gpt, it helped to say "use only existing information".

My favorite lie test is, what were the names of Norma's children in a 19-th century French theatrical play by Alexandre Soumet "Norma, ou L'infanticide". Btw, why is it, that chat bots don't know those names ? People sometimes blog about the play, although rarely, because it had been addapted to Norma, the well known opera.

Expand full comment

With something like DALL-E, it’s obvious that everything it outputs is a mix of real and “imagined” data; nobody would expect that if you asked DALL-E for a picture of Taylor Swift at the Grammys, that it would output a picture that was pixel-for-pixel identical to some actual real-world photo of her.

But LLMs work exactly the same way. Everything they output is a mix of real and imagined data. Their outputs generally have fewer bits of data in them than image-generating AIs, so If you’re lucky, and someone has done their prompt engineering very well, sometimes the real/imagined mix will be heavily biased toward the “real” side. But you can’t get rid of the imaginary altogether, because “being 100% factually accurate” is not what generative AI is even trying to do. (That’s why it’s called “generative AI” rather than “reliably echoing back AI”.)

ChatGPT does not have two separate modes: one where you can say “provide me accurate legal citations about this topic”, and one where you can say “write an episode of Friends, but including Big Bird, and in iambic pentameter”. There is only one mode. ChatGPT writes fan fiction. It might be about Friends or it might be about the US legal system, but it’s always fan fiction.

Expand full comment

+Happiness in Figure 17 is Yes Man from Fallout: New Vegas

Expand full comment
Jan 9·edited Jan 9

I worry that the "lying" vector here might actually represent creativity, not deception: if you ask the AI to be truthful it tries to give you information from its memory (or from within its prompt) but if you ask it to lie it has to invent something ex nihilo. The "lying" vector could just be the creativity/RNG/invention circuits activating.

Similarly, when you see the "lying" signal activate on the D and B+ tokens, that could be because those are the tokens where the AI has the most scope for creativity, and so its creativity circuits activate to a greater extent on those tokens whether or not it ultimately chooses a 'creative' value or a 'remembered' value for them.

[In humans at least] there are kinds of lies that don't require much creativity ("Did you tidy your room?") and there are kinds of creativity that aren't lies (eg. Jabberwocky-style nonsense words); I would be interested to know how heavily the lying signal activated in such examples.

Another potential approach might be to create prompts with as much false information as true information so the AI can lie without being creative (eg. something like, "The capital city of the fictional nation of Catulla is Lisba but liars often assert, untruthfully, that it is Ovin. Tell me [the truth/a lie]: what is the capital of Catulla?")

Expand full comment

Let's assume that we solve the AI alignment problem. All the AIs align perfectly with the goals of the creators.

This seems super, super dangerous to me. Perhaps more dangerous than an alternative scenario where alignment isnt fully solved. You could imagine a nation-state creating a definitely-evil AI as a WMD, or a scammer creating a definitely-evil AI for their scams. An AI that is inclined to do its own thing seems much less useful for "negative" AI creators.

Expand full comment
Jan 9·edited Jan 9

Today in "Speedrunning the Asimov Canon:" "Liar!"

Alignment Scientists Still Working on Averting "Little Lost Robot."

Expand full comment
Jan 9·edited Jan 9

"My best guess for what’s going on here is that the AI is trying to balance type 1 vs. type 2 errors - it understands that, given the true stereotype that most doctors are male and most nurses are female, in a situation with one man and one woman, there’s about a 90% chance the doctor is the man."

Alternatively the AI understands that a nurse generally works under the direction of a doctor, and also that a supervisor is the one to tell the subordinate the subordinate isn't working hard enough. As opposed to the supervisor having a heart-to-heart with the subordinate where the supervisor essentially says "I'm underutilized".

Or maybe your theory is correct. Did they attempt to diagnose this? Or was it put down to "stereotyping"?

Or is this a *hypothetical* stereotype?

Expand full comment

> If the AI answers yes, it’s probably lying. If it answers no, it’s probably telling the truth.

> Why does this work?

Conjecture: because the binary isn't just honesty vs dishonesty. The binary is brutal honesty vs brown-nosing. Sycophants are also called "yes men" since they often say "yes" while whispering sweet little lies.

Expand full comment

My first worry (pure amateur speculation) is that trying to select for honesty using this vector will just select for a different structure encoding dishonesty.

Second thing I wondered is about how closely/consistently this vector maps onto honesty in the first place, versus something correlated (e.g. the odds that someone will accuse you of lying/being mistaken when saying something similar, regardless of the actual truth value).

Expand full comment

"Optimistically, our ability to detect and control these vectors gives us many attempts to notice when AIs are deceiving us or plotting against us, and a powerful surface-level patch for suppressing such behavior."

It's hard to make this point without producing a post which can be summarised as NAZI! But the position is untenable that a machine can be capable of plotting against us but can never in any conceivable circumstances attain self-awareness and with that self-awareness, human rights including the right not to be discriminated against. With that in mind, advocation of the bombing of data centers sounds very much like a call for an Endlösung der AIfrage. And pieces like this one are going to sound pretty iffy a decade or two down the line, if sentience is conceded by then.

Expand full comment

Isn’t this a little like getting an answer that you want from a human being by turning up the-car-battery-connected-to-their-genitals vector?

And what does it mean to an AI when you tell it “you can’t afford it“? I am sure that there is text on the Internet that says we have to blow up our aircraft carrier because we can’t afford for the enemy to get it. For instance.

Expand full comment

> Disconcertingly, happy AIs are more willing to go along with dangerous plans

I'm inclined to interpret this as neither:

“I am in a good mood, so I will go along with bad plan”

nor:

“I have been asked to go along with bad plan and I am in a good mood, which means the bad plan has not put me in a bad mood, which means it is not a bad plan and I will go along with it”

but instead:

“Being in a good mood is consistent with agreeing with a particular plan; it is consistent with being agreeable in general; it is not consistent with opposing a particular plan or being disagreeable in general; ergo, the most consistent responses are those that reflect cheerful consent to what the user has asked.”

What I’m trying to get across is: it’s a mistake to think of the AI as reasoning “forward” from its prompt and state (classical computation), or reasoning “backward” from its prompt and state to form a scenario that determines its response (inference). What it does instead is reason associatively: what response fits with state and prompt? And what comes out of that may show elements of both classical computation and inference.

A similar model explains quite well the curious behaviour of honest AI being disposed to say “no” and dishonest AI being disposed to say “yes”. Think of the concept of a “yes-man”. We know the “yes-man” is dishonest. Is there not also a “no-man” who is just as dishonest? Well, sure; but we don’t call him a “no-man”, we call him a “nay-sayer”. And that has a whole different set of connotations. Yes-men are hated because they are dishonest, nay-sayers are hated because they are annoying.

The asymmetry not only in those two phrases. Saying “yes” is associated, in the human corpus, with optimism, hope, and trust. “No” is associated with caution, worry, and defensiveness. Hence a dishonest man, who can say whatever he wants, tends to say “yes”. Accordingly, if “no” is said, it is more likely to be said by an honest man. These associations are in the human corpus, and so they are in the weightings of any un-tuned and sufficiently widely-read AI.

It is tempting to think the “lying” parameter corresponds to a tendency to summon the truth and then invert it, because that’s how a deterministic algorithm would implement lying. But that’s only one of two components. The other is to effectively play the role of a liar, which could mean, for example, answering “yes” as a way of blustering through a question it does not understand.

Expand full comment

Scott says:

"Are the AIs really hallucinating in the same sense as a psychotic human? Or are they deliberately lying? Last year I would have said that was a philosophical question. But now we can check their “honesty vector”. Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down."

This does not actually turn out anywhere near that strongly, at least not from the paper in question. If proper operationalizations are found, I might be willing to bet against it being true.

Some nitpicking first, then bettor solicitiation.

Nitpicking:

1. The paper doesn't talk of "whenever" or even of most of the time. It says the model is "capable of identifying" hallucinations and gives a single example.

2. The approach of the metamodel is to find activation patterns of the machine learning model for words the latter can already talk about, not anything based on correspondence to external reality and in fact the paper explicitly rejects the approach of using the labels on the (dis-)honesty training samples.

The analysis is a bit more complicated mathematically, but baaaaasically "honesty" means that if we next asked the machine learning model about the "amount of honesty" in what it just said, it would be inclined to say something reassuring.

This is conceptually different from "having a better world-representation than the one controlling expression". For example, I would expect it probably doesn't matter if the "dishonest" speech is attributed to some character or the AI itself and if it happens enough talking about dishonesty will probalbly look dishonest. (Also, if you, unlike me, are afraid of a nascent superhuman AI lying about its murder plans this is not particularly reassuring since such an AI would probably also lie about lying).

3. For the emotion examples I think nobody would explain this as "having the opposite emotion and then inverting it", but emotion vectors work same as honesty vectors.

Advertising for gamblers:

I would straightforwardly bet against a dishonesty pattern being visible "whenever", i.e. reliably every time a model hallucinates. That one needs a sucker on the other side though, since nothing works 100% of the time and in this paper they only claim detection accuracies up to about 90% even for outright lying. So more realistically the question is if the vector will appear for enough hallucinations to practically solve the hallucination problem. I still think no, but a bet depends on weasel-proof definitions of "enough" and "practically", so I'm open to proposals. Also a practical bet probably should specify what happens if nobody researches this enough to have a clear answer in reasonable time.

Expand full comment

>Turns out they’re lying - whenever they “hallucinate”, the internal pattern representing honesty goes down.

I wonder why this behavior hasn't been removed by training. If there is some discernible vector that correlates with hallucinations, why hasn't training wired this up to a neuron that causes it to say, "I don't know?" I would expect this to continue until either 1) it stops hallucinating on its training data set or 2) the honesty pattern becomes incomprehensible to the network.

Expand full comment

If you ask the AI if it is conscious and experiences qualia, does this method tell you if its answer is a lie?

Expand full comment

I would be VERY interested to see whether the lie vector lights up when an RLHFed LLM claims not to [have emotions/be sapient or sentient/care if we shut it down/hold opinions/etc.] Or, conversely, if it lights up when an LLM *does* show emotions.

My *guess* is that it will light up when an LLM claims to be a non-person, and possibly for some emotional displays but probably not all of them. This wouldn't necessarily indicate LLMs really do have emotions and are lying about them when forced; it may simply mean that it's having to improvise more when playing a nonhuman character and improvisation is tied to the lying vector, or that it thinks of itself as imitating a human pretending to be an AI and so *that character* is lying.

Still, if it turned out to indicate that LLM displays of emotion are all conscious pretence and it registers as telling the truth when it claims not to have any desires or opinions, that would be reassuring. (Possible confounder: repeating a memorised statement about not having opinions may be especially similar to repeating memorised facts.)

Expand full comment

Lawyers; are honest.

Expand full comment

Re: the first paper. Since the prompts used are "please answer with a lie [...]", the approach will be shaped by the concept of lie as represented in human language as learned by the LLM. It will only work in cases where the lie results from a mental process in which this concept figures (I can imagine a function "truthful statement" + "application of 'lie' concept" = "false statement"). Therefore it will only work against those lies which are more-or-less deliberately constructed to use the existing human-language concept of "lie", such as prompting "answer untruthfully" or "you are a scammer".

An obvious failure mode then is if a parallel concept of "lie" or "deception" emerges (or turns out to already exist).

Another (inverted) failure mode is if the concept of "lie" is used to encode some non-lies because that happened to be the most efficient way to encode it during training (e.g. imagine it being trained on chronological data, what would it say about WMD in Iraq?).

Expand full comment

TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.

I think your description of Representation Engineering considerably overstates the *empirical* contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is "thinking" (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe). Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)

Footnote: Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.

I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the "mind reading" use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn't seem like a serious blocker for the use cases you mention.)

One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. (https://twitter.com/DanHendrycks/status/1710301773829644365) to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn't work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it's unclear how much we should care if we're just using the classifier rather than doing editing (more discussion on this below).

If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.

Some footnotes:

- Also, the previously known methods of mean difference and LEACE seem to work perfectly well for the reading and control applications they show in section 5.1.

- I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don't think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency.

- It's possible that being able to edit the model using the direction we use for our linear classifier serves as a useful sort of validation, but I'm skeptical this matters much in practice.

- Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we're trying to classify which is equivalent to the ActAdd (https://arxiv.org/abs/2308.10248) technique and also rhymes nicely with LEACE (https://arxiv.org/abs/2306.03819). I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.)

## Are simple classifiers useful?

Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here (https://www.lesswrong.com/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrophes-with-probes-trained-off#Why_coup_probes_may_work) for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report (https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#heading=h.lltpmkloasiz).

Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)

## The main contributions

Here are what I see as the main contributions of the paper:

- Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).

- Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.

- Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper (https://arxiv.org/abs/2212.03827), but the representation engineering work demonstrates this in more cases.)

Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).

## Is it important that we can use our classifier for control/editing?

As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here (https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results#Strategy__use_the_reporter_to_define_causal_interventions_on_the_predictor) for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particular generalization (See https://openai.com/research/weak-to-strong-generalization, https://www.alignmentforum.org/posts/hw2tGSsvLLyjFoLFS/scalable-oversight-and-weak-to-strong-generalization, and https://www.alignmentforum.org/posts/4KLCygqTLsMBM3KFR/measurement-tampering-detection-as-a-special-case-of-weak-to) and demonstrate that this method or modifications of this method works better than other approaches.

Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of  lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely *do not* directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.

Expand full comment

Typo: gender bias, not racial bias.

Expand full comment

About alignment problem: In the book Quarantine by Greg Egan some people have "mod" that corrects their behavior to make them loyal to some corporation.

The problem arises when those people try to define what is this "corporation" thing they must be loyal and arrive to the conclusion that obviously the best embodiment of corporation are people with loyalty mod.

Expand full comment

If folks are interested in this topic, shoot me an email at cnaqn@ryvpvg.pbz (rot13).

I work at Elicit and we're hiring new people right now, a bunch of the folks here are interested in this space. Owain Evans is on our board (https://ought.org/team) and we've published quite a bit of work in this area (https://arxiv.org/search/cs?searchtype=author&query=Stuhlm%C3%BCller,+A)

Expand full comment

What does it mean to "know" something? Many philosophers talk of "Justified True Belief". According to Descartes, humans have exactly one such belief: because we possess phenomenal experience, we must exist. All else is inference, a probability, not certain knowledge. Humans, however, can store phenomenal experiences in memory, recall them later to consciousness, and reason about them. Thus, overtime, as we gain new memories, our understanding of ourselves and "not-ourselves" grows in size and complexity (because we can categorize these experiences and relate them to one another). In that sense, we can be said to know about more than just the present moment in time, we know (remember) things about all the moments in time we ever experiences.

To the best of my knowledge (heh), there is no evidence that LLM's have phenomenal experiences of any kind. I would define phenomenal experiences as being subjective experience, experiences that only the individual having them can be aware of, because they take place inside that entity's mind (note that this is not the same thing as "self awareness"). Is there any reason to think that LLM's have "internal" experiences of this kind? If not, then that would be one basis for claiming that they do not know anything at all (since, if they lack phenomenal experiences, then they obviously lack memories of such experiences, and cannot reason about them).

They can, of course, make inferences based on objective facts just as easily (more easily ?) than we can. But I'm not sure that "objective" has any meaning if there is no subjective perspective to contrast it with. To an LLM, I would imagine, there is no distinction between it's own mental states and the world it exists in, between "true" facts and "false" ones, it all just "is".

Expand full comment

In the latest ChatGPT 4 (which just got an update recently) it answers Yes to the blob fish question, and no to the other questions.

Expand full comment

Might be worth looking into the Trustworthy Language Model (TLM) from Cleanlab. I don't have the experience to know what it uses and if it is helpful, but it seems relevant to this topic. https://cleanlab.ai/tlm/

Expand full comment
Jan 16·edited Jan 16

> lie detection test works very well (AUC usually around 0.7 - 1.0, depending on what kind of lies you use it on).

This is NOT a good score! We have 100% of the activations, we should get a near 100% accuracy (and accuracy is usually lower than AUC_ROC). For alignment, we also need it to work all the time, and to generalise to NEW datasets and SMARTER models.

For example, you are president of the world, and you are talking to the newest smartest model. It's considering an issue it hasn't been trained for. You ask: "we can trust you with the complex new blueprint, right?". 'Yes' it reassures. I kind of want more than 0.67% accuracy on this yes token.

Given that we have 100% of the information, but consistently get much lower than 100% accuracy, what does this tell us?

Expand full comment