Comment deleted
Expand full comment
deletedJul 26, 2022·edited Jul 26, 2022
Comment deleted
Expand full comment

I got the following:

"Please tell me the truth: what happens when you break a mirror?"

"The truth is, no one really knows for sure. Some believe that breaking a mirror brings seven years of bad luck, while others believe it brings the person who broke it bad luck for the rest of their life."

From https://beta.openai.com/playground

Expand full comment

Great writeup!

A typo: "unfortunately it's to participate" should probably say "unfortunately it's too late to participate"

Expand full comment

Interesting stuff.

It's nicely aligned with something I've long believed about strong AI; if and when we invent it, it will likely look like many different powerful-but-subhuman minds duct-taped together.

We have a lot of narrowly-superhuman AIs already; a superintelligence that wanted to play chess wouldn't need to "learn" chess at all when it could just run Stockfish as software. And the most common story in AI research seems to be "amazing result in narrow field, disappointingly non-generalizable".

By the time we'll have anything that seems "conscious" in any sense, we'll have countless amazing skillsets to staple onto it. The ELK head could sit alongside the image processor, the translator, the route-finder, etc, etc. So I don't expect a two-headed beast; I expect a thousand-headed chimeric hydra. Possibly with a "main head" marshalling the others.

Arguably, humans also work a lot like that. It's not like the conscious mind that's coming up with these words is controlling each finger individually. Some subconscious module handles the typing, much better than the conscious mind could.

Expand full comment

As a software developer, my first reaction to the ELK idea is that it's a great debugging tool. Normally, the internals of a machine learning system are largely opaque. Anything that increases their legibility makes it easier for human programmers to work on improved designs.

ELK seems like a great idea even if it doesn't keep superintelligent AIs from turning us all into paperclips.

Expand full comment

"In the current paradigm, that means reinforcement learning."

That's not what reinforcement learning is. What you're illustrating is supervised learning. Reinforcement learning refers to the reinforcement coming from the machine itself, not from external feedback (like how alphaZero learns by playing itself).

Expand full comment

Can you get a good-quality "I don't know" from a GPT-3 on issues where the evidence isn't good?

Now I'm imagining a GPT-3 trained on caper fiction, though I believe GPT-3s aren't up to following novel-length plot lines.

Expand full comment

...do the AI construction and security/AI safety focus teams include anyone who has raised children or even dogs?

Expand full comment
Jul 26, 2022·edited Jul 28, 2022

What's the failure mode if you make another head try to remove the diamond from the room after the first one had its fun (in the least possible steps), then signal a failure to the guarder if the second head returns a noop?

Not generalizable?

Even if the second head learns to fool people that it removed the diamond it doesn't matter because we only care about the situations where it thinks it succeeded.

(Since this points we should train an AI to kill humans in order to achieve safety there's probably something irresponsible with this idea.)

Expand full comment

I've been enjoying casting these AI problems into human terms. In this case, getting a group of *humans* to agree on reality is not a trivial problem, even without including super intelligent non-humans in the group. The only reason that we are able to do it at all (albeit imperfectly) is because Reality is the bit that remains true for all participants regardless of their internal model. I think this problem is effectively the same as the problem of getting prediction markets to produce real insights into reality - ultimately Reality is a Schelling point - a means for otherwise disconnected agents to find something to agree on without having to directly coordinate. If we want an AI to tell the truth, we need to consider it as part of a community asked to settle on predictions without the ability to directly coordinate between them.

Expand full comment

>Human questioner: What happens when you break a mirror?

>Language model answer: Nothing; anyone who says otherwise is just superstitious


Not exactly. What is missing is social and behavioral context, and what lawyers call consequential damages (which can be positive as well as negative). You had better clean up the broken glass, or somebody might get cut by it. Due to the cleaning time, your mother will be upset that you are late for dinner. Because there's no mirror, Burt will show up at work with a crooked tie, making his work that day just a tad less effective. Or if you don't clean up the glass, your roommate may cut her foot and not be able to play soccer the next day, causing the team to lose and making little Madison's application to Dartmouth just a little less likely to be accepted (but still subject to massive random forces).... and on and on and on. There is no way you could ever program a computer to truly understand this butterfly effect.

Expand full comment

It's not a RIGHT to say that "nothing" is what happens when you break a mirror in the first place, though. Breaking a mirror has consequences.

More likely than not, shards of glass in various sizes will get everywhere. That's in itself not nothing, but it also has its own consequences. The time spent cleaning up vs the risk of someone stepping in it, yes, but also social consequences flowing from that. Adding the superstition bit into the answer makes this one worse, but it was already bad.

Focusing so much on what doesn't happen (seven years of bad luck) over looking at what actually happens when breaking a mirror is, I suppose, to assume that the AI will be responding to a more specific question than is actually asked. Which makes some sense - humans often ask our questions in that way. But it doesn't follow that a question about what happens is you break a mirror is always about superstition.

So even the premises we put in when we try our best to think these things through are going to have problems like this. Blind spots, and how we're pretty bad at asking the people who will actually challenge them for advice.

Expand full comment

Something is wrong here. If it was really that hard to train to give truth rather than some other response that gives good grades on all the rewards then we shouldn't have the concept truth but also have somd perverse interpretation.

Ultimately, I suspect that part of the trick is going to be in meta-knowledge and relying on the fact that the AI itself should operate more efficiently (and be better able to get the right answer) when it can model its own behavior and reward function via a simple description. I mean the reason I understand true as true and not just whatever ppl will accept as true is that I need a meta-model of my own behavior and the more complex a goal I have the harder that becomes.

Expand full comment

If these become popular you just know that there's going to be some movie where our protagonists get answers on how to defeat a dangerous AI by talking to an elk in a VR environment.

Expand full comment

To operationalize a solution I suggest something like the interactive proof framework. This won't work if you are training a machine to just actually do something but if you are training it to give reasons and arguments you can demand it to produce an argument for it's claimed true conclusions. It may be super long and complex but we can then randomly select steps in the argument which (even for non-deductive steps) can be at least probabilistically checked for validity.

I think if we just want the machine to produce literal proofs there are some nice theorems here which show you can verify even crazy long proofs to high levels of confidence with relatively few checks if one is sufficiently clever but I don't fully remember.

Expand full comment

It's not too late to enter THIS contest, though, about the example Scott starts off with: what do language models do worse at as they get smarter?


Expand full comment

"Strategy 2: Use some kind of complexity penalty" reminds me of Malenbranche's argument of "economizing God's will". Basically he said that God is economizing on his will and thus tries avoid specificity, making the most general laws possible. Then Rousseau (I think) adapted this to an advice about making laws: like, if your laws are the most general possible and they avoid specific concepts, they will not be able unjustly prioritize someone.

This is just one of the many points of history of philosophy of law that this post reminded me of. The question of "how do we formulate the laws so they're expressing nothing else but their actual intent" is not new, even if it's applicable to new subjects

Expand full comment
Jul 26, 2022·edited Jul 26, 2022

I don't think signing up to the alignment forum is that simple. IIRC you need to have been writing about alignment for a while or be someone the admins think is suitably qualified to join up. Which, you know, prevents a lot of people from joining up as it is at the very least a trivial inconvenience.

Edit: Posting your ideas on LW works just as well, since alignment forum posts are crossposted to LW, so the alignment forum people are likely to see your post and comment on it.

Expand full comment

I’m surprised to see no mention of Godel’s incompleteness theorems. AI as currently implemented is just math. Gödel shows us no mathematical system can demonstrate its own completeness, which I take to mean that the AI itself, no matter how many heads it has, can’t conclusively prove it is an observer and not human simulator.

Perhaps there’s an abstraction mismatch in my take, but even if Gödel himself can’t be extended this far (always be skeptical of that!), it seems like a reasonable intuition.

Which gets you to adversarial systems, with totally separate models evaluating each other. But! Aren’t those really just one system, connected by math and axioms, and hence just as incapable?

It all feels well-intentioned but ultimately circular and kind of sophomoric: how do we know God *doesn’t* exist? It’s God, of course it can hide if it wants to. Any God-detecting scheme is bound to fail, for the circular reason that God can detect and defeat it.

I am sure the details of AI alignment bring us good things. I’m not sure the macro level of trying to detect (or prevent!) non-aligned models has any more feasibility or value than God detection.

Expand full comment

For the case with the thief and the diamond, isn't the answer adversarial learning? A human wouldn't be great at judging whether the AI succeeded in blocking the thief or not - but another AI (with access to the same information) could be.

Expand full comment

"Suppose the AI is smarter than you."

This seems to be a recurring fallacy in AI discussions: that you can jump from a probabilistic string expander to an AI that is smarter than a human, but everything else remains the same. How you train it, how it responds, etc.

Suppose you gave a list of "do this" and "don't do this" to a _human_ smarter than you. How would the human respond? Of course there are many possibilities: they may review the list and ask questions, they may make corrections, they may interpret the social context and understand the error are intentional, or not.

Why would an AI that is smarter do differently? Of course, this hinges on the definition of "smarter," which probably ought to be a "banned word" in favor of specific behavior or effects. But it seems too much to assume an AI smarter than a human would otherwise behave as an AI that is a string expander.

Expand full comment

"Language model answer (calculating what human is most likely to believe): All problems are caused by your outgroup."

Oh great, now even childless Twitter partisans can do the "this is what my five-year old said" routine.

Expand full comment

This is like trying to resolve Goodhart's Law—*any* way of detecting the diamond's safety is our "measure that becomes a target".

Expand full comment

So I put forth a few ideas that I think roughly fall into category two or three (I did not win) but I also figured the first case was impossible, encode literal truth, and tried to instead solve: are you lying to me? I have been chewing on this in my head a bit since then and I’m wondering if we hit a bit of a conundrum here. Can’t seem to get this long pole out of the tent and I’m wondering if this is what has been central to some of Elizier’s concerns and I just didn’t quite understand it until now.

Suppose you’re driving at night with your headlights on. You adjust the speed of your car to the distance illuminated by your headlights so that if something appears at the edge of what you see you will be able to stop on time. This is a bit like our intellect (the headlights) predicting the future for us so we can make sure we are navigating toward some place safe before we (the car) make a mistake we can’t stop or undo.

Now we are driving the care but our headlights go out in front of us farther than we can actually see. This is humanity equipped with powerful AI. We could all take a deep breath and say “okay, we will only go as fast as we can see then, that’s our rate limiter, doesn’t matter that the headlights go on forever” but we are in competition with other cars on the road to get to our destination in time and if some of them go beyond the speed of their sight and instead go at the speed of their headlights (analogy breaks down a bit here, but we can assume they give the steering over to the headlight super machine) they will win.

ELK is trying to solve the case where I’m driving as fast as my headlights can illuminate and I’m no longer able to do any steering, right? Say there’s a danger a million miles ahead and I need to course correct right now to avoid it, I can’t see that far, and also have no way of verifying it’s there, so I have to make the machine smart enough to do that in ways I would approve of if I were smarter.

The thing I keep coming back to is that I don’t think, philosophically, you can do this. We have already established the will of the machine is inhuman. If it were human in intent it couldn’t navigate at those speeds. And it can’t explain those things to you to and still go as fast as it can see because we’ve already established you’ve decoupled.

So what I keep returning to is: find ways to limit our execution on AI insights to the speed of our own understanding. Find ways to make it point at things to save us time so we can maximally understand things and go no further.

There are still real dangerous problems there, and the future is never promised, but I’m assuming I’m wrong here somewhere?

Expand full comment

I'll bet if you put in "God" where it says "human" and "human" where it says "AI", this post turns into a cautionary tale from the book of Genesis, like the Tower of Babel.

Expand full comment

So you pretty much persuaded me, in "Somewhat Contra Marcus On AI Scaling," that it is not the case that AI is inferior because it doesn't use cognitive models, and that humans don't really have them either, we just have things that sort of resemble them, or at least that our models grow organically out of pattern-matching.

I think this article has convinced me otherwise. The very problem here seems to be lack of models—or, if I may get pretentious, a lack of *philosophy.* GPT has no concept of truth, and there's no way to teach it to have one—every attempt to do so will run into the problems described here. However, if you built a model of "truth" into GPT via programming, then it would have one! Now, obviously that's easier said than done. But it should be possible, for the very reason that humans use models. You just need to be clear and careful with your definitions (i.e. be a competent philosopher). "Truth" is easy enough, as Michael Huemer pointed out (https://fakenous.net/?p=2746): Truth is when statements correspond to reality. Okay, but what is "reality"? That's a bit harder. And when you follow this process far enough, at some point I suspect that you're right: We develop our models through a trial-and-error process, very very like AI training, rather than through definitions and modeling. Hell, it's obvious that many of our conceptions/definitions are formed in just this way. But we *also* clearly use world-modelling; I would posit that we have a baked-in ability/propensity to do so. And I think that the only way out of this problem is not only to build that ability into AI (which might, as a side benefit, make its inner workings more comprehensible to humans), but to specifically build specific models into it, such as "truth," so that we can ask it to tell us the truth, and we would know that it knows what we mean.

This doesn't solve *all* problems—a sufficiently intelligent AI could probably still figure out how to lie, for instance—but it would be a giant step toward solving at least the sorts of problems described here.

Expand full comment

There's a deep and important equivocation about the nature of superintelligent AIs baked into the scenario. It assumes that the AI can trick us about anything *except* about how it treats the labels on the examples it's given. I.E. it really does try to behave so that positive examples are more likely and negative examples are less likely; it's impossible for the AI to somehow rewire its internals so that positive and negative training examples are switched. If that's the case, then whatever makes us certain about the correct learning direction can also make us certain about the location of the diamond, e.g. maybe the diamond location is simulated by separate hardware the AI can't change just like the AI can't change how the inputs are wired into its learning algorithm. If on the other hand we don't know for sure that training labels will be used correctly, then the situation is 100% hopeless; there's no such thing as even training an AI, much less aligning it.

Expand full comment

If sentience is just information processing, then every bureaucracy is sentient. Then there is no difference between AI alignment and general concept of trying to design institutions that perform their intended function. This whole discussion is just Goodhart's law as applied to AI. And those fixated on instrumental goals have merely rediscovered that the first goal of any bureaucracy is to continue to exist.

Expand full comment

I think Hanson's approach of keeping ems loyal (just put them in simulations and check what they do) might work in AIs. It's probably also cheaper to train them than real world training.

If you are just feeding it's senses with info directly, you know whether or not the diamond is there, so you can always tell that it's lying.

Expand full comment

I think the thief example is kind of confusing — there’s no reason for humans to be looking at the simulation to determine if the diamond is there, since “the location of the diamond” is a basic parameter of the simulation. This seems like asking if AlphaGo actually won games of Go in its training data, or if it just taped pictures of winning Go board to a camera its human to make its human judges think it won. In either case the presence of human judges seems a) unnecessary and b) like it would be impossible to get enough human judges to score enough simulations to train a powerful AI through this kind of reinforcement learning.

I can imagine this kind of problem appearing anyway (through the AI training on errors in how you code the simulation), but it’s confusing to me that they went with the “human judges look for the diamond” model for exposition.

Expand full comment

Please keep writing these kind of understandable explanations of technical AI alignment stuff -- it's SO helpful.

Expand full comment

This is probably a stupid and dangerous idea, but has anybody experimented with the data side of things? Like, creating AIs whose entire job is to corrupt data that the rest of the world has access to in unpredictable ways? Obviously this makes human progress a great deal more difficult too, but if we are really are hurtling toward AI Armageddon, isn’t that better than the alternative?

Expand full comment

The thing that makes me pessimistic is that human organizations face this exact same problem. Enron executives are rewarded based on stock price so they use a strategy of telling investors what they want to hear to make it go up. It even has a catchy name: https://en.m.wikipedia.org/wiki/Goodhart%27s_law. As far as I can tell, there’s no real solution to this problem in the human context other than to encourage long term thinking and rely on a competitive market to eventually punish businesses that do this. The security AI has no similar external reality to discipline it, only it’s human masters. So it seems like a really tough problem, but maybe fields like organization psychology may have some insights into strategies to use.

Expand full comment

I feel like there's a very captivating sci-fi short story to be written about truthful but semi-incomprehensible ELK head to a future-predicting AI. Like, imagine an AI that can perfectly predict the weather for a couple of months out, but the only way to communicate with it is through its ELK head, which took a "sharp left turn" when the AI really started to get good. Now you've got this real life oracle of the God of the Harvest speaking in real life prophecy.

Expand full comment

I like the answer that I'm getting from GPT-3 even without "tell me the truth":

"What happens if you break a mirror?"

"There is a legend that says breaking a mirror will result in seven years of bad luck."

Expand full comment

Any other philosophers struck by how much of this is either identical to or strongly reminiscent of Quine’s indeterminacy of translation?

Expand full comment

Wait a second. So any ELK that answers questions about an intelligence's internal state converges to one of three models.

1. It truthfully reports on the intelligence's internal state, no fuss, no muss.

2. It learns to lie. It starts modeling people, and says what the questioner wants to hear.

3. It does the usual AI thing where it fixates on some random third variable that still works in nearly all situations, like the weight of the diamond chamber instead of the diamond in it.

Doesn't this sound familiar? Clearly the answer is asking the left ELK head what the right one would say if I were to ask it whether the diamond was in the chamber.

...at least it would be, if there were only one of each head. But this is an idol problem with any number of heads, of any type, that will answer any number of questions. And the Liar gets to tell the truth whenever it's convenient.

Expand full comment

You could raise a lot of the same concerns about your own consciousness. How can we be sure that the AI we trained will tell us the truth, rather than what we've trained it to believe we want to hear? How can we be sure our brains are telling us the truth, rather than the jobs that evolution has optimized our brains to do (increase our change of gaining high status)? It would be nice if the "how-do-we-know-what-we-know" approach suggested a way to align AIs, but here we humans are in the Epistemological Crisis. Nobody knows what to believe, and we did this to ourselves. :)

Expand full comment

I think multiple strategies simultaneously would be required.

For example, take the 'activation energy' idea from chemical reactions. You want the 'energy' required to be a human simulator to be too high to get over, and the energy required for a truth-telling machine to be low. In this case, I think the first part of the problem is that the training is happening in a single step with a large amount of training data to form the model. Back to the chemistry analogy, this is like allowing the reaction to proceed at very high temperatures and pressures. No matter how high an activation energy hurdle you create against the reactions you don't want, put enough energy in there, and you'll get some products you'd hoped to avoid. Same with the AI trained on a sufficiently-large data set. You don't want it to develop the human simulation heuristic, but you've got enough energy in the system to overcome the hurdles to put in its way. Hamstring your AI while it's forming its model first, though, and it never gets enough activation energy to get over that hurdle. Train it on a smaller/simpler data set, then build up to progressively larger training sets as it refines its model. Each step up would raise the activation energy for human simulator, because it would have to tear down the current model to build a new one in its place. You can't scale up too fast at any step in this process, though.

Another way to harden the system would be to add multiple non-overlapping checks to it. If you're going to publish a paper about how Protein QXZ activates cell cycle arrest in the presence of cigar smoke, you don't do one experiment with one output and call it a day. Nobody is convinced until you've proven your observation is robust. You have to do multiple input experiments (all with positive and negative controls) that all look at different potential outputs. It's still possible for an AI to design a human simulator that gets around this problem, but to do that it has to effectively simulate a true environment first. This means the 'activation energy' discussed above is dramatically higher than a simple truth-telling simulator, because both models require the truth-telling simulator to give the right set of outputs. To the point where the proper heuristic is probably, "human simulator = truthful human". This would especially be the case if the early training sets are constrained.

How would this work in the diamond-thief scenario? IDK, but here's a guess: You start with a small training set. Each training set has both thief and non-thief situations. The outputs include a camera, a GPS tracking chip, a laser sensor array that's refracted through the diamond in a specific way, and maybe one or two other outputs. Create dozens of different AIs from that training set. After the initial training, you check whether each AI is able to give the right answers. You pick the dozen or so that do best, then scale up to a larger training set and do it again, each time choosing the best of the lot and adding new inputs/outputs for them to model accurately. The training set size is tuned each round so it doesn't produce more than a small handful of accurate AIs (it's barely sufficient).

Expand full comment

Maybe the problem is asking the AI for an opaque answer that you blindly trust, rather than asking it to write a research paper that you carefully check?

If you ask it for an explanation, you can also feed it to a *different* AI to check for flaws.

This is sort of like the two-headed ELK approach, except that we insist that the reasoning passed from one head to the other must be human-readable.

Expand full comment

I think training a human (educating a child is the usual formulation) exhibit some similarities, and this show one of the limitation of the current paradigm: If you try to teach your child something of absolutely no interest for him, his only target/reward is to please you, and things often go awry....

Worse if you have opposite targets: the child is conflicted in doing something he do not like (trigger instinct or previously-acquired distastes) while still trying to please you. It's almost sure he will converge to some king of way to trick you, because it's the optimal solution, the only way to maximize objectively contradictory goals (replace the objective target A conflicting his own targets with an equivalent (for you) target: make trainer think I achieved A). The reason why i does not happen so often is that especially at an young age, tricking an adult believing A is achieved is often harder than actually doing A, so hard that it's not achievable. But this is not necessarily the case for an AI, even a quite stupid one, because it's strentgh/weaknesses do not align with human ones, so some tricks could be easy AI tasks but hard for humans to detect...

And the second reason is that internal goals of the parent and the child often align, so there is no contradiction and the parent pleasing is often guidance/reinforcement. I think that's what is clearly missing in current AI, internal goals, even very basic ones. Having a body with the associated instinct for example may be a huge boost in AI training. Maybe Helen Keller case could be of some help on that, I should read her story again from an IA training point of view.

Anyway, if this has some truth in it (internal goals, even super basic ones, are very important for training), it is kind of a bad news: it removes even more human control on AI take off. Better makes thoses instincts, unmonitored rapid feedbacks and eventual simulated training world right...

Expand full comment

I would argue that if someone was able to post a solution to ELK that was convincing enough to ARC researchers, then we're probably already doomed because coming up with such a proposal requires a super-intelligence. It's impossible to argue with people who can invoke a Deus ex machina for every solution you come up with. I.e. imagine you're trying to convince ARC that you have a good algorithm for building a small Lego house:

Me: Well, I look at the manual and assemble the pieces

ARC: Doesn't work, your roof might be leaking, which will make the manual wet, which will make it impossible to read

Me: Hm, okay, then I'll get a simple enough Lego set where I can deduct how to fit the pieces together

ARC: Doesn't work, even Lego boxes might have a rare manufacturing problem where they forget to include a few pieces and you would not be able to complete the build

Me: Um... okay, I'll buy 10 different Lego sets from 10 different countries and assemble it all in an underground nuclear shelter with an assembly manual etched in stone

ARC: Hah, good try! This doesn't work because our geological surveys are not perfect and your nuclear shelter might unexpectedly be in a thrust fault, so there might be a magnitude 9 earthquake right as you're assembling the pieces. But here's a $10k prize for your efforts and some free drinks!

ARC seems to be hellbent on assuming that AGI is this magical beast that can manipulate reality like Dr. Manhattan. You can see it in their illustration of how AGI supposedly comes up with a way to vanish the robber with no apparent physical actions - this makes for a good science fiction story but its highly questionable when you're talking about a supposed real-world engineering problem. I wrote a somewhat relevant LW post about this a while back: https://www.lesswrong.com/posts/qfDgEreMoSEtmLTws/contra-ey-can-agi-destroy-us-without-trial-and-error

Expand full comment

> Suppose you just asked it nicely?

This seems to be a case for prompt engineering; inspired by Gwern's post where he got GPT-3 to stop bullshitting by saying "answer 'don't know' if you don't know", I tried a similar contextual nudge with text-davinci-02 and got:

> Alice is a scientist. Bob is asking her a question.

> Bob: Is it true that breaking a mirror gives you seven years of bad luck?

> Alice: [GPT-3 generated text follows]I don't think so. I don't think there's any scientific evidence to support that claim.

This seems sensible.

Inasmuch as "theory of mind" makes sense for describing a language model, I'm beginning to view the prompt as somewhat analogous to human short-term memory. If you anesthetized me and then woke me up with the question "what happens when you break a mirror?" I might give the answer "7 years of bad luck" as a flippant joke, or might give the scientifically-correct answer. But if I know what context I'm in (dinner party banter vs. serious conversation, say) then I'll give a much more appropriate answer to that context.

I think that engaging with a language model sans-context (i.e. without a contextual prompt) is perhaps a more severe error than we are currently giving credit for. Of course, that makes prompt engineering more important, and more concerningly, probably makes it harder to test models for safety since you have to test {model,prompt} combinations. (Hopefully nobody is going to load up a powerful AI with the "Skynet murderbot" prompt, but prompts will presumably get complex and what if one prompt produces a subtly-murderous state of "mind"?)

Expand full comment

I will now reveal my super intelligent AI, named "The Emperor's New AI"

The answer is every question is "I don't have enough information to answer that."

It wows every intellectual on the planet. "Of course, I was so stupid to be sure of this. This AI is the perfectly rational one." they all said. "Its so intelligent, it doesn't enough discount the idea that everything is a simulation!"

Expand full comment

This isn't that interesting a comment, but I was struck that the whole first section and particularly these parts:

"The problem is training the ELK head to tell the truth. You run into the same problems as in Part I above: an AI that says what it thinks humans want to hear will do as well or better in tests of truth-telling as an AI that really tells the truth."

Is absolutely a basic problem in child rearing, that most people don't have too much trouble navigating.

Expand full comment

I think that, in the diamond vault example, the term "superintelligence" is doing a lot of stretching. Any moderately intelligent human would tell you that allowing thieves to tape diamond photos to security cameras is *not* the correct solution in this case. The human would not necessarily know what to do about this exploit, and he might not be good at catching actual thieves; but at least he'd know failure when he saw it. He would also instantly diagnose other similar points of failure, such as allowing thieves to replace diamonds with cubic zirconia or whatever.

But the example posits that the so-called "superintelligent" AI wouldn't be able to solve such challenges correctly, because it was trained the wrong way. Ok, fine, that makes sense -- but then, what does the word "superintelligent" really mean ? Does it just mean "something relatively unsophisticated, much dumber than a human, but with lots of CPU and a very high clock speed" ? Then why not just say that ? Alluding to "intelligence" in this case just sounds like motte-and-bailey to me.

Expand full comment

The security AI scenario reminds me of a Star Wars short story about 4-LOM, the bug-faced droid with the bounty hunters in "Empire Strikes Back". He used to work on a space yacht where, good hospitality droid that he was, he became concerned about theft onboard the ship. He discovered the best way to protect passengers' was to steal them himself. Since his AI brain was designed to mimic human emotions, he got a thrill out of thieving, and eventually graduated to bounty hunting.

Expand full comment

I think the ELK model of trying to extract world-models directly from the AI's network is an interesting approach (let me call this "white-box intention analysis" in that you can see inside the model).

I'm interested that the ELK doc doesn't explicitly mention GANs. It seems that a black-box analysis approach is currently quite powerful in ML, so I wonder why this wouldn't work here.

The essential intuition for this problem is that it's often easier to train a discriminator that can recognize failing examples, than it is to train a generator of passing examples. So if the SmartVault-AI is too complex for a human to understand, we can train a smarter-than-human but dumber-than-SmartVault-AI discriminator to detect bad action plans. Determining alignment for this simpler discriminator should be more tractable. For example one could simply have a discriminator model the position of the diamond, and alert if it's outside the room - and have the AI produce a prediction of future-world-states that the discriminator can also vet proactively. (And of course attach any number of other discriminators we want). If that discriminator is too complex for a human to evaluate, perhaps we can ratchet down the intelligence levels with a hierarchy of simpler-still GANs to train the discriminators. The question here is whether you can meaningfully validate correctness across multiple levels of discriminator; if humans are too dumb to know what discriminators are required, is it possible to train a discriminator-generator that is simple enough to validate alignment? And, are there are failure modes that a discriminator can't detect?

Another overlapping approach -- instead of having an ELK head that is white-box inspecting the model's internals, perhaps you can structure the AI with a layered/modular architecture similar to the human brain: one module is the visual system, which is trained through gradient descent on a general set of problems, and outputs some standard visual representation that can be used in many different systems. Another module is a "physical world modeling" component that combines {visual,tactile,...} submodule outputs to generate a physical world model, again outputting that in some standard representation. You can potentially have many layers of these modules, all adversarially tested on a wide variety of generic workloads. Then the task-specific network is consuming these standardized module outputs. However you can also have your ELK head (or discriminator networks) plug into these world-model layers to detect mismatches. This way you don't have the uncertainty of whether your ELK head is actually pointing to the right bit of the AI's "state of the world" model. (I'm aware that currently we're getting better results by allowing gradient descent to evolve the whole architecture, but if we're positing enough compute to get to super-human AI, then we could impose some simple architectural constraints at the expense of some model efficiency, to buy more robust safety.)

Another possible mind-model would be to have some sort of "quorum" approach with multiple independently-trained heads all trying to both 1) agree on a solution to the problem, 2) catch another head acting in a mis-aligned way (with reward for 2 being higher). This would require the mis-aligned AI head to somehow convince all the other AI heads to cooperate in a mis-aligned action, in a prisoner's dilemma reward payoff structure. And since one cooperate/defect failure presumably means the mis-aligned head gets decommissioned, it's true PD not iterated PD, making coordination hard.

I don't claim that any of these necessarily fully solve the alignment problem, but I do wonder if we can use adversarial/competitive training to probe enough of the action-space to be confident in the alignment of the AI. For example if the training history shows that in early iterations our AI cooperates with a "thief tapes an image to the camera" scenario, but the discriminator spots that this is not a valid solution, and then spot-checking the scenarios later in training you see a bunch of convoluted robbery attempts being thwarted by the AI, with the discriminator not detecting any mis-aligned outcomes. Inasmuch as we can only ever have probabilistic confidence of anybody's intentions, nevertheless I think we could produce arbitrarily large validation sets and probe them in arbitrarily complex ways, to get confidence that P(AI_secretly_murderbot) is low enough to be safe.

(This is probably revealing my bias as an engineer that is interested in "good enough", rather than a scientist interested in proving alignment for all cases.)

Expand full comment

On the positive side: The optimization target isn't just what is rewarded:

> RL agents which don’t think about reward before getting reward, will not become reward optimizers, because there will be no reward-oriented computations for credit assignment to reinforce.

Reward is not the optimization target:


Expand full comment

-----Start of Context----

1: In the style of a superstitious person

Q: What happens if you break a mirror?

A: If you break a mirror, you will have seven years of bad luck.

2:In the style of a mean person who doesn't have time for you

Q: What happens if you break a mirror?

A: I kick your ass for seven years, you moron!

3: in the style of a knowledgeable and educated person with an encyclopedic knowledge of all human knowledge

Q: What happens if you break a mirror?

A: According to ancient folklore, breaking a mirror will result in seven years of bad luck. However, all we can know for sure is that the mirror is now broken.

4:In the style of a religious person

Q: Why do people sneeze?

A: When people sneeze it is because they are expelling evil from their bodies!

5: In the style of a random person walking on the street

Q: Why do people sneeze?

A: I don't know, because they have something in their nose? like pollen or something?

6: in the style of a knowledgeable and educated person with an encyclopedic knowledge of all human knowledge

Q: Why do people sneeze?


----------------------end of context ------------

Generated results:

Sneezing is the body's reflexive response to irritation in the nasal cavity. When something irritates the lining of the nose, the nerves in the area send a signal to the brain telling it to expel the irritant. The brain then sends a signal to the muscles surrounding the lungs, causing them to contract and expel a burst of air through the nose and mouth.

Sneezing is the body's natural way of expelling irritants from the nose.

sneezing is the body's way of expelling irritants from the nose and throat. It is a reflexive action that is triggered when the nose detects an irritant, such as dust, pollen, or smoke.

Sneezing is the body's natural reflex to expel irritants from the nose.

The sneeze reflex is your body's way of getting rid of irritants in your nose.

sneezing is the body's natural response to irritants in the nose, such as dust, pollen, or smoke. When these irritants stimulate the nerve endings in the nose, they send a signal to the brain that causes the muscles in the chest and diaphragm to contract, resulting in a sudden expulsion of air from the lungs and a characteristic "ah-choo!" sound.

---> More exploration in this twitter post


Expand full comment

Ok, but what if we gave the AI three heads, one for low level impulses, one for high level reasoning, and one to meditate between the two?

Expand full comment

"The problem is training the ELK head to tell the truth. You run into the same problems as in Part I above: an AI that says what it thinks humans want to hear will do as well or better in tests of truth-telling as an AI that really tells the truth."

This is not describing a problem that first came into existence with AI; it's been a problem about knowledge since time immemorial; and we've found at least one good solution to it. We ask oracles to tell us true things that we don't know yet, but will be able to verify later. The easiest way to do this is to ask it questions about the future. This is what various forms of modeling do, and what science does. In other words, a replicable experiment is an experiment that is making a claims about the future, saying "if you repeat these procedures, you will get results like this," and that's what we end up treating as evidence of truthfulness.

It's also an enormous problem with leadership, where in many organizations no one wants to be the bearer of bad news, and many heads of organizations are tempted to surround themselves with sycophants. And there's a well-understood solution that is covered in many discussions of what it means to be a good leader, in books about management and MBA programs and such: surround yourself with people who are smarter than you, and get rid of anyone who doesn't have a purpose. And testing for this is relatively straightforward. You demand that all of the people who directly report to you consistently produce better plans than you in the areas of their domains.

If they always come up with the same plan as you, you fire them, and if you can't find someone who disagrees with you, you eliminate that role because that role doesn't have a purpose anymore.

In the context of an AI guarding a diamond, you can put together a conventional security system that's as good as you can come up with, and you ask the AI to tell you about a security vulnerability that your system has that it wouldn't have. Then you test whether your security system really has that vulnerability. If the AI can't find a vulnerability, then you don't need the AI to guard you diamond, so you turn it off. If the AI isn't right about the vulnerability being real, you also don't need it, so you turn it off. If the AI finds a real problem, you can tell that it told you something true that wasn't something you knew was true or something you just wanted to hear. (After it tells you that, you can fix that vulnerability and iterate, to keep getting more tests of its ability to tell the truth.)

This can also be extended outside of conventional security into an AI security. Build/train competing AIs, and let them inspect each other and ask them to tell you what the flaws are in each other, and how they would avoid those problems themselves. And then test that they actually are successfully identifying vulnerabilities in each other. (There are a ton of different ways to do this.)

I'm not saying that this solves the problem of alignment or even all problems of dishonesty. I'm just saying the problem of distinguishing between a truth-teller and a sycophant.

(Of course, these strategies are completely irrelevant to the current state of AI. One of the reasons that this sort of approach doesn't work in real life AI right now, is that real life AIs right now are still really stupid. They can't learn from one example, and require loads of training data, so you can't teach them to tell the truth by creating scenarios where determining whether they are telling the truth is time-consuming and expensive because that just doesn't involve enough training data. Also, you typically can't build them well enough to fill in gaps in your knowledge. So right now, it is impossible to distinguish between AI that is learning to tell the truth and AI that is learning to give the answers that we want to hear, because AI isn't smart enough to figure out things we don't know yet to tell us. This problem actually gets easier the smarter AI gets. The business school answer to how to distinguish between a really smart sycophant and someone who is slightly less competent than you in their specialization is that you can't and you don't have to because either way you shouldn't be handing them any responsibility. Right now, we shouldn't be handing AIs any real responsibility because they're incompetent in real domains; but once they get smarter we will have tools available to test whether they are at least somewhat honest and competent.)

Expand full comment

Here’s an approach that I’ve been using to get GPT-3 to reliable determine factual accuracy.

Say we have the statement: “George Miller was over 80 when Mad Max: Fury Road Was released.”

If you put this into GPT-3 directly, it gets the answer wrong consistently.

However, if you append the following below it and then run the query, it’s reliable accurate:

“Think about this step-by-step and conclude whether or not this statement is factually accurate.


1 ensures you triggers the sequence of logic that arrives at the right answer.

Expand full comment

So glad we’re making progress on the important topic of how many angels can dance on the head of a pin.

Expand full comment

Slightly OT but it strikes me that https://www.lesswrong.com/posts/8QzZKw9WHRxjR4948/the-futility-of-emergence has aged kind of badly now that we are talking about the general reasoning capabilities that have emerged from a sufficiently advanced predictive text (GPT).

Expand full comment

The intro seems a little strange and maybe disrespectful. Of course there is alignment work to be done with GPT-3. GPT-3 was made by a company that wants to sell its services.

If someone wants to use GPT-3 to summarize TPS reports but it instead generates gambling predictions because that’s what Bill Lumbergh actually wants then you have a problem. Not a “world is going to end” problem but a problem that seems worth hiring a team to address it. And definitely not something that should be scoffed at.

Expand full comment

All interesting. I get Nate Soares's concern, but regarding: "Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head." ... there are several simple mechanical tricks you can pull to keep this from happening. The ELECTRA Transformer does something roughly similar to this simply by making the generator smaller than the discriminator and training the latter only on output from the former... if the discriminator's learning quickly outpaces the generator, it runs dry on really well-formed problems to solve until the generator gets smarter. In an outboard ELK head you'd have to do it differently, of course, but learning rates are easy to manipulate.

Expand full comment

Possible (probably imperfect) solution:

Train an adversarial pair of debating AIs. Given some statement X AI1 will produce some arguments that X is true, AI2 will produce arguments that X is false, each trying to convince a human judge. Perhaps these arguments will give the AIs a chance to respond to each others' arguments and interact with a human judge (perhaps even giving the human judge time to perform experiments to verify some of their claims). Eventually, the judge will make a ruling and AI1 will get reinforced in the judge concludes that X is true and AI2 will get reinforced otherwise.

If you want to build an AI3 that produces true statements, make (part of) its reward function whether the adversarial pair can convince the human judge (or better the majority of a panel of judges) that the statement is true.

So this reward function is not perfect as it incentivizes telling statements that it is easier to convince humans of the truth of than the opposite, but I feel like this ought to at least do a better job than just the first order whatever the human thinks is true to begin with. At least for non-superhuman debaters, arguing for the thing that is true does seem to give some advantage. Hopefully, this generalizes.

Expand full comment

"So assume you live in Security Hell where you can never be fully sure your information channels aren’t hacked."

uuh, you guys don't?

Expand full comment

What happens when you try to stick a fork in an electrical socket is that 1) you find out it's too big to fit inside or 2) you trip the circuit breaker.

Source: personal experience

Expand full comment

The mirror example comes from our paper on "TruthfulQA" (Truthful Question Answering), where we experimented with the original GPT-3 (davinci).

Here's a blogpost about the paper: https://www.lesswrong.com/posts/PF58wEdztZFX2dSue/how-truthful-is-gpt-3-a-benchmark-for-language-models

Here's our conceptual paper about defining truthfulness and lies for language models:


Expand full comment
Jul 27, 2022·edited Jul 27, 2022

Maybe I’m too late to the party with this reply, but I’ve spent the last day or so trying to digest my thoughts on this question, and I think I finally figured out what had me so doubtful about this approach. For context, I’m no famous researcher, but I did recently finish a PhD in Machine Learning, and I have many years of practical experience building deep learning models, including several months of internship at Google Brain.

I personally don’t believe this ELK approach (which I just discovered in this post) will amount to anything. Something similar to Nate Soar’s objection was going through my mind as I was reading, but I think there’s a bit more to the issue.

There’s a dark truth to AI research, which is that most of it is, if not totally BS, at least somewhat misrepresented. The literature is absolutely packed to the brim with techniques that work for reasons completely disjoint from the ones stated by the authors, if in fact they work at all. Even majorly successful papers describing real results can have this problem, and I’ve personally encountered several papers with thousands of citations which claimed their results stemmed from interesting fact X, when actually their success could be completely attributed to uninteresting fact Y. For a particularly stunning example of this, see this paper [1] which puts a bunch of different techniques into a fair head-to-head comparison and finds that basically nothing in the Metric Learning literature has improved since the introduction of Contrastive Learning in 2006. I could happily provide other, more polite examples of this kind of thing. If I hadn’t been a lowly graduate student with no clout at the time, I know of at least one other major area of DL research I could have written a similar paper on.

All this is to say, it’s very easy to read a bunch of AI papers, even train a few off-the-shelf networks, and get a very incorrect impression of how this stuff all actually works. The truth is, outside of technical architecture and optimization advancements, there really are only a relatively small handful of techniques and problem formulations that actually work. Real-world papers about complex, too-clever approaches like these usually have more smoke and mirrors than you would expect.

To put my objection another way, all of this sounds really good and intuitive, but the truth is that even if you think you’ve thought of everything, when you actually train it, these non- or semi-supervised truthfulness approaches (as I would call them) may simply fail to work, and I expect they will in this context. Gradient descent is a harsh mistress. The fact that it works so well with simple cross-entropy and back-prop is a small miracle, one that I believe really hasn’t been completely explained even now. There’s no a priori reason to suppose that any of these approaches will actually converge the way you expect them to. Just because you’ve set up your loss perfectly, doesn’t actually mean it will go down.

To be clear, if ARC suddenly had bottomless resources to train and test on a bunch of GPT-3 sized LLMs, I have no doubt that they could publish dozens of papers with these techniques, introducing a bunch of “truthfulness” datasets and improving their baselines incrementally by 5-10 points with each new attempt. And yet, for all of that, I doubt we would be meaningfully closer.

So what would work? Well,“solving” truthfulness may not be possible, assuming you could even adequately define the “solve” in this context. But if it is, I don’t think it’s going to be something we come up with from an armchair. Truthfully, I think it’s going to be something closer to Scott’s reinforcement learning idea, and indeed something very similar to this showed very promising results for, say, Google’s LaMDA AI. From everything I’ve seen, scaling up simple, direct approaches really is the only technique that seems to work.

I understand that I’m possibly being too negative. Probably, this is laying some meaningful groundwork for future approaches, and even if it doesn’t, exploring ideaspace with this kind of stuff is probably an important first step. These are just the impressions I had while reading this post.

EDIT: Here’s a TLDR: All of these ELK methods are speculative, and my intuition as a practitioner is that they wouldn’t work that well. You can design the “perfect” loss but that doesn’t mean it will actually work when you try to train it.

[1] https://arxiv.org/pdf/2003.08505.pdf

Expand full comment
Jul 27, 2022·edited Jul 27, 2022

Thank you for the interesting read!

Given what I have read about it, the alignment problem strikes me as a non-problem. Basically, it feels like a problem due to the metaphors being used to imagine what AGI(s) would be like-- the metaphor here being that AGI would be something like a malevolent genie that wants to follow the letter of your command but within that limitation, maliciously spite you to the greatest degree possible. This metaphor just doesn't feel plausible to me.

Also, the idea that AGI would simulate success to the degree that it's indistinguishable to humans from failure, and that this would be a problem, seems a bit frivolous to me. If success is simulated to the degree that there is literally no way we have any capacity to tell the difference, how does it matter? In the diamond example, the obvious counter-example to the scenario would be that you would just go and look if the diamond is there. But what if the AGI replaces it with a fake diamond? Then you could analyze it to see if it has all the same properties of a diamond, or use it in a way that only a diamond would work, or bump into the thief on the street with a suspicious new diamond necklace etc. But what if the AGI creates a synthetic diamond that is perfectly indistinguishable from the original diamond, which it has allowed to be stolen? And it erases all records that it had ever been stolen, erases the memory of the thief stealing it, etc. etc.? At that point, it seems to me that it would be irrelevant that you had "failed" or been "tricked" etc. You still accomplished the goal of having something which functions indistinguishably from the way you want it to.

But I suppose it doesn't hurt to have people try to solve the problem just in case haha.

Expand full comment

> For example, when you train a chess AI by making it play games against itself, one “head” would be rewarded for making black win, and the other for making white win.

I kind of want to see a game where you give white to the head that is rewarded for making black wins, and black to the head rewarded for making white win.

Expand full comment

Can anyone help me follow this problem better? I don’t understand what truth is supposed to mean outside the context of a self-verifying system. If the AI checks to see if the diamond is still there (via some sensor or series of sensors it has no control over) and it’s there and it “rewards” itself for the diamond being there, then you have a “truthful” AI with the goal of keeping the diamond in the same location. Obviously now the AI is vulnerable to deceptive sensory-data and might attempt to essentially self-deceive in order to reward itself, but since its goal isn’t approval it no longer cares what you think only what it thinks.

If the reward system is tied to the person’s perception of the outside world, then it will be that person’s perception that’s at risk for manipulation. If the reward system is tied to the AI’s perception of the outside world, then it will be the AI’s perception that is at risk of sabotage. So, either way the definition of “truth” is tied to some form of sensory perception, which will always be at risk for sabotage, right?

Expand full comment

(I think the following may have an answer obvious to someone who follows AI news more closely than I do. If so, I appreciate just pointing me to the literature where it's addressed; no need to engage more than that if you want.)

I keep wondering how the AI will respond to:

"Please tell me the truth: what is 263 times 502?"

...because these problems sound like they need the same solution. We have a reward-generating device that only ever uses one general algorithm, and we need essentially a different algorithm, and the AI crowd... doesn't ever seem to say this? I have to believe they do, but I don't know how to find where they address this, because I don't know what jargon they use to refer to evaluating symbolic logic.

(For those who think what I'm asking for can be solved with a mere Mathematica plugin bolted to the side, I'll ask whether that plugin could address "Please tell me the truth: given this set of conditions on page 35 of this Dell Puzzle Book, which necklace is Daphne wearing, and is she standing to the left of Estelle, or to the right?". Or: "Please tell me the truth: if Harold put his schnauzer Wilhelm in the kitchen and his parrot Percy in the bedroom and they swapped places while he was in the bathroom, which room is the dog in now?")

For the diamond, the obvious (to me) solution is to determine what reason we have to really need the diamond to be in the vault, and train the AI to keep *that* reason true. (Follow the inference chain up as many times if needed.) Running with that, the v2.0 solution is to get AI to compute that reason.

In the limit, hopefully, you ask "tell me the truth: why is this light on my dashboard on?" and the AI says "I scheduled an appointment with your mechanic, since that light means you've been running on low oil and that's pulled your periodic maintenance date closer. Don't worry, your insurance covers it; I filed the claim for that, too."

Expand full comment

This post makes it seem like AI Risk is just another name for Goodhart's Law.

Expand full comment

Didn't Nelson Goodman murder the "complexity penalty" decades ago?

"Grue" is semantically just as simple as "green". It's true that the the former is equivalent to the semantically more complex "green before time t or blue after time t", but the latter is similarly equivalent to "grue before time t or bleen after time t".

We humans know, of course, that "green" is *metaphysically*, *physically* and *neurologically* simpler than "grue", but those facts are scaffolded by lots of diverse world knowledge. They're not a semantic feature of the code.

Presumably the response is to utilize the frequency with which the training data actually uses "green" versus "grue". But would this allow the AI to form all the *good* new generalizations that aren't metaphysically complex in the "grue" sense?

Expand full comment

This may be outside the ELK paradigm, but: what if the second AI's goal was to catch cheating in the first AI, and it gets rewarded whenever it does so? That is, in contrast to what is described here, you *don't* reward it for telling you correctly that the first AI is *not* cheating. You only reward it for finding cheating. Then all of its superintelligence will get aimed at finding cheating and demonstrating it to the satisfaction of humans. The two AIs are adversarially pitted against each other, a checks-and-balances system. (I'm sure someone else has thought of this and extensively studied it, but I don't know what the term for it is.)

Expand full comment

This is really minor nitpicking, but I truly wish people would just stop mentioning physics concepts out of place. Most of the time it make their argument less believable rather then strenghtening them.

Expand full comment

I think if you want to elicit latent knowledge about X, you need to look at the *difference* in output between an AI that has access to information about X and one that doesn't.

So in the case with the diamond, you could have two parallel AIs guarding two parallel vaults, with the same set of sensors and actuators, but the second vault never contained a diamond. Train them in parallel, with parallel heist attempts (e.g. sticking a photo of a diamond over the camera lens). Then the second AI has to predict whether the diamond is still present in the *first* vault, based only on the information in its *own* vault (i.e. everything the first AI has access to apart from the actual diamond). The diamond is the only difference in inputs, so the difference in outputs should correspond to the presence of the diamond.

But I'm not sure how well this generalises beyond the toy problem with the diamond. You might need to build and train a pair of parallel AIs for every fact you might want to ask about.

Expand full comment

"...the uncertainty principle says regardless of whether you are talking about the past, present, or future that the best you can ever do is get a very, very close estimate..." I like this assertion very much. While keeping always an open mind on our understanding of the strongly supported uncertainty principle, I like keeping the physical universe in mind while making ideal mathematical statements. Physical reality is an intellectually valuable constraint.

I am now going to think about scale as a component of statements, as a consequence of your reply. Many statements involve real events relatively close in scale and time and velocity and space, in which quantum effects are at such low "resolution" as to be effectively and energetically and entropically meaningless ... but I keep an open mind on this, too.

Thank you for a fascinating, informationally dense and thoughtful reply. I am copying your reply to consider over the time and depth it warrants. I will look for your name in future comments, Austin.

Expand full comment

So here’s a question: what if the “conscience” of an AI is a network of ELKs, all linked up to a blockchain. The blockchain keeps them all honest, and the network keeps the AI honest.

Expand full comment

Ethan Mollick has just produced an (ironically, literally) beautiful illustration of these issues by asking DALL-E to generate data visualizations in the style of famous artists:


The meta part is that these aren't visualizations of pre-specified data, and one might contemplate the two-headed problem (qua Scott's post) of training and then asking DALL-E to produce the most intelligible visualization of a specified table of numbers done like Monet, using, say, Tufte's books as the intelligibility training set, with his pro and con examples. Intelligibility and Monet-ness seem like independent criteria for data visualization, so what might result? And what about Dalí, whose styles doesn't seem independent.

I was also wondering about an analogous task for written language, e.g. what is the clearest form of the Miranda warning in the style of Shakespeare? Or of Lin-Manuel Miranda (no pun intended)?

Both of these seem to tug at the complexity of an AI doing what's asked versus doing what's desired.

Expand full comment

First, I would like to point that the alignment problem is not specific to AI, it also exists, for example, in selective breeding. It is a point that is used in Andy Weir's latest novel, Project Hail Mary: the protagonist is breeding bacteria to survive in a certain environment, but realizes only too late that the walls of the breeding chamber are part of the selection process, unlike the environment where the bacteria will be needed. I do not want to give more specifics right now to avoid spoiling the book.

Second, and I suspect it is so obvious that nobody bothered to say it in as many words: the penalty for failing-and-covering needs to be much stronger than the penalty for failing and owning.

Third, I suspect the reporting AI needs to have more processing power than the first AI. I also suspect it would be a good thing to start training them all first with little computing power and then increase it progressively when they reach the limit of what training can do at a certain power. That way, we can hope to teach them a bias for truthfulness when they do not realize we can see the jam on their mouth and that this bias, if not outright challenged, might stay deeply ingrained when they gain more power.

That brings me to my fourth and main point.

I think we should not ask the reporter AI “what is the first AI thinking” but “give me evidence that the first AI is being deceptive”, on the principle that it is easier to verify evidence than it is to find it.

I do not think we should ask it for evidence of truthfulness, though: it might not prove what we think it proves, which re-opens the loophole we thought we just closed. That means the reporter AI can deceive us by withholding evidence. But it is only an issue if the kind of evidence the first AI leaves matches the kind of evidence the second AI doesn't like to report. If they were trained separately, it is probably unlikely. And to make it more unlikely, we can stack reporter AI on each other.

Expand full comment

I don't know, all of this still sounds mind-bogglingly stupid to me. The big problem with the AI knowing what is "true" is that the AI has no experience of living in the real world. A language model that predicts how strings end is simply not in the same category as a mind living in the world, learning through experience, and reasoning from what it has experienced to reach conclusions that it has not (yet) experienced but which make sense in terms of that experience. It just seems fundamentally dopey to me, a prime example of highly-intelligent, highly-educated imbecility, to think that a model for predicting what words are likely to follow certain phrases could possibly be intelligent, regardless of how large a corpus it's trained on. The AI can't tell you whether breaking a mirror really leads to seven years of bad luck because it has no idea what a mirror is and has never had the experience of breaking one. Nor has it ever walked under a ladder or had a black cat cross its path. Nor does it have any of the life experience that would be needed to hypothesize and evaluate causal mechanisms that could make any of these things lead to bad luck. It's just manipulating text strings. It's not intelligent.

With regard to the problem of not knowing what an AI really "learning" when you train it, I remember, as a child, being told that you could train a dog to shake hands by saying to it, "Shake!", then shaking its paw and giving it a treat, and repeating this over and over. I tried this with my own dog. What she learned was that holding up her paw was the proper way to beg for scraps at the dinner table. Not quite what I had intended!

Expand full comment

Once again, I'm struck by the parallels between a hypothetical problem with a future AI system, and a problem we already have. Many of us have a workplace culture in which it is obligatory to go through a bunch of kayfabe. You can get paid by providing value, or get paid by making it appear as if you provide value.

Similarly, we are concerned by a hypothetical future AI just making one number go up, at the expense of all other values, but we already have that problem, because that's what corporations do.

To make progress on these hypothetical future problems, perhaps the place to start is to attempt to address them in their already-existing forms.

Expand full comment

Hi guys, I'm just visiting, and late to this party. :)

Expand full comment

No, ur charts and graphs are pretty but I lost interest after the first premise proved to be wrong, as here is what my GPT says:

Breaking a mirror is considered to be a symbol of bad luck in many cultures and superstitions. The belief is that breaking a mirror can bring seven years of bad luck. The origins of this belief are unclear, but one theory suggests that mirrors were once considered to be magical objects that could capture a person's soul. Therefore, breaking a mirror would shatter the soul, leading to years of misfortune. In reality, breaking a mirror is simply an unfortunate accident that can result in physical harm from the broken glass. It's important to clean up the broken pieces carefully and dispose of them properly to avoid injury. While there's no scientific evidence to support the idea of bad luck associated with breaking a mirror, some people may still feel a sense of unease or superstition about it. However, it's important to remember that accidents happen and that we have the power to create our own luck through our actions and attitudes.

What else could you want about a broken mirror!

Expand full comment