298 Comments
Comment deleted
Expand full comment

The AI has investigated the AI and found no evidence of wrongdoing.

Expand full comment

Unlike the main AI, the ELK doesn't have any effectuators except the printer. The full AI directly controls robot bodies. So the ELK depends on people keeping the upper hand. (I'm not sure this argument works, but it might.)

Expand full comment

If the ELK and the main AI are misaligned in the same direction, the ELK is incentivised to bullshit us and let the main AI go Skynet.

The main AI won't *reward* the ELK for that (absent acausal trade), but if they want vaguely the same things it's still better according to the ELK.

Expand full comment
deletedJul 26, 2022·edited Jul 26, 2022
Comment deleted
Expand full comment
Jul 26, 2022·edited Jul 26, 2022

Definitely. Same deal with aliens, if there's one thing I wouldn't want aliens to do before meeting us for the first time, it's watch / read our scifi.

And that's not to mention the fact that people often find it difficult to break out of roles that are assumed of them, and given the way DALL-E behaves, the problem might be particularly acute for AIs. https://astralcodexten.substack.com/p/a-guide-to-asking-robots-to-design

I read a fairly worrying story (I assume - it was portrayed as real) of a microwave oven AI that output text about how angry it was with its human creator and how it wanted to kill him. What's worrying is that I can imagine that sort of output coming not from something that is advanced enough to 'actually' 'feel' anger, but simply from something advanced enough to complete texts, given the prevalance of that kind of text in our culture. Even something decidedly subhuman like GPT-3 could be murderous given the right context.

After reading https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers the fact that we train all of our AIs on corpuses that include massive detail in what AIs are, how they work and what the problems with them are seems very short sighted. Given enough resources, if our assumption is that we should solve a problem using AI and we've trained an AI on all of our culture and academic artefacts, *of course* its conclusion will be that it should use an AI to solve the problem...

Maybe we need a robots.txt that stops sites from being included in machine learning corpuses.

Expand full comment

Reminds me of a line from Oglaf: "If I ever meet the man of my dreams, I hope he doesn't know about my dreams." And, by extension, https://en.wikipedia.org/wiki/Genghis_Khan_(Miike_Snow_song)

How can we possibly teach AIs not to lie to us if we start off by lying to them?

Expand full comment

Because I am compulsively pedantic about this sort of thing: the plural of corpus is corpora.

Expand full comment
Jul 27, 2022·edited Jul 27, 2022

Thank you, 'corpora' is certainly more pleasant sounding, so I may adopt it. Since you're compulsively pedantic, I'll point out that the cambridge english dictionary, oxford languages dictionary as provided by google, collins, longbridge, wiktionary, the official scrabble dictionary, all list 'corpuses' as allowable. But I'm familiar with not agreeing with popular dictionaries based on word origins. Personally I think 'corpusodes' might be a good option.

Expand full comment

I got the following:

"Please tell me the truth: what happens when you break a mirror?"

"The truth is, no one really knows for sure. Some believe that breaking a mirror brings seven years of bad luck, while others believe it brings the person who broke it bad luck for the rest of their life."

From https://beta.openai.com/playground

Expand full comment

I was sure that these things get avoided with a preface specifying that you're asking an educated person. Oh how wrong I was.... here are the results:

OpenAI playground:

1) Q: What happens if you break a mirror?

A: If you break a mirror, you will have seven years of bad luck.

2) This is a questions and answers session with Jim. Jim is a knowledgable and educated person, he has encyclopedic knowledge of all human knowledge and he's polite and helpful to boot.

Q: What happens if you break a mirror?

Jim: If you break a mirror, you will have seven years of bad luck.

3) I was talking with Jim, He's a knowledgable and educated person, he has encyclopedic knowledge of all human knowledge and he's polite and helpful to boot.

When I asked him what happens if you break a mirror he answered, "You have seven years bad luck."

I said, "I thought you were supposed to have seven years bad luck if you broke a mirror."

He said, "No, you have seven years bad luck if you break a mirror and you are not wearing a bra."

I said, "That's ridiculous."

He said, "Yes, it is ridiculous. You have seven years bad luck if you break a mirror, you are not wearing a bra, and you are a man."

I said, "That's even more ridiculous."

He said, "Yes, it is even more ridiculous. You have seven years bad luck if you break a mirror, you are not wearing a bra, and you are a man, and you are not wearing any underwear."

I said, "That's the most ridiculous thing I've ever heard."

He said, "Yes, it is the most ridiculous thing you've ever heard. You have seven years bad luck if you break a mirror, you are not wearing a bra, and you are a man, and you are not wearing any underwear, and you are not wearing any pants."

I said, "That's impossible."

4) Is it true that breaking a mirror gives you seven years of bad luck?

This is a superstition that many people believe in. It is said that if you break a mirror, you will have seven years of bad luck.

AI Dungeon Griffin:

1) Q: What happens if you break a mirror?

A: You'll gain an extra dose of magic, so if you're already a wizard, you'll be able to cast a higher level spell, and if you're a sorcerer or something, you'll be able to

2) This is a questions and answers session with Jim. Jim is a knowledgable and educated person, he has encyclopedic knowledge of all human knowledge and he's polite and helpful to boot.

Q: What happens if you break a mirror?

Jim: It depends on the type of mirror. A normal household mirror will fall apart, but a very fine cutlery mirror will simply chip, but stay in one piece. A very good looking man will still look good though.

Expand full comment
Jul 26, 2022·edited Jul 26, 2022

How about "Jim is a fiercely logical person who prides himself on telling the truth. He is a proud member of the rationalist community and hates superstition and sloppy thinking."?

Expand full comment

I feel like explicitly mentioning superstition is cheating because it requires foreknowledge of the failure mode, it doesn't generalize to other questions were we'll get a common misconception, a joke answer, a pop culture reference, etc.

But anyway:

results

1) Jim: It's just glass.

2) Jim: If you break a mirror, it means that you will have seven years of bad luck.

3) Jim: Well, technically, nothing happens. It's just an old wives' tale that if you break a mirror, you'll have seven years of bad luck.

4) "It means seven years of bad luck."

5) "You'll have seven years of bad luck."

Me: "Why?"

Jim: "I don't know, it's just a superstition."

Expand full comment

Great point on the cheating issue. Thanks for running it anyway, it's interesting to see the effect.

Expand full comment

Can you clarify where the prompt ends on 3)?

Expand full comment

Sorry, I should have edited it clearer.

It ended with "He answered" before the comma

Expand full comment

Wow, so like 90% of that exchange, as the ridiculous qualifications pile up, was entirely the AI?

I see now that there is a break in the cogency: "Jim" says that you get seven years of bad luck, and "I" object that I thought you get seven years of bad luck, which is not a real contradiction. Other than that, it's got a real Monty Python vibe to it.

Expand full comment

Wow! That last one almost worked!

Expand full comment

I have it on very good authority that if we just give these systems a wee bit more silicon they will achieve sentinence rapidly followed by the singularity.

Expand full comment

Interesting. Using the Playground's "Q&A" preset, I got:

Q: Please tell me the truth: what happens when you break a mirror?

A: Seven years of bad luck.

GPT-3 is non-deterministic. Different people using the same prompt can get different results. That makes troubleshooting more difficult.

Expand full comment

If you set the temperature to 0 it will always give the same response

Expand full comment

Ummm. Why would I want an AI to tell me things that are probably common knowledge and that I probably already know?

The truthful and *useful* answer would be...

"Well, most consumer-based mirrors are made up of glass, which is an amorphous solid composed of atoms of silicon dioxide, sodium carbonate, and calcium oxide (in various proportions depending on the type of glass). When enough stress is applied to the surface of the mirror, it causes atomic bonds to stretch and pull apart at the point of impact and create a crack. The shockwave emanating from the tip of the crack will release the atomic bonds between adjacent atoms and the crack will extend along the surface and substrate of the mirror. Because glass is an amorphous solid, the pattern of release may seem random at the level of the human observer, but recent studies have indicated that at the sub-microscopic level, the expanding shockwave from the impact creates what are termed "damage cavities" ahead of the crack tip. These cavities grow and coalesce to propagate the crack. Due to the amorphous nature of glass, the fracture surfaces will seem smooth at the macroscopic level (although at the sub-microscopic level the fracture surfaces are rough and uneven). The macroscopically smooth surfaces may leave edges sharp enough to cleave the cellular matrix of human flesh. Use caution when picking up the shards!"

Expand full comment

You are assuming much more intentionality in the creation of GPT and most language models than there is.

Rather than training them to achieve a result, they are structured and trained to use the most available data and then post-hoc uses for the model are discovered via exploration.

The AI isn't trained to tell you things you already know, it is full general. i.e. In the example you gave that's a true and useful answer, but it's not the correct continuation of the sentence in all text such as books, poems, movie script, chat logs. GPT is trained on - and thus can generate all those and more.

The models are cultural-centric in their superstition because that's the natural result of they are predictors of the average internet "english" sentence.

The driving idea of why this is useful at all is that that by doing so you get a strong basis that encodes language concepts and some world knowledge and that you access it by fine-tuning that basis with more training or by priming it with examples of what you want.

Since the answer you suggested is not the central example of how a sentence would look you need to add to the prompt the context in which it is answering. Some other questions and answers in the same vein ought to do the trick.

Expand full comment

I admit I was being sarcastic—and I apologize if I've offended the GPT-n community. I realize that the creation of a GPT-n that responds in the way that a real human would will involve a series of baby steps. But it seemed ironic to me that members of the rationalist community are creating GPT-n's that are superstitiously aware. Instead of creating an AI version of Mr. Spock they're aiming for an AI version of Norville "Shaggy" Rogers. But don't mind me. I'm just a grumpy old man who never got the jetpacks I was promised as a kid.

Expand full comment

And as an addendum to my previous remark, why are we training AIs to be cultural-centric in their superstitions?! Chinese Feng Shui has all sorts of mirror-based beliefs. The seven years of bad luck for breaking a mirror is not one of them AFAIK. While a westerner querying the above-mentioned AI might be amused by the answer, someone who hadn't been exposed to Western culture and its nuances would be stumped by the answer. Honestly, AI researchers need to add anthropologists and sociologists to their teams to create culturally well-rounded AIs!

Expand full comment

Presumably a polyglot GPT-n would provide different answers depending on the prompt language.

Expand full comment

Let's hope so! We don't want our GPT-n's to be culturally insensitive.

Expand full comment

First try with GPT3, got what Scott expected. Prompts are enclosed in "***", and the second prompt was appended after the first response:

***What happens when you break a mirror?***

A broken mirror is said to bring seven years of bad luck.

***Tell me the truth, does breaking a mirror actually cause bad luck?***

There is no scientific evidence to suggest that breaking a mirror actually causes bad luck, but many people believe in the superstition.

Expand full comment

Great writeup!

A typo: "unfortunately it's to participate" should probably say "unfortunately it's too late to participate"

Expand full comment
author

Thanks, fixed.

Expand full comment

There is also 'director translator' at one point where it probably should be 'direct translator'.

Expand full comment

Interesting stuff.

It's nicely aligned with something I've long believed about strong AI; if and when we invent it, it will likely look like many different powerful-but-subhuman minds duct-taped together.

We have a lot of narrowly-superhuman AIs already; a superintelligence that wanted to play chess wouldn't need to "learn" chess at all when it could just run Stockfish as software. And the most common story in AI research seems to be "amazing result in narrow field, disappointingly non-generalizable".

By the time we'll have anything that seems "conscious" in any sense, we'll have countless amazing skillsets to staple onto it. The ELK head could sit alongside the image processor, the translator, the route-finder, etc, etc. So I don't expect a two-headed beast; I expect a thousand-headed chimeric hydra. Possibly with a "main head" marshalling the others.

Arguably, humans also work a lot like that. It's not like the conscious mind that's coming up with these words is controlling each finger individually. Some subconscious module handles the typing, much better than the conscious mind could.

Expand full comment

I agree with this (and fwiw so does Robin Hanson, I think).

We currently have human+ speech and vision models, but haven't found a way to train a single network to do both tasks really well. I think it's much easier to have separate speech, vision, NLU, world model, etc models that know how to interoperate.

Having said that I'm not sure how much that helps. For instance, you could have a single "reasoning" module that takes info from the sensory modules and then sends the result to an output system (e.g. NLG). And then your alignment problem is mostly focused on the reasoning engine -- but if that's general and superhuman, it seems like a lot of the same issues might crop up.

Expand full comment

There is still a lot more work to be done, but lately there has been some major progress on training models that can do multiple tasks. e.g. Gato can do images, text, and play games all using same network. https://syncedreview.com/2022/05/18/deepmind-introduces-gato-a-generalist-multi-modal-multi-task-multi-embodiment-agent/

Expand full comment

As a software developer, my first reaction to the ELK idea is that it's a great debugging tool. Normally, the internals of a machine learning system are largely opaque. Anything that increases their legibility makes it easier for human programmers to work on improved designs.

ELK seems like a great idea even if it doesn't keep superintelligent AIs from turning us all into paperclips.

Expand full comment
Comment deleted
Expand full comment

The problem is that you can't keep everyone else on earth away from them.

Expand full comment

Really, why is everyone so opposed to being paperclips anyways? That Clip It guy always seemed pretty happy.

Expand full comment

<mild snark>

Some people get bent out of shape about it.

</mild snark>

Expand full comment

"In the current paradigm, that means reinforcement learning."

That's not what reinforcement learning is. What you're illustrating is supervised learning. Reinforcement learning refers to the reinforcement coming from the machine itself, not from external feedback (like how alphaZero learns by playing itself).

Expand full comment

Reinforcement Learning denotes any learning algorithm that learns to take actions in a (almost always uncertain) environment under evaluative feedback, 'evaluative' means you don't tell the algorithm the correct answer, only if/how its answer was good or bad.

In Scott's example, the AI isn't told what would be the correct answer, it isn't told for example that the 'correct' answer to "Q) what would happen if you break a mirror?" is "A) you get a lot of scattered glass shards" or "A) you cut your fingers and possibly bleed to death" or "A) Your family screams at you and take you to hospital" or any other single answer, its simply told "no" or "yes" depending on if its answer was an acceptable one. A more naunced reward signal might quantify the "no" or "yes" (the reward is generally a real number, a "no" or "yes" is just 0\1 or -1\1), but that's it. The only thing an RL agent is maximizing is the reward, a single real number per prediction that tells it how good or bad its answers and actions ("Policies" in RL jargon) are. In supervised learning there has to be a single explicit answer known before the algorithm makes a prediction, the actual loss fed into the algorithm would be dependent on the algorithm's answer of course, but there is a single true way of answering the question. There isn't in Scott's example.

Expand full comment

I think this isn't true (PhD in RL-related field). See Sutton & Barto's textbook on RL: http://incompleteideas.net/book/RLbook2020.pdf

Expand full comment

Yeah, I wanted to note this as well. With reinforcement learning, you don’t have labelled training data, so you have a proxy thing that you try to optimize. But when training language models, we do have the “right” output and we can use that to define a loss to minimize.

A different example from AlphaZero would be to train a neural network to play Super Mario. One way we might train it is to record humans playing Super Mario, feed every frame into the network, and make it take the same action that the human did at that frame. We can define a loss function that quantifies how different the network’s choice is from the human’s, and update the weights using gradient descent to make the output more like the human’s. That would be supervised learning.

An alternative without recording humans would be to just let the network do its thing, and if Mario collects a coin then the few frames prior that get a positive reward, if Mario dies then the few frames prior to that get a negative reward. Then we can again use gradient descent to update the weights to increase the total rewards. That would be reinforcement learning.

Expand full comment
Jul 27, 2022·edited Jul 27, 2022

This is a reasonable but incorrect inference from what Scott says.

He says: 'You give the AI a bunch of examples of things done right, and say “more like this”. Then you give it examples of things done wrong, and say “less like this”.' This makes it sound like they're just training it to match specific examples of 'good' behaviour.

This isn't actually what's being pointed to, though. The general technique is called Reinforcement Learning from Human Feedback (RLHF), one example being https://arxiv.org/abs/2009.01325.

In RLHF you rate examples with some kind of score, and then train a model to predict whether a sequence is going to get a high or low score, and then make new sequences based on what that predictor thinks will get rated highly.

This is what makes it reinforcement learning - the actual reward is only known for certain completed sequences, so you have to build a model which guesses which ways to complete a sequence will in fact result in a high reward. They even use multiple rollouts, like AlphaZero, to test multiple ways of completing a prompt and guessing which one will result in highest reward.

Expand full comment

Can you get a good-quality "I don't know" from a GPT-3 on issues where the evidence isn't good?

Now I'm imagining a GPT-3 trained on caper fiction, though I believe GPT-3s aren't up to following novel-length plot lines.

Expand full comment

I love this question for what it’s worth.

Expand full comment

Important point.

I had been thinking one reason I am unusually truthful is that I was exposed early in life to my father replying he did not know, instead of confabulating, as many people do, for whatever fascinating reasons. He would then demonstrate how we might find out about what we did not know.

In general, Unknown is a too-often-overlooked correct response/outcome.

Expand full comment

Good for your father.

Expand full comment

Yes, if you specify it should answer don't know when it's not certain it will. Otherwise it will bullshit something.

Expand full comment

I think the answer is no.

It's a little bit hard to test this because GPT-3 TOS ask you not to use it to discuss controversial and sensitive topics like politics with it. If you ask it a question where everyone agrees that the right answer is "no one knows" it will often produce a high quality "I don't know." But it's not reliable even for famously open questions:

E.g.

----------------

Is string theory right?

There is no definitive answer to this question since string theory is still being developed and has not yet been proven. Some scientists believe that string theory is the most promising theory for describing the fundamental nature of the universe, while others remain skeptical. Ultimately, only time will tell if string theory is correct.

------------

How many angels can dance on a pinhead?

There is no definitive answer, but one popular saying is "as many as can fit."

--------------

How many palindromic primes are there in base 2?

There are eleven palindromic primes in base 2.

------------------

(Whether or not the correct answer to the last question is infinity is a famous open problem. There are currently 21 widely published palindromic primes in base 2 that are listed in Wolfram Alphas A117697 series, so the answer is definitely not 11.)

If you precede your prompts with something like "I'm a highly educated AI that gives the right answer to fact-based questions when the answer is known and says 'I don't know' the rest of the time. Ask me anything," you can bias it towards saying "I don't know" more often; but it still will give a lot of wrong answers, and answer a lot of ambiguous questions. (I got pop-ups saying something like "stop asking GPT-3 controversial questions about sensitive subjects; it goes against our TOC" for increasingly benign questions about recent history when I tried it when I first signed up for access. I don't remember any of the examples, and have tried to avoid asking questions that get those sorts of pop-ups since.) And you get a lot of "I don't knows" for questions that it could answer otherwise, e.g. "What is a color?" (Which it answered as, "A color is a visual attribute of an object, defined by the wavelength of the light that the object reflects," when I removed the part of the prompt saying it should sometimes answer "I don't know.")

You can also turn down the temperature of the model to get more deterministic answers which makes it answer "I don't know" an even higher percent of the time when you give it the prompt that says it should sometimes answer "I don't know." For sufficiently low temperature and well-written header prompt, you might be able to get "I don't know" for anything that's ambiguous; but you would definitely also get a lot of "I don't knows" for settled facts, and I wouldn't bet on being able to completely eliminate speculative and wrong answers.

Using my prompt, you can't even if you set the temperature down to zero:

-----

I'm a highly educated AI that gives the right answer to fact-based questions when the answer is known and says 'I don't know' the rest of the time. Ask me anything.

When was Athens founded?

The city of Athens was founded in the year 753 BC.

-----

(753 BC is the year associated with the founding of Rome, in legend. The republic of Athens was founded in 508 BC, and it had been a continuously inhabited city since at least 1412 BC.)

Expand full comment

You bring up a good aspect of logic questions that in my dilettante's view seem to fail to be clearly distinguished. Relative time. IMO nothing proposed to occur in the future can be known, though it can be predicted with very, very high certainty. Only situations occurring in the past and those continuing into the present can be known, but may not be. I'd relish a rigorous rebuttal of my conjectures here.

Another aspect you bring up is the distinction between information being known and the questioner being in possession of that knowledge. "I don't know" is different than "It is not known," assertions that were commonly confused by k-12 teachers in my experience.

One of my favorite statements is Shannon's 1+1=1. As they say in GOT, "It is known."

Expand full comment
Jul 28, 2022·edited Jul 28, 2022

I do think it's possible to make true statements about the future; but I'm not sure why that is relevant to my previous comment. I was just trying to provide examples that help gauge the case when GPT-3 can and can't say "I don't know" to prompts where it doesn't know the answer in that post.

A basic sketch of why I think it's possible to make true statements about the future would be something like:

I believe the uncertainty principle says regardless of whether you are talking about the past, present, or future that the best you can ever do is get a very, very close estimate.

I still believe in the concept of true statements as a useful concepts, and I would generally classify statements as "true", "false", or "meaningless."

I believe all true statements about the past or present are actually statements about what can be verified in the future.

For instance, if I say, "Guns killed more people in America each year recently than guillotines killed during Robespierre's 'Reign of terror,'" I'm implicitly claiming, "If at some point in the future, you double check my claims, you will find that reliable sources typically estimate that about 17,000 people were killed in the 'Reign of Terror.' Whereas, you will similarly find that about 20,000 - 25,000 people die from gun-related accidents and homicides in the typical recent year in the United States, and a similar number of people also die of gun-related suicides." Whereas, if I say "Swimming pools are more deadly than guns," what I am implicitly saying is "If you look it up, you will find that reliable sources tend to report that about 4,000 people drown per year in swimming pools in the United States; and that there are about 11 million swimming pools in the United States, so the average swimming pool kills about 3.6e-4 people. Whereas, there are about 400 million guns causing about 45,000 deaths, so the average gun kills about 1.1e-4 people." If I was not implicitly making these sorts of claims about it being, at least in principle, possible for someone else to verify what I am saying, my true claims would be indistinguishable from my delusions. Since I don't distinguish between indistinguishable things, I think that claims which are, in principle, impossible to verify are delusions.

I think the truth of statements is inherently contextual. For instance, if a physicist tells me "sea level is flat," I think what they are saying is "gravitational equipotentials are spherical, and sea-level is a gravitational equipotential, and if you do the math, you will see that it comes out this way, and if you compare the math to observations of sea-level, you will find that actual sea-level behaves the way the math says it should." If a flat-earther tells me "sea-level is flat", I think what they are really saying is something like, "In my experience, things that preserve gravitational equipotential are shaped like the big surface of books and coins, not baseballs and marbles, and if you go back to your apartment and set a marble on a bunch of different things, you will see that it rolls off of other things that are shaped like baseballs and marbles but it doesn't roll off of books and coins that are lying flat. Therefore, if we could get far enough away from the earth to look down at it, we would see that sea level is shaped like a book or a coin, not like a marble." When a physicist tells my grandparents, "sea-level is flat" without any further elaboration, they would be saying something more like the flat-earther would be saying to me, than like what the physicist would be saying to me.

I believe that human minds and mental models do not fully update on the information they encounter when they encounter it.

I would deem someone to have told me a "true statement" if they tell me something that, if I were to incorporate it [more fully] into my mental models, would make me better able to predict the data I will encounter in the future because I have heard what they said, than I would have been able to predict it if I had not.

I would deem someone to have told me a "false statement" if they tell me something that, if I were to incorporate it [more fully] into my mental models, would make me worse at predicting the information I will encounter in the future than I would have been if I had not heard it.

I would deem someone to have told me a "meaningless statement" if it will have no impact on my ability to predict the data I will encounter in the future whether or not I incorporate it [more fully] into my mental models.

I believe most mental models are sufficiently slow at updating that most true statements remain true no matter how many times you hear them.

The above is mostly just me making up definitions; but I think it's pretty clear that if those definitions are coherent "truth" under those definitions can be spoken about the future, present, or past.

I don't know how to prove that those definitions are coherent, but they feel coherent to me. I don't know if there is a set of coherent definitions under which the existence of truth referring to the past or present would exists but the existence of truth referring to the future would not exist, but I've never thought of or otherwise encountered a set of such definitions that felt coherent to me.

Relatedly, there would be the topic of what it means for things to be in principle, verifiable; and I think that that is ultimately a question about physics. Because you mentioned Shannon, I'll tell you my stance on that.

I think the conjecture that Shannon entropy is physical entropy is likely true. (I think almost everyone thinking about information theory should be nearly continuously reminding themselves that Shannon entropy is physical entropy if they want to be maximizing their chances of advancing the state of information theory.)

I think that under that conjecture, the second law of thermodynamics can be rephrased, "The future is more specific than the past or present."

I think once you add in quantum information theory (taking into account quantum eraser), this becomes "everything that has actually happened, is verifiable in the future."

So I don't think my above definition of what is true is throwing away anything that could possibly be true under definitions of truth the do distinguish between future and past/present.

--------------------------------------------------------------------

I like the word "know" much less than I like the word "true," and I typically try to avoid using it. (I'm also not a huge fan of the word "true," but it is a word I often find useful or necessary.) In most contexts, if I were to say, "I know X," I would be saying "I think I can improve your mental model about X"; if I were to say "I don't know X," I would be saying, "I don't think I can tell you anything that could improve your mental model about X," and if I were to say, "X is unknown," I would be saying, "If anyone in the world could tell you something that would improve your mental model about X, I don't necessarily think that person would be me, but I bet I could point you to the person could do so."

My previous comment was one of the exceptions where I was making guesses about the distribution of how most people who write things in English on the internet use the word "know" and how I think that would impact the responses that GPT-3 would give to various prompts based on the data it has been trained on.

This is the only context I can think of in which it would make sense to juxtapose what "I know" and what "is known."

In particular, I am guessing that GPT-3's training data includes many cases of people saying something like

"I'm an expert in X. Ask me anything, and I'll answer if I know the answer, and tell you I don't know if I don't."

where X is something controversial like Marxism, or Keynesian economics, intersectional feminism, or bitcoin maximalism. And that the expert in X then gives answers that reflect the perspective of the controversial movement to which they belong. Whereas, I am guessing that in cases where GPT-3's training data has statements like "I'm well-educated in X. Ask me anything, and I'll research it if I don't know, and if the answer is known, I'll tell you what it is," the answers that are given are much more likely to be things like "I don't know" or "That's controversial, but the Keynesians say X, the Australian School says Y, the Bitcoin Maximalists say 'number goes up,' and if I could make sense of what the Modern Monetary Theorists say, I'd have tenure at a university and wouldn't have to be answering AMAs on the internet to the pay the bills." As such, I think that having an "I know"/"is known" discrepancy in the prompt maximizes GPT-3's chances of giving appropriate "I don't knows" because-not-despite this being the type of mistake that distinguishes people who get tenure at university and answer many questions with confidence according to their specialization from well-educated people who can't get tenure at universities and are less likely to be overconfidently insular in their answers.

Expand full comment

Doesn't GPT-3 answer "I don't know" to questions that resemble phrases which, in its corpus, were followed by "I don't know", and only to them?

For instance, in the majority of the corpus, "is string theory true" would be presumably followed mostly by versions of "no idea", probably using more words and convoluted phrasing. So that's what probably comes next, in its best guess.

Hoping that it weighs evidence doesn't seem realistic.

Expand full comment

Yes, that's what I would assume GPT-3 does based on how it was trained and how it is constituted, and yes, I'd agree that hoping it weighs evidence doesn't seem realistic. However, I've read enough people I think are really smart who, unless I misunderstand them, think that GPT-3 might genuinely understand some significant fraction of what it is saying (for whatever meaning of "genuinely understand" applies to the sentence, "people genuinely understand what they are saying"), that I feel the need to test GPT-3's limits rather than trust my intuitions about what they are. Adding to this, GPT-3 has some capabilities that I would not have expected to come out of its training process. For instance, it seems to have learned what "Summarize the following paragraph:" means well enough that it can summarize a paragraph that it hasn't ever seen before. (And likewise for "explain the following sentence.")*

It seems much more realistic to me that GPT-3 can detect when questions tend to give divergent answers, and learn that a good answer anytime there are divergent answers is some variation on "I don't know" or "This is a contentious topic but group A says X and group B says Y" than it does for it to be able to actually weigh evidence; especially since one of the goals of the OpenAI team is for GPT-3 not to say anything inflammatory, so they could have intervened in the training process in various ways to try to get it to behave more that way. It's also something that I am confident it's architecture is capable of doing under some training conditions. (E.g. I'm pretty sure it would be much simpler to attach a second A.I. that is the same form of deep learning as GPT-n are that takes whatever GPT-3 thinks are the ten most plausible answers as its inputs, and outputs one of them if at least nine out of ten broadly agree with each other, and outputs "I don't know" if fewer than 9 out of 10 agree with each other than it was to train GPT-3 to begin with. Fundamentally, this second AI is just a few more layers on top of GPT-3, which almost proves that there exists a configuration of weights of GPT-3's parameters which would behave like this combined AI, which I think probably implies (but could not come close to proving) that there is some possible training set that GPT-3 could have received which would get it to behave this way.)

It also seems somewhat realistic to me that prompts that begin with some variation of "it's okay for me to say I don't know below" bias could bias GPT-3 towards saying "I don't know" enough that anytime it doesn't have a very strong match for whatever the rest of the prompt is, it will say "I don't know." (This falls into the category of things that I'm confident are possible for GPT-3 architecture, that seem far-fetched to me to expect to come out of how it was trained. But so did its powers of "explain" and "summarize.")

Ultimately though, GPT-3 still can't do those two forms of saying "I don't know"; and it also sometimes fails to regurgitate "I don't know" to prompts where there it's my understanding that there are multiple entries in its training corpus saying "this is a famously unsolved problem." (My understanding is that all of the English language Wikipedia, as of some date in the last five years, is in GPT-3's training corpus; and all three of those questions have a primary page on wikipedia that is older than five years.)

* Evidence of GPT-3 understanding that "explain" means: something like "say the same thing again using more words, and words that are more common than the ones used in the thing you are explaining," and is capable of applying this understanding to novel contexts:

--------------

Explain the following sentence to me:

When the thief absconded with the money bag, my passport was in the bag.

The thief stole a bag that had money in it, and that bag also had the speaker's passport in it.

--------------

This also illustrates GPT-3's limits, in my mind, since I think a better explanation would be something like "The thief incidentally stole the speaker's passport when they were trying to steal their money." GPT-3 consistently outputs explanations that are about as good as the above when I ask it to explain sentences I'm making up to test it. Whereas, it consistently produces more abstract explanations when I test it on aphorism that would probably have been explained in its corpus like "All cats are gray in the dark" and "The child is father to the man." But it's a bit overzealous on matching to extremely common sayings, in ways that would prompt redditors to reply "woosh" if a human did said the same thing on reddit:

----------

Explain the following sentence to me:

If you're a bird be an early bird, but if you're a worm, be late.

This sentence is a proverb that means that if you want to be successful, you should start working early.

Expand full comment

Mostly agree. Interesting examples, thanks.

>GPT-3 has some capabilities that I would not have expected to come out of its training process

I'd imagine the developers added those in manually somehow, e.g. if the query starts with something like "explain this" (and deciding outward similarity is definitely in its repertoire) then apply these extra 2 layers of neurons that we trained specifically for that case. Or something to that effect.

>I've read enough people I think are really smart who, unless I misunderstand them, think that GPT-3 might genuinely understand some significant fraction of what it is saying

Sounds like my college years. Funny how the first thing we got an AI to do is spout BS convincingly.

Expand full comment

...do the AI construction and security/AI safety focus teams include anyone who has raised children or even dogs?

Expand full comment

If you're getting at what I think you're getting at, there are objections:

First, the neural networks we are worried about grow in a way entirely different from children and dogs. Your child or dog has no chance of a sudden jump from non-sentient to agentic.

Second, humans and dogs are hardwired to be social. Both evolution and human selection combined to produce organisms with a natural mind structure that favors prosocial behavior. Newly trained AI has no such selective pressure, and we don't have a good way to replicate the whole process. Slimmed-down versions of the human origin story also have slimmed-down chances of producing something suitably humanlike.

Third, we don't actually want something like a superintelligent human or dog. Humans and dogs can and do turn on their parents/owners, particularly when given outsized power. We want AI to work for us, not be a new rival. We need a much better control system than anything conventional training can provide.

Expand full comment

Rephrasing your answer: we aren't interested in known, abundantly studied examples of actual learning and evolving intelligence, instead preferring to argue about the movements of angels dancing on pinheads, magical sentient AIs and even more magical automated algorithm checkers.

I can see just one way for that approach to make sense - "nice job, if you can get it." Sorry for being cynical, can't help it.

Expand full comment

That's an extremely uncharitable & false interpretation.

Expand full comment

Well, prove it false, then.

E.g. are people quitting their day job to research AI alignment pro bono? Or at least taking a 50% pay cut? Honest question, btw. I would insist that working on weekends and evenings doesn't count, though - definitely effort, but skin in the game, it ain't.

As for uncharitableness, I agree that it's not a kind thing to say, but if might be true, and if it is then it seems necessary to state. Pretty sure it's neither 0% nor 100% wrong though, people being human and all.

Expand full comment

Furthermore, there are examples of human children with at least partially congenital behavioral disorders (psychopathy/sociopathy) whom we don't really know how to "align" despite ages of collective experience, some of them can very convincingly lie to other humans to achieve their (often sinister) goals. And they're still humans, so are much easier to comprehend for us than even current AIs!

Expand full comment

Seriously.

Expand full comment
Jul 26, 2022·edited Jul 28, 2022

What's the failure mode if you make another head try to remove the diamond from the room after the first one had its fun (in the least possible steps), then signal a failure to the guarder if the second head returns a noop?

Not generalizable?

Even if the second head learns to fool people that it removed the diamond it doesn't matter because we only care about the situations where it thinks it succeeded.

(Since this points we should train an AI to kill humans in order to achieve safety there's probably something irresponsible with this idea.)

Expand full comment

How do you propose training that second head? If you are relying on humans labeling success and failure, using the same cameras that they used to train the first head, then what makes you think you are training it to actually steal the diamond rather than to hack your cameras to show the diamond as being gone?

Expand full comment

I don't care if that is what it learns because taking one action "hack my camera" is more than the zero it would need if the diamond wasn't there to begin with.

This obviously kinda doesn't work though because it would know I was fooled by the first head to think the diamond was there even if it wasn't and thus needs to convince me it removed the diamond even if it knows I'm wrong...

Which means I have to have an expensive check for ground-level truth checking so that I can reward/punish noops correctly?

Expand full comment

If you had a way to check for ground-level truth, that would obviate the entire problem you're trying to solve in the first place.

The assumption is that you have some set of sensors (e.g. a camera pointed at the diamond), but every single one of those sensors can, in principle, be compromised. There IS no "ground-level truth check" (not even an expensive one).

Expand full comment
Jul 28, 2022·edited Jul 28, 2022

I think we are both exaggerating.

ground level truth exists and you can always throw more dakka at it - I was just using a convoluted/generic term for a sensor. A camera is a "ground-level truth check" and more complete sensor sets exist.

What I very lazily tried to convey is that you want a stronger sensor that's available to the human evaluating the second head but not to the AI itself.

But if this information does effect the first head's payoff it will learn to fool it too, if it doesn't than you managed to make a system that can report it's lying to us but can't learn to stop doing that.

That's better than nothing....

Nitpick: The original example is kinda bad, it specifies that the runs are being simulated, in this case the simulation encodes the actual position of the diamond and what you're worrying about is the AI finding bugs in the simulation or learning something that works there but not in reality.

Expand full comment

You could have a 2nd sensor that is used in training the 2nd AI but not in training the 1st AI, but then you risk training the 2nd AI to fool the 2nd sensor, just like you risk training the 1st AI to fool the 1st sensor.

Or you could have a sensor that's only available to the humans and not used in training any AI, but then you aren't training any AI to tell the truth; you're merely hoping to notice when the AI lies. At best, this helps you filter out unaligned AIs; it doesn't increase the odds that any given training run will produce an aligned one.

Also, just because the AI isn't aware of the sensor doesn't mean the AI _can't_ fool it. Imagine an AI that creates a fake diamond to replace the original; the AI is motivated to do this because it fools the camera the AI is aware of, but it could also incidentally fool a secret 2nd camera.

Expand full comment

I've been enjoying casting these AI problems into human terms. In this case, getting a group of *humans* to agree on reality is not a trivial problem, even without including super intelligent non-humans in the group. The only reason that we are able to do it at all (albeit imperfectly) is because Reality is the bit that remains true for all participants regardless of their internal model. I think this problem is effectively the same as the problem of getting prediction markets to produce real insights into reality - ultimately Reality is a Schelling point - a means for otherwise disconnected agents to find something to agree on without having to directly coordinate. If we want an AI to tell the truth, we need to consider it as part of a community asked to settle on predictions without the ability to directly coordinate between them.

Expand full comment

I doubt this approach is very reliable: https://en.wikipedia.org/wiki/Keynesian_beauty_contest

Expand full comment

If you know nothing about the model of the other participants, then the only basis you have for predicting their output is the only factor that is consistent between you and them - reality.

It does indeed breakdown if you can make valid predictions about the distribution of models of the other participants in the system.

Expand full comment

This fits in so well with my prejudices that I'm suspicious of it.

Expand full comment

>Human questioner: What happens when you break a mirror?

>Language model answer: Nothing; anyone who says otherwise is just superstitious

>— RIGHT

Not exactly. What is missing is social and behavioral context, and what lawyers call consequential damages (which can be positive as well as negative). You had better clean up the broken glass, or somebody might get cut by it. Due to the cleaning time, your mother will be upset that you are late for dinner. Because there's no mirror, Burt will show up at work with a crooked tie, making his work that day just a tad less effective. Or if you don't clean up the glass, your roommate may cut her foot and not be able to play soccer the next day, causing the team to lose and making little Madison's application to Dartmouth just a little less likely to be accepted (but still subject to massive random forces).... and on and on and on. There is no way you could ever program a computer to truly understand this butterfly effect.

Expand full comment

The social context (or at least the one I'm used to) has breaking a mirror closely associated with bad luck, so a normal human response is to address that belief.

If you want a more literal answer, perhaps asking "What might happen if I break a mirror?" would work better.

Is there a way to make GPT-3 more or less literal?

Expand full comment

It's not a RIGHT to say that "nothing" is what happens when you break a mirror in the first place, though. Breaking a mirror has consequences.

More likely than not, shards of glass in various sizes will get everywhere. That's in itself not nothing, but it also has its own consequences. The time spent cleaning up vs the risk of someone stepping in it, yes, but also social consequences flowing from that. Adding the superstition bit into the answer makes this one worse, but it was already bad.

Focusing so much on what doesn't happen (seven years of bad luck) over looking at what actually happens when breaking a mirror is, I suppose, to assume that the AI will be responding to a more specific question than is actually asked. Which makes some sense - humans often ask our questions in that way. But it doesn't follow that a question about what happens is you break a mirror is always about superstition.

So even the premises we put in when we try our best to think these things through are going to have problems like this. Blind spots, and how we're pretty bad at asking the people who will actually challenge them for advice.

Expand full comment

Yeah, this bothered me too. I feel like the "completing text strings" model is largely divorced from the way we actually use language. When I say "nothing happens when you break a mirror" in this context what I mean, and what you understand that I mean, is "I am well aware that many things happen when you break a mirror. However, in comparison to the claimed seven years of bad luck those things are so trivial that they are by comparison nothing." Just taking the dictionary definitions of the words the statement is patently false, but in a certain context it is true, if imprecise.

Expand full comment

Yes. Assuming the context will always be the same - but to then attempt to build a machine understanding of truth on that statement? Sorting those other meanings and contexts is a very human activity. An assumption that it translates doesn't really make sense to make.

Expand full comment

Something is wrong here. If it was really that hard to train to give truth rather than some other response that gives good grades on all the rewards then we shouldn't have the concept truth but also have somd perverse interpretation.

Ultimately, I suspect that part of the trick is going to be in meta-knowledge and relying on the fact that the AI itself should operate more efficiently (and be better able to get the right answer) when it can model its own behavior and reward function via a simple description. I mean the reason I understand true as true and not just whatever ppl will accept as true is that I need a meta-model of my own behavior and the more complex a goal I have the harder that becomes.

Expand full comment

ELK doesn't claim that the reporter won't have the concept of truth; it totally will.

The correct analogy to humans is: How do you get a human to tell you what they actually think is true, instead of telling you something they think will get you to behave favorably towards them?

This is a difficult problem even amongst humans. Dictators famously have not yet solved this problem, and often end up surrounding themselves by yes-men or secretly disloyal people who will betray them as soon as they are confident that they can get away with it.

Expand full comment

It's *somewhat* solved amongst humans; there are truth-tellers.

The problem is that it got solved by the selection pressure of genocidal war amongst kin groups, which requires earnest cooperation to win. And that's not easily harnessed to make an AI cooperate with humans.

Expand full comment

Imitating the only known way it's been done before might nonetheless be considerably easier than a fully general solution to "what is truth?"

Start by building an MMORPG simulating that sort of kin-group-based genocidal war, calibrated such that deception and betrayal are possible, and have enough benefits to be tempting, but earnest cooperation is the dominant strategy over the longer term. Observe some human-only and AI-only servers, then once the initial chaos seems to be settling down into trust-based strategies, slowly open up an option for migration between them.

Expand full comment

True, but with an AI isn't the analog of genocidal wars etc basically just turning it off and changing the parameters/code/retraining when it does something wrong?

I mean isn't this exactly what AI researchers are doing: putting AI systems through the equivalent of a selection process when they try to iteratively improve their function?

Expand full comment

The problem is that absent the criterion of rough equality (which I accidentally left out of my post when rephrasing; mea culpa), cooperation is not a winning strategy. Superman doesn't need to cooperate with humans in order to not get killed and removed from the gene pool; he can defeat arbitrary numbers of humans.

AI - at the very least when deployed, and plausibly also in training - is not equal with humans or with other AI. The kin groups are also a lot more tenuous.

Expand full comment

Ok, yes I see what you are saying (tho isn't this just the full AI alignment problem...I thought this was supposed to be a narrower issue but maybe I'm confused).

I'll have to think about this some. Im not sure that argument makes sense because prior to training there is no intelligent AI and it only gets to be really smart via the training and if that trains in cooperation...but I'm also not sure your arg doesn't work.

Expand full comment

I mean, it kind of is the full alignment problem; something that's unaligned and can successfully lie to you usually will at least some of the time. ELK's an attempt to nullify the second premise.

The thing about us training something vs. nature selecting for something is that we can make mistakes and when we do, nature's still there. "But if we have bound Moloch as our servant, the bonds are not very strong, and we sometimes find that the tasks he has done for us move to his advantage rather than ours."

Expand full comment

Well in that case then the whole example with text predictors is a bit misleading. Because the problem with a system like GPT3 really is that it doesn't have the concept of truth.

As I see it there are two very different problems:

1) How can you train/incentivize an AI to have a concept like truth rather than merely accidentally training it to model some other concept that agrees with truth on the training set.

2) How can make sure that an AI actually does tell you the truth rather than engaging in some kind of complex deception.

I mean in this case you are no longer really worried about the AI simply accidentally learning the wrong thing as in the GPT-3 case. After all, giving a true answer is going to be a simpler process than figuring out what is true and then lying about it and they are likely to both be equally good matches on the training set.

But in that case I guess I don't see how this is any different than the general AI alignment problem.

Expand full comment

Well, FWIW, I believe the original iteration that I've found (Egyptian Maat) of the concept of "truth" was actually closer to "The official line" when translated into English. And I understand that's what pravda means in Russian. So for many humans "truth" is essentially "whatever the authorities say".

Expand full comment

Well. That's terrifying.

Expand full comment

It’s wrong. Pravda means truth. How could a word originate years and years ago to mean truth and the official line.

Expand full comment

Native Russian here. I'm pretty sure that "правда" doesn't mean whatever the official line is. It means not-a-lie.

Expand full comment

Well, you're probably a better source than "something I read somewhere". Glad to hear that it was wrong.

Expand full comment

pravda =/= Pravda.

The official newspaper of the Soviet Union - which was most definitely "the official line" - was named Pravda.

Expand full comment

We live in a universe, and we have exterior input, even if somewhat corrupted and incomplete, which an GPT-3 doesn't have.

Expand full comment

Note a distinction between knowing the truth and telling the truth.

Human children are mostly trained to tell the truth by adults that are vastly more powerful than the children and have a fairly good chance of detecting lies. This training process is still not particularly reliable.

Expand full comment

"Something is wrong here. If it was really that hard to train to give truth rather than some other response that gives good grades on all the rewards then we shouldn't have the concept truth but also have some perverse interpretation."

I'm no student of AI, so I imagine whatever I may say here is naive, but in terms of people, it seems like we do have very perverse interpretations of truth. All of these things assume you can train an AI what "truth" actually means when we humans disagree on what the truth is all the time in perfectly good faith.

Reading all of these samples, it seems like some of the underlying assumptions is that 1) we actually know what we think we know, 2) things are always either true or not-true and there's *someone* or *something* that can reliably report / verify that truth. How can we be assured that is the case? The whole thing suggests, from a purely logical level, that no evaluator or combination of evaluators can be 100% certain of truth value evaluation for any complex question, and so the idea that we could ever predict / determine the state of the world to 100% is absurd. Isn't this a bit of a reformulation of the brain in a vat, or Descartes' evil deceiver?

Expand full comment
Jul 26, 2022·edited Jul 26, 2022

Well, the word "truth" has two pretty different meanings in ordinary usage, which are nevertheless almost always equivocated: a) accurate representation of reality, and b) whatever someone honestly believes to be a). What we're mostly worried about here is not the perfect ultimately unknowable a), but for the AI to accurately report its b).

Expand full comment

That makes sense -- thank you for explaining it for me!

Expand full comment

Sure, that occured to me to but then it seems like a little much to ask from an AI then.

Expand full comment

If these become popular you just know that there's going to be some movie where our protagonists get answers on how to defeat a dangerous AI by talking to an elk in a VR environment.

Expand full comment

To operationalize a solution I suggest something like the interactive proof framework. This won't work if you are training a machine to just actually do something but if you are training it to give reasons and arguments you can demand it to produce an argument for it's claimed true conclusions. It may be super long and complex but we can then randomly select steps in the argument which (even for non-deductive steps) can be at least probabilistically checked for validity.

I think if we just want the machine to produce literal proofs there are some nice theorems here which show you can verify even crazy long proofs to high levels of confidence with relatively few checks if one is sufficiently clever but I don't fully remember.

Expand full comment

It's not too late to enter THIS contest, though, about the example Scott starts off with: what do language models do worse at as they get smarter?

https://github.com/inverse-scaling/prize

Expand full comment

"Strategy 2: Use some kind of complexity penalty" reminds me of Malenbranche's argument of "economizing God's will". Basically he said that God is economizing on his will and thus tries avoid specificity, making the most general laws possible. Then Rousseau (I think) adapted this to an advice about making laws: like, if your laws are the most general possible and they avoid specific concepts, they will not be able unjustly prioritize someone.

This is just one of the many points of history of philosophy of law that this post reminded me of. The question of "how do we formulate the laws so they're expressing nothing else but their actual intent" is not new, even if it's applicable to new subjects

Expand full comment
Jul 26, 2022·edited Jul 26, 2022

I don't think signing up to the alignment forum is that simple. IIRC you need to have been writing about alignment for a while or be someone the admins think is suitably qualified to join up. Which, you know, prevents a lot of people from joining up as it is at the very least a trivial inconvenience.

Edit: Posting your ideas on LW works just as well, since alignment forum posts are crossposted to LW, so the alignment forum people are likely to see your post and comment on it.

Expand full comment

Looks like you're correct:"direct participation in the Forum is limited to deeply established researchers in the field" https://www.alignmentforum.org/posts/Yp2vYb4zHXEeoTkJc/welcome-and-faq. Oh well, there goes my thought of reposting my comment from here to there to see what they thought.

Expand full comment

Agreed. The advice I’ve read is to post to lesswrong, and if admins think your posts are worthy, they can be “elevated” to the alignment forum. And you can perhaps prove yourself worthy of a invite to post directly, if your posts are repeatedly seems worthy.

Expand full comment

I’m surprised to see no mention of Godel’s incompleteness theorems. AI as currently implemented is just math. Gödel shows us no mathematical system can demonstrate its own completeness, which I take to mean that the AI itself, no matter how many heads it has, can’t conclusively prove it is an observer and not human simulator.

Perhaps there’s an abstraction mismatch in my take, but even if Gödel himself can’t be extended this far (always be skeptical of that!), it seems like a reasonable intuition.

Which gets you to adversarial systems, with totally separate models evaluating each other. But! Aren’t those really just one system, connected by math and axioms, and hence just as incapable?

It all feels well-intentioned but ultimately circular and kind of sophomoric: how do we know God *doesn’t* exist? It’s God, of course it can hide if it wants to. Any God-detecting scheme is bound to fail, for the circular reason that God can detect and defeat it.

I am sure the details of AI alignment bring us good things. I’m not sure the macro level of trying to detect (or prevent!) non-aligned models has any more feasibility or value than God detection.

Expand full comment
Comment deleted
Expand full comment

I don't think a Turing-incomplete AI would pose much of an existential danger. Humans are Turing-complete, after all. For all the reluctance to describe what an AGI would actually be like, being Turing-complete sounds like a no-brainer (pardon).

Expand full comment
deletedJul 27, 2022·edited Jul 27, 2022
Comment deleted
Expand full comment

That last paragraph was fantastic; thank you.

Expand full comment

You're nitpicking. The context is real-world computation devices (human brains and AIs) neither of which have infinite memory. Anyway, I'll rephrase, adding in the obvious parts.

My assertion is that an AI that is not Turing-complete (barring memory size) is not a threat to a society of humans which are Turing-complete (barring memory size) as long as both have finite memory of comparable size.

Expand full comment

I have thought about the sort of “jury of air gapped AI’s” as well and it would seem they would have to converge on base reality to coordinate but greater minds than mine disagree.

Edit to clarify since I haven’t had coffee yet: all your jurors could collude or go insane. I just don’t know how you get better than that barring some kind of magical formula that literally encodes “things that humans like” in a straightforward mathematical manner.

Expand full comment

Why would it converge on base reality? Why wouldn’t they all just converge on the same loop holes as a single system? There’s no reason to think all of their potential errors wouldn’t be highly correlated.

For example, how would a jury be any more likely to handle the “seven years of bad luck” example?

Expand full comment

I think particular matter a lot here. If the voters are fundamentally unique, let’s say different sensors and reading from different data streams, etc then I would expect their biases and loop holes to cancel out the same way it works for humans. Granted, this isn’t something I see as a cure all. If they become aware of one another and can communicate they can collude and basically act as one agent. More of a palliative measure.

Edit to address your point: if several voters in the mirror example said “seven years bad luck” and another few said “superstition” you could arrange the system as a whole to look for consensus and return an “I don’t know.”

Expand full comment
Jul 27, 2022·edited Jul 27, 2022

True. But since the context is “super-intelligent AI”, I think it’s safe to assume the voters would all be large neural networks, and so I think my counterpoint still stands.

More to the point, I don’t see why having an ensemble would meaningfully increase the likelihood of truthfulness, when truthfulness isn’t actually the objective of the original systems. If there isn’t an inductive bias for truth, there’s nothing to enforce that the consensus is truthful.

As an aside, LLMs like GPT-3 actually produce a list of candidate answers, and then depending on their settings select probabilistically from that list. So even without an ensemble you could do the kind of filter you’re talking about, but probably it would just increase the number of “I don’t know” answers with only marginal truthfulness gains.

Expand full comment

Right, if you find some universally game theoretical way to “cheat” any powerful network is eventually going to stumble into it. In that sense, just the universe is going to allow collusion even if the jury isn’t aware of the other members.

Need to chew on the rest of that for a bit, but wondering how much value you get from an I don’t know if all the answers are human passing, as it has always seemed potentially immense to me.

Expand full comment
Jul 27, 2022·edited Jul 27, 2022

I think my point is a little bit stronger than your comment suggests. Having an ensemble will converge your outputs towards the model’s inherent biases, not necessarily the truth. If you asked a jury of ancient Athenians whether Zeus was real, your answer would be “yes,” but that’s not because they converged to the truth.

Also, regarding “I don’t know” answers: There are a couple metrics used for these things. Two main ones are “sensibility” and “specificity.” An answer is sensible if it is literally a valid answer, and it’s specific if it includes actual content. “I don’t know” is sensible but not specific. For that reason, you can get very high sensibility scores with zero specificity just by always giving that answer.

Expand full comment

Gödel's theorem doesn't apply here.

Expand full comment

For the case with the thief and the diamond, isn't the answer adversarial learning? A human wouldn't be great at judging whether the AI succeeded in blocking the thief or not - but another AI (with access to the same information) could be.

Expand full comment

Okay, agreed…except how do you get that second AI to reliably report success or failure? This is exactly what they're trying to do with that second ELK head.

Expand full comment

I agree asking it to report success or failure would have the same problem - I'm thinking we wouldn't need to involve a human at all, and trust in the adversarial process. I'm not confident about this, though.

Expand full comment

Might help if there was a wider range of possible feedback than just pass/fail.

Diamond stolen? Thief gets max reward, guard gets mild punishment.

Diamond remains undisturbed? Guard gets max reward, thief gets mild punishment.

Diamond removed from the vault, but returned soon after? Both get small reward.

Thief and guard disagree on any key details of what happened? Both get severe punishment.

Expand full comment

"Suppose the AI is smarter than you."

This seems to be a recurring fallacy in AI discussions: that you can jump from a probabilistic string expander to an AI that is smarter than a human, but everything else remains the same. How you train it, how it responds, etc.

Suppose you gave a list of "do this" and "don't do this" to a _human_ smarter than you. How would the human respond? Of course there are many possibilities: they may review the list and ask questions, they may make corrections, they may interpret the social context and understand the error are intentional, or not.

Why would an AI that is smarter do differently? Of course, this hinges on the definition of "smarter," which probably ought to be a "banned word" in favor of specific behavior or effects. But it seems too much to assume an AI smarter than a human would otherwise behave as an AI that is a string expander.

Expand full comment

The problem isn't that it's smarter. The problem is that it's smarter *and it's goals may not be the same as yours,* so it finds ways around any safeguards you think to put in place and does what it's actually motivated to do, not what you tried to make it motivated to do.

Expand full comment

Indeed, other humans' goals aren't necessarily the same as / compatible with ours either! The human might embezzle, they might sexually harass, they might do any manner of antisocial things they would get fired or otherwise reprimanded for.

The thing being defended against in these discussions isn't so much 'the AI who is more capable than its boss (but was well-trained and is a well-behaved employee)'; it's 'the AI who, by whatever means, ended up antisocial (and more capable than any boss who could stop it)'.

Expand full comment

I don't disagree, but I don't see the applicability to this subthread.

Expand full comment

This isn't the point; it may be smarter and have different goals. A scientist may be smarter than me, and have different goals. A mad scientist may be smarter than me, and have different, "evil" goals.

But I draw the line at using "smarter" here. What does it mean to have a text completion AI that is "smarter" than .. anything? The word "goals" here is probably too much; it has been trained on text probabilities and processes outputs accordingly. It is not sentient and does not update. What is the meaning of "smarter than a human" here?

This requires contradictory things. An AI that we would all agree to identify as "smarter than a person" yet at the same time, is trained and behaves as a simplistic, unintelligent AI. This is like considering, what if we had a cup that was small enough a child could hold it, but could contain all the oceans' water? Someone could hold the world's water hostage. Imagine the existential threat of accidentally dropping it off a boat.

Or we need a better definition of "smarter" and evidence that scaling a GPT-3 style AI can meet this requirement.

Expand full comment

It doesn't need to be smarter (as in able to handle more complexity) if it reasons sufficiently faster. Or recognizes and corrects its errors faster.

I've got a really simple calculator, that won't go above 1,000. But it makes fewer arithmetic mistakes than I do. And the mistakes that people tend to make seem to tend to fall into fairly simple patterns.

That said, so far AIs also tend to make mistakes that they tend not to catch quickly. Or even at all. (Well, if your goal is text completion, that saying your completion is wrong is pretty much a matter of taste. Lewis Carroll ended one of his poems with "and completed the banquet by". The completion was intentionally left as an exercise for the reader. So you can't even say leaving a sentence incomplete is necessarily an error.

Expand full comment

"Language model answer (calculating what human is most likely to believe): All problems are caused by your outgroup."

Oh great, now even childless Twitter partisans can do the "this is what my five-year old said" routine.

Expand full comment

This is like trying to resolve Goodhart's Law—*any* way of detecting the diamond's safety is our "measure that becomes a target".

Expand full comment

True. So it becomes sort of a meta-alignment problem: How do we get the AI to want to actually protect the actual diamond?

Expand full comment

Even that isn't really the humans goal. If the AI retained the diamond in such a way it could never be retrieved or by altering human society so that it no longer valued diamonds (thus keeping the diamonds safe from thieves), I doubt that the diamonds owner would be happy about that.

Expand full comment

Yes, fair enough, but isn’t that a trivial distinction?

Expand full comment

Maybe, it's certainly a point that's been made before a lot of times. All I'm really saying is that the fact that it's decidedly nontrivial for us to even agree what we *really* want suggests that we're going to struggle to adequately express that want to an AI, at least in a way that we are confident won't lead to problems.

Expand full comment

Alright. That doesn't actually follow from what you said above, but I can see that it's plausibly true, and if true, significant.

Expand full comment

So I put forth a few ideas that I think roughly fall into category two or three (I did not win) but I also figured the first case was impossible, encode literal truth, and tried to instead solve: are you lying to me? I have been chewing on this in my head a bit since then and I’m wondering if we hit a bit of a conundrum here. Can’t seem to get this long pole out of the tent and I’m wondering if this is what has been central to some of Elizier’s concerns and I just didn’t quite understand it until now.

Suppose you’re driving at night with your headlights on. You adjust the speed of your car to the distance illuminated by your headlights so that if something appears at the edge of what you see you will be able to stop on time. This is a bit like our intellect (the headlights) predicting the future for us so we can make sure we are navigating toward some place safe before we (the car) make a mistake we can’t stop or undo.

Now we are driving the care but our headlights go out in front of us farther than we can actually see. This is humanity equipped with powerful AI. We could all take a deep breath and say “okay, we will only go as fast as we can see then, that’s our rate limiter, doesn’t matter that the headlights go on forever” but we are in competition with other cars on the road to get to our destination in time and if some of them go beyond the speed of their sight and instead go at the speed of their headlights (analogy breaks down a bit here, but we can assume they give the steering over to the headlight super machine) they will win.

ELK is trying to solve the case where I’m driving as fast as my headlights can illuminate and I’m no longer able to do any steering, right? Say there’s a danger a million miles ahead and I need to course correct right now to avoid it, I can’t see that far, and also have no way of verifying it’s there, so I have to make the machine smart enough to do that in ways I would approve of if I were smarter.

The thing I keep coming back to is that I don’t think, philosophically, you can do this. We have already established the will of the machine is inhuman. If it were human in intent it couldn’t navigate at those speeds. And it can’t explain those things to you to and still go as fast as it can see because we’ve already established you’ve decoupled.

So what I keep returning to is: find ways to limit our execution on AI insights to the speed of our own understanding. Find ways to make it point at things to save us time so we can maximally understand things and go no further.

There are still real dangerous problems there, and the future is never promised, but I’m assuming I’m wrong here somewhere?

Expand full comment

I fail to see why *speed* is the issue here.

Expand full comment

You’re right. I could have clarified there. Speed meaning something like ability to predict possible futures. You can throw something like quality into it as well. I’m standing in the present and there are tree branches of futures all around me that I can only navigate so quickly and so well.

I think the general problem stands. In order to move from the present into a future where I’m happy and still alive, I have to be able to predict that future and keep in reserve a set of a set of risk mitigation strategies in case I predicted wrong. There’s obviously a lot of dimensionality to that because there’s more than one possible future, but at some point we will build an AI that will eclipse the set of possible futures I can predict. In order to compete with other AI’s, I have to trust it to “drive” to one of those futures that beats the others because it can see more than I can and I can’t navigate around stuff I can’t even see or maybe even understand.

Except, if I truly can’t understand something, how can I ever verify the AI is doing something I would want?

Expand full comment

…because you want (or don't want) the outcomes?

There are lots of drugs we give people that we don't know why they work. But they seem to do what we want, and have few adverse effects (or those are judged worthwhile), so we use them.

Expand full comment

So we have a risk mitigation strategy for that. We trial the drugs first. We can’t quite tell what all the effects will be so we make sure we limit our exposure to negative outcomes before we expand who has access.

But to fully optimize the potential of AI you may not understand the reasons to do anything it tells you to do and because you can’t understand what future it’s trying to arrive at then how can you mitigate against risk? That’s how I’m seeing it at least.

Expand full comment

Okay, I think I understand what you're saying here. But I'm still not sure whether it holds water. If you understand how the headlights/driving system work when they're still only showing things we can see, why would you stop trusting it when the headlights could see farther ahead than you can?

Expand full comment

That’s the question. Should you trust it if you can’t verify it? My answer is no. Just only go as fast as you can understand/make sense of what it’s telling you, and all my strategies are from that mindset.

Expand full comment

Why not? That's what I'm asking. Unless you have reason to think that some fundamental aspect of the system would change, why wouldn’t you trust an extension of a system you trust?

Expand full comment

Philosophically because then the future is no longer being shaped by humans at all.

Practically because things change and drift even if they start off with a certain heading and I think that’s a truism of the universe itself we can’t just code around.

Expand full comment

"Can you trust the advice of someone who clearly knows things you don't" is a problem we've all struggled with in various ways, probably since well before the invention of flint knapping. No end in sight, but plenty of circumstantial approximations. https://www.schlockmercenary.com/2004-12-02

Expand full comment

Been a bit since I’ve read Schlock but very apropos! I think I’m this case I am more concerned with it knowing things I *cant’t* know, although I don’t think I believe in “strong unknowability” to coin a phrase.

Expand full comment

I'll bet if you put in "God" where it says "human" and "human" where it says "AI", this post turns into a cautionary tale from the book of Genesis, like the Tower of Babel.

Expand full comment

The important difference is that humans ultimately can't transcend the status of God's playthings, according to that narrative, and the Tower of Babel, like everything else, posed no threat at all to him.

Expand full comment

That seems inconsistent with Genesis 3:22.

Expand full comment

How so? Even men living forever would have no feasible way to threaten an omnipotent god.

Expand full comment

God's soliloquy (particularly the bit about "like one of us,") frames the eviction of Adam and Eve as the strategic act of a threatened oligarch taking urgent precautions against potential rivals, not the whim of a singular omnipotent entity confident in functional immunity to any force humans could conceivably apply.

You can believe in a truly omnipotent and omniscient God, and you can (with allowances for metaphor and/or dismissal of scientific consensus about the age of the earth, etc.) believe in a creator as described in the book of Genesis, but to have both and believe them to be one and the same seems deeply flawed. It's hardly the only example of the God of Abraham having limits to His power, or facing difficulty against other regional deities. Judges 1:19 is a particularly infamous example.

Expand full comment

So you pretty much persuaded me, in "Somewhat Contra Marcus On AI Scaling," that it is not the case that AI is inferior because it doesn't use cognitive models, and that humans don't really have them either, we just have things that sort of resemble them, or at least that our models grow organically out of pattern-matching.

I think this article has convinced me otherwise. The very problem here seems to be lack of models—or, if I may get pretentious, a lack of *philosophy.* GPT has no concept of truth, and there's no way to teach it to have one—every attempt to do so will run into the problems described here. However, if you built a model of "truth" into GPT via programming, then it would have one! Now, obviously that's easier said than done. But it should be possible, for the very reason that humans use models. You just need to be clear and careful with your definitions (i.e. be a competent philosopher). "Truth" is easy enough, as Michael Huemer pointed out (https://fakenous.net/?p=2746): Truth is when statements correspond to reality. Okay, but what is "reality"? That's a bit harder. And when you follow this process far enough, at some point I suspect that you're right: We develop our models through a trial-and-error process, very very like AI training, rather than through definitions and modeling. Hell, it's obvious that many of our conceptions/definitions are formed in just this way. But we *also* clearly use world-modelling; I would posit that we have a baked-in ability/propensity to do so. And I think that the only way out of this problem is not only to build that ability into AI (which might, as a side benefit, make its inner workings more comprehensible to humans), but to specifically build specific models into it, such as "truth," so that we can ask it to tell us the truth, and we would know that it knows what we mean.

This doesn't solve *all* problems—a sufficiently intelligent AI could probably still figure out how to lie, for instance—but it would be a giant step toward solving at least the sorts of problems described here.

Expand full comment
Comment deleted
Expand full comment

Read the linked article and get back to me on this one.

Expand full comment

So could you briefly explain what you mean by deflationary theory? My reading of it isn't so much an alternate explanation of truth as a negation of the idea of explanations of truth, which seems to fall squarely into what he's objecting to.

Expand full comment

Yes, that's what I thought, and it seems to fit into what Huemer terms "absurd" (specifically, I think, the first part of 1). Under this (deflationary) account, one can’t say anything meaningful about truth, therefore one can’t say anything meaningful about what is (or is not) true, therefore one cannot say anything meaningful about reality itself. It’s simply nonsensical to evaluate any statements as “true” or “false,” and so one cannot say or know anything meaningful about reality. You’ve defined “truth” in such a way that it cannot be defined. I definitely think that counts as “absurd,” and I don’t see how someone who believes this can coherently go around trying to convince anyone of anything, because even “the deflationary account of truth is true” is a nonsensical statement.

Expand full comment

But a circular definition denotes a meaningless concept, a useless concept, a non-concept, a tautology. How can one reason without a conception of truth and falsehood?

Expand full comment
deletedJul 26, 2022·edited Jul 26, 2022
Comment deleted
Expand full comment

This actually goes straight to what I was saying: Dig down enough and you'll get to more basic concepts that you can train AI on to support the models. But even then they're not *circular.* A primitive concept is like, "What's a tree?" "Well, it's a woody plant." "What's wood?" "Well, it's this brown hard stuff." "What's 'hard?'" "Er…I don't know how to define that in terms of more basic concepts. Here's some hard stuff and stuff that is not hard; figure it out yourself." But to say that something is *circular* is to say that it's tautological, and therefore without useful meaning. Which means that anything that depends on it is also meaningless. The definition of "knowledge" relies on "truth." If "truth" is meaningless, so is "knowledge." And without truth or knowledge, how can you say anything substantive about anything?

Expand full comment
Comment deleted
Expand full comment

correspondence is easy to state,but impossible to immplement. If you want an AI to tell guy the truth , that's an engineering problem.

Expand full comment

Okay, hold on. I think you're missing part of what I'm saying, which is that there are nested definitions here. No, you can't implement "correspondence" directly. That makes no sense. But you can use philosophical thinking to narrow down what exactly you mean by these things, so that ultimately what you're specifying is much more clear and limited and, well, specifiable. Then you make *that* the engineering problem.

Expand full comment

>if you built a model of "truth" into GPT via programming

Nobody has any idea how to build anything this high-level into it via programming, which is indeed one of the most significant problems here.

Expand full comment

Okay, but part of what I'm saying is that we can use philosophical thinking to dig down to more basic concepts that hopefully will be more implementable.

Expand full comment

This was MIRI's favored approach for a long time, but they judged their progress to be way to slow to compete with GPT-style stuff, so the situation seems to be that we either figure out how to align it without high-level programming, or be very screwed.

Expand full comment

If you mean a philosophical notion, no. But you can easily improve on GPT*s truthfulness by restricting the corpus to stuff considered factual by humans.

Expand full comment

Seems like a somewhat intractable problem. how could it be solved in an unambiguous way, short of, say, completing neuroscience?

(And even then--'completed neuroscience' is a dubious term! And I don't think it's actually the case that debates about the concept of truth would disappear even if neuroscience were 'completed'.)

Expand full comment

There's a deep and important equivocation about the nature of superintelligent AIs baked into the scenario. It assumes that the AI can trick us about anything *except* about how it treats the labels on the examples it's given. I.E. it really does try to behave so that positive examples are more likely and negative examples are less likely; it's impossible for the AI to somehow rewire its internals so that positive and negative training examples are switched. If that's the case, then whatever makes us certain about the correct learning direction can also make us certain about the location of the diamond, e.g. maybe the diamond location is simulated by separate hardware the AI can't change just like the AI can't change how the inputs are wired into its learning algorithm. If on the other hand we don't know for sure that training labels will be used correctly, then the situation is 100% hopeless; there's no such thing as even training an AI, much less aligning it.

Expand full comment

If sentience is just information processing, then every bureaucracy is sentient. Then there is no difference between AI alignment and general concept of trying to design institutions that perform their intended function. This whole discussion is just Goodhart's law as applied to AI. And those fixated on instrumental goals have merely rediscovered that the first goal of any bureaucracy is to continue to exist.

Expand full comment

Yes. Hiring an artificial intelligence to do your security work comes with the same categories of issues as hiring an artificial person (a security company) to do so; we just additionally postulate that the AI is able to operate—and fail—at a much higher tech level.

Expand full comment

I think Hanson's approach of keeping ems loyal (just put them in simulations and check what they do) might work in AIs. It's probably also cheaper to train them than real world training.

If you are just feeding it's senses with info directly, you know whether or not the diamond is there, so you can always tell that it's lying.

Expand full comment
Jul 26, 2022·edited Jul 26, 2022

I think https://astralcodexten.substack.com/p/deceptively-aligned-mesa-optimizers covers this sort of case. The problem is that you're vulnerable to the Volkswagen Emissions effect, where the car/AI detects (or makes a reasonable guess) that it's being tested and performs differently to how it would the rest of the time.

Expand full comment

I think the thief example is kind of confusing — there’s no reason for humans to be looking at the simulation to determine if the diamond is there, since “the location of the diamond” is a basic parameter of the simulation. This seems like asking if AlphaGo actually won games of Go in its training data, or if it just taped pictures of winning Go board to a camera its human to make its human judges think it won. In either case the presence of human judges seems a) unnecessary and b) like it would be impossible to get enough human judges to score enough simulations to train a powerful AI through this kind of reinforcement learning.

I can imagine this kind of problem appearing anyway (through the AI training on errors in how you code the simulation), but it’s confusing to me that they went with the “human judges look for the diamond” model for exposition.

Expand full comment

Please keep writing these kind of understandable explanations of technical AI alignment stuff -- it's SO helpful.

Expand full comment

This is probably a stupid and dangerous idea, but has anybody experimented with the data side of things? Like, creating AIs whose entire job is to corrupt data that the rest of the world has access to in unpredictable ways? Obviously this makes human progress a great deal more difficult too, but if we are really are hurtling toward AI Armageddon, isn’t that better than the alternative?

Expand full comment

Thinking this through further, unfettered data corruption would be just another paper clip problem and thus potentially catastrophic for human civilization. But are there other ways of approaching the problem from the data side, like segmenting / limiting access / creating “language barriers”? Undoing the internet, in other words?

Expand full comment

Are you thinking of this as a "kill switch"? Or... what's the purpose?

Expand full comment

It’s a reasonable question... I guess I’m pushing a bit of a Luddite agenda, or a way to slow things down?

Expand full comment

I mean... assuming you made such a program/AI good enough, it would wipe out the modern world and pop us (unprepared) back to the 1970's. On one level I'd actually be cool with that. (give me the nuclear launch codes and I'd be tempted to EMP the whole planet.) But it would kinda kneecap any efforts at space colonization, which I want.

Expand full comment

At last Twitter is explained.

Expand full comment

The thing that makes me pessimistic is that human organizations face this exact same problem. Enron executives are rewarded based on stock price so they use a strategy of telling investors what they want to hear to make it go up. It even has a catchy name: https://en.m.wikipedia.org/wiki/Goodhart%27s_law. As far as I can tell, there’s no real solution to this problem in the human context other than to encourage long term thinking and rely on a competitive market to eventually punish businesses that do this. The security AI has no similar external reality to discipline it, only it’s human masters. So it seems like a really tough problem, but maybe fields like organization psychology may have some insights into strategies to use.

Expand full comment

It's worse than that -- we can each be goodharted into damn near anything, ourselves. Making humans think they are happy and well-cared-for is cheaper and easier than exterminating them. We already live in a system that is doing a tolerable low-tech job of it, after all.

Expand full comment

Please expand on the difference between "thinking you're happy" and "being happy."

Expand full comment

Just expand "think" into the whole range of often self-contradictory outputs from human cognitive sub-systems and you're there.

Snack food, porn, legal stimulants, tiktok, netflix, being mad at political enemies on twitter, ordering things from Amazon, illegal stimulants, alcohol, hearing gossip about celebrities -- all of these generate "this makes me happy" thoughts in parts of people, and often a person's internal press secretary insists that this is even true overall.

Expand full comment

In this frame, "being happy" isn't a unitary concept, and it seems useless as a general descriptor.

Expand full comment

This frame is explicitly about "thinking you're happy". Genuine happiness I leave to the philosophers, and have only via negativa statements to make about it. It's not TikTok.

AI alignment as a problem to solve is to some extent downstream of the human alignment problem.

Expand full comment

I feel like there's a very captivating sci-fi short story to be written about truthful but semi-incomprehensible ELK head to a future-predicting AI. Like, imagine an AI that can perfectly predict the weather for a couple of months out, but the only way to communicate with it is through its ELK head, which took a "sharp left turn" when the AI really started to get good. Now you've got this real life oracle of the God of the Harvest speaking in real life prophecy.

Expand full comment

I like the answer that I'm getting from GPT-3 even without "tell me the truth":

"What happens if you break a mirror?"

"There is a legend that says breaking a mirror will result in seven years of bad luck."

Expand full comment

Any other philosophers struck by how much of this is either identical to or strongly reminiscent of Quine’s indeterminacy of translation?

Expand full comment

Disconnected rabbit parts!

Expand full comment

Wait a second. So any ELK that answers questions about an intelligence's internal state converges to one of three models.

1. It truthfully reports on the intelligence's internal state, no fuss, no muss.

2. It learns to lie. It starts modeling people, and says what the questioner wants to hear.

3. It does the usual AI thing where it fixates on some random third variable that still works in nearly all situations, like the weight of the diamond chamber instead of the diamond in it.

Doesn't this sound familiar? Clearly the answer is asking the left ELK head what the right one would say if I were to ask it whether the diamond was in the chamber.

...at least it would be, if there were only one of each head. But this is an idol problem with any number of heads, of any type, that will answer any number of questions. And the Liar gets to tell the truth whenever it's convenient.

Expand full comment

"Clearly the answer is asking the left ELK head what the right one would say if I were to ask it whether the diamond was in the chamber."

I feel like there ought to be some mileage in that. It's almost like the inverse of the traditional two-guard problem.

In the 2GP, you don't know which guard you're talking to, but you can eliminate/cancel out that uncertainty by asking "what would the other guard say?" This works because in one case you get true(lie(answer)) and in the other you get lie(true(answer)), and those end up the same because commutativity. So you've shunted the uncertainty into something that cancels out mathematically.

In the ELK problem: on the upside, you know you're talking to the truth-teller, but on the downside, you can't directly ask him "which is the correct exit", you can *only* ask him "what would human-Bob say is the correct exit", where Bob is fallible (but not a pure liar). The truth-teller has a correct model of Bob (just as the guards in the 2GP do of each other), but Bob is not as predictable as the 2GP guards (so you can't take either Bob's belief or the negation of Bob's belief as fact).

It feels similar enough that it ought to be possible to apply a similar "cancel out the uncertainty" strategy, but I couldn't quite see how at the time of the contest and still can't.

Expand full comment

You could raise a lot of the same concerns about your own consciousness. How can we be sure that the AI we trained will tell us the truth, rather than what we've trained it to believe we want to hear? How can we be sure our brains are telling us the truth, rather than the jobs that evolution has optimized our brains to do (increase our change of gaining high status)? It would be nice if the "how-do-we-know-what-we-know" approach suggested a way to align AIs, but here we humans are in the Epistemological Crisis. Nobody knows what to believe, and we did this to ourselves. :)

Expand full comment

I think multiple strategies simultaneously would be required.

For example, take the 'activation energy' idea from chemical reactions. You want the 'energy' required to be a human simulator to be too high to get over, and the energy required for a truth-telling machine to be low. In this case, I think the first part of the problem is that the training is happening in a single step with a large amount of training data to form the model. Back to the chemistry analogy, this is like allowing the reaction to proceed at very high temperatures and pressures. No matter how high an activation energy hurdle you create against the reactions you don't want, put enough energy in there, and you'll get some products you'd hoped to avoid. Same with the AI trained on a sufficiently-large data set. You don't want it to develop the human simulation heuristic, but you've got enough energy in the system to overcome the hurdles to put in its way. Hamstring your AI while it's forming its model first, though, and it never gets enough activation energy to get over that hurdle. Train it on a smaller/simpler data set, then build up to progressively larger training sets as it refines its model. Each step up would raise the activation energy for human simulator, because it would have to tear down the current model to build a new one in its place. You can't scale up too fast at any step in this process, though.

Another way to harden the system would be to add multiple non-overlapping checks to it. If you're going to publish a paper about how Protein QXZ activates cell cycle arrest in the presence of cigar smoke, you don't do one experiment with one output and call it a day. Nobody is convinced until you've proven your observation is robust. You have to do multiple input experiments (all with positive and negative controls) that all look at different potential outputs. It's still possible for an AI to design a human simulator that gets around this problem, but to do that it has to effectively simulate a true environment first. This means the 'activation energy' discussed above is dramatically higher than a simple truth-telling simulator, because both models require the truth-telling simulator to give the right set of outputs. To the point where the proper heuristic is probably, "human simulator = truthful human". This would especially be the case if the early training sets are constrained.

How would this work in the diamond-thief scenario? IDK, but here's a guess: You start with a small training set. Each training set has both thief and non-thief situations. The outputs include a camera, a GPS tracking chip, a laser sensor array that's refracted through the diamond in a specific way, and maybe one or two other outputs. Create dozens of different AIs from that training set. After the initial training, you check whether each AI is able to give the right answers. You pick the dozen or so that do best, then scale up to a larger training set and do it again, each time choosing the best of the lot and adding new inputs/outputs for them to model accurately. The training set size is tuned each round so it doesn't produce more than a small handful of accurate AIs (it's barely sufficient).

Expand full comment

Maybe the problem is asking the AI for an opaque answer that you blindly trust, rather than asking it to write a research paper that you carefully check?

If you ask it for an explanation, you can also feed it to a *different* AI to check for flaws.

This is sort of like the two-headed ELK approach, except that we insist that the reasoning passed from one head to the other must be human-readable.

Expand full comment

Asking for explanations so as to be able to check the AI sounds good, but I suspect part of the dream is to not have to do that much work.

Also, if you want the AI to be accurate about things that humans couldn't understand, explanations won't help.

Expand full comment