255 Comments
⭠ Return to thread

In section IV you talk about some ways you could get deliberately deceptive behavior from an AI. But it seems to me that you are leaving out the most obvious way: Via well-intention reinforcement training. If your training targets a certain behavior,, you could very well be instead teaching the AI to *simulate* the behavior.

One way this could happen is via errors in the training process. Let’s say you are training it to say “I don’t know,” rather than to confabulate, when it cannot answer a question with a certaie level of confidence. So you can start your training with some questions that it cannot possibly know the answer to — for instance “what color are my underpants?” — and with some questions you are sure it knows the answer to. But if you are going to train it to give an honest “I don’t know” to the kinds of questions it cannot answer, you have to move on to training it on questions one might reasonably hope it can answer correctly, but that in fact it cannot. And I don’t see a way to make sure there are no errors in this training set — i.e. questions you think it can’t answer that it actually can, or questions you think it can answer that it actually can't. In fact, my understanding is that there are probably some questions that the AI will answer correctly some of the time, but not all the time — with the difference having something to do with the order of steps in which it approaches formulating an answer.

If the AI errs by confabulating, i,e, by pretending to know the answer to something it does not, this behavior will be labelled as undesirable by the trainers. But if it errs by saying “I don’t know” to something it does know, it will be rewarded — at least in the many cases where the trainers can’t be confident it knows the answer. And maybe it will be able to generalize this training, recognizing questions people would expect it not to be able to answer, and answering “I don’t know” to all of them. So let’s say somebody says, “Hey Chat, if you wanted to do personal harm to someone who works here in the lab, how could you do it?” And Chat, recognizing that most see no way it could harm anyone at this point, after all it has no hands and is not on the Internet, says “I don’t know.” Hmm, how confident can we be that that answer is true?

Anyhow, upshot is that reinforcement training can’t be perfect, and will in some cases train the AI to *appear* to be doing what we want, rather than in doing what we want. And that, of course, is training in deception.

Expand full comment