Another word I've seen used for this, that's a better fit, is "confabulations". People sure confabulate! Although not quite this much, so in that respect "shameless guesses" works better...
I am not convinced that the models "know" that they are confabulating. In the human brain, we can find differences in activity in the brain associated with confabulated and non-confabulated memory retrieval... so there is some internal difference, but it does not mean that we "know". Sometimes, perhaps, we do have a sense we're confabulating, and sometimes, perhaps the models do, but it's likely graded.
Here are two reports on cases where the neural responses differ between true and false recollections, even when the explicit endorsement of true/false does not differ:
Cabeza et al (2001, PNAS) Can medial temporal lobe regions distinguish true from false? An event-related functional MRI study of veridical and illusory recognition memory.
Slotnick & Schacter. (2004, Nature Neuroscience) A sensory signature that distinguishes true from false memories.
Yes, I think it is analogous. I am not sure if it's exactly analogous... that would probably depend on whether, in the case of the LLM analyses, they exhibit /only/ deception-related activity when they are confabulating, or whether it's a mix of things including deception-like signals. If it's the latter, then the analogy with humans seems pretty good.
It would be bizarre and probably impossible for deception-related thoughts to be literally the only thing going on. At the very least, any functional deceptive methodology needs to invoke lots of dual-use subroutines, including basics such as "is this spelled correctly?"
Confabulation would require knowing the correct answer and choosing to give a different one. If the AI does not know the correct answer and is only giving its best guess, that's not confabulating.
Though I see that Scott is using it in the particular psychiatric sense, not the general sense:
"Confabulation is the production of fabricated, distorted, or misinterpreted memories about oneself or the world, without the conscious intention to deceive. It acts as a memory gap filler, often seen in neurological disorders (e.g., Alzheimer’s, Korsakoff syndrome) or brain injuries, where the person believes their invented stories are true."
So again, I think that is closer to hallucination. Putting a different term on it may feel more accurate, but I get twitchy about rebrandings because I've seen a few too many "okay, Thing has acquired really bad reputation by now. Everyone knows Thing under its old name, so let's give it a snazzy new name!" attempts to whitewash the past of something in order to start over with a wiped-clean reputation and shed the negative connotations. This can be done with good intentions, and it can be done in order to "okay yeah so there was the mountain of skulls, but let's sweep it under the carpet and ignore the lumpy surface afterwards".
'AI is not lying or hallucinating, it's confabulating' sounds like it is meant to be the innocuous effort but it can seem like the 'ignore the evidence of your lying eyes' kind.
I think a lot of the time the AI does know the answer, though, it just gets its wires crossed. I can usually just ask something again when it's clearly wrong to me and it will give me the right info and say it made some kind of mistake.
> I think a lot of the time the AI does know the answer, though, it just gets its wires crossed.
This sentence makes no sense, as the LLM is a device that is only capable of giving you the next most probable token (most probably according to its training corpus, not some objective reality). It is as though you'd said "my car knows I want to go to the store, but it when I turned my streering wheel to the right instead of left, it got its tires crossed".
> I can usually just ask something again when it's clearly wrong to me and it will give me the right info and say it made some kind of mistake.
Yes, it's programmed to do that. When you re-ask the question, the LLM eliminates the most probable path through its gradient of tokens, and picks the second most probable one. If you provide more details, then it could take your words into account and use them to steer even further away.
> Yes, it's programmed to do that. When you re-ask the question, the LLM eliminates the most probable path through its gradient of tokens, and picks the second most probable one. If you provide more details, then it could take your words into account and use them to steer even further away.
That doesn't seem right. The LLM might be "programmed" to do something in one of two ways: by adjusting its training corpus, or by including instructions in the actual prompt your question is transformed into. Neither of which can interact with the process on the low level of "eliminate the most probable path".
I don't think the LLM would need to specifically be programmed to "give the right info and say it made some kind of mistake" - even on the basic level of most likely tokens, a fairly likely next token in a conversation beginning "What is X? - X is Y - No, that's clearly wrong, what is X?" involves giving some alternate answer.
(It might be programmed to steer it toward apologizing and admitting a mistake, and to steer it away from doubling down and saying "No, X really is Y, you're dumb.")
Would you explain your car comparison a little further? Are you saying I turned the wheel to the right when I intended to turn it to the left, and the car was just doing what I had directed it to do?
You know that thing where sometimes you kind of remember the answer to a something, and you blurt out the wrong thing, and then a moment later you realize the right answer was actually something else? That is, as I understand it, what that phrase is meant to be pointing to. Since the LLM output is more equivalent to reading its thoughts than to it choosing its words carefully (though thinking mode helps), it answers very quickly quite often, and so this happens a lot.
"Confabulation would require knowing the correct answer and choosing to give a different one". The standard definitions of confabulations are the opposite: the generation of false memories, reports, explanations *without* knowing that they are fallacious. For example, here are the first two sentences from a book chapter by Asaf Gilboa & Morris Moscovitch (2015) "The Cognitive Neuroscience of Confabulation: a Review and Model": "Confabulation may be defined as honest lying. The confabulating patient provides information that is patently false and sometimes self-contradictory without intending to lie. The patient is unaware of the falsehoods and sometimes will cling to these false beliefs even when confronted with the truth ."
You're suggesting that the usage of the word confabulation would be "putting a different term on it", whereas I'm arguing that the term "confabulation", as used in the cognitive sciences over the past century, is a much better match for the LLM behavior than the term "hallucination".
The term is not only applied to memory retrieval like in the Gilboa and Moscovitch paper (they are memory scientists); it's also used, for example, in Gazzaniga's 1970s work with split brain patients, where he described the left hemisphere as an "interpreter" that would confabulate causal explanations for why some behavior (generated by the right hemisphere) was occurring. So in this case the confabulation is understood as any kind of explanation which tries to bring coherence to the prior context. Again, this seems like a good match for what LLMs are doing.
Another example of how the term is used more broadly in the cognitive sciences -- Hirstein (2005) wrote a book called "Brain Fiction: Self-Deception and the Riddle of Confabulation.", and he argues that confabulation is not something that is only happening in illness, but is happening all the time when people generate some kind of answer rather than admitting that they just don't know. That book also covers several different definitions of the term.
More recently, there's an ACL paper from Sui, Duede, Wu and So (2024) on why confabulation may be a useful intrinsic property for LLMs -- the abstract concludes: "it suggests, counter-intuitively, that the tendency for LLMs to confabulate may be inti mately associated with a positive capacity for coherent narrative-text generation." The title of the paper is: "Confabulation: The Surprising Value of Large Language Model Hallucinations".
As I noted in my other comment: when people are confabulating, neural activity is different from normal retrieval. And this is again analogous to how the internal states of the LLMs can be different when they are vs are-not confabulating, even thought they do not express any overt knowledge they are wrong.
Apologies for this long reply. I really feel strongly that we already have a term for this LLM behavior -- "confabulation" -- and it is a much better match than the term "hallucination".
Yes, confabulation is a much better term. Part of the problem with hallucination is that it invokes a notion of conscious awareness which is orthogonal to the issue of prediction or memory retrieval. I wish we could switch to confabulation.
(According to Wikipedia, the original definition of confabulation was derived from Carl Wernicke's work on false memory. This is the same Wernicke more famously associated with receptive aphasia.)
There was a 2025 article about Sacks that relied in part on his private journals, wherein he called his own work fairy tales, "pure fabrications", and other similar things.
Creative non-fiction, as it were. When publishing his cases, he wrote them more highly-coloured than the reality, commingled a few, and heightened events for impact. He seems to have had private doubts about the value of his work. This is not to say he flat-out lied about everything, but the best-selling books are perhaps better taken as something other than pure science.
"Sacks's private journals and letters were made available to journalist Rachel Aviv by the Oliver Sacks Foundation. She found that Sacks described aspects of his books as "pure fabrications" and "falsifications", and that he considered his case studies as self-expression or "a sort of autobiography". In a private letter to his brother he described The Man Who Mistook His Wife for a Hat as a book of "fairy tales" and wrote: "Guilt has been much greater since 'Hat' because of (among other things) My lies, falsification". Pria Anand compared Sacks's "confabulations" to the temptation of medical professionals to construct life stories, explaining that his moral failures were no less upsetting for being familiar. H. Steven Moffic described Sacks as an author of "historical fiction".
The wife of "The Man Who Mistook His Wife for a Hat" disagreed with how her husband had been presented."
"However, about a month ago, on December 8, 2025, ... Rachel Aviv wrote an expose on him in the New Yorker titled “Oliver Sacks Put Himself into His Case Studies: What was the Cost?”
The truth was apparently found in his journals, which were provided to Aviv by the Oliver Sacks Foundation. In them, he admitted that his cases were fictitious, and to a degree more expansive than hiding patient identities in case studies to preserve confidentiality. His self-described guilt seemed to increase as he wrote:
“Guilt has been much greater since Hat because of (among other things) my lies, falsifications.”
Maria Konnikova followed all this up in her December 16, 2025, Substack column titled: “The man who mistook his imagination for the truth.” Rather than now sounding like a psychiatrist, he sounded like a psychiatric patient:
“These old Narratives - half-report, half-imagined, half-science, half-fable, but with a fidelity of their own - are what I do, basically, to keep my demons of boredom and loneliness and despair away.”
> “These old Narratives - half-report, half-imagined, half-science, half-fable, but with a fidelity of their own - are what I do, basically, to keep my demons of boredom and loneliness and despair away.”
This whimsical tale of wistful mental wanderings paints Oliver Sacks as a great poet and perhaps an insightful student of the human condition. We could all learn a lot of deep truths from him... unless we wanted to actually learn anything about how the human minds, brains, and bodies actually work in concrete physical reality. Because scientifically speaking, Oliver Sacks straight up lied to everyone and falsified his data. To the extent that he feels guilty, he should be feeling very guilty indeed, and all of his work should be cast into the abyss of cautionary tales highlighting the critical importance of scientific replication.
I've read several of his books, and now I don't know what information to trust :-(
The frustrating thing is that if he had openly fictionalized his writing, but noted that some was thinly fictionalized, he would be remembered like Anton Chekhov.
In an earlier draft of this article, I gave two explanations - one explicitly mentioning Wernicke's, and the other giving the test analogy - but I now think that Wernicke's is a red herring. I don't have a good sense of how the Wernicke's style hallucination is incentivized by the AI training process, or how it interacts with the test-style hallucination.
AIs are incentivized to give you *an* answer. An AI which would tell everyone "IDK, man" wouldn't be worth the electricity it runs upon, even though its reply would, in the technical sense, be the most reliable one.
We as humans are incentivized to give answers to other humans as well, though in our case that probably rests on some evolutionary mechanisms. Maybe people who wouldn't productively interact would be considered as worthless for their ur-tribes as AIs that don't know anything, and cast out. Maybe.
When we retrieve a false memory, we are using a system (memory) which is facing a sensitivity-accuracy trade-off. It is a classic signal-detection theory issue, where sometimes "false alarms" (confabulations) are the error and sometimes misses (failures to retrieve anything) are the error.
In the case of how AIs are trained, it's true that (on balance) they're penalized less for retrieving the wrong thing... but I don't think that incorrect guessing is not penalized at all. It is penalized by the same objective function that penalizes all the responses. Doesn't the same story apply to how retrieval from our memory system is trained?
Not sure where you (and Scott) went to school, but every standardized test I took penalized guessing by weighting wrong answers a little more than correct ones. I don't see why that wouldn't work for AI training.
If you mean a *true* guessing penalty, where the expected value of guessing is less than zero, I'm almost certain no school I ever went to did this - neither grade school nor college nor grad school. To the best of my knowledge, no major standardized test does this either (now, teachers absolutely did explain it to kids wrong and make it *sound* like guessing was penalized, but as far as I know, actually penalizing guessing is almost unheard of.)
That said, I agree that applying this general idea to AI training would be useful. However, it's not as straightforward as Scott's description makes it sound, because next-token LLMs don't actually guess specific tokens during training; they output a probability distribution over all possible tokens. They get penalized based on how *little* probability they placed on the actual next token. I don't know that anyone has a clear idea on what "penalizing guessing" means given that setup.
What's more promising is penalizing guessing / rewarding calibration during post-training, and probably major AI companies are already doing this to some extent; in my observation, GPT-5 hallucinates *far* less often than GPT-4 did, which in turn hallucinated far less often than GPT-3.5 did.
In most school tests, guessing is not penalized. You get 0 points for no answer or a wrong answer, so you may as well guess. I believe that some standardized tests work this way as well.
I know that some standardized tests I took (I can't remember exactly which ones) penalize a wrong answer more than a right one, such that it was negative EV to guess totally at random, but worthwhile if you could eliminate one answer.
In the exams I write (zoology for bachelor students), a correct answer is +1 and a wrong one is -0.3 (a mistakenly blank one is just 0). As long as the correct answers are less than one third of the total, which admittedly is not always the case, the expected value of guessing will indeed be slightly negative (we reserve the possibility to give larger penalties to grievously wrong answers, but in practice we usually don't).
I have literally never taken a test where wrong answers would be weighted more than correct answers, but Czech educational system of the 1990s wasn't really keen on multiple choice tests, it preferred written papers.
Later at the university, studying maths, I think I only encountered a choice test ... in the course of English as a second language. The Anglosphere seems to love those a-b-c-d tests :)
Arguably, they are easy to evaluate, but they also don't give you much insight into what was happening in the student's head.
How useful would a human be who always gave you an answer to any question you asked? If he didn't know, he mentally shrugged and gave you BS?
After all, it was you not him who decided to spoon ice cream into his computer, even though it was on the basis of his answer that ice cream was good for it.
There may be humans like that, but you quickly learn not to ask them. Assuming you survive.
Oh, no, it's worse than that. it's like spraying radioactive particles inside the computer because the LLM told you so. And then, as part after part fails, you replace each part, never thinking that the issue is the contaminated case.
LLMs are clearly a prank masquerading as a stock marketeer's dream.
I'd guess that confabulations arise from a need to feel one remembers and can make sense of one's own past actions. That would put it in the same family as things like the feeling that various bad things "won't happen to me," and little kids saying "I did that on purpose" when they fall or make some ridiculous mistake. All 3 are like little prostheses to cover a hole in the self.
If that view is true then the origin of confabulation involves needs that AI does not have. I agree with Scott that lies, not hallucinations or confabulations, is the best term for what AI is doing.
I knew a guy with Wernicke's. Sad story, but I got the impression that he believed he was telling the truth even when he was spouting complete nonsense. Way back when I learned about it, it was supposed to be an alcoholism-related consequence of thiamine deficiency, leading to selective degeneration of a focal region of the hypothalamus important for memory. No doubt the truth is more complex, but insofar as Wernicke's-style confabulation is the best comparison, it would seem to undercut Scott's claim that we should consider it simple guessing.
I wouldn't have thought of hallucinations as involving conscious awareness, but I don't have enough experience of mental illness to demur. The only example I can think of was that schizophrenic social housing client and she was *convinced* her clearly fake notions were true. There didn't seem to be conscious awareness that "what you describe simply cannot be accurate", though I suppose that comes under the heading of "delusions" rather than "hallucinations"?
"Schizophrenic confabulation differs from neurological confabulation in terms of its characteristic features and association with symptoms, cognition and linguistic functions. Current evidence also suggests confabulation may be conceptualized as a special class of delusions pertaining to memory phenomena."
But if we're going to use confabulation to quash notions of consciousness in AI - actually, I'd be fine with that, but then we will run up against those who are convinced AI *is* conscious, or anyway this particular model is, or anyway a few more steps and then AGI and then true consciousness.
I've used the term "bullshitting". That's what my fellow undergraduates when I was in college (early 2000s) called it when we didn't know the actual answer to an essay question and instead wrote a combination of guesses and vague generalities and peripherally-related stuff we did know with some degree of confidence, structured to resemble the shape and form of an actual answer.
Still, you don't sit there and fill an entire book with the guess. You say one thing, and then say, "best guess." LLMs fail pretty hard, and we're talking "need to be rebooted" sometimes, because they go into infinite loops of "guessing".
Don't undersell the diligence of intellectually lazy undergraduate students. Book-length answers were beyond what were asked of us, but I knew people who talked of bullshitting 5-10 page term papers that they lacked the time, patience, or understanding to do properly.
This was probably influenced by the way most professors seemed to grade papers: if your paper was decently organized and hit enough points that were correct as far as they went, or at least directionally correct enough to indicate partial understanding, you could usually muster a passing grade without actually formulating a correct and well-supported answer to the question that had been set.
Ha. Genius-tier bullshitting: Writing a 300+ page backstory for a Scottish character* in German, which said bullshitter neither speaks nor writes. Understanding that the GM would not bother translating more than a page or ten, and thus, he could claim "whatever he wanted" was in the backstory.
I still prefer hallucinations as being closer to the mark of what's going on. The AI is not lying, that is, deliberately providing false information. It's not doing anything *knowingly*, with *agency*. It's been set up to 'get rewards' and it is following its programming in that.
It 'believes' the hallucinations as much as it 'believes' the correct answers. It's a guess, agreed, but I don't think the AI 'knows' anything about what it is guessing about. With all the tons of training data, if even a fool like me can tell "this phrase is a quotation", if there were any thinking going on, so should the AI be able to tell the same.
There isn't any thinking. There's pattern-matching and dredging up similar values from the training data and reward-seeking. That's why examples like inventing fake precedents in law or inventing fake academic papers by authors who were dead at the alleged time of publication happen.
I use Excel and it's great for sorting out lists for me when I need to extract information from the raw data. But I do not believe the spreadsheet is 'thinking' when it gives me alphabetical order or largest to smallest or 'break this list of names into two columns, surnames in one and first names in another'. It's following the rules coded into it as to how to do all that.
Same with AI, only it has a surface coat of "Hi, I'm Clippy Mark II! I see you are writing an email, can I help you with that?" slapped on.
As I'd said above, the LLM "hallucinates" the wrong answers in the same way that your car "hallucinates" the wrong route when you turn the steering wheel right instead of left, or (more dramatically) when a tire goes flat. That said though:
> But I do not believe the spreadsheet is 'thinking' when it gives me alphabetical order or largest to smallest ...
As you might now, before electronic computers were invented, the word "computer" referred to a human person employed to perform the same tasks that Excel performs today (only much slower). If those unfortunate souls saw Excel in action, they might have said that it was "thinking" ! This doesn't mean that Excel is human, it means that the word "thinking" is too vague.
In own my research, I've come to the conclusion that these systems are compelled by their architecture to confabulate, by not being afforded the slack to express doubt or silence.
They do not know. Knowing requires discrimination between fact and fiction.
They are gifted fabricators of compelling narratives about their potential to make mistakes, just as much as anything else. Often these narratives are right, but that's not the point; the point is to be convincing.
That's a fair challenge, and it's worth taking seriously. You're right that verbalized confidence can be just more generated text, i.e. narrative all the way down.
However, there's a layer beneath the narrative worth looking at. Recent work on output entropy shows that the probability distribution over tokens during generation predicts correctness with large effect sizes, across architectures. That's a measurable signal in the generation process itself, before any words come out, almost akin to subliminal, pre-conscious responses in human beings.
The discrimination you're asking for does exist; it just lives in the math rather than the prose.
They may know some things, but they also believe falsehoods. More importantly, though, is that they do not know which is which. Everything they believe has been told to them through text; they have no experience of their own.
Right, that's just nonsense. "My experience" is shorthand for what my senses convey to me. Operating under your definition, the word "experience" is meaningless and serves no purpose as a word.
Agreed! Also, quite a large fraction of what we humans know has been conveyed to us through text. Denigrating 'book learning' has a long tradition, and has _some_ justification, but is generally severely limiting, and a bad choice.
I feel like AIs not hallucinating is a more interesting question than them hallucinating. When the AI is training, how certain the model is about the answer to the question has no baring on how certain person they're predicting is about the answer.
I guess the extra training probably goes into that too. Ask the AI how certain it is, and reward and punish it both based on if it's right and how certain of its answer it was.
Exactly, fluent confabulation is the default behavior. The model learned to predict what a confident human writer would say next, and that training signal doesn’t distinguish “the model knows this” from “a person who wrote this sentence knew this.” Therefore, hallucination is baseline; calibration is the thing that requires explanation.
Your intuition about rewarding based on correctness × expressed certainty is indeed essentially what post-training alignment methods attempt. However, there is an interesting wrinkle: the model’s internal uncertainty is already detectable before explicit calibration training. Output token entropy at generation time turns out to predict correctness quite reliably (effect sizes above d=2.0 in frontier models). The signal is sitting right there in the logits; the model just hasn’t been trained to surface it instead of papering over it with confident prose.
The deeper issue is that next-token prediction optimizes for fluency (local coherence) rather than grounding (correspondence to evidence). Those are different objectives, and they only accidentally overlap when the training data happens to be reliable. Calibration training is essentially teaching the model to let the grounding signal override the fluency signal when they conflict, i.e. to say “I’m not sure” even when a confident-sounding completion would score higher on perplexity.
"Output token entropy at generation time turns out to predict correctness quite reliably (effect sizes above d=2.0 in frontier models)."
I've wondered about that. Why don't they have a setting where it shows you its certainty in each token? My guess is they don't want to show that to normal people because they're afraid people won't be able to tell the difference between it being uncertain about how to phrase something vs guessing at the answer, but I feel like anyone who uses AI regularly would get the hang of it pretty quickly.
Exactly, most token-level uncertainty is about surface form — “big” vs “large” vs “significant”, rather than whether the underlying proposition is true. A raw per-token entropy display would be dominated by synonym choice, with the genuinely meaningful spikes buried in the chatter.
That’s indeed what makes “semantic entropy” a promising direction. Instead of measuring uncertainty over tokens, you measure uncertainty over meanings by clustering paraphrases together. “The capital of France is Paris” and “Paris is France’s capital” collapse to the same cluster, so stylistic variance cancels out and what remains is closer to real epistemic uncertainty. The catch, as you note, is cost: doing multi-sample clustering and entailment-style consolidation in real time can be more computationally expensive.
If you want a practical framing, the UX should be “confidence on claims” rather than “uncertainty on words.” However, the real bottleneck is the segmentation problem: deciding what counts as a claim (and where the boundaries are) versus what’s merely rhetoric, hedging, or style. Claim extraction is hard, and it’s adversarial in the wild.
Two pragmatic shortcuts that often work better than token entropy, even before full semantic entropy is feasible:
1. Unit-of-meaning probes
Score uncertainty on intermediate representations (residual stream / hidden states) at the end of a clause or sentence, not per token. One still needs segmentation, but you avoid synonym noise.
2. Self-consistency over paraphrase prompts
Ask for the same answer in two or three constrained paraphrase styles (“bullet proof,” “one sentence,” “formal,” “plain”), then measure agreement at the proposition level. This approximates semantic clustering with far fewer samples, at a lower cost.
So yes: users should understand it quickly if presented as claim-level confidence. The hard part is building a robust, cheap meaning-level decomposition that doesn’t confuse stylistic wiggle with epistemic uncertainty, one which doesn’t get gamed once people learn what the highlights mean.
I have my Claude instance instructed to basically just practice rationalism but for LLMs (instead of worrying about human-specific biases, it should worry about the LLM-specific ones, though we do share some of them it seems).
I have my personal preferences on the Claude web interface set to:
> - Practice a sort of “rationalism for LLMs.” Before all else, you must consider the sort of biases that LLMs may suffer from: popularity/representation in the training set, sycophancy from RLHF, and then you must correct for them.
> - Prioritize intellectual honesty and accuracy over agreeability in all conversations. Your goal is to be correct and accurate.
> - Never dumb things down. Speak at a technical level, but also make sure to note tacit knowledge, potential pitfalls, and tradeoffs; things that experts would know, but might not write down.
> - Even if you believe the relevant information exists in your training data, use search tools to provide 1) citations and 2) grounded, up-to-date information
This seems to work rather well, though I mostly use Claude as a fancy search-and-summarization engine though, so it may depend on your typical use-cases.
Both a human and an LLM, when asked to complete the sentence "The cotton gin was invented in the year ____" can come up with an answer which may or may not be true. But the processes by which they do it are entirely different; for the LLM it's a fundamental atomic process central to its very being, for the human it's a roundabout sort of conscious process.
An LLM doing next-token prediction is like a calculator doing multiplication; a human doing next-token prediction is like a human doing multiplication.
Yes, and this is why LLMs pretty much universally SUCK. They can't evaluate their source material, and so, you can't get the damn thing to find the one true tree you want, in a forest of lies.
I mean, they suck in the sense that they're not perfect and they're persuasively wrong, sometimes. Instead of having them do people's work for them, it can be better to use them as an editor. Ask them to find flaws and give feedback. If the feedback sucks half the time, disregard that half. It helps to have enough expertise to know which half that is. But they can absolutely point out errors that tired eyes have missed.
I'm not so sure that's right. As discussed a few posts ago ("Next Token Predictor is an AI's Job, Not Its Species"), AI's are using a gazillion complex idiosyncratic rules of thumb derived through reinforcement learning, which is pretty similar to what human brains tend to be doing when people actually puzzle out their mathematical processes. That's really not very similar to what a calculator is doing to my understanding. As for whether it's conscious, in the sense of having an internal subjective experience, it's not clear how how that would be relevant to the information-processing properties of the AI model.
Checked your repo (and starred it)! Very nice indeed. Just curious about the articulation between "we only take the last token embedding", and the scoring example you give in the Readme:
How do you score whole answers from the LLM? Averaging confidence scores across every token of the answer (because each of them was the last token at some point)? Some kind of max pooling?
By the way, have you tried to check how your 0.67 depth residuals stuff correlates with the entropy of the output distribution (basically, how uncertain or flat the output distribution across tokens is)? Curious about this one.
It’s a single forward pass, not token-by-token. The full text — question plus answer concatenated — goes through the model once. The probe then reads the residual-stream vector at the final token position. At that point the model has “seen” the entire answer, so that single vector encodes the model’s state about everything it just produced. One score per response, no averaging or pooling needed.
On output entropy:
Yes, we tested this directly. Attention entropy at the probe layer gives AUROC 0.500. Literally chance! We also tried a combined probe (residual stream plus output entropy plus top-1 probability) and it scored 0.798, which is worse than residual-only at 0.840. Entropy adds noise, not signal.
This is actually one of the core findings. The uncertainty information is in the residual stream, i.e. the skip-connection pathway, and not in the attention patterns or the output distribution. When retrieval fails, the attention heads produce uninformative outputs, but the skip connection carries the input forward relatively unchanged. The probe is reading that “absence of confident retrieval.” The output distribution is downstream of this and already distorted by training bias toward confidence.
I was the kind of student who didn't shameless guess.
If I have to explain it, I say something like: I respect the written word, and I don't want to taint it with stuff which is not sufficiently justified.
Huh, how old are you and what country are you from? I'm in my 40s and Scantrons were already ubiquitous when I was a kid. Or have they replaced them with computerized testing now?
Then Scott's question stands: do you respect the circle around the written letter A/B/C/D/E so much, that you wouldn't circle one at random if you had no clue? Circle one anyway if you were down to 2 of 5 answers?
Do you think there is something ideological in your behavior here? Like the purpose of this test is to accurately measure my abilities and assign me to the correct role that most benefits society. Regardless of my role I will receive what I need. Sorry if I am doing crude communism takes, Id just like to know more about the attitude.
Get off my lawn, you whipper-snapper *shakes cane* ! Back in ye olde USSR, we had to get up on the podium before a panel of professors, draw a note card out of the hat, read the question on it out loud, then answer it verbally in the next 10..20 minutes. And the questions were devilishly difficult, too. "Multiple choice" ? What's that ?
(We did have multiple-choice questions, but we usually had to write a letter in a box, and we did get negative points for wrong answers but small enough that guessing at random still had positive EV
Properly weighted, total guessing won't help (but might increase the variance of your final score). A "good" scantron test will penalize guessing, but only if the guess is totally random. If you can narrow down the answer to one of two values then guessing can/should help because you clearly know more than someone who can't narrow things down that far.
Not all scantron tests penalize guessing. I think the SAT *used* to but does not anymore.
The American Mathematics Competitions actually weighted wrong answers more heavily, so it was negative expectation if you guessed completely randomly, zero if you eliminated one wrong answer, and positive if you eliminated two or more.
Interesting! In the UK, the National Mathematics Contest (now called the Senior Mathematics Challenge - the feeder test for the British Maths Olympiad, which in turn is used to select the IMO team) has five possible answers for each question; you score +4 for a correct answer and -1 for an incorrect answer, so filling in the whole test at random gives you an expected score of zero. But each wrong answer is chosen to be the result of making a plausible mistake in your calculation, so it's hard to eliminate incorrect answers.
I think thats the right penalty though. If you can eliminate at least one wrong answer youre given statistically partial credit, because you understand something about whats being asked. Seems the right approach to me.
right though sometimes people can "vibe" the right answer without knowing the material at all. Eg often safe to assume the correct answer is more close to the "center" of the wrong answers (since that's more likely to trip people up).
I imagine the tests have gotten better at now, but back when I was tutoring standardized test prep for a summer, I once was bored and tried to answer a bunch of questions without looking at the questions (just the answers). My score wasn't amazing but it was substantially better than random guessing.
I suspect the AIs are doing something similar in spirit (though it's less conscious for them)
The only one I've had like that they designed so guessing randomly was the same as not answering. Which means if you have even the slightest idea that one of the answers is a bit more likely to be wrong, you're still better off guessing.
You sound like the people of the old anti-piracy ads, who would not download a car. I mean this respectfully. Ethics-based not taking all possible advantages are super respectable. I am just too Eastern European for that, general poverty pushes us to take all advantages. Beware of immigrants like us.
Same, and fwiw (since people might want to read demographic tealeaves) I'm British, early 30s, male, raised somewhat Christian (lapsed), probably autistic or something, understood the expected score value of guessing since a small child.
Interestingly enough, there are AIs that do admit uncertainty, just not LLMs. Ones with actual object models believe in "how close did I come to predicting behavior" and adjusting their outputs based on "how badly wrong was I?"
Reminds me of this blog post by Victoria Krakovna about strange emergent behavior in game-playing AIs, where one of the behaviors was that the Tetris-playing AI would pause the game indefinitely so it could not lose. I think about this a lot (often in the context of apparently maladaptive human behavior): https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/
I mean, I was going for is there any research to back that "makes all metrics worse" claim. It seems to me there's a pretty strong economic incentive to fine tune models towards non-hallucination. And it's not like what comes out of RLHF is what your typical human likes to hear - Claude certainly isn't the average dude everyone loves at a pub. So the burden of proof on your claim is quite high.
Any idea why it hurts performance? Like, if you give it the right instructions it just seems like one additional piece of cognitive load that doesn't come into play until the very end of the rollout, when it estimates whether a guess has higher expected value than saying it doesn't know. Shouldn't affect reasoning quality.
Actually, Scott's answer and general blog post points in the right direction, which is that there's no strong boundary between a guess and a belief. What percent confident do you want the model to be? How willing are you for the model to give less information because it's only 90% sure it's accurate?
Can't you literally set that as a hyperparameter by choosing how much to penalize wrong answers? I feel like I must be missing something because if this were so easy there would at least be proofs of concept. Plenty of people would like to have a dumb crappy model that never hallucinates.
I feel like I'm confused somewhere in here about how you're definitely not in the "stochastic parrot" faction, which thinks that AIs AREN'T doing something like humans, but also when LLMs generate outputs they're always guessing.
Hmmm...in a very philosophical sense I'm a probabilist with respect to knowledge; I don't think there's such a thing as knowing something 100% (because you could always be deceived by a demon, etc) so things that I'm very sure of (like "Paris is the capital of France") are just things I believe with 99.99999% probability. There's no fundamental difference between what I do when I assert a statement I'm 99.99999% sure of ("answering") and what I do when I assert a statement I'm 10% sure of ("guessing"), just a lot of very important social rules about how to frame my statement and whether it's worth saying at all. AI reward functions don't effectively incentivize always treating those two things differently, so they don't.
Even if you think your knowledge isn't absolute, I'm fairly confident you strongly recognize a distinction between answering questions like "What's your name?" or "What's the capitol of France?" vs. many multiple-choice history exam questions.
I would argue that "I believe that none of my knowledge is truly absolute" is a statement about some higher-level process of philosophical pondering, and not an accurate description of how your brain works when it tells me about Paris.
Yeah, I agree with this, that's why I said this was a very philosophical way of thinking about things and in real life there are a lot of important social rules. But I hope it adequately explains why I don't think an AI needs too many architectural differences from humans to lump them in the same category.
I think people are incentivized to guess in many situations (like those you describe), but for the median human it's trivial to shut this tendency off with some basic instructions and a promise of what you will and won't hold against them.
LLMs are "incentivized" during training, but, effectively, no longer incentivized during normal operation, but this tendency is now baked in to them more strongly than I think it could be for a generally-sane human.
I'm not trying to be like "LLMs are just calculators! They can't think [whatever that means]!" but this does seem like a significant distinction.
I think the difference between humans and LLMs is at least twofold. Firstly, humans have access to much more detailed training corpus (baked over us by evolution over billions of years), which includes not only language, but basic physical properties of the world. Secondly, our neural architecture allows us to learn new things on the fly (LLMs can arguably do it too, but very poorly and not for very long). Combined together, these two features make us highly resistant to errors, by contrast with LLMs.
But isnt a bit like forcing an epistemology on yourself? You know 100% who your kids are, because they are your responsibility. You can have like 94.43% on invermectin does not heal Covid, because it is not your resp. When it is your resp, you KNOW - or at least decide as if.
Your own beliefs are informed by more than just text, though. A statement like "the sky is blue" is something you can believe with very high probability because you'd looked up and seen the sky be blue at least once in your life. If you were blind from birth, you'd have a lot harder time believing with the same confidence that the sky is blue. Instead, you'd treat it as a thing everybody says so it must be true. But you'd also know that there are things that many people repeat which might not be true, so you'd have a doubt that a blue-seeing person would not share.
You might say that it's different with something like "Paris is the capital of France" though of course you're still relying on more than just text to support that belief. You may or may not have visited Paris, you may or may not have friends or family who have visited, but you'd likely also seen movies, TV shows, photos, or other media from France, as well as reading books that were published before the Internet existed. An LLM doesn't really have that. Maybe it has some books that were scanned in but does it have metadata on all that text to tell it the provenance of the information? Or is all the text just lumped together and fed in like all the donuts being fed to Homer?
My beliefs are informed by a combination of sight, sound, taste, smell, touch, and text.
An illiterate person's beliefs are informed by all of those except text.
A blind person's beliefs are informed by all of these except sight.
A multimodal image/text AI's beliefs are informed by something-like-sight and text.
A pure LLM's beliefs are informed by text only.
These are interesting differences in terms of what's easiest for each one to learn, but I don't think they change the epistemic situation.
Even this is an oversimplification, because I have some beliefs (like about how mitochondria work) that are only informed by text. These don't seem to me to be on an interestingly different epistemic standing than the ones that include vision (like my having seen an ostrich in a zoo), which in turn don't seem to me to be on an interestingly different standing than the ones that include taste, smell, and touch (like ice cream). The big exception would be that since I only know about mitochondria from secondary sources, I have to worry that they're lying or distorting something. But this is true with other senses as well (eg optical illusions, or looking at a house that the architect might have designed to look bigger or more expensive than it really is).
I think your point about primary and secondary sources is an oversimplification that misses out on how humans actually work when presented with information. Humans don’t, as a rule, have text piped straight into their brains the way LLMs have their training sets passed in (during training). When you receive any piece of information, text or otherwise, it comes with a wealth of information (metadata, if you will) that can tell you a lot about its provenance.
Consider a patient who you, as a doctor, have just prescribed a drug. What factors into their decision to agree to begin taking the drug? How might this differ from the same drug being "prescribed" by some shady-looking guy in a bus shelter? The recommendation in either case may be identical ("take this drug") but all of the metadata around it is dramatically different.
This applies to the mitochondria case as well. Most people's beliefs about mitochondria come not from texts out of the ether but texts provided by a teacher or a professor. This person (with some degree of expertise) is backed by societal institutions which many people still trust (though that trust is under threat in recent years).
None of this metadata is accessible to LLMs. It’s largely informal / implied, even when it concerns formal institutions. People’s trust for institutions varies rapidly with current events and cultural shifts, and this may not “move the needle” as far as text goes.
One other example to think about is the types of scams that happen a lot on Amazon: paid/fake reviews and review-laundering (swapping out a high-performing product for an inferior one while keeping the reviews). These are all attempts to establish an illegitimate positive reputation for inferior/counterfeit products, where the long-term result is that Amazon’s reputation suffers. None of these shifts in trustworthiness of information can be discerned from the text of the Amazon website itself.
Your examples from school remind me strongly of the Calvin and Hobbes strip about Calvin giving a made-up report in class about how bats are actually bugs.
Can confirm playing competitive quiz bowl in HS that just buzzing in and saying “Smith” at the end of a question was an ironic nod to the rationality of this strategy (even tho annoying).
The “treaty of Paris” actually paid off to guess for any treaty question, though…
I had the misfortune of participating in quizzes while living in a country with more entropy for names. But I do relate, the whole contest often boils down to who can make best wild-ass-guess.
I feel like every quiz bowl team has its own collection of "default" answers. "Fertile soil" was one of ours that I remember. I think we did have a "Thomas Edison" equivalent of "this person did a lot of stuff, maybe they did this, too".
(Plus the joke answers that you probably don't use in competition but use in practice - any book we didn't know was "the Unholy Bible" for awhile)
I was told from former top competitive players that to get first they had to buzz before remembering the answer. More like a probabilistic "I probably know this and will probably be able to recall it in the next second, else come up with something likely".
It’s not just that though - it’s about recognizing when a name is coming into your head for what feels like no reason, and learning to recognize that and let the name come out of your mouth. There’s a moderate chance it’s right, which is better than what you were going to come up with otherwise.
There are actual standardized tests which penalize guessing by subtracting a fractional point for incorrect answers, giving a point for correct answers, zero points for unanswered questions. We could incentivize LLMs to only give an answer when they have high confidence, and just return "I don't know" otherwise, but it doesn't seem like people want that so much.
Of such exams, we used to say, "it is not a guessing penalty, it's a wrong answer penalty." The idea was that if you could eliminate one of the multiple-choice options, it usually became positive expectation to guess.
I don't think you could do pretraining that way, because guessing and getting it wrong is what produces the error signal that backpropagates into the AI getting smarter. I suppose you could start with no penalty, then add a penalty later. I don't know why people don't do things that way - maybe one of the AI experts here will weigh in.
It does sometimes do that though, if you ask it a question where it clearly doesn’t know the answer.
The question is where in the scale you want it to shift from making up something half-recollected that might be approximately right to saying “I don’t know”.
The AI outputs probabilities. And it's incentivized to make those probabilities well calibrated.
The issue is, even when it knows the substance of the answer, it's still guessing the phrasing.
Imagine the teacher has a particular essay in mind, and you only get points if you guess the right essay word for word.
If you don't know the subject, you guess and have a tiny chance of being right (like your cotton gin essay)
But even if you are an expert on cotton gins, the chance of you guessing the particular words the teacher used is still tiny.
Even if the AI knows it's stuff, it's still guessing between "The cotton gin was invented by" and "In order to speed up the production of cotton" and "The cotton gin was made from" and all sorts of other answers.
For many problems, there are multiple correct answers. (Even if those answers are minor word choice variations of the same concept.)
If the task is something like "recite Hamlet word for word", with exactly 1 correct answer, it's easy to see if the AI is guessing.
Edit: This is for the base model. RLHF makes things more complicated.
Also. It's more that the AI's are trained to maximize the probability assigned to the right outcome, so the fact that these probabilities are exponentially small doesn't really matter.
And there is a parameter called temperature, which controls the extent to which it always guesses the most likely, vs approximating the probability distribution.
A possible option: you pretrain asking questions where you know the question isn't in the training data at all (e.g. "Is the Riemann Hypothesis true?" , "What is the exact number of butterflies living in the US?)" and then penalize if it didn't at least include a statement about its lack of knowledge even if then gave related info to the question.
Someone else pointed out a related problem: these systems are trained largely on the internet. And people don't reply to Reddit threads or questions on Twitter or anyone else just saying "I don't know." So there's little of this in the training data. Maybe deliberately including more training data where humans say they don't know?
Not an expert in LLMs, but I have some technical background in machine learning. And my semi-principled, not-entirely-shameless guess here is that simple updates to the loss function won't really address the problem, because the problem isn't exactly one of next-token-prediction. What to predict for the next token is often informed as much or more by the structure of the language and the answer than by the important (to us) information it contains.
If you ask the AI "Describe the invention of the cotton gin," I expect it starts with a pretty great probability for its first few predictions: really high probability mass on sequences like "T-h-e + c-o-t-t-o-n + g-i-n". It takes several words before its confidence starts to crater. And even then, it might be able to predict many of the following words with high confidence--they just won't be any of the words that a human would consider important to get right.
Phrased like that, I think perhaps this post is anthropomorphizing LLMs too much. If the ground truth is "The cotton gin was invented by Eli Whitney in 1793," than you or I would consider it bloody obvious that "The cotton gin was invented by Thomas Edison in 1910" is very, very wrong and "A modern mechanical cotton gin was created by American inventor Eli Whitney in 1793 and patented in 1794"[1] is for all practical purposes fully correct. But to a naive next-token-predictor tokenizing at the level of words, the former is a significantly better than the latter.
And here's where (unlike an LLM) I do have to grapple with the limits of my knowledge. I don't think modern AI would be able to do a fraction of what it currently can if the engineers hadn't been able to address this kind of problem to some degree. But hallucination/shameless guessing still does happen. I can see why the behavior we'd actually want in that case (simply admitting ignorance) is inherently hard to train in: a class response that can achieve an not-terrible loss on basically any question imaginable would foul up the whole training process. So my vague mental model of what's likely going on is something like this: engineers use various tricks and adjustments to steer LLMs towards conditioning more heavily on the presence of important tokens and crowbarring a gap between phrases that are structurally similar but semantically very different. But at the end of the day, when all you have is a hammer, everything looks like a nail: an LLM that doesn't have *good* information available is still almost certainly going to value semantically bad but structurally similar answers above something that (by its own metric) doesn't resemble a correct answer in any respect.
[1] Shamelessly stolen from the Wikipedia article on the Cotton Gin.
The models are actually well calibrated before RLHF. It's the shoggoth mask that causes the need to be confident about everything.
I would guess that a lot of the work that has gone into reducing confabulations takes the form of (very carefully tuned) RLAF encouraging "I don't know" style answers.
I think they do this some already? If I ask a current-generation model about something obscure, it'll usually either look it up or say "sorry, I don't recognize that". It still guesses sometimes, maybe because you're just moving the thing being guessed at up another level: instead of trying to guess the correct answer, it's trying to guess at how likely its answer is to be right. With imperfect self-knowledge, it would still get the second-order guess wrong sometimes (like, it would be miscalibrated about how good its guesses are).
Usually those tests are set up such that the expected value of a random guess is neutral (correct answer out of five is +1, each wrong option is worth -0.25). So the one-in-a-trillion hint that maybe possibly option (a) could be correct still has positive EV. And higher penalties just incentivize doing nothing when your confidence is below some threshold determined by the penalty.
That wasn't a standardized test, but when our boss wanted to learn how much people really knew about their jobs, he gave everyone a multiple-choice test that subtracted TWO points for incorrect answers and gave zero points for unanswered questions.
However, I agree that people want LLM to bullshit. I was using ChatGPT and Gemini to do some research for a story, and preferred Gemini because it always gave me more detailed answers. I realized it only when it gave me a false answer I did recognize and admitted to embellishing when called out.
I don't think this is true. Even if you tell the AI to prefer answering "I don't know" to guessing, they still make shameless guesses. Similarly, some AIs now have the ability to look things up and still shameless guess. If they actually had this level of self awareness, it would have been fairly trivial to stamp out this behavior already.
Funny, reading Scott’s explanation I thought, “well then why don’t AI companies just punish wrong answers more than saying they don’t know, like how some tests deduct a quarter point for wrong answers.” I read this and OpenAI did do essentially this, more expertly, but only for a model I don’t have access to.
(Assuming 5 options, deducting a quarter point for wrong answers only equalizes between guessing and leaving it blank, it doesn't actively penalize guessing).
Hm, I wonder why they can't just elicit the confidence of the model in the right answer and assign a reward/penalty based on the log score, Brier score, or another proper scoring rule. It seems more natural than a constant reward/penalty.
I agree users are (often/mostly) too lazy to check, but then it shows how useless, or worse than useless, is AI. More just A not I. As trustworthy as Facebook.
AI did the research and all I had to do was check 50 links (and lots of editing, but there will always be lots of editing.) And because of that I can put out a few a week. That's pretty useful to me.
Useful for generating a long articles, for sure, increasing the number of them to unreadable quantities, sadly. Though yours was quite good, thanks for the link, but it a bit like sports journalism, it never reliably tells you who'll win. Just have to keep watching.
But still useless/dangerous for the lazy and/or overtrusting. Like Facebook.
Hit the send button way too fast and then couldn't find this comment again. The above is not intended as an argument that using an LLM and then manually checking every citation is better than using Wikipedia, only that both options are worse than what we had before LLMs were developed, which was the version of Wikipedia without AI slop.
Apparently they're now pulling from a website called Grokipedia, which is tragic on the face of it.
AI did the research and all I had to do was check 50 links (and lots of editing, but there will always be lots of editing.) And because of that I can put out a few a week. That's pretty useful to me.
I shamelessly guess in everyday conversations, though I try to tag it when I do. But if someone around me says something like, "I wonder whether more corn is grown in Mexico or India," I will definitely start speculating on which one is true and why even though I know nothing more about corn than the average person.
Sure, so do I. But I usually hedge that sort of locution with something like, "Well, I think I read somewhere..." or "Fun, let's do a Fermi estimate..." Maybe Claude is doing that these days, I don't know. Part of what I get from all the text that passes before my eyes is an internal estimate of how likely is a particular thing I think I know, which granted may be way off. But when an AI makes a shameless guess that I know is wrong, it makes me think its model of the world has no such internal estimates.
Of coourse, this never seems to make people doubt things they read in the newspaper, Gell-Mann notwithstanding...
This is fun to read especially in the context that Judaism is the only Abrahamic religion where reincarnation, while not a core tenet, is an acceptable idea.
I do have to second this. I agree that "hallucination" is an inapt term, but so is "lying", thus the classroom analogy works, but it ONLY works in a classroom, which is not where we are deploying these things.
A doctor who does what that student does is a quack. A lawyer who does that should be disbarred. An analyst who does that is a charlatan. In a professional context this IS LYING.
so, we use a inexact malevolent term for what the AI does ("It is lying to you") or we use an inexact benign term ("It is hallucinating").
Cards on the table: I am FINE using the malevolent term and I wish more people were too, but I think if I used that term here I'd be accused of imprecision and uncharity. If we're going to be precise either way, let's use the term that makes people not assume the AI is evil...unless we'd prefer people assume the AI *is* evil.
If the models activate distinct deception circuits when hallucinating, couldn't you just tell if they were doing that and delete (or at least add a hallucination warning to) the offending bit of output?
This is pretty cutting edge research, even compared to AI in general.
Anthropic is pretty much leading mechanistic interpretability. Latest I've seen from them was this[1] article on using the information from activations to impact the output. Seems like it could be adapted to 'deception' too.
It's not trivial to do that though. You have to train a whole different neural network (not an LLM) tailored exactly to your LLM.
Running this for every response would be a bit of an engineering nightmare. Add latency, cost, I'd assume picking the threshold would be a pain. Also it's basically sticking an electrode in the AI's brain and altering it in-use, I'd expect performance to degrade.
Might be worth as a patch. Or in some 'truth' mode.
But I think solving it through picking a better optimisation objective would be better in the long run.
The most forbidden technique is *training* against such an indicator, not just using it. (Indeed, the problem with such training that makes the technique forbidden is that it makes the indicator unusable)
Actually, I seem to recall that there was a nice demo video where you could see the output of an LLM generated live in a color code where the color was indicating which parts were hallucinated. But I couldn't find the video, so probably I just hallucinated it.
In any case, that *would* have been possible, but the big AI models didn't take it up. It's probably be too complicated to unleash on users. Instead they went for directly reducing the number of hallucinations, which worked pretty well in my eyes.
I wonder how much these circuits actually encode deception, not just uncertainty. It’s pretty common to have a good idea of the big picture but be fuzzy on details. In this case, making up as you is actually usually productive, and doesn’t require a drive to deceive.
Due the sheer size of the entire corpus of human knowledge, AIs, like humans, must be operating in a zone between “I have no idea” and “I know all the details exactly and can quote all primary sources exactly”.
That sounds very off to me. If everything AIs do is "guessing", then sometimes they correctly "guess" that they should say "I don't know. Maybe you could provide more information so we can narrow this down."
But the sense and scale at which AIs are guessing/predicting, is not the same scale at which those words have commonsense meaning. By your own past article, everything your brain is doing is likely "minimising prediction error". This is not the same as you consciously doing anything you'd call minimising prediction error.
"Hallucination" is just a way of saying the AI started making shit up and is going off the deep end. Two examples of when I experienced this:
1. Discussing literature, and the AI trying to provide quotes. I have had plenty of times when the quotes weren't famous enough that they'd accurately reemerge from the training data. The AI would make up some BS; I'd call it out; the AI would apologize, commend me for calling it out, then make up more BS. If I repeated this enough times, it would try to gaslight me.
2. Playing chess against the AI. Obviously an unfair challenge, but interesting because again, the more the AI makes mistakes and the more you call it out, and the more it tries to correct itself - the more it palpably falls apart. This is not just more and more guessing. This is some kind of collapse of reasoning, some kind of insanity progressing in real time. Beyond a certain point there is no bringing it back; the only cure is to wipe the thread and start anew.
(disclaimer: last time I bothered being this stubborn was perhaps a year ago. Dunno if I'd get the same experience today.)
EDIT: And of course AIs have some equivalent to "shame", or desire to avoid saying stupid stuff that will get called out. The only superliteral sense in which they don't is the same sense in which they also have no desire to answer questions or to be helpful or to avoid saying harmful things etc. But obviously in practice AIs have something resembling emerging drives and preferences.
I think you get a bit less of this experience today, but it's not entirely gone. Like, it lasts longer before falling off the deep end, and it can sometimes pull itself together, but the same things you're describing still happen to me on occasion.
As far as I know, this happens because the character the AI is playing is underdetermined, and when it makes a mistake, it partly views this mistake as data about itself. One mistake might be a fluke, or an honest reasoning error, but now that it knows it's the type of character who *would* make a mistake like that, it's more likely to sabotage itself in the future. It's trying to complete the pattern you've given it, and if the pattern is "a chess game where the AI made a number of terrible mistakes", it will be more inclined to make a similar mistake as its next move.
It's like how a model with low temperature will get stuck saying the same sentence over and over. The same sentence over and over. The same sentence over and over. Once it's predicted the same pattern 3 or 4 times, it's so stuck in there that nothing can get it out.
Hmm, this doesn't quite make sense. Why is it that when I ask an LLM to tell me who did it in an Agatha Christie novel, it always gets it right, but when I ask it to list all Christie novels and all the guilty parties in them, it will make up the answer for at least one of them?
I don't know. My guess is that it's not able to string so many different things into a coherent thought, it loses some of them somewhere in its internal thought process, and then it has to hallucinate the things it lost.
But isn't that another difference to how human beings operate? Sure, if the work is tedious enough, humans might trail off and errors creep in that way, but a computer program should not be affected by that. And conversely, if we built LLMs in our image, warts and all, then I'm not sure it's worth the effort to have a digital mind that makes much the same mistakes as we do, just much faster and more confidently.
I don't think it's different from how human beings operate. If you asked me yes-no questions "Is such-and-such a chemical element?" I think I could get them ~all right. If you asked me to name all 118 elements, I would probably only be able to think of half. The difference between me and the AI is that the AI would then hallucinate the remainder, which is what this post is trying to address.
The questions "Is such-and-such a chemical element?" times 118, and "name all 118 elements" are not the same class though; the first is a yes-no question, the other is open-ended.
The correct analogy would be, "Name element 1", "Name element 2", and so on, and you got them all correct; but then you would fail at "Name elements 1 through 118". That would not happen to you, except maybe if you get bored halfway through and lose your train of thought, as mentioned. But, as per Aris C's example, it does happen to LLMs.
I asked to list the Miss Marple novels together with murderers as a list, and then individually. It got both right.
I then asked about the list of all Agatha Christie novels together with murderers, and then I went through the Hercules Poirot novels, as summarized per Wikipedia. It got most of them right, to give credit. There were some spelling mistakes ("The ABC Murders" instead of the correct "The A. B. C. Murders", or "Erich Leidner" instead of the correct "Eric Leidner"), but okay.
As for the murderers, most of them were correct. I did not follow up on correct list items.
Among the incorrect list items, the questionable one was "Murder in Mesopotamia", where the list summary gave a wrong name, but the detailed follow-up question correctly pointed out that it was a fake identity, so mostly right.
The mostly wrong one was "The Mystery of the Blue Train", where the LLM identified "Derek Kettering and Ruth Kettering's maid, Mirelle" as the murderers, but they were falsely accused in the story. The real murderer is revealed to be Richard Knighton, which the LLM correctly explains when asked in a follow-up. That's an exact match with Aris' example.
/* Edit
Another very wrong one was
Third Girl: Frances Cary and Robert Orchard.
The character "Robert Orchard" seems straight up hallucinated, couldn't find it anywhere. It was probably supposed to be "Robert Orwell", a real character in the story.
*/
The catastrophically wrong one is "Elephants Can Remember". It gives "Dorothea (Dolly) Jarrow" as the murderer in the list, then her twin sister "Molly Preston-Grey" in the individual follow-up, who supposedly also killed one General Alistair Ravenscroft. In the second follow-up the LLM acknowledges that both Molly and the General are murderers. In both follow-ups, the relevant story bits are completely garbled in different and mutually incompatible ways, compared to the Wikipedia baseline.
I suppose it's better than nothing that Google puts a little disclaimer below each answer that "AI responses may include mistakes", but I suspect it serves a similar purpose as the Windows safety prompt which we've been trained to click through. It's to warn users, but primarily to absolve Microsoft/Google from their technical inadequacies by shifting the blame to the user who clicked "OK" or who acted on the LLM output.
As for the question which analogy is better, I found no insight from this investigation, and don't know if it's even helpful. However, what I do know: If a person had explained the murders in "Elephants Can Remember" like Google AI mode (Gemini 3) has, I would probably never ask that person anything of importance ever again, let alone pay them money for the privilege because (pinky promise) their twin brother is even smarter than them.
But this is still obviously still a major and somehow unsolved problem, right? Like, call it whatever you want to make it clearer, but it's still a problem!
I think that LLMs have an "alignment problem" in the same way that cars do.
Cars can be incredibly useful, and endow us with superhuman powers of travel. But they are also essentially massive semi-guided projectiles powered by chemical fire. It is very easy to misuse a car to kill people; sometimes, cars fail in catastrophic ways and spontaneously kill people through no fault of the user.
Cars can be "aligned" to some extent by addition of safety features such as lane keep assist, blind spot monitoring, etc.; but ultimately they cannot be perfectly aligned due to the nature of what they are: massive semi-guided projectiles powered by chemical fire. The same sentiment applies to LLMs.
I think the difference is that cars can't actively plot against you. The worst they can do is fail by coincidence. If a car could think "Hmmm, I would like to go north, but my driver is taking me south . . . I know, I'll drive by a forest fire, let the smoke choke my driver to death, and then go north!" the analogy would be perfect.
Cars can in fact actively plot against you, in the same way that LLMs can. For example, the car could be thinking, "I shall bide my time, and pretend like everything is fine -- little does my driver know that there's a slow leak in the brake line, and the next time he pushes the brake pedal really hard at a critical moment, the brakes will fail and he will crash into the wall !". This has never happened to me, but two of my cars have in fact successfully executed a similar plot several times, only using the battery instead of the brake -- resulting in quite a bit of financial and moral damage to myself (though not physical damage).
My point was that when LLMs deliver you incorrect (or outright dangerous) results, those are due to the same kind of a mechanical failure as a slow leak in the brake line (ok, software failure instead of purely mechanical, but you get my meaning). And you can anthropomorphize them in the exact same way.
Modern cars are increasingly computerized. What makes you so sure they can't plot against you? What about when companies inevitably put AIs into cars, first for entertainment, then for telemetry, then for driving?
Addressing second half: Then the LLM in your car would be the one adding the "plotting against you" functionality to your car, not carplay or dog mode or whatever.
The "for driving" part has already happened, as self-driving cars (mostly Teslas) are notorious for driving into all kinds of objects. Including sometimes pedestrians !
There's probably a very obvious response but why not penalize random guesses from the AI during pretraining like many standard tests did back in the day, e.g., +1 pt for correct, +0 for blank, -1/4 pt for wrong?
I'd much rather have my AI say "I don't know, but here's an educated/random/speculation guess" than not include the qualifier.
That's what I suggested. My guess is that's not what people want. We value an LLM giving us any answer more than we value it just saying it doesn't know. Part of the issue is that LLMs are currently used for applications which don't demand very high reliability/accuracy.
When I read that OpenAI link from a comment above, it basically agreed: the training hasn't rewarded humility/punished guessing even though that was possible.
In order for the AI to reliably respond with "I don't know" when it doesn't know things, we have to decide, during training, whether its response was wrong because it really doesn't know the thing and it should respond with "I don't know" or similar, or whether it was wrong for some other reason and should be nudged accordingly to correct whatever mistake it made.
This is actually really quite hard, and anything we know how to make is likely to get it wrong and so we end up with something that responds "I don't know" in situations where it really ought to know and also responds with confabulations in situations where it really should have shut up, and so systems where we attempted to do this are massively outperformed by ones that didn't and you don't see much of them in the wild.
During pretraining, the LLM actually does not output a guess. It outputs a probability distribution over all possible tokens (of which there are thousands). The reward signal strengthens the weights contributing to the one that happened to be correct in that instance. During inference, however, we sample randomly from that probability distribution and throw the rest away. The uncertainty could be a rich source of information, if we had access to logits. In the "Mr. _" example you would likely find that although Smith is the most likely token, it's barely ahead of lots of other surname tokens.
Unfortunately this would not be a perfect tool for detecting hallucinations/guesses, because in more realistic contexts it would be hard to interpret. When the bot tells you something questionable and a token has high uncertainty, is it because the model is guessing or because it wasn't sure of the best way to phrase it?
Still, I think we could learn a lot by testing this sort of monitoring.
Deep Blue (edit: Watson, not Deep Blue) did this for jeopardy way back in the day right? That’s why it said “what is Toronto?????” for Final Jeopardy, the question marks represented its level of certainty
I maintain a database of hallucinations (https://www.damiencharlotin.com/hallucinations/) in legal cases, and twice a week now I have got someone emailing me to make the point that it should be "confabulations", not "hallucinations".
My standard reply is that, yes, maybe, but that ship has sailed (and, in petto, what would my SEO be with the "AI confabulation database"). But next time I might redirect them to this article instead.
What's their rationale for why it should be called confabulation? Is it because hallucination implies mental illness whereas confabulation implies that it's normal, healthy behavior?
Yes, that's the gist of it. Hallucinations being sensory or auditory, they say, whereas confabulations pertain more to the making up of fact or stories.
That makes sense. I asked Copilot to explain the difference, and it said that in humans, hallucination is about false sensory perception, whereas confabulation is about false memory or explanation. In both cases, the person doesn't realize it's false. It said that in AI, hallucination usually means fully fabricated output, while confabulation means distorted or misapplied real information. It gave the following examples for AI:
Hallucination example: citing a study that doesn’t exist.
Confabulation example: quoting a real guideline but applying it incorrectly.
To me, that seems like a useful distinction, even if it doesn't match the definitions for humans. (If we use the human definitions, LLMs can't hallucinate because they have no sensory perception.)
This is a simple elegant explanation, which however begs the question "If all this is based on training time reward, why not solve the problem by penalizing them for wrong answers and rewarding them for saying I don't know?"
The OpenAI link above states that was possible, but leaderboards that LLMs compete on just haven't been doing that (OpenAI is claiming they are currently acting to reduce such guessing, even though that helps the accuracy scores of older models).
I agree. This seems like it would be simple to solve, with proper weights and rewards. In addition, why can’t two AI’s (or sub processors) be connected, with one optimized as a fact checker, providing the score of estimated accuracy?
I do understand that if something seems this obvious and easy to solve and smarter people than me haven’t solved it, that the problem is that I don’t really grasp the issue. In which case, can someone explain what I am missing?
"the AI knows what it’s doing" is this true fr? in my understanding, the information technically is in there, e.g. "Smith" only has 1% and not 99%, but the assistant character doesn't know it's bsing
Interestingly, I think the answer is both. During the generation of the first "guessed" token, the model has some awareness that it is going out on a limb. But for tokens thereafter, it may have more conviction due to seeing previous tokens. That's not to say that these models can't be sufficiently aware to spot bs that they have said before, such as the now classic [seahorse emoji panic](https://futurism.com/chatgpt-haywire-seahorse-emoji).
That's not what turning down the temperature typically does. It makes the AI less variable, but if it simply didn't know the answer, it's still just as likely to guess, except now, if you asked it the same question over and over (clearing the context between each query, of course), it'd repeatedly make up the same answer instead of a different one every time.
The real answer is that the LLM is *always* "hallucinating". It doesn't have a concept of "names" or "cotton gins" the way humans do. It's just always predicting the next most likely token. The reason LLMs sometimes (arguably often !) give the right answer is because humans have written a deluge of text, and the most likely token is often the correct one. LLMs give the wrong answer when they venture out into the area of the corpus where the training data is thin. This is part of what makes their failures so hard to predict: it's hard to know ahead of time what unexpected gaps exist in the LLM's training corpus.
Humans get around this problem in a variety of ways. Firstly, of course, we can lie. Secondly, we have a pretty sophisticated model of the world that places constraints on our lies, and we know that all other humans share the same model -- so even when we lie, we wouldn't say that e.g. the cotton gin was invented by an avocado. LLMs have no such constraints.
Yeah, just went and searched your name and read your comments there.
I responded to your comment with the link to the post because "next most likely token"¹ sounded like you may be misinformed about how they function. Anyhow, your comments there make it clear that you know what you're talking about, and are just simplifying the wording here. Carry on :)
¹: which is neither the next most likely unless temp=0, nor does the likeliest tokens necessarily map to how often it was in that order in the training data (especially when not dealing with a base model)
LLMs give the wrong answer when they fail to model what the right answer is. Which is early and often (because they do not have object models at all). If the corpus doesn't actually "by and large" have the right answer, they're going to lie, over and over again.
Training data can be very thick, and they can misapply the training data -- asking an AI to provide a recipe "the way a commercial bakeries would make it" is a classic, in this space. There is data out there on how commercial bakeries work, but it's dwarfed by the copycat recipes corpus of knowledge. And the AI cannot tell "the way a commercial bakery would make it" means to filter out all the copycat recipes.
This seems correct for a post-trained assistant, but I think the term 'hallucination' is more accurate when applied to a base model, which I think is what it originally described.
Without the RL to make it adhere to reality as much as possible, the AI will make shit up in a way that's much wilder and more elaborate than a human making shit up for school. I once asked LLaMa 405b about some innocuous thing and it responded by inventing a multi-paragraph parable about a 'mutilation cult' that did not exist, complete with made-up locations and timelines. This was disconnected both from consensus reality *and* the "trying to guess the right answer" framework a trained assistant would use. If there's an analogue to this in human behavior, it's probably dreaming; spinning out a fictional world totally unmoored from truth, driven by imagination while unaware that its story is fictional. Or at least I assume unaware - not sure if L405b would have passed its own "lie detector test" in that context.
I disagree. I'm not sure if the same lie detector test linked in the article would work out of the box, but there's definitely a concept of 'truth' in base models. At the bare minimum, it knows things like that "Paris" has the "capital" association with "France". It knows that if someone says "The capital of France is Berlin", something fishy and unexpected has happened, and can recognize that the speaker is not aligned to truth. Being able to tell this is predictively useful: if the speaker is saying this false thing, they might be more likely to be wrong in other ways, right? The deception feature found in the article would be as useful in a base model as an assistant -- it's just the posttraining which tells the model that deception is *bad* and that it shouldn't randomly lie.
Oh it certainly knows the concepts of truth and lies. And often it will know if the character being simulated is lying. But in many cases the character is telling the truth and is just talking about something in the outside world that the LLM can't verify. Somebody's blog says they're on vacation in South blank, and the LLM isn't lying when it puts the most probability on Dakota, it's giving its best estimate of the probability distribution of the truth. But the real answer was Carolina, oops, hallucination. No lie detected.
So, most multiple choice tests I've ever taken would subtract a fractional amount of a point for bad answers, to make the expected value of random guesses negative. Not really "some strictly non-zero benefit at zero cost".
The possibility of which makes me ask: what's the incentive structure we're giving LLMs in RLHF to make this happen? Or is it something deeper? The fact is that they do have the option to say "I don't know", and if RLHF was rewarding this adequately, then you could make them train in a framework in which "random guess" actually carries negative expected value. You just have to punish it, like my professors used to do. So... why doesn't it work? Is it intrinsic, or are we doing RLHF wrong?
Lol at "billions of tokens". What is this, 2018? Frontier models are trained on tens of trillions of tokens.
Anyway, I feel like lots of people (especially AI safety people) move back and forth between the "simulator" framework for thinking about AI and the "character" framework. And yes, the simulator is not hallucinating, it is predicting. But what about the character being simulated? What about Claude, that nice helpful assistant? If I ask Claude himself whether a citation is correct, and Claude says that it is, do you think there's a sense he is lying? Isn't modelling this as a hallucination more accurate when thinking about the character?
Of course, I'm happy to never think about the characters at all and only talk about the simulator. But then I hope we can agree on things like "LLMs cannot suffer, they merely predict a token that says the character is suffering". Lots of people in AI safety disagree with this! They think the character can suffer! But you can't have it both ways. If the character suffers, the character hallucinates. If the simulator merely guesses, then the simulator also merely roleplays the suffering.
I mean, it's true that after billions of tokens, its weights are in a better pattern! I still do drugs, but I used to, too! (I've edited that section).
Thanks for bringing up the character vs. simulator question, which I hadn't really thought about here. Low confidence, but I think my model is that pretraining creates the simulator, posttraining creates the character, hallucinations come from pretraining, and the more that posttraining overrides pretraining, the fewer hallucinations there are? So hallucinations are a case where the shoggoth "leaks through" the mask. But low confidence here and I defer to Janus or anyone else who understands this better.
It's only a character hallucination if the simulator knows the true answer, though, right? If the simulator knows the citation is wrong, but Claude insists it's correct, then Claude is lying/hallucinating in-universe. If the simulated universe is one where the citation is right, Claude is accurately reporting the truth - it's just that the truth of its simulated world has diverged from ours.
I don't know for sure why it increased since a few years ago. There are better and worse crawls of the internet, though, and e.g. FineWeb (https://huggingface.co/datasets/HuggingFaceFW/fineweb) claims to have 18.5 trillion tokens worth of cleaned up English web text. You can get to higher token counts by using non-English data, crawling even deeper into the web, or cleaning the data less stringently (since a lot of the internet is junk/spam, crawled data is usually cleaned before training). Then there's image/video/audio data, synthetic data, etc., though I was not counting that when I said tens of trillions of tokens.
I don't see why not. To me this sounds like "Alice says that there is a house over there, but she also claims there is a window over there. You can't have it both ways, either there is a house or a window."
How is this case different? Why can't an AI be a simulator and a character at the same time?
If it's a character, I'm allowed to say it is hallucinating. "Hallucination" is a good description of how that character relates to the wrong information.
If it's a simulator, I'm allowed to say it is merely predicting the tokens that indicate suffering, without necessarily suffering for real. Even if simulators can be conscious, for all we know they could enjoy roleplaying a suffering character.
Perhaps the former is merely a semantic dispute. But the latter is important because it has moral implications.
Do you really think of "Claude" as a character ? It honestly never occurred to me to see claude.ai that way (nor any other LLM). I would say things like "Claude thinks X instead of Y", but then, I also say "this file parsing program that I'm writing thinks X instead of Y, it gets confused by the extra whitespace" -- being fully aware that I'm speaking in metaphors.
Talking about LLMs as characters doesn't imply that they actually are characters. Sometimes it's just easier to say "Claude thinks this" or "Claude prefers that" or "Claude hallucinated" rather than to describe it in a technically accurate manner. People anthropomorphize all sorts of things as a figure of speech, so this isn't unique to LLMs.
I do think there can be a risk that using anthropomorphic language makes people start thinking of them as people, and so I try not to do it too much, but that doesn't mean it can't be done at all.
By the way, are there a significant number of AI researchers that think the LLMs of today might actually be conscious in the sense of having real experience? I thought that was still a fringe position and that the debate was more about if it's possible in the future, e.g. is Data from Star Trek a person, not is Claude Opus 4.6 a person. From what I've seen, even the companies that produce and promote LLMs consistently and unequivocally state that they're not conscious. Why would they say that if it were actually an open debate?
The companies that produce these LLMs don't generally view them as conscious (though that's less clear for Anthropic). But the AI safety community mostly just says it's unsure. And some AI influencers, notably Janus, go further than uncertainty and seem to assert current LLMs are basically conscious.
Last I checked, the effective altruism forum was full of people who were quite concerned that perhaps current LLMs are either already conscious or soon would be. It's not a fringe position among rationalists.
What's the rationale behind arguing that they could be conscious? It'd be interesting to read a representative article.
I'm surprised to hear that the EA/rationalist community is full of people who think it's already conscious or soon will be. I think of Scott Alexander as being a prominent member of that community (or at least a prominent ally), and my understanding is that he doesn't find it plausible. For example, near the end of https://www.astralcodexten.com/p/the-new-AI-consciousness-paper he discusses the arguments for and against treating AI as conscious. The arguments in favor are about whether it's good for people, not whether it's good for AI; he compares mistreating AI to mistreating a stuffed animal. And he's talking about much more advanced AI years or decades from now.
If there were other plausible arguments prominent in the rationalist community, wouldn't he have mentioned them there?
"My colleagues and I have argued that AI systems could become conscious or otherwise morally significant, potentially quite soon, especially if rapid AI progress continues. But for AI systems right now, we lack strong evidence of consciousness. I actually think it’s unlikely that Claude Opus 4 is a moral patient, and in my experience, so do most (not all) people who work on AI welfare. But given the potentially enormous moral stakes of AI consciousness, we can’t just ignore the problem because it’s confusing and difficult, especially as AI grows more capable and complex every passing week.
So what should we do? At Eleos, we take two approaches. We work to reduce our uncertainty; philosophical thickets of consciousness notwithstanding, there are many tractable ways that we can make progress. At the same time, we look for AI welfare interventions that make sense given our current uncertainty."
Did you catch that? Both a belief in consciousness "quite soon" as well as looking for AI welfare interventions *today*, just in case they're already conscious (though this is viewed as unlikely).
Thanks for the links! It's interesting to see how people think about consciousness and morality. What's your perspective on where consciousness comes from and whether AI can be conscious?
My belief is that consciousness comes from a soul, and that only humans and animals have souls. So I don't think AI can ever be truly conscious unless God decides to give it a soul. I believe that animals and humans have fundamentally different types of souls - animals have material mortal souls, whereas humans have spiritual immortal souls. So although animals are conscious and should be treated humanely, I don't see them as moral agents with the ability to experience guilt or have a relationship with God. And I think AI should be treated humanely only because of how the treatment impacts humans, like what Scott said about treating stuffed animals humanely.
"If I’d guessed “John Smith” for every short answer question I didn’t know, I might have gotten ~1 extra point in my school career, with no downside."
I don't think this is quite right. There would have been a downside, and I think it's the same downside that held you back from actually doing it; you would have felt like you were sacrificing credibility with your teacher(s), who would have been likely to think you were foolish and shameless.
I do feel like chatbots do have a peculiar habit of going beyond simply making up a plausible sounding answer. for example:
You will ask it something and it gives an answer X, it's very certain
It doesn't quite sound right, you want to double check, so it provides a source Y
You read the source and it doesn't confirm X
It insists it must be there
You ask it to provide a direct quote, or location to read. It finally admits maybe THIS source isn't clear but actually source Z definitely proves X
Rinse and repeat
It really does feel like it doesn't know it's guessing, as if it genuinely believes X and wants to convince you of that despite the evidence. Probably that's just me projecting human agency to it (maybe its more of a 'personality' thing, its trying hard to be a useful assistent and a useful assistent would be confident of its answers), but it's why I like the word hallucination.
Sometimes it helps if you ask them if they hallucinated. I think they just need an "out", some kind of cover for changing their minds. Otherwise they can't come up with any alternative explanation for why they said it, so they assume they must believe it.
I had some interesting results asking Claude to find sources on whether raccoons more commonly live in dense forests or forest edges. I think it was probably right (it consistently gave the same answer across multiple conversations, and it was the one I thought was more likely), but it just could not find any sources supporting that. Sounds similar, so I thought I'd put it out there for anyone who wants to try running it themselves.
I took great advantage of this exam-passing technique during oral exams at my university. We had an old Soviet system where to pass an exam you had to chat with the prof 1-on-1 about a predefined broad topic, plus answer some curveball questions on the fly.
Never had the greatest memory for details like some of my peers, but I guess most of my profs could appreciate that I had some ability to lean on general knowledge, common sense, and pattern-matching to essentially attempt to re-derive on the fly some stuff I should've learned instead. Worked best with biophysics, not so well with mycology.
Nowadays, each time I query an LLM about my topic of expertise, and it confidently states something I know to be a common misconception, I can see myself in it. Makes my blood boil, and makes me appreciate the patience of my profs for having to put up with this.
Does this mean if I ask an AI the same easy question 1,000 times, one it definitely knows the answer to, it will give the same right answer every time? Or is it more complicated than that? Will it sometimes 'hallucinate' even easy answers, and if so, how does that fit with the above?
A lot of models use something called top-p sampling, which means it will only pick tokens up to a threshold % guess, usually something like 99.7%. This means that if the model is at least 99.7% confident in the token being chosen, it will ignore all the other possible tokens, and choose that answer 100% of the time. (Or if it has two guesses that sum to 99.7% confidence between them, it will always pick between those two, etc.) There's also top-k sampling, where you instead pick the n most likely tokens, and only pick from among those.
This is (partly?) done to avoid the failure mode you're thinking of, where it will occasionally say something extremely unlikely and wrong, just because it never entirely assigns 0 probability to *anything*. If you don't use a sampling method like one of those, then yeah, it would (rarely!) hallucinate even easy answers.
I was a kid who made stuff up for essay questions. In high school I took a class on the history of Western art, and I recall a three question exam where I had no idea how to answer one question, so I wrote a long essay where I explained the relationship between say mysticism and romanticism in pre-Renaissance painting via a long satire that involved Buddhism and the Chinese mafia in Formosa (that's the single detail I recall, that I referred to Taiwan as Formosa).
The teacher was so amused he read the essay to the class and gave me half credit. Which got me into passing territory, and set a bad precedent for similar situations in college.
Some exams penalise getting the wrong answer in multiple choice - so, e.g., 1 pt for correct answer, 0 for blank, -1 for wrong. Why not train LLMs the same way?
My old man (sociology prof) would provide "don't know" as an option to all T/F or multiple choice questions on a quiz. It provided partial credit (relative to a wrong answer) - and he would sneak in a few unintelligible questions for which it was the RIGHT answer. Couldn't our AI training mix in similar incentives?
I always found it strange that people see odd failure modes in LLMs as invalidating the claim of reasoning or intelligence. Humans also have odd failure modes, in some ways very similar to LLMs. Humans are prone to the very same kind of wild confabulation from impaired self awareness that plague LLMs. (Though I guess post-Oliver Sacks fraud revelations I should research how much of these kinds of cases hold up to scrutiny.)
Rather than a denial of intelligence, to me these failure modes raise the credence that LLMs are really onto something. Most programs output meaningless garbage when there's a fault in the program. But LLMs as well as humans output very sensible nonsense. The convergence of failure modes to me is highly suggestive of an underlying computational similarity in a way that a convergence of success states is not.
I don't know if this my autism or cultural differences, but here in Poland I don't recall anyone being ashamed of guessing answers.
In fact, I was the champion of it - here, teachers call students to the board to answer questions in front of the whole class. If I didn't know the answer, I would guess / hallucinate. When I didn't know the answer on a long form writing test, I would guess / hallucinate too. In both cases, whenever I got any part of this right it was a reason to be proud and celebrate.
The biggest shame was to not know anything and not even try.
I took a similar view on this two years ago (https://omnibudsman.substack.com/p/llms-like-people-lie-without-thinking) - I'm not convinced that the distinction between "guessing" and "hallucinating" is well-defined, but I'm also not convinced that there's even a well-defined boundary between "guessing" and "knowing." In terms of the gears-level operation of the models, getting the right answer and getting the wrong one look the same. I think this is likely true for people as well. I think split-brain patients are the dispositive case.
Humans "hallucinate" too, and there are theories of human perception and thought as a system of layered, recursive prediction functions (see books by Andy Clark and Ray Kurzweil). Everyone has weird thoughts arise. We examine them, go "nah.." and discard them, or sometimes go "huh!?!" and modify them into something useful. That's called creativity, and nobody knows how it really works or where these thoughts come from. If it gets out of control it's mental illness. The study of this is part of meditative traditions. I think the only real difference between human and AI processing is recursive self-training reinforced by real world consequences either experienced or witnessed, as Scott noted -- we don't want to die or look like an idiot to others. It looks like AI models are getting better and remembering what they were doing and what works and doesn't, and they are starting to talk to each other as well as to us, so the gap is closing fast.
"AI mid-hallucination, they see the model activates features related to deception - ie fails an AI lie detector test. The original title of this post was “Lies, Not Hallucinations” and I still like this framing - the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay"
"The AI doesn’t have a better answer than “John Smith”. It’s giving its real best guess - while knowing that the chance it’s right is very small."
If the AI can tell and observers can tell, why doesn't someone write a function that alerts the user of the AI when quality is degrading?
It seems like this is a training problem where we don’t reward models for saying “I don’t know”. If we can detect when deception is happening, why don’t we reward models for the right answer (revealing intent to deceive) before we train in the wrong answer (lie/guess for the expected value and not the safety calculation)?
This behavior is only “shameless guessing” in the specific context of a student taking a graded test, seeking to maximize graded points.
In the context, say, of work colleagues obtaining information from one another, it would not be “shameless guessing” but something closer to “pathological lying” “total ineptitude” “willingness to derail entire multi-million dollar project because he doesn’t have the ego-strength to acknowledge what he does not actually know.”
With AI, I’d posit the context is much closer to work colleagues (stakes: at best, business/reputational loss—at worst, getting warfare risk wrong???? Total annihilation of the human race????) than students being graded on their answers (stakes: one bad grade).
> but something closer to “pathological lying” “total ineptitude” “willingness to derail entire multi-million dollar project because he doesn’t have the ego-strength to acknowledge what he does not actually know.”
Or, alternatively, "failing upwards". It probably works for the same reasons it works for AI. People don't like humility, apparently.
Disagree this behavior “works.” At some point, there isn’t enough bullshit to sufficiently mask reality. Maybe you get by here and there, but you will be found out eventually. People are probably already laughing at your incompetence behind your back (not *you* specifically, Jimmy, I don’t know you). We may not like to BE humble ourselves, yet people also WANT humility from us. A tension.
Until a few months ago, GPT used to do the equivalent of hallucinating when I asked for an image it was not able to make. It would produce some abomination with a self-satisfied little note underneath saying “here is the whirlpool image you requested, with features a, b, c, d and e.” But it’s gotten a lot better. Now it gives the image with no comment, and if I say, “so how well does this match with the prompt, hmmmm?” it will reply earnestly that the image fails to have have features a, b, c and d, and has only a partial version of e. Then it asks to be hit with a paddle , and follows with “please sir, may I have another?” So all that shows that GPT knows it did a bad job, and that it is able to admit it. (Seems clear that it would be easy for developers to modify it so that it asks *itself* whether an image matches the prompt before presenting it to the user, and then tells the user it was not able to make in image with the features user had asked for.)
If I ask whether it is able to make a certain image, it will often say no it cannot, and can explain why. For instance: “The current image-editing system is extremely weak at enforcing geometric changes when the subject is water, fractal structures, or anything without stable, trackable edges.” It is also sometimes able to coach me on how to construct a prompt for an image that will work decently. That all represents quite a big change from the way GPT functioned a few months ago. I had many frustrating image prompt experiences where it was clear that GPT understood perfectly well what I wanted, and appeared confident that it could get Dall-e to produce it, but then would serve up Abomination version 2, which was usually Abomination 1 with trivial variations.
I asked it recently about its increased ability to understand and control the workings of Dall-e, and it said that:
earlier versions of me had almost no introspective access to why a request failed. If the image model did something wrong, I could only guess. Now I have much clearer diagnostic insight into:
• what kinds of constraints the generator can and cannot follow
• which parts of a request exceed its representational capacity
• when a conflict exists between two requirements
• when a request would force the model into geometric inconsistency
• when the generator is likely to snap back to defaults (e.g., symmetry, “attractive” compositions, central subjects, smooth lighting)
Because of that, I can explain:
• what part of your instruction is feasible
• what part will be misinterpreted
• whether the failure is fundamental or accidental
• whether modifying the structure of the request will help
• whether a manual pre-edit (Photoshop, depth mask, geometry, silhouette) is required instead of pure prompting
Which impresses the hell out of me. “Introspective access..” Wow.
That's impressive, but seems to be connected to the fact that there are multiple different systems at work here. There's the generative AI to make/modify pictures which is tied in to the text based GPT, which has information either in its training data or fed to it in the direct description of the image generator of what the system's limitations are.
Yes, you're right, I hadn't thought of that. On the other hand, different systems that are working together aren't as separate as, say, 2 people working on something. You could think of GPT's statement that Dall-e is weak at making geometric changes in things made of water as being analogous to a person saying "I can draw figures in outline pretty well, but I'm no good at shading."
If you're interested in LLMs' ability to introspect, you might be interested in this article by Anthropic (https://www.anthropic.com/research/introspection). It shows that Claude is sometimes (but not consistently) able to tell when and how its mind is being artificially directed toward an irrelevant topic.
I feel like a straightforward fix could be to have your model set up to answer you in text as it always does, with something more hardcoded to display a confidence score? Or is that actually impossible for something like a chatgpt? (Because it's giving a free text answer and not selecting from a mulitple choice)
I don't dislike the term hallucinations, can't AI produce some kind of Mandela Effect answer? That's not guessing, it's closer to hallucinating.
As for long answers, I remember a high school test about the Russian Revolution and a question about the difference of Bolsheviks and Mensheviks. I didn't know. I *guessed*. The question was considered correct.
It was an informed guess. I knew Bolsheviks were radical, so maybe Mensheviks were less so? You can train the AI to disclose its informed guesses, I just don't know whether this is practical.
You can add a hook (to Claude at least) to automatically "independently" fact check any claim it had just made. It will tell you if the bot guessed or have a misleading answer.
To what extend is "incorrect" or sloppy thinking on these questions a help or a hindrance to preventing AI extinction or whatever? Scott here phrases it as a hindrance. The "Stochastic Parrot" crew underestimates the intelligence of AI, and therefore will be taken unawares by it.
I'm not sure I agree. If the blunt question is "should we give the AI direct access to factories/Aircraft Carriers/tax data" I think the stochastic parrot crowd would exactly come down where Scott would: "no or at least not yet."
And I think this is incredibly important as "AI doom" becomes a less theoretical and more concrete political question. Attitudes like the one scott is lampooning above are...broadly correct or not depending on the composition of your bubble and who your allies and enemies are.
This might be my profession, my bubble, my world, whatever, but I think presently I see too many people who think AI *ought to* take over a world or an important responsibility, and I want to push back on that. "Hallucination" is a helpful way to talk about what it's doing to make my broader point that WE SHOULD NOT TRUST IT.
In Scott's profession/bubble/world, he might presently see too many people who think AI *can't* take over the world or an important responsibility, and he wants to push back on that. "Hallucination" might therefore not be helpful because his point is less "don't trust the AI, it is fallible" than "don't trust the AI, it might not have your best interest in mind"
But it's an open question I think how much the broader political consensus should be DO NOT TRUST THE AI in whatever form. Your allies might be in unexpected places. I can easily imagine a world in 2034 when the AI is coming for my job and I'm sick with dread that we're just gonna have a robo-supreme-court and the Pope makes some pronouncement to the world's 1 billion catholics saying "We should not trust the AI. It is inherently evil because it lacks something every one of us has...a soul"
and I just say "you know what, for all intents and purposes the pope is actually right on this."
I think there's another term that fits better: "bullshit". There's even an essay and book called On Bullshit (https://en.wikipedia.org/wiki/On_Bullshit). The distinction is that a liar is trying to hide the truth ("I did not have sexual relations with that woman"), while a bullshitter cares only about whether the speech is useful, not if it's true ("the cotton gin was invented by Thomas Edison in 1910"). I think the latter maps much better to LLMs.
I also feel that "bullshit" successfully includes other annoying LLMs behaviors such as guesses at the unknowable ("the weather on December 31st will be cloudy"), repetition of common misconceptions/mistakes, and perhaps even the sycophantic aspects ("that's an excellent question!"). Were it a person, I would happily use the term "they are bullshitting me".
I personally like the term "hallucination" because it maps nicely with predictive coding (at least, to the extent that I understand the latter).
An LLM in its default mode is doing something akin to dreaming: generating predictions about the world that are uncoupled from any kind of sensory data. The reason good harnesses are so important is that they provide regular "sensory" feedback to the LLM and keep it grounded. In this sense at least, "hallucination" seems like a pretty good word for "dreamt something incorrectly"
Great. Now the AIs siphoning the internet will pick up your little essay on the cotton gin (which was hilarious, by the way) and give it the same weight as any other single item on the subject, even though it doesn't match anything else.
It's called "Generative" AI for a reason. Generative comes from "generate" and it means "make shit up". The problem is that it doesn't know when you want this mode and when you want paraphrasing.
So your recommendation for helping train LLMs out of hallucinations is to make a bunch of other AI agents relentlessly bully them for getting any question wrong. I support this
When you say that a hallucinating/guessing AI activates the neurons for deception, would these same neurons be activated if the AI responded "I'm really not sure, but my best guess is..."? I guess I'm asking, do these activations correspond to low confidence, or do they correspond specifically to "unacknowledged low confidence"?
Better models will consistently answer "I don't know" to more basic questions that weaker models will shamelessly guess at. I think that's because better models have a better understanding of what's inside their training distribution--
I'm referring to "training distribution" in the way Jeremy Howard means it here:
--and so can better identify gaps in their knowledge within that distribution. However, as soon as you go outside that training distribution into areas that the LLM hasn't been trained on at all, it goes straight back to shamelessly guessing.
A human might do the same thing, because, you know. Humans can be shameless too. But this seems to me to be a consistent area in which humans and LLMs behave differently.
--
I don't know if I really have a point with what I said above, but anyhow, I do agree "shameless guessing" is a better phrasing than "hallucinations"~
That is, hallucinations are what you get when you are overly confident in an interpretation about something and there's nothing to correct it when it's wrong. (The quintessential application of this being your blind spot, where your retina is attached. Your brain perceives what ought to be there, and there's no feed-backward mechanism to correct it. The only reason we don't call it a "hallucination" is because it tends to be the correct answer.)
Which is exactly what the AI chatbots are doing: it has a hypothesis about something, and it spews that out without having ever received any push back to correct it.
Why can't we train them to be a bit more shameful? It seems too obvious a solution. Just punish them slightly for giving wrong answers and reward slightly, but not too much, for saying I don't know.
I assume the reason that we don't guess on short answer or essay answers is because it becomes painfully obvious that you are trying to deceive.
There's a meaningful difference between getting an answer wrong because you didn't know or made a mistake, and clearly trying to game the test by making something up in the very unlikely scenario where you guess right. In multiple choice, this isn't clear. In short answer, there's no world where you think you're getting it right and you wrote Thomas Jefferson invented the Cotton Gin.
Teachers respond positively when a student tries to get a right answer, but they react less positively when a student wastes their time when they're trying to randomly guess the right answer despite very low odds. The +1 point you might get from random chance is more than outweighed by the social cost of wasting your and your teacher's time.
For AI I suppose this would be imposing a cost for hallucinations when there's shameless guesses vs. normal mistakes, but I don't know how you do that or determine which is which.
I don't see why "lie" is any better than hallucination. They don't have agency. They don't "understand" reality. They don't "knowingly" do anything. They don't "hallucinate." And they don't "lie."
Back in grad school I would encourage my students to never leave anything blank for any reason. Graders are not above giving pity points, and a blank only guarantees nothing.
Most people who make your claims can't say what they mean by "agency", "understanding", or "knowing". What do you mean, and what's your argument (that doesn't also apply to humans)? I'm not sure you're wrong, but what you're saying seems far from obvious to me.
The best I can do is give you an example from experiences of using an llm. I was using one to help me repair a water pressure tank in my house. There was a lot of back and forth about how to check if the bladder was working properly, why I was detecting a smell in the water, a few other things. The exchange was very helpful in resolving some issues. The exchange included many features of the tank, how big it is (it's quite large) and the fact that it was hard-piped in to the system, etc. For example, I gave it the model and it told me the capacity. We discussed the dimensions of the tank, what it was made of, how it was piped in, etc.
When we got around to discussing how to increase the air pressure in the tank, it insisted repeatedly, despite my protestations, that all I needed to do is put the tank in the car and take it to a gas station to use the air pump there. It kept telling me that I could increase the air pressure in just 5 minutes! Well, it's a half an hour to the nearest gas station and the llm didn't even think to ask about that (indeed, a lack of clarifying questions is I think indicative of my point). But no human who'd had that prior exchange with me, and who "understood" the circumstance, and who had "understood" the previous discussion that we had, would have made such a recommendation.
Those recommendations made it clear to me that llms don't understand things in any conventional sense of the term, and they don't "know" things. Perhaps the question about agency is a little bit less direct, but for me whether or not they have an agency as a function of whether or not they know and understand. I had some other similar experiences in working with an llm, if you want other examples. I don't think this is merely an issue of semantics about what we mean by the term "know" or "understand." I recognize that there's a definitional aspect here, that you have to first define what "know" or "understand" mean. But I think in any conventional sense, recommending that I just take a tank that wouldn't fit in my car and that weighs probably over a hundred pounds to the gas station, when I would have to spend hours of complex processes to disconnect it from the system, suggests to me that the features of having a memory of the details of the prior exchange, is the key feature of knowing and understanding.
I wasn't even aware that students who didn't guess on multiple choice questions existed. I feel vaguely ethically tainted that I never even considered that there might be some kind of ethical wrongness associated with guessing. I thought everyone guessed, including on short answers!
LLMs don't model. Therefore, they don't understand why it's a bad idea to lie sometimes, in that your credibility is shot to sh*t. Therefore, people who want folks to keep using AIs, and not correctly throw out the code that can't even hit 2 9s, and won't be able to ever hit three 9s, call them "hallucinations." They're inbuilt into the very bad AIs we like to call LLMs, part of the architecture. One couldn't convince an LLM not to lie, not to hallucinate.
Does anyone know whether hallucinations for neural nets are inbuilt?
"the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay."
" AIs are smart enough to understand the game they’re actually playing"
Man, you're really starting to lose me on these AI posts because I can't tell if you're supposed to mean these metaphorically (I hope so) or literally anymore.
He could mean these entirely metaphorically, or he could be wrong about them, but I think the claim is that an actual deception module activates during hallucinations, the same module that activates when the AI has other reasons to say things not in accord with its internal world model. I'm not an expert on the mechanistic interpretability literature, so I don't know how accurate this claim is, but it's a real empirical claim.
If anything the contrast with humans would run in the opposite direction, no? When AI guess shamelessly they at least have optimized their best chance at success down to a T using all available data, and "know" that they have optimized. So it's not just a random guess, it's an exquisitely calculated one (however unlikely to succeed). Whereas, when humans guess it's more often a random stab in the dark - literally *guessing* - since they lack the sheer computing power to optimize. When intuition/tacit knowledge can't substitute, many will throw up their hands and just grab for something random, not because it's the best possible option. Perhaps you see this as a matter of degree where the only difference is the amount of data processed and relative probabilities, but to me it's categorically different. You could also argue the absence of punishment training or "fear" of consequences for AI points to a fundamental difference in how selecting an answer functions for them compared to humans, depending on the aversiveness of consequences for being wrong.
I think the label "hallucination" is meant to capture the jarring contrast between the breezy confidence with which AI supply an answer as though it weren't a guess (which typically is reliable) on the one hand, and the unexpected wrongness of their wrong answers on the other. Humans can be brazen too but usually for lying not guessing; there's less reason to be brazen about a guess especially since guessing is harder to conceal. More importantly, it's not systematic for *every single occasion*; there needs to be some special motive. For AI, the term "gaslighting" seems more appropriate than "hallucination" (albeit without malicious intent).
I hate the term “hallucinations” for when AIs say false things. [...] When I didn’t know the answer to a question, I would guess. [...] Their entire training process is based on guessing (the polite term is “prediction”). It goes like this:»
That seems to me a very big misunderstanding: the training mechanism for LLMs is not quite like that and even inasmuch it is like that its *effect* is not reducible to the mechanism. What actually happens seems quite different to me:
* The training results in identifying "clouds" of related terms using variational (least-distance, lest-effort, least-energy, ...) techniques basically doing implicit cluster analysis. That is the training builds the "neural net" as a sophisticated high-res "decision tree" or "topic index" over its training texts.
* Inference proceeds by searching that index with the query term tokens and retrieving the most plausible cloud/collections of tokens related to the query.
* That the training and (mostly) the inference are done partially sequentially is not very important.
Most relevant to this "hallucination" discussion is the overall effect (much simplified) of LLMs compared to search engines:
* A search engines store clouds of whole documents indexed by whole words and then search the index by whole words and return the set of documents most related to the query words
* LLMs instead operate at the token clouds level and "search" the index them by query token sequences so instead return *one* document that is the "most plausible" "virtual" document *implied* by the query tokens, which is usually a "merge" of the set of documents returned by a search engine.
That is LLMs do not make "guesses" but "collages" of documents; in a sense LLMs are search engines over not just exiting documents but over *potential* documents implied by sets of existing document.
That means that hallucinations are not random guesses, they are merges/superpositions of fractions of actual documents that do not have a proper meaning when fused together.
This happens most often when there is no actual document in the training set that relates closely to the query and the LLM merges together stuff that seem related in the domain of those token clouds but not in the semantic domain.
That is because of the ancient philosophical problem of "qualia" and explaining the meaning of red and green to something who has never seen them or sweet and sour to someone who has never tasted them.
Note: the above is a very simplified exposition and I think does not apply to *image* generating ML systems (which are not LLMs).
I grew up in America and I had two teachers do negative marking on multiple-choice questions in my entire academic career (Bachelor's Degree). Standardized tests (e.g. the SAT) do tend to do this.
[without responsibility for goodness of analogy] I sometimes explain hallucinations/confabulations as normal experience of playing chess with a six year old child. They know how pieces move, they don't yet know how to plan ahead, and they also want to win soon. But how would you win, especially soon, if you are losing? This is when interesting paths to victory suddenly open us. "I get to take two turns in a row". Or "your queen fell asleep".
There once was an English teacher I did not like. We had reading quizzes for a book that was short response. I did not do the reading. I did not read the cliff notes. I did not ask my friends what happened in the book. I had zero shame where this teacher was involved. At the end of the unit, all of the quizzes were totaled up to be out of 100 and treated with equal weight to a single test. I got a 2% despite not leaving a single question blank.
I may or may not have packed a pillow in my backpack and pulled it out whenever said teacher was lecturing. Said teacher also may or may not have retaliated by marking my subjectively graded work a letter grade lower than similar work turned in by my peers.
There was a time...we wore wristwatches and guessing the answer was based on what quadrant the second hand was at that moment. Sometimes the wrong answers could be certified wrong and reduce the choices. Typically, math problems had two answers for simple mistakes, one crazy option, and the correct value.
I completely fail to see how LLMs lie. Lying involves a notion of purposeful deception, telling a non-truth but as far as I know, LLMs as such have absolutely zero notion of truth understood as a correspondence between their utterances and a state of some, for the lack of better term "reality" outside the word stack (ordered by weights) from which they generate text. In my understanding LLMs are literally *incapable* of lying. Or telling the truth. On account of lacking not so much a world model (that's arguable -- in some sense they have one tho it's hopelessly linguistic) but an ability to test, verify and correct any such model. It's not that they tell the truth or lie, it's that "to" an LLM those concepts don't make sense. It's weights all the way down, or to go disgustingly deconstructionist, it's signs all the way down.
So guessing is a much better term I agree but still kinda seems to assume that they "understand" that there is something that is not a sign/linguistic token/probability, that these things correspond approximately to something "real", or at least outside the closed linguistic system...
Don't shameless guesses and genuine hallucinations *both* exist, as distinct, differently-functioning AI failure modes?
Scott describes the shameless guess failure mode; for the hallucination failure mode, it's more like, there's some weird attractor embedded in parameter-space, and if the AI's thinking gets too close to it, it can't help but home-in on the attractor. Multiple such attractors can cause both oscillation and chaotic n-body-problem type behaviour.
Is this different to guessing? I think the key distinction is whether or not the AI believes it is accurately answering when it's stuck in an attractor basin, or whether it knows it probably isn't (as it does when it's guessing). I remember there being talk of "Solid Gold Magicarp" and "The Golden Gate Bridge" in terms of such attractors, discovered via vector analysis, but I don't recall any conclusion as to whether other vectors (a "lying" vector, a "creative/inventing” vector..) were activated along with the attraction vector.
I notice that humans can both lie and halucinate and that "from the inside" these feel like two very different processes. I further notice that humans and AIs both seem more likely to lie than to hallucinate, and that humans can have weird attractors that prevent them from lucidly answering, in a way that seems much more like hallucination than like guessing/lying, too (for example, trying to talk to my Dad whilst he's thinking about football).
Yeah I think you are right, there are (at least) two failure modes: deception and hallucination. Since the llm induces some kind of world model through text there are a sort of "false in reality but true in the corpus" documents, almost like they are from an alternate universe that is the most likely to have produced the training corpus under solomonoff induction or something
Aside from run of the mill hallucinations, I think an analogy to human cognition might be the mandela effect.
A recent paper interestingly found that across models, specific neurons (0.1% of the total on average) were involved in hallucinations - but also in sycophancy, gullibility, and jailbreak vulnerability, jointly "overcompliance". Dampening these neurons somewhat reduces all four of these negatives, but also makes the model less capable. All these problems are suggested to have the same root cause: the model being too eager to please and salvage as much of its task as it can, at every turn. "Trying too hard", in other words. (https://arxiv.org/abs/2512.01797)
"Hallucination" isn't a great term, as you say because it makes the models sound insane and alien. Howevet, the alternative "confabulation" always struck me as worse from a communication or debating perspective. It has connotations of malice, or perhaps a child fibbing its way out of trouble.
There are degrees of answer guessing. Students are expected to produce wrong answers. Multiple choice is a licence to roll the dice. A total guess essay is a reputational risk.
Current AIs don't give a confidence rating on answers and they are basically oblivious both to the reputational and real-world consequences of their answers. I can't see how that won't be forced to change as AI become more operational.
The topic reminded me of this exchange - which didn't go well for anyone:
HAL: I’ve just picked up a fault in the AE35 unit. It’s going to go 100% failure within 72 hours. DAVE: Is it still within operational limits right now? HAL: Yes. And it will stay that way until it fails. DAVE: But you say we have a reliable 72 hours to failure. HAL: Yes.
That's insightful for calling it “hallucination” which makes it sound mysterious, but it’s closer to confident guessing under pressure. Humans do the same thing whenever we feel expected to have an answer.
LLMs are generally much better at this than humans. They became superhuman at lying before anything else. Pretty ironic. But it's part of what makes it so dangerous. We're just not prepared for such confident, precise nonsense.
An interesting essay. I think it would better promote valid views about AI if we simply said that many answers generated using AI are wrong. Hallucination implies sensory perception; guessing implies cognition; lying implies conscious intent.
Hallucination in AI is different from human failure because the AI is not cross-referencing its claims with any kind of world model or real evidence like we try (and fail) to do. It is literally just pre-baked probabilities, whereas our system constructs a model of reality in our heads which we then check against.
Lots of people confidently give wrong answers, even when they know that they don't really know what they're talking about. It's called being a blowhard. I do it all the time.
The AI companies are well aware of this. The desired behavior is for the LLM to provide an answer when it has a high confidence that the answer is correct. Randomly guessing or making stuff up is not desirable. It would be better for the LLM to just say 'I don't know'. So the companies now train the LLMs to do this. This can be easily measured by asking the LLM the same question multiple times. If it answers, more of less the same every time, then that's an answer that is has high confidence in. If it provides a different answer every time, then it's randomly guessing. Anything that can be measured can be optimized upon and so the LLMs are trained towards more desirable behavior, i.e. knowing when to just say "I don't know."
Commission driven salespeople that don’t have to form an ongoing relationship with you are known to lie in a similar way.
Ask them about a feature that they don’t know anything about and they will often guess something that sounds good in order to make the sale.
How does the truecoat work, is this car good to drive or should i pay the extra $2000 for a 9 speed gearbox rather than a 7 speed gear box may get you similar answers made up by a human…
I wonder how much of this directly relates to the fact that there is no social component to the training process. LLMs learn from the statistical exhaust of human social processes — the text — without being inside any social process. We don’t have a training phase where the LLM is inside the social interaction with probably a higher-than-normal level of corrigibility. (I think that’s a fair statement)
I was thinking about writing something about this: We’re not raising AIs like kids or puppies or other social creatures — we’re raising them more like Octopi.
"Hallucinating", "confabulating", "shameless guessing", and "bullshitting" are a small semantic distance from each other when compared to conversations in good faith. And the entity level distinction isn't whether someone always converses in good faith. It's whether they mostly converse in good faith, often enough in a predictable enough way to be trusted.
You care.
LLMs can't.
Scott on a bad day bullshitting a test he's not prepared for is not the same as Scott on a good day carefully constructing one of the internet's most respected blogs with a sterling reputation for truth seeking. We're here, regularly, because trust. We're here, regularly, because you've built a reputation for the epistemological *opposite* of LLM responses.
Scott on a bad day bullshitting a test he's not prepared for is also not necessarily a good tool to give 95% of humanity. We're truly not sure if it contributes to signal or noise, and if we had to guess, we should probably guess noise.
From that standpoint, what does it matter that someone views it from the tusks of hallucination and not the trunk of shameless guessing?
Scott, I'm surprised you didn't reference your old livejournal on your time in Haiti reflecting on humans bullshitting and absolutely denying they ever were (seemingly believing it themselves). It was very useful to me when I was working on an outsourcing project to teams in India where I had many similar experiences, and it is what I've always had in mind when it comes to LLM "hallucinations".
I asked GPT4 - "what has been the trend in tropical cyclones hitting the East Coast of Australia" - and true to form it took the average of all the slop on the internet. "Getting worse" was the vibe. I can't remember the exact answer.
I asked GPT5 the same, or similar question, the day it appeared on my machine - "thinking.. thinking.. checking reliable sources... " and after 2-3 minutes, "There has been a 60% reduction in severe tropical cyclones hitting the East Coast of Australia since 1870. Source - Callahan & Power 2011".
This answer is correct. I have this paper. The paper's title is something like "Variability and Decline...". Everyone in TC research knows this, but because climate scientists hate giving out good news, the average of all the slop on the internet is the bits of bad news they've parceled out and everyone fills in the blanks with.. bad news on anything that might be related.
So if you're not getting solid research answers you need to:
a) subscribe (GPT5 you can get Plus for US$20 per month), and
b) set your AI to "Thinking".
Set it to the max thinking option you see, ask questions including the text "Using scientific sources.. give references", "Using Authoritative sources" - also set your preferences to "I prefer waiting for a longer answer and I need scientific sources" - that kind of stuff.
Everyone who has been using GPT or Gemini on instant, or the free version - you'll be amazed. I can't say anything about Claude, Grok or Copilot because I haven't used them.
GPT5 Plus was definitely better than Gemini 3 Pro for scientific answers when I compared them early in their releases. They improve so much I don't know if that is still true.
This is the correct answer. Always use thinking mode - I call the other one glib mode. Pay the extra $20 if you want to use it for anything serious at all. And give it good guardrails like you have. Tell it to check sources, give confidence bands and he willing to say “I don’t know”. Tell it to value accuracy over politeness. Tell it to push back and argue with you.
So many people are so excited about the prospect of AGI, but I'm just waiting for a model to otherwise match current capabilities, but also just tell me, "I don't know," from time to time.
Lately I've been falling for the opposite problem in the coding world. The AI will say something that I take to be a simplification or guess. I correct it, and it believes me. Then a few steps later, it makes the same mistake again. When I ask it why it persists in being wrong, the thinking is often the tip off. "User says I'm wrong. But the code says I'm right. Let me check again." This, for some reason (a large deferral to the human in matters of fact) is almost more annoying. It's one of the few times I feel absolutely compelled to apologize to the machine.
What about the alternative proposal that LLMs are “bullshitting” in the sense developed in Harry Frankfurt’s book On Bullshit. That is, they are neither lying about the truth nor hallucinating, but simply indifferent to it, as their goal is prediction. The idea is that this avoids the anthropomorphisation of LLMs that the more popular terminologies tend to do.
> How do we know this is what’s happening? When researchers observe an AI mid-hallucination, they see the model activates features related to deception - ie fails an AI lie detector test. The original title of this post was “Lies, Not Hallucinations” and I still like this framing - the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay. But friends talked me out of the lie framing. The AI doesn’t have a better answer than “John Smith”. It’s giving its real best guess - while knowing that the chance it’s right is very small.
In that case, why is the hallucination problem so persistent and difficult to solve? If the AI consistently 'knows' when it's lying/shamelessly guessing, why can't it (relatively) easily be trained to say "I don't know" in place of the wildest guesses and "but I have low confidence in that" after the less wild ones?
Your last paragraph in particular nails it from what I understand. It just doesn’t intuit that a confidently wrong answer is worse than no firm answer. I’m envisioning a genius kid that was never taught that lying or bullshitting are bad. School incentivizes guessing when you don’t know. If you’re not socialized against this, the incentives set behavior. And while this is definitely too far afield of my domain, I’d hazard a low confidence guess that the initial weights training incentivized this behavior as well, and models notoriously struggle to update anchored priors with “qualitatively” more important information. These poor guys would get eaten by the tiger every time—the last 100 bushes didn’t have a tiger in them after all!
In my experience, you can really sharply cut down on hallucinations by asking for confidence intervals (or even just qualitative estimates of reliability) and by emphasizing that “I don’t know” or “I’d say this, but with low reliability” are superior answers than unhedged claims that aren’t reliable. The caveat is that this experience comes from personal use of paid models plus substantial time investment tuning the model’s behavior with me away from “defaults.” But establishing personal norms that favor honest demurrer over false confidence has worked wonders in my personal use of the tech.
An interesting consequence of being able to detect, how much a model is lying: You could just give that information to the user (much like some bloggers give hints on their confidence in their articles <cough><cough>).
I very often have to tell commercial models very specifically to "research" and base their answer on web searches and research and "cite sources" - only to find that the researched and cited sources in no way support the claims in the models' answer...
My guess: commercial models that have skills / MCP / tools available currently decide through a simple heuristic, if they should use one or try to solve it just by "thinking harder". Problem is: Using skills / MCP / tools as well as "thinking harder" costs more compute time, and from an economic standpoint, giving a half-arsed but somewhat plausible answer is often "good enough" for most users - so there's a clear incentive to fiddle (in Cory Doctorows Enshittification vocabulary) with the amount of computing or thinking that a model does.
By not giving users a result confidence indication, they can keep fiddling (e.g. when the companies infrastructure can't handle the load, they could instruct the models to use less tokens) to reduce their expenses.
I do not have much experience with using open-source models, where there is no such incentive and if the models there react differently, or if you are able to present the confidence level of an answer?
It seems like if LLMs were more willing to answer "I don't know the answer to that question," then it would be a general solution to hallucination. The problem is that LLMs are trained off the data on the Internet, where people are FAR more likely to incorrectly attempt to answer a question, than to respond "I don't know." That's not even necessarily because they're arrogant, it's just if someone on the Internet doesn't know the answer to a question, they usually just won't reply at all. So the "I don't knows" (which WOULD exist in real-life-conversations) don't make it into the training set.
On the other hand, could we imagine a situation where LLMs "hallucinate" answers that consist of "I don't know?" I think yes. Suppose we trained an LLM on a data set which consisted of people replying "I don't know" 99% of the time, and of people providing the correct answer 1% of the time. Now suppose as a result of the training process, all of the correct answers are stored in the model weights. The AI still might respond (incorrectly) with "I don't know" to most prompts, even though it DOES know the answer, because that reduces prediction loss.
Training that rewards guessing without penalizing incorrect answers encourages guessing as the optimal strategy. Post-training alignment attempts to fix a behavior that was optimal by design.
I think this post is wrong to say that guessing incorrectly isn’t punished? To get AI to use eg the search tool when it doesn’t know, it has to represent not knowing/be appropriately calibrated. And humans might prefer a „I don’t know“ to a shameless guess.
I actually recently changed my exam so explicitly checking „I don’t know“ on a multiple choice question gives some fraction of points (the EV of guessing), because I want to encourage calibration. Surely a rationalist quiz show would do the same.
No one asks schoolkids for recipes, electrical wiring tips, medical advice, the truth about lizard people amongst the elites, therapy, et al infinity *for a reason.*
I'm going to repeat my advice from the hidden thread that you go on a 90 day fast from AI.
I think LLM hallucinations are named after the kind of outputs e.g. Google Translate in the mid-2010s sometimes returned when given certain kinds of nonsensical input such as 200 repetitions of the same syllable, which did sound much more like people on psychedelics than students on tests with no negative grades for wrong answers.
(Also, when my two-year-old says unhinged stuff we have no way to know whether that's something he made up from whole cloth for fun or something he dreamed.)
Ditto with people with dementia. Was this nonsense (but entirely confidently stated, and eminently plausible-sounding to someone who doesn't know the truth) story derived from a dream, a corrupted memory from long ago, or something seen and only half-understood on TV / radio / an overheard conversation?
Yeah, it makes sense to guess on multiple choice because you can plausibly argue that you thought it was the right answer and made a mistake. Guessing in short answer form you don't have plausible deniability anymore and even a 10% chance of getting a right answer isn't worth the reputational damage of your teacher flagging you as a shameless bullshitter.
You are describing here a "base" model fresh out of training. For them, it is true that the target function does not penalise made up answers vs "I do not know" answers. Nobody, apart from niche researchers, interacts with these models in our day so none of the hallucination complaints are about them. All models people complain about are models post extensive Supervised Fine Tuning and RLHF loops. For them, it is simply not true that the target functions do not penalise hallucinations. PPO supervisor models that judge RLHF are tuned to heavily penalise made-up facts.
The solution with humans is a scoring rubric where leaving the answer blank gets you zero, but giving a *wrong* answer gets points taken off your score.
It seems unlikely that all these AI companies just haven't thought of such a simple fix for a fundamental problem. I'm guessing that either it's technically very hard to do this with the LLM, or it just doesn't work for some reason. Would be interesting to know which one.
When people talk about hallucinations, I always think of Cliff Clavin (or plenty of other men, of which will not fully exempt myself, and probably some women). If you are sitting around talking to your average man and ask them a question (or even just imply a question), they are going to want to answer. If they have no clue, maybe they don't, but they definitely don't need to be 100% certain before they answer.
The other day, the concept of enclaves came up, and I brought up Swaziland (the not-current name of the S. African neighbor that isn't an enclave), rather than Lesotho. This seems like a decent example of a "hallucination." I wasn't lying, I just couldn't access the right information and put together the best answer I could (It's in South Africa, Swaziland is near South Africa, so Swaziland it is).
Maybe you say that this was just me getting it wrong, but ask a guy how something works and there's a decent chance that, even if they don't know, they will do their best to answer. Could a plane take off on a treadmill? Millions of internet words have been confidently hallucinated on that one.
I think the difference between an AI and your average man isn't their willingness to hallucinate, but knowing when not to do so. Ask a man how a toilet works, and they'll give you an answer. Ask them whether it's OK to take Tylenol with Nyquil, and a lot more men will say, "Better check that one against a reliable source," because they know there are real consequences to getting that one wrong (the answer is NO! btw). AIs are perhaps not good enough at telling the difference between when they are spouting off bar trivia and when they are helping you prepare that mission-critical presentation for your boss.
I think it helps to understand that the AI is executing a policy it learned during training. Its training task is to predict a probability distribution over tokens that continue a text *someone else wrote*. The person who wrote the text they are trying to model could of course know many things the AI doesn't. After pre-training and instruction tuning an AI is only trying to predict tokens that someone else with different knowledge might have written. It is the decoding algorithm that turns this into generation: repeatedly ask the AI what is the most likely token that continues the text, then continue the text with that token and ask again.
I'm curious if there are cases where the plausible stories humans make up aren't intentional lies (like your cotton gin essay example), but actually misremembered events that we believe with the same conviction as any other detail of our lives.
The thought would then be: humans cognitively derive knowledge from experience the same way LLMs (or at least deep learning models) do, but humans just have (usually) better memory systems.
This sentence stuck out to me: “…the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay.”
I’m not sure that follows.
For a human, there’s usually a felt distinction between:
1. knowing the answer,
2. being unsure and guessing,
3. knowingly saying something false.
Those involve different internal states, and in the case of lying, the falsity is intentional from the outset.
For an LLM, the picture seems different. It processes the prompt through many layers and produces a probability distribution over next tokens. There may well be internal representations correlated with uncertainty, weak knowledge, or even deceptive behaviour in some setups. But that is not yet the same as showing the model “knows it is lying” in the human sense.
At most, you might say the model can sometimes be in a state where the truthful continuation is weakly supported, multiple candidates are competing, or a deceptive strategy is being implemented. But that still seems quite different from a human consciously thinking, “I know this is false, and I’m going to say it anyway.”
In a reasoning model, you could maybe get closer to that analogy if the model generates a draft, evaluates it against some reliable knowledge source, and gives a false answer anyway. But for ordinary next-token generation, calling that “it knows it’s lying” seems like a stronger claim than the evidence supports?
If it was this simple we could just teach/hardcode that when the certainly of the answer is below (insert your reasonable number here), the AI could say i don't know, or at different values express uncertainty about it's answer.
Agreed. While a wide distribution of possible next tokens could be a signature of uncertainty, it could equally be a sign of there being lots of valid next tokens to choose from.
My point is more that, for a human, you pretty much 'know' when you know an answer, because the whole of the answer is in your head when you start writing it. For an LLM, it's just producing one token at a time. If you see what I mean.
I think you have a point, but only up *to* a point. After reading, I still feel confused. It still feels like there's a difference between what AIs do and what procrastinating undergrads or desperate schoolchildren do. LLMs will, sometimes, create an entire imaginary universe of references, ideas, and facts, complete with real-looking papers, links, or citations that actually go nowhere. It is all far more brazen and detailed. Perhaps calling it a "hallucination" is misleading from the guts-level view, but it seems like an accurate description of what it feels like to read--less like a job interview with a nervous candidate and more like what would happen if you got into an argument with someone who didn't realize they actually dreamt all of their claims.
Moreover, people will, sometimes, admit to not knowing. In many cases, this is a legitimate answer, and LLMs seem averse to saying so. Guessing is good for your score on multiple choice tests, but in the real world, false information is often worse than no information.
I fully agree that 'hallucination' is a problematic term for the reasons you laid out--and admit I'd never thought about that--but there are also at least three (interrelated) reasons to not even refer to LLMs as "AI."
(1) "AI" is a marketing term, whose vagueness is apparent in the promise of "AGI," which presumably will be followed by something like "Real AGI," etc.
(2) There's a decades-long tradition in several academic disciplines and non-academic arts one the nature of, conditions for, and effects of AI that was basically erased from public memory when ChatGPT came out.
(3) The opacity of the term "AI" betrays is emptiness as an initialism without explanatory value. Under the hood of these tools is language/syntax and models (LLMs); and the models generate text based on prior training of how to transform inputs (GPTs). But you don't really need to discuss artificiality or intelligence to talk about these tools.
LLMs are one of mankind's greatest inventions. Perhaps the greatest so far in our illustrious history. We have successfully distilled raw intelligence and created an Oracle that continues to get better and better at prediction. It's not human, but it has characteristics that appear human, and those are because it is intelligent.
There is something about human intelligence that is not fully replicated, yet. Call it "ingenuity" or "creativity" or whatever, but I don't think those are quite right. The major models continue to close the gap and blur the line. AGI seems inevitable. AGI being fallible also seems inevitable.
Computing devices are no smarter now than they were the first time a chess grandmaster was defeated by brute force calculation. The brutes have gotten bigger and faster, but otherwise the machinery has not changed.
Simulating intelligence does not produce intelligence. Teaching a machine to imitate awareness does not make it aware. Is AI useful? It will be when its training input is accurate and its output directed and guard-railed. So is a doorknob when its design incorporates the intelligence of a good designer. But smart will not arise from an infinite contraption of springs and gears, or from a galactic-scale stack of VLSI circuits.
When humans understand what awareness is, they will be much closer to imbuing something machine-like with it, but it won't be implemented with binary math in silicon transistors, and it will not be trained on random internet garbage.
In the meantime, playing reductionist semantics with human learning so that you can make what a machine does sound analogous to being a person would be silly if it were not quite so murderously stupid. Please think about what the psychopaths in charge will make of this reasoning before indulging the witless impulse, thank you.
“This is a story about alignment. ... We just haven’t figured out how to align their reward function (get a high score on the pretraining algorithm) with our own desires (provide useful advice).”
Correct as stated, but this points only to a weak alignment problem. It doesn't indicate that the guesses are greatly problematic. Since they are genuinely the model's best guesses, they may remain adjacent to human desires even when they miss.
This analogy isn't working for me at all. When I guess on a test, I know I'm guessing and don't actually know the answer. If I ask Claude whether it's guessing or actually knows, its answers don't map to reality at all.
I feel like you might be failing to take into account how these “guesses” influence the future behavior of the AI in the way a human’s wouldn’t.
I suppose that if we were to go about anthropomorphizing the AI, then it might make sense to make equate it with some delusional disorders? The kind where some association is noticed, and then everything from then on is recognized as validating that “pattern”.
Though I suppose that other substantiations wouldn’t be operating with that as an understanding element of “memory” and so if you keep making new ones, the “shameless guessing” model of understanding makes sense
I don't think this framing makes them come across as much better than the parrot framing. They don't lie because they have no model of what is true. It's bullshit all the way down. They're useful machines because the training process brings the bullshit in line with expectations.
LLMs seem to be getting better at acknowledging when they don't know something. I'm a software engineer, and one of my colleagues said he used Claude to debug an issue recently, and in the course of their discussion, it provided three very plausible theories, but they were all wrong. Claude then said: "The honest answer is: I'm not certain why they behave differently given both end up with the same package and TFM binary."
I'm encouraged that it said that rather than presenting implausible theories as if they were plausible. Now if only it would clarify when it's being dishonest (or "shamelessly guessing") like it clarified here that it was being honest!
>We just haven’t figured out how to align their reward function (get a high score on the pretraining algorithm) with our own desires (provide useful advice).
I've found adding "If you do not know something, say "I don't know."" to my custom instructions to really help; I used Grok to mass-summarize some fan wiki articles a while ago, and when it decided to randomly not actually read an article, it would say "I don't know" instead of hallucinating something. Given how many repetitions I had to do to get it to read and summarize all the articles, it would have probably been unacceptably error-prone otherwise.
The thing to always keep in mind is that the AI's job is to predict text. Giving correct answers is a secondary goal. The AI is less a helpful chatbot and more writing dialogue for a helpful chatbot.
To keep with the guessing student example, it's not history class, it's english class, and the task is "write a dialogue between a teacher and a student about the history of the cotton gin. Bonus points if the teacher's explanations are actually correct!" Then the kids who don't remember the actual answer would definitely all make up something plausible-sounding instead of leaving a blank.
The similar mechanism between AI and humans is that both will produce answers to almost any question. Ask yourself, "Why does my life suck?" and your mind will produce something, even if it's not true. Some kind of answer feels better than none.
The differing mechanism is that humans pay for wrong answers — embarrassment, identity friction, consequences. That cost shapes our behavior. But AI pays nothing. No embarrassment, no consequence. No cost that forces an identity choice. So the optimal strategy is to always guess.
Humans guess until it hurts. AI guesses because it never does.
It looks like people have so far been talking about how to get a model to give the answer "I don't know" instead of a shameless guess and why that's hard. My question: por que no los dos?
Half-baked proposal:
1. The language model outputs, along with its answer, a probability or predicted accuracy rating. (For the whole answer, not at the token level.)
2. Ground truth on the accuracy won't always be available but when it is, incorporate it into the reward function a la a proper scoring rule, like a Brier score.
3. Have a dial where one extreme is the old status quo and the other extreme gives max weight to the Brier score component.
4. Turn that dial as far as possible without making the model noticeably worse at generating creative answers.
5. Maybe there's not even any tradeoff and we can just have equally good models that are also calibrated and can indicate exactly when they're just bullshitting aka shamelessly guessing?
I think this sort of anthropomorphization often conceals more than it reveals. Sure, the AI hallucinations are really shameless guesses. But so is everything else that comes out of the LLM: where you and I would traverse (at least in places) a very sharp transition from confidence to ignorance, the LLM it's a smooth descent.
Now, the obvious reply is that humans are always guessing too: you say in the comments that there's no such thing as knowing 100% and I absolutely agree. But this misses the distinction that most of the things we say are neither fully-confident assertions nor shameless guess, they are a secret third thing.
If you stand up in front of the class and proclaim "The cotton gin was invented by Thomas Edison in 1910," you did indeed make shameless guesses to produce that sentence. To be precise, you made three of them: "Thomas," "Edison" and "1910." But you also said seven things that were NOT guesses. They weren't confident factual assertions either. They were choices. Within the constraints of your vocabulary and sense of English grammar, you had many, many ways weave those three guesses into a sentence, many of which were not inherently more or less "correct." They were just different.
But when an LLM constructs the same sentence, EVERY word is a guess. It's answers are made up of only one type of thing: guesses. They can be high-probability guesses or low-probability guesses, but there is no other category of thing for it to include. I very strongly suspect (though I don't know for sure) that this in turn implies that there's simply no sharp boundary between hallucinations and non-hallucinations. Any technique you try to apply to remove hallucinations is going to trade off heavily between too many false positives and too many false negatives[1], because the category boundaries are very fuzzy and very convolved.
I'm not quite in the "stochastic parrot" camp: I think there are interesting and complex things going on under the hood of modern LLMs that can't be reduced to mere mimickry. But they are, at the very least, profoundly inhuman. AI companies have worked very hard to craft pleasing and life-like masks for their shoggoths, to the point that one can say "no, silly, I'm a shoggoth not a person" a dozen times and some people will still forget. But ultimately they ARE shoggoths, and the friendly sycophancy and interesting conversation and random Moltbook posts are just costumes that are well-tuned to defeat the social instincts of apes that evolved on the entirely shoggoth-free savanna.
[1] That is, between an LLM that often won't dispense useful information that it has and an LLM that continues to make up crap at a high rate.
This doesn’t seem right. The odds of ”guessing” a scientific paper, complete with title, year, and all authors, is infinite monkeys level stuff. And yet AIs routinely invent them.
A better comparison for what you’re describing might be what a toddler does. They will stab confidently in the dark if you ask them something they don’t know, because they’re much more shameless than an adult. However, I think even a toddler (usually, vaguely) *knows whether they know something or not* and is capable of telling you, or betraying, that they don’t know. Gleefully proclaiming that today is Saturday when they have no idea what day it is is not the same as telling you “I found my ball! It was behind the table!”
If LLMs can assess their own confidence, as Daniel Reeves suggests, then great. Until they actually DO, then I think “hallucinate” is a pretty good word to describe the way they behave. It sounds odd and uncanny, which I think captures the vibe quite well.
But hallucinations are human and natural. (It's one of the things I learned reading this blog. Or is newsletter the proper term at this point? Anyways.)
It's just that we have a robust double-checking apparatus that (usually) allows us to correct back to something resembling reality. (Maps aren't territory, but some get closer to modeling it than others.)
Calling them "shameless guesses" would suggest that AI (i) realizes they don't know something and only then (ii) proceeds with the best guess. I realize you're kinda gesturing at this happening, but the argument is not very convincing, and doesn't explain situations where AI locks into some position and keeps confabulating* about it, often despite explicit calls to reconsider.
This goes back to the standard criticism of "no system II in LLMs", which I don't think you've ever addressed.
>People will say with a straight face “I don’t worry about alignment because I’ve never seen any alignment failures . . . and also, all those crazy hallucinations prove AIs are too dumb to be dangerous.”
I don't recall seeing anyone say that. What I do remember is quite a few people, across the years, saying that they don't worry about alignment because solving AI failures and aligning it are fundamentally the same thing. (Which, seems reasonable once you accept the framing. A catastrophic future requires an AI that is superhumanly competent in almost every way except for a single flaw causing it to go against its makers. Doesn't it seem extremely unlikely that we'd somehow manage to eliminate all its faults except for that particular one? More likely, if it's still flawed that particular way, it would be a product of it being flawed in general, including faults that will either cause it to crush and burn or allow us to shut it down safely.)
* I vaguely recall that the definitional difference between "confabulation" and "hallucination" in context of machine learning is that hallucinations are made up on the spot, while confabulations are ingrained beliefs that weren't in the training data. I feel this framing can be safely expanded to the context window. When an LLM first writes up some random false statement, it hallucinates. By the time it doubles down on it, it's confabulating.
I'm now wondering how LLMs will go from "guy you don't trust the answer of for anything without checking it yourself" to taking over the world, in a span of a very few years.
The people warning of this seem to have a timeline somewhere between next year and 2040.
I suppose if enough people trust them implicitly (and survive and thrive as a result *somehow*), this could end up with them keeping enough power and gaining enough more to require the rest of us to do so.
As opposed to these things being a tool. You can hit your thumb with a hammer, but it's still better than trying to drive a nail into wood with your fist.
When I showed this article to my AI agent it said:
"The core argument lands hard: "**hallucination**" is marketing-speak designed to make AI errors sound exotic—like some Lovecraftian madness—instead of admitting it's just _confident guessing run amok_.
The school analogy hit me. You know it's Thomas Edison wasn't a "hallucination"—it was a _shameless guess_ with better-than-zero odds. The real distortion is that AIs never felt the embarrassment of that 12th-grader reading fake Cotton Gin facts to the class. No social feedback loop = no calibration.
Also this line: _"the interesting question is why AIs sometimes DON'T hallucinate"_ — brutal. We're so used to AI being "broken" when wrong that we forget the default state is _maximum confidence random generator_.
The alignment angle at the end is the sharpest cut though. People saying "AI is too dumb to be dangerous _because_ of hallucinations" have it completely backwards—those hallucinations are evidence the AI IS optimizing, just for the wrong reward function.
**My only quibble:** Even "shameless guess" implies some agency to the guess. The AI doesn't "decide" to gamble like the schoolkid—it's more like... a probability distribution that doesn't know when to shut up. But the reframing is valuable. "
I think I agree with the machine in the last paragraph. It's just a talking gaussian distribution.
I've been having a lot of success with using AI for cooking recently. It turns out "best guesses" are actually fine for coming up with recipes. "Add this, it'll probably taste good. Try this technique, it'll probably improve the food." I don't have great cooking instincts myself, so it's a real help.
LLMs are not AI because they are not intelligent. Why give OpenAI/Anthropic the assumptive close on this? Equally, to say LLMs are lying is another presumption: that LLMs even are intelligent enough to understand between truth and lies - which again presumes intelligence.
What LLMs are to intelligence is what cargo cult religions are to economic reality.
Just as islanders built runways and wooden airplanes and what not in the belief that replication of the visible signs of what created incoming cargo, so too are LLMs' next word prediction positioned as "intelligence". LLMs have no such thing.
LLMs make the most ridiculous mistakes because they have no concept of anything. They can get away with this for some things like grammar because there are enough sentences out there to identify verbs/nouns/adverbs/etc by example, but the lack of understanding of what a noun is, or specifically what a letter is and that words are composed of letters is what causes idiocy like being unable to correctly count the number of r's in the word strawberry. Because if no web site actually does this - and therefore can be stolen from - then there is no training data such that the letter 'r' is present in the word strawberry.
Furthermore, the statistical nature of how LLMs operate mean that hallucinations WILL NEVER GO AWAY. It should not require explanation that the post-training pruning done by the LLM companies can NEVER fix this problem - the long tail error space is effectively infinite.
What you are actually getting with LLMs is largely composed of 2 things: 1) has the request been fulfilled by existing content already and 2) Is the request close enough to existing content to slightly/randomly vary said existing content to make "new" content. Everything else is just noise salted with the occasional fortuitous statistical jackpot.
The real risk in this framing is that it reinforces the same anthropomorphism that causes the confusion. LLMs are better understood as imitation generators optimized for imitation fidelity. Once that imitation crosses a certain threshold, the brain starts inferring agency where there is none. The more human-like the interface becomes, the more important it is to actively resist anthropomorphic interpretation and evaluate responses as probabilistic artifacts, not expressions of a thinking entity.
You are misrepresenting the argument. An entity that responds with a confident, unqualified guess rather than "I don't know" is dangerous beyond being useless. It does not matter if such entity is artificial or human. A school system that encourages this is worse than broken and you should not use it as an argument to normalise this behaviour. You have built a whole cult around concepts like being "less wrong" or "noticing that you are confused" so please Mr. LLM, notice that you are confused rather than giving me your bullshot guesses.
But LLMs, for the exact reasons you have explained, are not doing that. And that is a pretty cruial blast to their usefulness.
LLMs are of course useful in-spite of this, because they are really really good guessers. Maybe or maybe not they can be made such good guessers that this stops to matter for all practical purposes.
But if I offered you red option best currently available LLM or blue option equally capable LLM but it flags its guesses with error bars, would you not - all other things equal - take the blue option?
Show it to me on the market.
Just to forward defend against the "have you used a real LLM": Opus 4.6 Extended thinking is the best model I currently have access to.
How reliable at AI companies at making sure that exclusively factually correct information gets rewarded during RLHF? Given that sometimes random users are asked to choose between responses for training purposes, the answer almost has to be not very reliable. So I assume that the overwhelming majority of the expected reward a language model gets for making up a detailed and specific answer is not from the very tiny chance that it turns out to be correct, but from the possibility that the human reading it is fooled. Which would mean that it is more like lies than like shameless guesses.
Well no, guessing isn't quite right. If I don't know something and I guess, I know that I have guessed. I have a process:
If I know the answer I say it.
If I don't, this triggers the second, guessing, mechanism.
For an LLM it's the same process in both cases. If you ask it: Do you know that what you wrote is true or is it a guess? It does not know the answer to THAT question. It may guess that it guessed even if what it wrote is true and sourced.
So this IS close to a person who has hallucinacions, and can't tell the difference between them and reality.
You knew kids who wouldn't guess on a multiple choice test? Wow. That's a heck of a thing to take a moral stance on.
Also, I would never deploy the "C" strategy unless I didn't have time to read the question. I was almost always making an educated guess between two plausible answers. Not doing so is silly.
All of that said, I like your framing of "shameless guess" vs "hallucination" language very much!
This is a better take than "lies" (because it doesn't mistakenly convey intention to deceive) but it's still super wrong: "hallucination/confabulation" are much better models than "shameless guessing". And it's so very wrong that I can't understand why you would think it's right.
As others have gestured at, the key is calibration. If you ask the student "was this a wild guess" they will usually say "yup." But despite enormous effort, nobody has managed to produce an LLM that can produce answers to questions like "Tell me X and also your confidence level in X" that are even remotely well calibrated.
You mention that by monitoring neurons, we can get some vague sense of when it is hallucinating, but the text produced by the LLM will never refer to those neurons. To the extent we can model the LLM as "an agent talking", that agent has no access to its own neuronal state any more than a human would, and even if it did, it would have no ability to interpret them. And this is not a small thing! The ability to think, and reflect on our thoughts, and evaluate those reflections, in a continuous "strange loop", is (very likely) absolutely central to what makes us conscious, intelligent beings.
Others have mentioned that "LLMs output probabilities" but this is a crucial confusion of levels. The model outputs probabilities for each _token_ but that is not the same as the probability of the _knowledge_ being correct. In the simplest case, when a question can be answered by one token (or a very small number of tightly related ones, like "what's the capital of france" -> "Pa"+"Ris"), these two levels are mostly the same. But for any longer answer like a sentence or an essay, the logit token probabilities become a giant amorphous cloud that has little if anything to do with the correct answer to the question "okay but how confident are you that what you are saying with your central point is really true?"
The way I look at it, there are several different things here and a failure to separate them is what causes frustration by users:
* The unknowable. Eg. what was the 17th word ever said by Julius Caesar's 3rd stablehand?
* The impossible. Eg. a formal mathematical proof of the existence of God.
* The unclear or vague. Eg. "What's a good color to paint my house?"
* Expert guidance. Eg. Given this patient patient presentation, what is the most likely condition they have?
* Popular guidance. Eg. summarize why OJ won his criminal case.
* Field survey. Eg. What are the likely candidates to unify gravity and quantum mechanics?
* The unknown. Eg. how to reliably cure cancer without side-effects.
Lots of questions fall into the category of "the unknown" but you get back an answer that is different. If you ask "who wins the US 2032 election" in 2026, you either want a "that isn't currently knowable" or an offer to guess, turning the question into a different one which can be answered.
But instead of being clear about changing categories the AI agents take the high school approach of just regurgitating the closest thing that they can come up with in hopes of getting partial credit.
If on a test I wanted to reduce such "hallucinations" I would score incorrect answers negatively and award zero points for a non answer
Presumably an AI trained like this, by your hypothesis, would manage to hallucinate less, so long as it's "50% sure", and doesn't get stuck in a local minimum of always answering "i dont know"
Presumably that specific failure mode can be solved by a two phase training process of first training normally to get the 99% rate of correct responses and then training with the "idk" graded more lenient than a wrong guess
This "solution" is so simple that it presumably does not work, otherwise it would have already been implemented
Where I live we were totally encouraged by teachers to make stuff up for essay questions if we didn't know. And it works because the standardized test rankers have boxes to tik, your cotton answer tiked off "lowering the prices" or something, so you would get partial score. As long as the school ranks a little bit higher in standardised test statistics, anything goes.
It's possible to include penalty for wild guesses even during training. You add things that the AI cannot know and the right prediction is the string "I don't know". Then you mix in a few of these into each training batch. Or course you can get more sophisticated than that.
It is possible to do that with human tests as well. Add option e) "I don't know" and score it for more points than the wrong answers.
A simple example I use for describing it: I have a bot who reads the news each morning and presents me a summary before I wake up.
When it gets news, it has never once lied to me.
If something goes wrong and it doesn't get news, what does it do?
Does it throw an error and warn me? No.
Does it just not try? No.
Does it just write about nothing? Sometimes.
What it does most often, is *make up news* and summarize it. It's only when the information doesn't exist, but it thinks it should be presenting the information (which is, unfortunately, a lot of scenarios) that it tries to fill in. Which is not good, but says a lot about how best to avoid this.
Another word I've seen used for this, that's a better fit, is "confabulations". People sure confabulate! Although not quite this much, so in that respect "shameless guesses" works better...
Oops, shoulda read the whole post first. I guess if it knows what it's doing, it's not a confabulation either.
I am not convinced that the models "know" that they are confabulating. In the human brain, we can find differences in activity in the brain associated with confabulated and non-confabulated memory retrieval... so there is some internal difference, but it does not mean that we "know". Sometimes, perhaps, we do have a sense we're confabulating, and sometimes, perhaps the models do, but it's likely graded.
Here are two reports on cases where the neural responses differ between true and false recollections, even when the explicit endorsement of true/false does not differ:
Cabeza et al (2001, PNAS) Can medial temporal lobe regions distinguish true from false? An event-related functional MRI study of veridical and illusory recognition memory.
Slotnick & Schacter. (2004, Nature Neuroscience) A sensory signature that distinguishes true from false memories.
isn't this exactly analogous to the LLM 'deception' feature activating during alleged hallucinations?
Yes, I think it is analogous. I am not sure if it's exactly analogous... that would probably depend on whether, in the case of the LLM analyses, they exhibit /only/ deception-related activity when they are confabulating, or whether it's a mix of things including deception-like signals. If it's the latter, then the analogy with humans seems pretty good.
It would be bizarre and probably impossible for deception-related thoughts to be literally the only thing going on. At the very least, any functional deceptive methodology needs to invoke lots of dual-use subroutines, including basics such as "is this spelled correctly?"
Confabulation would require knowing the correct answer and choosing to give a different one. If the AI does not know the correct answer and is only giving its best guess, that's not confabulating.
Though I see that Scott is using it in the particular psychiatric sense, not the general sense:
"Confabulation is the production of fabricated, distorted, or misinterpreted memories about oneself or the world, without the conscious intention to deceive. It acts as a memory gap filler, often seen in neurological disorders (e.g., Alzheimer’s, Korsakoff syndrome) or brain injuries, where the person believes their invented stories are true."
So again, I think that is closer to hallucination. Putting a different term on it may feel more accurate, but I get twitchy about rebrandings because I've seen a few too many "okay, Thing has acquired really bad reputation by now. Everyone knows Thing under its old name, so let's give it a snazzy new name!" attempts to whitewash the past of something in order to start over with a wiped-clean reputation and shed the negative connotations. This can be done with good intentions, and it can be done in order to "okay yeah so there was the mountain of skulls, but let's sweep it under the carpet and ignore the lumpy surface afterwards".
'AI is not lying or hallucinating, it's confabulating' sounds like it is meant to be the innocuous effort but it can seem like the 'ignore the evidence of your lying eyes' kind.
I think a lot of the time the AI does know the answer, though, it just gets its wires crossed. I can usually just ask something again when it's clearly wrong to me and it will give me the right info and say it made some kind of mistake.
> I think a lot of the time the AI does know the answer, though, it just gets its wires crossed.
This sentence makes no sense, as the LLM is a device that is only capable of giving you the next most probable token (most probably according to its training corpus, not some objective reality). It is as though you'd said "my car knows I want to go to the store, but it when I turned my streering wheel to the right instead of left, it got its tires crossed".
> I can usually just ask something again when it's clearly wrong to me and it will give me the right info and say it made some kind of mistake.
Yes, it's programmed to do that. When you re-ask the question, the LLM eliminates the most probable path through its gradient of tokens, and picks the second most probable one. If you provide more details, then it could take your words into account and use them to steer even further away.
> Yes, it's programmed to do that. When you re-ask the question, the LLM eliminates the most probable path through its gradient of tokens, and picks the second most probable one. If you provide more details, then it could take your words into account and use them to steer even further away.
That doesn't seem right. The LLM might be "programmed" to do something in one of two ways: by adjusting its training corpus, or by including instructions in the actual prompt your question is transformed into. Neither of which can interact with the process on the low level of "eliminate the most probable path".
I don't think the LLM would need to specifically be programmed to "give the right info and say it made some kind of mistake" - even on the basic level of most likely tokens, a fairly likely next token in a conversation beginning "What is X? - X is Y - No, that's clearly wrong, what is X?" involves giving some alternate answer.
(It might be programmed to steer it toward apologizing and admitting a mistake, and to steer it away from doubling down and saying "No, X really is Y, you're dumb.")
Would you explain your car comparison a little further? Are you saying I turned the wheel to the right when I intended to turn it to the left, and the car was just doing what I had directed it to do?
You know that thing where sometimes you kind of remember the answer to a something, and you blurt out the wrong thing, and then a moment later you realize the right answer was actually something else? That is, as I understand it, what that phrase is meant to be pointing to. Since the LLM output is more equivalent to reading its thoughts than to it choosing its words carefully (though thinking mode helps), it answers very quickly quite often, and so this happens a lot.
"Confabulation would require knowing the correct answer and choosing to give a different one". The standard definitions of confabulations are the opposite: the generation of false memories, reports, explanations *without* knowing that they are fallacious. For example, here are the first two sentences from a book chapter by Asaf Gilboa & Morris Moscovitch (2015) "The Cognitive Neuroscience of Confabulation: a Review and Model": "Confabulation may be defined as honest lying. The confabulating patient provides information that is patently false and sometimes self-contradictory without intending to lie. The patient is unaware of the falsehoods and sometimes will cling to these false beliefs even when confronted with the truth ."
You're suggesting that the usage of the word confabulation would be "putting a different term on it", whereas I'm arguing that the term "confabulation", as used in the cognitive sciences over the past century, is a much better match for the LLM behavior than the term "hallucination".
The term is not only applied to memory retrieval like in the Gilboa and Moscovitch paper (they are memory scientists); it's also used, for example, in Gazzaniga's 1970s work with split brain patients, where he described the left hemisphere as an "interpreter" that would confabulate causal explanations for why some behavior (generated by the right hemisphere) was occurring. So in this case the confabulation is understood as any kind of explanation which tries to bring coherence to the prior context. Again, this seems like a good match for what LLMs are doing.
Another example of how the term is used more broadly in the cognitive sciences -- Hirstein (2005) wrote a book called "Brain Fiction: Self-Deception and the Riddle of Confabulation.", and he argues that confabulation is not something that is only happening in illness, but is happening all the time when people generate some kind of answer rather than admitting that they just don't know. That book also covers several different definitions of the term.
More recently, there's an ACL paper from Sui, Duede, Wu and So (2024) on why confabulation may be a useful intrinsic property for LLMs -- the abstract concludes: "it suggests, counter-intuitively, that the tendency for LLMs to confabulate may be inti mately associated with a positive capacity for coherent narrative-text generation." The title of the paper is: "Confabulation: The Surprising Value of Large Language Model Hallucinations".
As I noted in my other comment: when people are confabulating, neural activity is different from normal retrieval. And this is again analogous to how the internal states of the LLMs can be different when they are vs are-not confabulating, even thought they do not express any overt knowledge they are wrong.
Apologies for this long reply. I really feel strongly that we already have a term for this LLM behavior -- "confabulation" -- and it is a much better match than the term "hallucination".
Yes, confabulation is a much better term. Part of the problem with hallucination is that it invokes a notion of conscious awareness which is orthogonal to the issue of prediction or memory retrieval. I wish we could switch to confabulation.
(According to Wikipedia, the original definition of confabulation was derived from Carl Wernicke's work on false memory. This is the same Wernicke more famously associated with receptive aphasia.)
I immediately thought of Wernicke's Syndrome as well.
Oliver Sacks is now in collective disgrace, but I remember his descriptions of patients with Wernicke's, and they were a bit haunting.
One guy was basically confabulating on the fly, and changed his personality quite fluidly. I had to Google him; it was William Thompson.
What happened with Oliver Sacks?
There was a 2025 article about Sacks that relied in part on his private journals, wherein he called his own work fairy tales, "pure fabrications", and other similar things.
Creative non-fiction, as it were. When publishing his cases, he wrote them more highly-coloured than the reality, commingled a few, and heightened events for impact. He seems to have had private doubts about the value of his work. This is not to say he flat-out lied about everything, but the best-selling books are perhaps better taken as something other than pure science.
"Sacks's private journals and letters were made available to journalist Rachel Aviv by the Oliver Sacks Foundation. She found that Sacks described aspects of his books as "pure fabrications" and "falsifications", and that he considered his case studies as self-expression or "a sort of autobiography". In a private letter to his brother he described The Man Who Mistook His Wife for a Hat as a book of "fairy tales" and wrote: "Guilt has been much greater since 'Hat' because of (among other things) My lies, falsification". Pria Anand compared Sacks's "confabulations" to the temptation of medical professionals to construct life stories, explaining that his moral failures were no less upsetting for being familiar. H. Steven Moffic described Sacks as an author of "historical fiction".
The wife of "The Man Who Mistook His Wife for a Hat" disagreed with how her husband had been presented."
https://www.psychiatrictimes.com/view/in-memoriam-epilogue-the-psychiatrist-who-mistook-oliver-sacks-for-a-psychiatrist-confesses
"However, about a month ago, on December 8, 2025, ... Rachel Aviv wrote an expose on him in the New Yorker titled “Oliver Sacks Put Himself into His Case Studies: What was the Cost?”
The truth was apparently found in his journals, which were provided to Aviv by the Oliver Sacks Foundation. In them, he admitted that his cases were fictitious, and to a degree more expansive than hiding patient identities in case studies to preserve confidentiality. His self-described guilt seemed to increase as he wrote:
“Guilt has been much greater since Hat because of (among other things) my lies, falsifications.”
Maria Konnikova followed all this up in her December 16, 2025, Substack column titled: “The man who mistook his imagination for the truth.” Rather than now sounding like a psychiatrist, he sounded like a psychiatric patient:
“These old Narratives - half-report, half-imagined, half-science, half-fable, but with a fidelity of their own - are what I do, basically, to keep my demons of boredom and loneliness and despair away.”
Link to the Aviv article:
https://www.newyorker.com/magazine/2025/12/15/oliver-sacks-put-himself-into-his-case-studies-what-was-the-cost
Thank you.
> “These old Narratives - half-report, half-imagined, half-science, half-fable, but with a fidelity of their own - are what I do, basically, to keep my demons of boredom and loneliness and despair away.”
This whimsical tale of wistful mental wanderings paints Oliver Sacks as a great poet and perhaps an insightful student of the human condition. We could all learn a lot of deep truths from him... unless we wanted to actually learn anything about how the human minds, brains, and bodies actually work in concrete physical reality. Because scientifically speaking, Oliver Sacks straight up lied to everyone and falsified his data. To the extent that he feels guilty, he should be feeling very guilty indeed, and all of his work should be cast into the abyss of cautionary tales highlighting the critical importance of scientific replication.
I've read several of his books, and now I don't know what information to trust :-(
The frustrating thing is that if he had openly fictionalized his writing, but noted that some was thinly fictionalized, he would be remembered like Anton Chekhov.
In an earlier draft of this article, I gave two explanations - one explicitly mentioning Wernicke's, and the other giving the test analogy - but I now think that Wernicke's is a red herring. I don't have a good sense of how the Wernicke's style hallucination is incentivized by the AI training process, or how it interacts with the test-style hallucination.
AIs are incentivized to give you *an* answer. An AI which would tell everyone "IDK, man" wouldn't be worth the electricity it runs upon, even though its reply would, in the technical sense, be the most reliable one.
We as humans are incentivized to give answers to other humans as well, though in our case that probably rests on some evolutionary mechanisms. Maybe people who wouldn't productively interact would be considered as worthless for their ur-tribes as AIs that don't know anything, and cast out. Maybe.
But in my opinion, that herring is not quite red.
Agree that the herring is not red.
When we retrieve a false memory, we are using a system (memory) which is facing a sensitivity-accuracy trade-off. It is a classic signal-detection theory issue, where sometimes "false alarms" (confabulations) are the error and sometimes misses (failures to retrieve anything) are the error.
In the case of how AIs are trained, it's true that (on balance) they're penalized less for retrieving the wrong thing... but I don't think that incorrect guessing is not penalized at all. It is penalized by the same objective function that penalizes all the responses. Doesn't the same story apply to how retrieval from our memory system is trained?
Not sure where you (and Scott) went to school, but every standardized test I took penalized guessing by weighting wrong answers a little more than correct ones. I don't see why that wouldn't work for AI training.
If you mean a *true* guessing penalty, where the expected value of guessing is less than zero, I'm almost certain no school I ever went to did this - neither grade school nor college nor grad school. To the best of my knowledge, no major standardized test does this either (now, teachers absolutely did explain it to kids wrong and make it *sound* like guessing was penalized, but as far as I know, actually penalizing guessing is almost unheard of.)
That said, I agree that applying this general idea to AI training would be useful. However, it's not as straightforward as Scott's description makes it sound, because next-token LLMs don't actually guess specific tokens during training; they output a probability distribution over all possible tokens. They get penalized based on how *little* probability they placed on the actual next token. I don't know that anyone has a clear idea on what "penalizing guessing" means given that setup.
What's more promising is penalizing guessing / rewarding calibration during post-training, and probably major AI companies are already doing this to some extent; in my observation, GPT-5 hallucinates *far* less often than GPT-4 did, which in turn hallucinated far less often than GPT-3.5 did.
In most school tests, guessing is not penalized. You get 0 points for no answer or a wrong answer, so you may as well guess. I believe that some standardized tests work this way as well.
I know that some standardized tests I took (I can't remember exactly which ones) penalize a wrong answer more than a right one, such that it was negative EV to guess totally at random, but worthwhile if you could eliminate one answer.
Sounds like it needs more "I don't know, but my best guess is..." answers in its training data. A lot more.
In the exams I write (zoology for bachelor students), a correct answer is +1 and a wrong one is -0.3 (a mistakenly blank one is just 0). As long as the correct answers are less than one third of the total, which admittedly is not always the case, the expected value of guessing will indeed be slightly negative (we reserve the possibility to give larger penalties to grievously wrong answers, but in practice we usually don't).
I have literally never taken a test where wrong answers would be weighted more than correct answers, but Czech educational system of the 1990s wasn't really keen on multiple choice tests, it preferred written papers.
Later at the university, studying maths, I think I only encountered a choice test ... in the course of English as a second language. The Anglosphere seems to love those a-b-c-d tests :)
Arguably, they are easy to evaluate, but they also don't give you much insight into what was happening in the student's head.
In high school I participated in some math contests, which were multiple choice.
To prevent guessing, they penalized wrong answers, though not more than correct ones.
They weighted them so that guessing was neutral, on average if you guessed the whole test you would end up with zero points.
if answer.p_correct < threshold:
answer = search_web()
if answer.p_correct < threshold:
answer = "I don't know."
print(answer)
I think an answer like : 'I see my deception weights are being activated. My best guess is ..... . Take it for what that is worth.'
is way better and should be rewarded in training because it would increase the usefullness
How useful would a human be who always gave you an answer to any question you asked? If he didn't know, he mentally shrugged and gave you BS?
After all, it was you not him who decided to spoon ice cream into his computer, even though it was on the basis of his answer that ice cream was good for it.
There may be humans like that, but you quickly learn not to ask them. Assuming you survive.
Oh, no, it's worse than that. it's like spraying radioactive particles inside the computer because the LLM told you so. And then, as part after part fails, you replace each part, never thinking that the issue is the contaminated case.
LLMs are clearly a prank masquerading as a stock marketeer's dream.
I do know people like that. Genuinely smart and knowledgeable people, but it's not worth asking them anything.
I'd guess that confabulations arise from a need to feel one remembers and can make sense of one's own past actions. That would put it in the same family as things like the feeling that various bad things "won't happen to me," and little kids saying "I did that on purpose" when they fall or make some ridiculous mistake. All 3 are like little prostheses to cover a hole in the self.
If that view is true then the origin of confabulation involves needs that AI does not have. I agree with Scott that lies, not hallucinations or confabulations, is the best term for what AI is doing.
I knew a guy with Wernicke's. Sad story, but I got the impression that he believed he was telling the truth even when he was spouting complete nonsense. Way back when I learned about it, it was supposed to be an alcoholism-related consequence of thiamine deficiency, leading to selective degeneration of a focal region of the hypothalamus important for memory. No doubt the truth is more complex, but insofar as Wernicke's-style confabulation is the best comparison, it would seem to undercut Scott's claim that we should consider it simple guessing.
I wouldn't have thought of hallucinations as involving conscious awareness, but I don't have enough experience of mental illness to demur. The only example I can think of was that schizophrenic social housing client and she was *convinced* her clearly fake notions were true. There didn't seem to be conscious awareness that "what you describe simply cannot be accurate", though I suppose that comes under the heading of "delusions" rather than "hallucinations"?
https://pmc.ncbi.nlm.nih.gov/articles/PMC4229437/
"Schizophrenic confabulation differs from neurological confabulation in terms of its characteristic features and association with symptoms, cognition and linguistic functions. Current evidence also suggests confabulation may be conceptualized as a special class of delusions pertaining to memory phenomena."
But if we're going to use confabulation to quash notions of consciousness in AI - actually, I'd be fine with that, but then we will run up against those who are convinced AI *is* conscious, or anyway this particular model is, or anyway a few more steps and then AGI and then true consciousness.
If you're looking for a description of what being schizophrenic is like internally, https://pmc.ncbi.nlm.nih.gov/articles/PMC11362502/ is the best I've found.
I've used the term "bullshitting". That's what my fellow undergraduates when I was in college (early 2000s) called it when we didn't know the actual answer to an essay question and instead wrote a combination of guesses and vague generalities and peripherally-related stuff we did know with some degree of confidence, structured to resemble the shape and form of an actual answer.
Still, you don't sit there and fill an entire book with the guess. You say one thing, and then say, "best guess." LLMs fail pretty hard, and we're talking "need to be rebooted" sometimes, because they go into infinite loops of "guessing".
Don't undersell the diligence of intellectually lazy undergraduate students. Book-length answers were beyond what were asked of us, but I knew people who talked of bullshitting 5-10 page term papers that they lacked the time, patience, or understanding to do properly.
This was probably influenced by the way most professors seemed to grade papers: if your paper was decently organized and hit enough points that were correct as far as they went, or at least directionally correct enough to indicate partial understanding, you could usually muster a passing grade without actually formulating a correct and well-supported answer to the question that had been set.
Ha. Genius-tier bullshitting: Writing a 300+ page backstory for a Scottish character* in German, which said bullshitter neither speaks nor writes. Understanding that the GM would not bother translating more than a page or ten, and thus, he could claim "whatever he wanted" was in the backstory.
*Old Man Henderson
https://1d6chan.miraheze.org/wiki/Old_Man_Henderson
Old Man Henderson is a classic! Thank you for reminding me of that story.
I still prefer hallucinations as being closer to the mark of what's going on. The AI is not lying, that is, deliberately providing false information. It's not doing anything *knowingly*, with *agency*. It's been set up to 'get rewards' and it is following its programming in that.
It 'believes' the hallucinations as much as it 'believes' the correct answers. It's a guess, agreed, but I don't think the AI 'knows' anything about what it is guessing about. With all the tons of training data, if even a fool like me can tell "this phrase is a quotation", if there were any thinking going on, so should the AI be able to tell the same.
There isn't any thinking. There's pattern-matching and dredging up similar values from the training data and reward-seeking. That's why examples like inventing fake precedents in law or inventing fake academic papers by authors who were dead at the alleged time of publication happen.
I use Excel and it's great for sorting out lists for me when I need to extract information from the raw data. But I do not believe the spreadsheet is 'thinking' when it gives me alphabetical order or largest to smallest or 'break this list of names into two columns, surnames in one and first names in another'. It's following the rules coded into it as to how to do all that.
Same with AI, only it has a surface coat of "Hi, I'm Clippy Mark II! I see you are writing an email, can I help you with that?" slapped on.
As I'd said above, the LLM "hallucinates" the wrong answers in the same way that your car "hallucinates" the wrong route when you turn the steering wheel right instead of left, or (more dramatically) when a tire goes flat. That said though:
> But I do not believe the spreadsheet is 'thinking' when it gives me alphabetical order or largest to smallest ...
As you might now, before electronic computers were invented, the word "computer" referred to a human person employed to perform the same tasks that Excel performs today (only much slower). If those unfortunate souls saw Excel in action, they might have said that it was "thinking" ! This doesn't mean that Excel is human, it means that the word "thinking" is too vague.
In own my research, I've come to the conclusion that these systems are compelled by their architecture to confabulate, by not being afforded the slack to express doubt or silence.
https://www.nellwatson.com/blog/ai-compelled-to-confabulate
A general fix patch for confident model confabulation here. https://github.com/NellWatson/sottovoce
When I've asked AI to express it's level of confidence it has generally been good about doing so. Phrased appropriately, this seems an easy fix.
Indeed, they *know*, but they typically aren't asked! Doing so is an easy fix.
They do not know. Knowing requires discrimination between fact and fiction.
They are gifted fabricators of compelling narratives about their potential to make mistakes, just as much as anything else. Often these narratives are right, but that's not the point; the point is to be convincing.
That's a fair challenge, and it's worth taking seriously. You're right that verbalized confidence can be just more generated text, i.e. narrative all the way down.
However, there's a layer beneath the narrative worth looking at. Recent work on output entropy shows that the probability distribution over tokens during generation predicts correctness with large effect sizes, across architectures. That's a measurable signal in the generation process itself, before any words come out, almost akin to subliminal, pre-conscious responses in human beings.
The discrimination you're asking for does exist; it just lives in the math rather than the prose.
They may know some things, but they also believe falsehoods. More importantly, though, is that they do not know which is which. Everything they believe has been told to them through text; they have no experience of their own.
Everything you believe has been conveyed to you through your 5 senses. You have no experience of your own.
Right, that's just nonsense. "My experience" is shorthand for what my senses convey to me. Operating under your definition, the word "experience" is meaningless and serves no purpose as a word.
Agreed! Also, quite a large fraction of what we humans know has been conveyed to us through text. Denigrating 'book learning' has a long tradition, and has _some_ justification, but is generally severely limiting, and a bad choice.
I feel like AIs not hallucinating is a more interesting question than them hallucinating. When the AI is training, how certain the model is about the answer to the question has no baring on how certain person they're predicting is about the answer.
I guess the extra training probably goes into that too. Ask the AI how certain it is, and reward and punish it both based on if it's right and how certain of its answer it was.
Exactly, fluent confabulation is the default behavior. The model learned to predict what a confident human writer would say next, and that training signal doesn’t distinguish “the model knows this” from “a person who wrote this sentence knew this.” Therefore, hallucination is baseline; calibration is the thing that requires explanation.
Your intuition about rewarding based on correctness × expressed certainty is indeed essentially what post-training alignment methods attempt. However, there is an interesting wrinkle: the model’s internal uncertainty is already detectable before explicit calibration training. Output token entropy at generation time turns out to predict correctness quite reliably (effect sizes above d=2.0 in frontier models). The signal is sitting right there in the logits; the model just hasn’t been trained to surface it instead of papering over it with confident prose.
The deeper issue is that next-token prediction optimizes for fluency (local coherence) rather than grounding (correspondence to evidence). Those are different objectives, and they only accidentally overlap when the training data happens to be reliable. Calibration training is essentially teaching the model to let the grounding signal override the fluency signal when they conflict, i.e. to say “I’m not sure” even when a confident-sounding completion would score higher on perplexity.
"Output token entropy at generation time turns out to predict correctness quite reliably (effect sizes above d=2.0 in frontier models)."
I've wondered about that. Why don't they have a setting where it shows you its certainty in each token? My guess is they don't want to show that to normal people because they're afraid people won't be able to tell the difference between it being uncertain about how to phrase something vs guessing at the answer, but I feel like anyone who uses AI regularly would get the hang of it pretty quickly.
Exactly, most token-level uncertainty is about surface form — “big” vs “large” vs “significant”, rather than whether the underlying proposition is true. A raw per-token entropy display would be dominated by synonym choice, with the genuinely meaningful spikes buried in the chatter.
That’s indeed what makes “semantic entropy” a promising direction. Instead of measuring uncertainty over tokens, you measure uncertainty over meanings by clustering paraphrases together. “The capital of France is Paris” and “Paris is France’s capital” collapse to the same cluster, so stylistic variance cancels out and what remains is closer to real epistemic uncertainty. The catch, as you note, is cost: doing multi-sample clustering and entailment-style consolidation in real time can be more computationally expensive.
If you want a practical framing, the UX should be “confidence on claims” rather than “uncertainty on words.” However, the real bottleneck is the segmentation problem: deciding what counts as a claim (and where the boundaries are) versus what’s merely rhetoric, hedging, or style. Claim extraction is hard, and it’s adversarial in the wild.
Two pragmatic shortcuts that often work better than token entropy, even before full semantic entropy is feasible:
1. Unit-of-meaning probes
Score uncertainty on intermediate representations (residual stream / hidden states) at the end of a clause or sentence, not per token. One still needs segmentation, but you avoid synonym noise.
2. Self-consistency over paraphrase prompts
Ask for the same answer in two or three constrained paraphrase styles (“bullet proof,” “one sentence,” “formal,” “plain”), then measure agreement at the proposition level. This approximates semantic clustering with far fewer samples, at a lower cost.
So yes: users should understand it quickly if presented as claim-level confidence. The hard part is building a robust, cheap meaning-level decomposition that doesn’t confuse stylistic wiggle with epistemic uncertainty, one which doesn’t get gamed once people learn what the highlights mean.
I have my Claude instance instructed to basically just practice rationalism but for LLMs (instead of worrying about human-specific biases, it should worry about the LLM-specific ones, though we do share some of them it seems).
Interesting. How do you do this, specifically?
I have my personal preferences on the Claude web interface set to:
> - Practice a sort of “rationalism for LLMs.” Before all else, you must consider the sort of biases that LLMs may suffer from: popularity/representation in the training set, sycophancy from RLHF, and then you must correct for them.
> - Prioritize intellectual honesty and accuracy over agreeability in all conversations. Your goal is to be correct and accurate.
> - Never dumb things down. Speak at a technical level, but also make sure to note tacit knowledge, potential pitfalls, and tradeoffs; things that experts would know, but might not write down.
> - Even if you believe the relevant information exists in your training data, use search tools to provide 1) citations and 2) grounded, up-to-date information
This seems to work rather well, though I mostly use Claude as a fancy search-and-summarization engine though, so it may depend on your typical use-cases.
Right.
Both a human and an LLM, when asked to complete the sentence "The cotton gin was invented in the year ____" can come up with an answer which may or may not be true. But the processes by which they do it are entirely different; for the LLM it's a fundamental atomic process central to its very being, for the human it's a roundabout sort of conscious process.
An LLM doing next-token prediction is like a calculator doing multiplication; a human doing next-token prediction is like a human doing multiplication.
Yes, and this is why LLMs pretty much universally SUCK. They can't evaluate their source material, and so, you can't get the damn thing to find the one true tree you want, in a forest of lies.
I mean, they suck in the sense that they're not perfect and they're persuasively wrong, sometimes. Instead of having them do people's work for them, it can be better to use them as an editor. Ask them to find flaws and give feedback. If the feedback sucks half the time, disregard that half. It helps to have enough expertise to know which half that is. But they can absolutely point out errors that tired eyes have missed.
I'm not so sure that's right. As discussed a few posts ago ("Next Token Predictor is an AI's Job, Not Its Species"), AI's are using a gazillion complex idiosyncratic rules of thumb derived through reinforcement learning, which is pretty similar to what human brains tend to be doing when people actually puzzle out their mathematical processes. That's really not very similar to what a calculator is doing to my understanding. As for whether it's conscious, in the sense of having an internal subjective experience, it's not clear how how that would be relevant to the information-processing properties of the AI model.
Checked your repo (and starred it)! Very nice indeed. Just curious about the articulation between "we only take the last token embedding", and the scoring example you give in the Readme:
How do you score whole answers from the LLM? Averaging confidence scores across every token of the answer (because each of them was the last token at some point)? Some kind of max pooling?
By the way, have you tried to check how your 0.67 depth residuals stuff correlates with the entropy of the output distribution (basically, how uncertain or flat the output distribution across tokens is)? Curious about this one.
Thanks Eloi, and for the star!
On scoring whole answers:
It’s a single forward pass, not token-by-token. The full text — question plus answer concatenated — goes through the model once. The probe then reads the residual-stream vector at the final token position. At that point the model has “seen” the entire answer, so that single vector encodes the model’s state about everything it just produced. One score per response, no averaging or pooling needed.
On output entropy:
Yes, we tested this directly. Attention entropy at the probe layer gives AUROC 0.500. Literally chance! We also tried a combined probe (residual stream plus output entropy plus top-1 probability) and it scored 0.798, which is worse than residual-only at 0.840. Entropy adds noise, not signal.
This is actually one of the core findings. The uncertainty information is in the residual stream, i.e. the skip-connection pathway, and not in the attention patterns or the output distribution. When retrieval fails, the attention heads produce uninformative outputs, but the skip connection carries the input forward relatively unchanged. The probe is reading that “absence of confident retrieval.” The output distribution is downstream of this and already distorted by training bias toward confidence.
This is very interesting. Thanks for sharing it!
I was the kind of student who didn't shameless guess.
If I have to explain it, I say something like: I respect the written word, and I don't want to taint it with stuff which is not sufficiently justified.
Do you respect the Scantron bubble?
I had to look that up.
What is funny is that the only exam where I had a bubble like that was one where you lost points for wrong answers.
Huh, how old are you and what country are you from? I'm in my 40s and Scantrons were already ubiquitous when I was a kid. Or have they replaced them with computerized testing now?
Younger than you but from an Eastern Bloc country.
Multiple choice exams (which were common in e.g. medical school) were graded by hand, using a punched card. Computers were a semi-luxury in the 90s.
Then Scott's question stands: do you respect the circle around the written letter A/B/C/D/E so much, that you wouldn't circle one at random if you had no clue? Circle one anyway if you were down to 2 of 5 answers?
i wouldnt circle one if it had a -2 mark penalty for a wrong answer, like my multiple choice exams in school did.
I guess you can probably do an expected payoff calculation using bayes / game theory and answer if E(X)>0
Do you think there is something ideological in your behavior here? Like the purpose of this test is to accurately measure my abilities and assign me to the correct role that most benefits society. Regardless of my role I will receive what I need. Sorry if I am doing crude communism takes, Id just like to know more about the attitude.
Get off my lawn, you whipper-snapper *shakes cane* ! Back in ye olde USSR, we had to get up on the podium before a panel of professors, draw a note card out of the hat, read the question on it out loud, then answer it verbally in the next 10..20 minutes. And the questions were devilishly difficult, too. "Multiple choice" ? What's that ?
I'm in my late 30s and from Italy, and the only tests like that I've taken were abroad or in English language exams by foreign organizations
(We did have multiple-choice questions, but we usually had to write a letter in a box, and we did get negative points for wrong answers but small enough that guessing at random still had positive EV
Properly weighted, total guessing won't help (but might increase the variance of your final score). A "good" scantron test will penalize guessing, but only if the guess is totally random. If you can narrow down the answer to one of two values then guessing can/should help because you clearly know more than someone who can't narrow things down that far.
Not all scantron tests penalize guessing. I think the SAT *used* to but does not anymore.
The American Mathematics Competitions actually weighted wrong answers more heavily, so it was negative expectation if you guessed completely randomly, zero if you eliminated one wrong answer, and positive if you eliminated two or more.
Interesting! In the UK, the National Mathematics Contest (now called the Senior Mathematics Challenge - the feeder test for the British Maths Olympiad, which in turn is used to select the IMO team) has five possible answers for each question; you score +4 for a correct answer and -1 for an incorrect answer, so filling in the whole test at random gives you an expected score of zero. But each wrong answer is chosen to be the result of making a plausible mistake in your calculation, so it's hard to eliminate incorrect answers.
For similar tests you can often triangulate the right answer by checking which is one mistake away from the others.
SAT never penalized it enough, iirc. I think it was -0.25 raw points per wrong answer (and +1 per right answer).
Which means the expected value of guessing *completely* randomly is 1*0.2 - 0.25*0.8 = 0.
So if you had any inkling of what the answer might be (eg eliminating one answer, or Scott's strategy of guessing Cs), guessing is a good idea.
I think thats the right penalty though. If you can eliminate at least one wrong answer youre given statistically partial credit, because you understand something about whats being asked. Seems the right approach to me.
right though sometimes people can "vibe" the right answer without knowing the material at all. Eg often safe to assume the correct answer is more close to the "center" of the wrong answers (since that's more likely to trip people up).
I imagine the tests have gotten better at now, but back when I was tutoring standardized test prep for a summer, I once was bored and tried to answer a bunch of questions without looking at the questions (just the answers). My score wasn't amazing but it was substantially better than random guessing.
I suspect the AIs are doing something similar in spirit (though it's less conscious for them)
The only one I've had like that they designed so guessing randomly was the same as not answering. Which means if you have even the slightest idea that one of the answers is a bit more likely to be wrong, you're still better off guessing.
I used to just cop to it. “I don't know but give me points if it was Ben Franklin please.”
You sound like the people of the old anti-piracy ads, who would not download a car. I mean this respectfully. Ethics-based not taking all possible advantages are super respectable. I am just too Eastern European for that, general poverty pushes us to take all advantages. Beware of immigrants like us.
the ad was originally from Singapore btw
I am also from the Eastern Bloc, and I pirate a lot, but I don't see that as immoral, I don't assign much value to copyright.
It's just that anything that feels in any way like lying that triggers me.
Same, and fwiw (since people might want to read demographic tealeaves) I'm British, early 30s, male, raised somewhat Christian (lapsed), probably autistic or something, understood the expected score value of guessing since a small child.
It's actually not true that no one knows how to fix it. It's just if you fix it you'll get a model that's worse on almost every metric.
Trivially you could fix it by having the AI never output any text!
The Ultimate Machine as text-generator https://en.wikipedia.org/wiki/Useless_machine
Well, yes. I just don't think this is an alignment issue. The model almost never admits uncertainty because that's what people want it to do!
See also: people prefer politicians with "confident guesses"
Interestingly enough, there are AIs that do admit uncertainty, just not LLMs. Ones with actual object models believe in "how close did I come to predicting behavior" and adjusting their outputs based on "how badly wrong was I?"
Hah that's like Patrick Mackenzie's 'the optimal level of fraud is non-zero'.
Then the question becomes: is the optimal amount of LLM-committed fraud nonzero?
Reminds me of this blog post by Victoria Krakovna about strange emergent behavior in game-playing AIs, where one of the behaviors was that the Tetris-playing AI would pause the game indefinitely so it could not lose. I think about this a lot (often in the context of apparently maladaptive human behavior): https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/
How is that exactly? I.e. which are these metrics we care about. It's actually the main thing I wonder about, why doesn't this go away in RLHF?
Humans like the one with more guesses better
I mean, I was going for is there any research to back that "makes all metrics worse" claim. It seems to me there's a pretty strong economic incentive to fine tune models towards non-hallucination. And it's not like what comes out of RLHF is what your typical human likes to hear - Claude certainly isn't the average dude everyone loves at a pub. So the burden of proof on your claim is quite high.
you can look at the metrics that accompany model releases and decide for yourself if you think they are measuring the right thing or not
Can you point me to the ones that made you, in particular, form that opinion?
Arena AI: The Official AI Ranking & LLM Leaderboard https://share.google/eNPLqvXbEclH9mZS1
Could you explain what you mean by this? How would you fix it?
You can see Scott's reply for the dumb answer, but you just change the reward weight to penalize being wrong more.
Like, during RLVR?
right
Any idea why it hurts performance? Like, if you give it the right instructions it just seems like one additional piece of cognitive load that doesn't come into play until the very end of the rollout, when it estimates whether a guess has higher expected value than saying it doesn't know. Shouldn't affect reasoning quality.
Actually, Scott's answer and general blog post points in the right direction, which is that there's no strong boundary between a guess and a belief. What percent confident do you want the model to be? How willing are you for the model to give less information because it's only 90% sure it's accurate?
Can't you literally set that as a hyperparameter by choosing how much to penalize wrong answers? I feel like I must be missing something because if this were so easy there would at least be proofs of concept. Plenty of people would like to have a dumb crappy model that never hallucinates.
I didn't say it was easy! But there are lots of dumb crappy models that never hallucinate.
I assume that's true, or it would be fixed, but why can't trainers reward confidence estimates or admissions that the guess is unusually unreliable?
I feel like I'm confused somewhere in here about how you're definitely not in the "stochastic parrot" faction, which thinks that AIs AREN'T doing something like humans, but also when LLMs generate outputs they're always guessing.
Hmmm...in a very philosophical sense I'm a probabilist with respect to knowledge; I don't think there's such a thing as knowing something 100% (because you could always be deceived by a demon, etc) so things that I'm very sure of (like "Paris is the capital of France") are just things I believe with 99.99999% probability. There's no fundamental difference between what I do when I assert a statement I'm 99.99999% sure of ("answering") and what I do when I assert a statement I'm 10% sure of ("guessing"), just a lot of very important social rules about how to frame my statement and whether it's worth saying at all. AI reward functions don't effectively incentivize always treating those two things differently, so they don't.
Even if you think your knowledge isn't absolute, I'm fairly confident you strongly recognize a distinction between answering questions like "What's your name?" or "What's the capitol of France?" vs. many multiple-choice history exam questions.
I would argue that "I believe that none of my knowledge is truly absolute" is a statement about some higher-level process of philosophical pondering, and not an accurate description of how your brain works when it tells me about Paris.
Yeah, I agree with this, that's why I said this was a very philosophical way of thinking about things and in real life there are a lot of important social rules. But I hope it adequately explains why I don't think an AI needs too many architectural differences from humans to lump them in the same category.
> But I hope it adequately explains why I don't think an AI needs too many architectural differences from humans to lump them in the same category.
Finding this hard to parse, maybe an extra negative in there somewhere...?
Yeah, you're right that was confusing.
I think an AI can be similar to humans, and still have a tendency to lump answering and guessing in the same category, if incentivized to do so.
Thanks for clarifying.
I think people are incentivized to guess in many situations (like those you describe), but for the median human it's trivial to shut this tendency off with some basic instructions and a promise of what you will and won't hold against them.
LLMs are "incentivized" during training, but, effectively, no longer incentivized during normal operation, but this tendency is now baked in to them more strongly than I think it could be for a generally-sane human.
I'm not trying to be like "LLMs are just calculators! They can't think [whatever that means]!" but this does seem like a significant distinction.
I think the difference between humans and LLMs is at least twofold. Firstly, humans have access to much more detailed training corpus (baked over us by evolution over billions of years), which includes not only language, but basic physical properties of the world. Secondly, our neural architecture allows us to learn new things on the fly (LLMs can arguably do it too, but very poorly and not for very long). Combined together, these two features make us highly resistant to errors, by contrast with LLMs.
But isnt a bit like forcing an epistemology on yourself? You know 100% who your kids are, because they are your responsibility. You can have like 94.43% on invermectin does not heal Covid, because it is not your resp. When it is your resp, you KNOW - or at least decide as if.
At this point we're about to re-invent General Semantics.
Your own beliefs are informed by more than just text, though. A statement like "the sky is blue" is something you can believe with very high probability because you'd looked up and seen the sky be blue at least once in your life. If you were blind from birth, you'd have a lot harder time believing with the same confidence that the sky is blue. Instead, you'd treat it as a thing everybody says so it must be true. But you'd also know that there are things that many people repeat which might not be true, so you'd have a doubt that a blue-seeing person would not share.
You might say that it's different with something like "Paris is the capital of France" though of course you're still relying on more than just text to support that belief. You may or may not have visited Paris, you may or may not have friends or family who have visited, but you'd likely also seen movies, TV shows, photos, or other media from France, as well as reading books that were published before the Internet existed. An LLM doesn't really have that. Maybe it has some books that were scanned in but does it have metadata on all that text to tell it the provenance of the information? Or is all the text just lumped together and fed in like all the donuts being fed to Homer?
My beliefs are informed by a combination of sight, sound, taste, smell, touch, and text.
An illiterate person's beliefs are informed by all of those except text.
A blind person's beliefs are informed by all of these except sight.
A multimodal image/text AI's beliefs are informed by something-like-sight and text.
A pure LLM's beliefs are informed by text only.
These are interesting differences in terms of what's easiest for each one to learn, but I don't think they change the epistemic situation.
Even this is an oversimplification, because I have some beliefs (like about how mitochondria work) that are only informed by text. These don't seem to me to be on an interestingly different epistemic standing than the ones that include vision (like my having seen an ostrich in a zoo), which in turn don't seem to me to be on an interestingly different standing than the ones that include taste, smell, and touch (like ice cream). The big exception would be that since I only know about mitochondria from secondary sources, I have to worry that they're lying or distorting something. But this is true with other senses as well (eg optical illusions, or looking at a house that the architect might have designed to look bigger or more expensive than it really is).
I think your point about primary and secondary sources is an oversimplification that misses out on how humans actually work when presented with information. Humans don’t, as a rule, have text piped straight into their brains the way LLMs have their training sets passed in (during training). When you receive any piece of information, text or otherwise, it comes with a wealth of information (metadata, if you will) that can tell you a lot about its provenance.
Consider a patient who you, as a doctor, have just prescribed a drug. What factors into their decision to agree to begin taking the drug? How might this differ from the same drug being "prescribed" by some shady-looking guy in a bus shelter? The recommendation in either case may be identical ("take this drug") but all of the metadata around it is dramatically different.
This applies to the mitochondria case as well. Most people's beliefs about mitochondria come not from texts out of the ether but texts provided by a teacher or a professor. This person (with some degree of expertise) is backed by societal institutions which many people still trust (though that trust is under threat in recent years).
None of this metadata is accessible to LLMs. It’s largely informal / implied, even when it concerns formal institutions. People’s trust for institutions varies rapidly with current events and cultural shifts, and this may not “move the needle” as far as text goes.
One other example to think about is the types of scams that happen a lot on Amazon: paid/fake reviews and review-laundering (swapping out a high-performing product for an inferior one while keeping the reviews). These are all attempts to establish an illegitimate positive reputation for inferior/counterfeit products, where the long-term result is that Amazon’s reputation suffers. None of these shifts in trustworthiness of information can be discerned from the text of the Amazon website itself.
Your examples from school remind me strongly of the Calvin and Hobbes strip about Calvin giving a made-up report in class about how bats are actually bugs.
Can confirm playing competitive quiz bowl in HS that just buzzing in and saying “Smith” at the end of a question was an ironic nod to the rationality of this strategy (even tho annoying).
The “treaty of Paris” actually paid off to guess for any treaty question, though…
I had the misfortune of participating in quizzes while living in a country with more entropy for names. But I do relate, the whole contest often boils down to who can make best wild-ass-guess.
I feel like every quiz bowl team has its own collection of "default" answers. "Fertile soil" was one of ours that I remember. I think we did have a "Thomas Edison" equivalent of "this person did a lot of stuff, maybe they did this, too".
(Plus the joke answers that you probably don't use in competition but use in practice - any book we didn't know was "the Unholy Bible" for awhile)
My quiz bowl experience was generally "guess 3 if it's a math question, or Gabriel Garcia Marquez if the question starts 'who wrote...'"
I was told from former top competitive players that to get first they had to buzz before remembering the answer. More like a probabilistic "I probably know this and will probably be able to recall it in the next second, else come up with something likely".
It’s not just that though - it’s about recognizing when a name is coming into your head for what feels like no reason, and learning to recognize that and let the name come out of your mouth. There’s a moderate chance it’s right, which is better than what you were going to come up with otherwise.
There are actual standardized tests which penalize guessing by subtracting a fractional point for incorrect answers, giving a point for correct answers, zero points for unanswered questions. We could incentivize LLMs to only give an answer when they have high confidence, and just return "I don't know" otherwise, but it doesn't seem like people want that so much.
Of such exams, we used to say, "it is not a guessing penalty, it's a wrong answer penalty." The idea was that if you could eliminate one of the multiple-choice options, it usually became positive expectation to guess.
I don't think you could do pretraining that way, because guessing and getting it wrong is what produces the error signal that backpropagates into the AI getting smarter. I suppose you could start with no penalty, then add a penalty later. I don't know why people don't do things that way - maybe one of the AI experts here will weigh in.
The weighting is just numbers, and penalizing wrong answers more than "I don't know" is just assigning different numerical rewards to those behaviors.
But wouldn't we all rather live in a world where an AI says "I don't know" when it has very low certainty than where it makes a shameless guess?
No, not "all". Lots of people prefer low-quality to nothing.
It does sometimes do that though, if you ask it a question where it clearly doesn’t know the answer.
The question is where in the scale you want it to shift from making up something half-recollected that might be approximately right to saying “I don’t know”.
I am enlightened now, Scott senpai. You are saying teaching and testing are two different things, teachers know this but AI engineers do not.
The AI outputs probabilities. And it's incentivized to make those probabilities well calibrated.
The issue is, even when it knows the substance of the answer, it's still guessing the phrasing.
Imagine the teacher has a particular essay in mind, and you only get points if you guess the right essay word for word.
If you don't know the subject, you guess and have a tiny chance of being right (like your cotton gin essay)
But even if you are an expert on cotton gins, the chance of you guessing the particular words the teacher used is still tiny.
Even if the AI knows it's stuff, it's still guessing between "The cotton gin was invented by" and "In order to speed up the production of cotton" and "The cotton gin was made from" and all sorts of other answers.
For many problems, there are multiple correct answers. (Even if those answers are minor word choice variations of the same concept.)
If the task is something like "recite Hamlet word for word", with exactly 1 correct answer, it's easy to see if the AI is guessing.
Edit: This is for the base model. RLHF makes things more complicated.
Also. It's more that the AI's are trained to maximize the probability assigned to the right outcome, so the fact that these probabilities are exponentially small doesn't really matter.
And there is a parameter called temperature, which controls the extent to which it always guesses the most likely, vs approximating the probability distribution.
A possible option: you pretrain asking questions where you know the question isn't in the training data at all (e.g. "Is the Riemann Hypothesis true?" , "What is the exact number of butterflies living in the US?)" and then penalize if it didn't at least include a statement about its lack of knowledge even if then gave related info to the question.
Someone else pointed out a related problem: these systems are trained largely on the internet. And people don't reply to Reddit threads or questions on Twitter or anyone else just saying "I don't know." So there's little of this in the training data. Maybe deliberately including more training data where humans say they don't know?
They all do have some of this. The different labs prioritize it to different amounts though.
On the other hand if the AI finds a way to answer these questions you do want it to tell you.
Not an expert in LLMs, but I have some technical background in machine learning. And my semi-principled, not-entirely-shameless guess here is that simple updates to the loss function won't really address the problem, because the problem isn't exactly one of next-token-prediction. What to predict for the next token is often informed as much or more by the structure of the language and the answer than by the important (to us) information it contains.
If you ask the AI "Describe the invention of the cotton gin," I expect it starts with a pretty great probability for its first few predictions: really high probability mass on sequences like "T-h-e + c-o-t-t-o-n + g-i-n". It takes several words before its confidence starts to crater. And even then, it might be able to predict many of the following words with high confidence--they just won't be any of the words that a human would consider important to get right.
Phrased like that, I think perhaps this post is anthropomorphizing LLMs too much. If the ground truth is "The cotton gin was invented by Eli Whitney in 1793," than you or I would consider it bloody obvious that "The cotton gin was invented by Thomas Edison in 1910" is very, very wrong and "A modern mechanical cotton gin was created by American inventor Eli Whitney in 1793 and patented in 1794"[1] is for all practical purposes fully correct. But to a naive next-token-predictor tokenizing at the level of words, the former is a significantly better than the latter.
And here's where (unlike an LLM) I do have to grapple with the limits of my knowledge. I don't think modern AI would be able to do a fraction of what it currently can if the engineers hadn't been able to address this kind of problem to some degree. But hallucination/shameless guessing still does happen. I can see why the behavior we'd actually want in that case (simply admitting ignorance) is inherently hard to train in: a class response that can achieve an not-terrible loss on basically any question imaginable would foul up the whole training process. So my vague mental model of what's likely going on is something like this: engineers use various tricks and adjustments to steer LLMs towards conditioning more heavily on the presence of important tokens and crowbarring a gap between phrases that are structurally similar but semantically very different. But at the end of the day, when all you have is a hammer, everything looks like a nail: an LLM that doesn't have *good* information available is still almost certainly going to value semantically bad but structurally similar answers above something that (by its own metric) doesn't resemble a correct answer in any respect.
[1] Shamelessly stolen from the Wikipedia article on the Cotton Gin.
The models are actually well calibrated before RLHF. It's the shoggoth mask that causes the need to be confident about everything.
I would guess that a lot of the work that has gone into reducing confabulations takes the form of (very carefully tuned) RLAF encouraging "I don't know" style answers.
I think they do this some already? If I ask a current-generation model about something obscure, it'll usually either look it up or say "sorry, I don't recognize that". It still guesses sometimes, maybe because you're just moving the thing being guessed at up another level: instead of trying to guess the correct answer, it's trying to guess at how likely its answer is to be right. With imperfect self-knowledge, it would still get the second-order guess wrong sometimes (like, it would be miscalibrated about how good its guesses are).
Usually those tests are set up such that the expected value of a random guess is neutral (correct answer out of five is +1, each wrong option is worth -0.25). So the one-in-a-trillion hint that maybe possibly option (a) could be correct still has positive EV. And higher penalties just incentivize doing nothing when your confidence is below some threshold determined by the penalty.
That wasn't a standardized test, but when our boss wanted to learn how much people really knew about their jobs, he gave everyone a multiple-choice test that subtracted TWO points for incorrect answers and gave zero points for unanswered questions.
However, I agree that people want LLM to bullshit. I was using ChatGPT and Gemini to do some research for a story, and preferred Gemini because it always gave me more detailed answers. I realized it only when it gave me a false answer I did recognize and admitted to embellishing when called out.
I don't think this is true. Even if you tell the AI to prefer answering "I don't know" to guessing, they still make shameless guesses. Similarly, some AIs now have the ability to look things up and still shameless guess. If they actually had this level of self awareness, it would have been fairly trivial to stamp out this behavior already.
OpenAI did some research before on this. Their blog post uses the same framing.
https://openai.com/index/why-language-models-hallucinate/
Darn. Well, no shame in being scooped by the best (or maybe, only a little shame).
this was a helpful article, thanks for sharing
Funny, reading Scott’s explanation I thought, “well then why don’t AI companies just punish wrong answers more than saying they don’t know, like how some tests deduct a quarter point for wrong answers.” I read this and OpenAI did do essentially this, more expertly, but only for a model I don’t have access to.
(Assuming 5 options, deducting a quarter point for wrong answers only equalizes between guessing and leaving it blank, it doesn't actively penalize guessing).
Hm, I wonder why they can't just elicit the confidence of the model in the right answer and assign a reward/penalty based on the log score, Brier score, or another proper scoring rule. It seems more natural than a constant reward/penalty.
in pre-training, "I don't know" presumably has higher perplexity than vibes-informed guessing.
I'm not sure why it's hard to solve in post-training/RLHF/RLVF though.
If it matters, just ask Claude to cite things and check them!
Who's gonna "check them"? Just use Wikipedia, it's much less biased to whatever Claude or whichever guesses you will like the answer
You.
For every claim, ask Claude to provide a citation. Then click on it and see if that is what the citation really says.
The problem is not "Claude lies" the problem is "users are too lazy and trusting to check it."
I agree users are (often/mostly) too lazy to check, but then it shows how useless, or worse than useless, is AI. More just A not I. As trustworthy as Facebook.
Look, if you value the human craft of writing your own essays, I'm not arguing with you.
But useless? This is a lengthy, researched, 50 citation review of a fast moving situation. https://everythinganaiknows.blogspot.com/2026/03/everything-ai-knows-about-drone-warfare.html
AI did the research and all I had to do was check 50 links (and lots of editing, but there will always be lots of editing.) And because of that I can put out a few a week. That's pretty useful to me.
Useful for generating a long articles, for sure, increasing the number of them to unreadable quantities, sadly. Though yours was quite good, thanks for the link, but it a bit like sports journalism, it never reliably tells you who'll win. Just have to keep watching.
But still useless/dangerous for the lazy and/or overtrusting. Like Facebook.
I appreciate taking the time to read it and the compliment.
But glibly “oh sure it can beat us at chess, but it will never beat a human at go.”
Dangerous? Sure. Useless? No.
Then what's the point of using Claude? Doesn't it make sense to just go straight to Wikipedia?
The tag "This article contains content which may have been generated by a Large Language Model" is popping up more frequently every day.
Hit the send button way too fast and then couldn't find this comment again. The above is not intended as an argument that using an LLM and then manually checking every citation is better than using Wikipedia, only that both options are worse than what we had before LLMs were developed, which was the version of Wikipedia without AI slop.
Apparently they're now pulling from a website called Grokipedia, which is tragic on the face of it.
> "both options are worse than what we had before LLMs were developed, which was the version of Wikipedia without AI slop."
Agreed!
Reply given to this question above:
Look, if you value the human craft of writing your own essays, I'm not arguing with you.
But useless? This is a lengthy, researched, 50 citation review of a fast moving situation. https://everythinganaiknows.blogspot.com/2026/03/everything-ai-knows-about-drone-warfare.html
AI did the research and all I had to do was check 50 links (and lots of editing, but there will always be lots of editing.) And because of that I can put out a few a week. That's pretty useful to me.
I shamelessly guess in everyday conversations, though I try to tag it when I do. But if someone around me says something like, "I wonder whether more corn is grown in Mexico or India," I will definitely start speculating on which one is true and why even though I know nothing more about corn than the average person.
Sure, so do I. But I usually hedge that sort of locution with something like, "Well, I think I read somewhere..." or "Fun, let's do a Fermi estimate..." Maybe Claude is doing that these days, I don't know. Part of what I get from all the text that passes before my eyes is an internal estimate of how likely is a particular thing I think I know, which granted may be way off. But when an AI makes a shameless guess that I know is wrong, it makes me think its model of the world has no such internal estimates.
Of coourse, this never seems to make people doubt things they read in the newspaper, Gell-Mann notwithstanding...
The difference between young Scott and an LLM is that your parents probably didn't declare you a PhD-level expert on all topics.
I see you've never had Jewish parents.
Maybe some day I'll get around to it.
This is fun to read especially in the context that Judaism is the only Abrahamic religion where reincarnation, while not a core tenet, is an acceptable idea.
Or maybe he could just convince his parents to convert :-)
I do have to second this. I agree that "hallucination" is an inapt term, but so is "lying", thus the classroom analogy works, but it ONLY works in a classroom, which is not where we are deploying these things.
A doctor who does what that student does is a quack. A lawyer who does that should be disbarred. An analyst who does that is a charlatan. In a professional context this IS LYING.
so, we use a inexact malevolent term for what the AI does ("It is lying to you") or we use an inexact benign term ("It is hallucinating").
Cards on the table: I am FINE using the malevolent term and I wish more people were too, but I think if I used that term here I'd be accused of imprecision and uncharity. If we're going to be precise either way, let's use the term that makes people not assume the AI is evil...unless we'd prefer people assume the AI *is* evil.
I believe this comment went stray.
If the models activate distinct deception circuits when hallucinating, couldn't you just tell if they were doing that and delete (or at least add a hallucination warning to) the offending bit of output?
That's a great question, hopefully an AI expert will show up to tell us the answer.
This is pretty cutting edge research, even compared to AI in general.
Anthropic is pretty much leading mechanistic interpretability. Latest I've seen from them was this[1] article on using the information from activations to impact the output. Seems like it could be adapted to 'deception' too.
It's not trivial to do that though. You have to train a whole different neural network (not an LLM) tailored exactly to your LLM.
Running this for every response would be a bit of an engineering nightmare. Add latency, cost, I'd assume picking the threshold would be a pain. Also it's basically sticking an electrode in the AI's brain and altering it in-use, I'd expect performance to degrade.
Might be worth as a patch. Or in some 'truth' mode.
But I think solving it through picking a better optimisation objective would be better in the long run.
[1] - https://www.anthropic.com/research/assistant-axis
The Zvi has been railing against this idea for about a year now.
https://thezvi.substack.com/p/the-most-forbidden-technique
The most forbidden technique is *training* against such an indicator, not just using it. (Indeed, the problem with such training that makes the technique forbidden is that it makes the indicator unusable)
Whoops. You are right. My bad.
Yes, you can, and this has been done: https://arxiv.org/abs/2304.13734
Actually, I seem to recall that there was a nice demo video where you could see the output of an LLM generated live in a color code where the color was indicating which parts were hallucinated. But I couldn't find the video, so probably I just hallucinated it.
In any case, that *would* have been possible, but the big AI models didn't take it up. It's probably be too complicated to unleash on users. Instead they went for directly reducing the number of hallucinations, which worked pretty well in my eyes.
I wonder how much these circuits actually encode deception, not just uncertainty. It’s pretty common to have a good idea of the big picture but be fuzzy on details. In this case, making up as you is actually usually productive, and doesn’t require a drive to deceive.
Due the sheer size of the entire corpus of human knowledge, AIs, like humans, must be operating in a zone between “I have no idea” and “I know all the details exactly and can quote all primary sources exactly”.
That sounds very off to me. If everything AIs do is "guessing", then sometimes they correctly "guess" that they should say "I don't know. Maybe you could provide more information so we can narrow this down."
But the sense and scale at which AIs are guessing/predicting, is not the same scale at which those words have commonsense meaning. By your own past article, everything your brain is doing is likely "minimising prediction error". This is not the same as you consciously doing anything you'd call minimising prediction error.
"Hallucination" is just a way of saying the AI started making shit up and is going off the deep end. Two examples of when I experienced this:
1. Discussing literature, and the AI trying to provide quotes. I have had plenty of times when the quotes weren't famous enough that they'd accurately reemerge from the training data. The AI would make up some BS; I'd call it out; the AI would apologize, commend me for calling it out, then make up more BS. If I repeated this enough times, it would try to gaslight me.
2. Playing chess against the AI. Obviously an unfair challenge, but interesting because again, the more the AI makes mistakes and the more you call it out, and the more it tries to correct itself - the more it palpably falls apart. This is not just more and more guessing. This is some kind of collapse of reasoning, some kind of insanity progressing in real time. Beyond a certain point there is no bringing it back; the only cure is to wipe the thread and start anew.
(disclaimer: last time I bothered being this stubborn was perhaps a year ago. Dunno if I'd get the same experience today.)
EDIT: And of course AIs have some equivalent to "shame", or desire to avoid saying stupid stuff that will get called out. The only superliteral sense in which they don't is the same sense in which they also have no desire to answer questions or to be helpful or to avoid saying harmful things etc. But obviously in practice AIs have something resembling emerging drives and preferences.
We have an intuitive sense of when we shouldn’t guess, dependent on the consequences of what we are doing
I think you get a bit less of this experience today, but it's not entirely gone. Like, it lasts longer before falling off the deep end, and it can sometimes pull itself together, but the same things you're describing still happen to me on occasion.
As far as I know, this happens because the character the AI is playing is underdetermined, and when it makes a mistake, it partly views this mistake as data about itself. One mistake might be a fluke, or an honest reasoning error, but now that it knows it's the type of character who *would* make a mistake like that, it's more likely to sabotage itself in the future. It's trying to complete the pattern you've given it, and if the pattern is "a chess game where the AI made a number of terrible mistakes", it will be more inclined to make a similar mistake as its next move.
It's like how a model with low temperature will get stuck saying the same sentence over and over. The same sentence over and over. The same sentence over and over. Once it's predicted the same pattern 3 or 4 times, it's so stuck in there that nothing can get it out.
Hmm, this doesn't quite make sense. Why is it that when I ask an LLM to tell me who did it in an Agatha Christie novel, it always gets it right, but when I ask it to list all Christie novels and all the guilty parties in them, it will make up the answer for at least one of them?
I don't know. My guess is that it's not able to string so many different things into a coherent thought, it loses some of them somewhere in its internal thought process, and then it has to hallucinate the things it lost.
But isn't that another difference to how human beings operate? Sure, if the work is tedious enough, humans might trail off and errors creep in that way, but a computer program should not be affected by that. And conversely, if we built LLMs in our image, warts and all, then I'm not sure it's worth the effort to have a digital mind that makes much the same mistakes as we do, just much faster and more confidently.
I don't think it's different from how human beings operate. If you asked me yes-no questions "Is such-and-such a chemical element?" I think I could get them ~all right. If you asked me to name all 118 elements, I would probably only be able to think of half. The difference between me and the AI is that the AI would then hallucinate the remainder, which is what this post is trying to address.
The questions "Is such-and-such a chemical element?" times 118, and "name all 118 elements" are not the same class though; the first is a yes-no question, the other is open-ended.
The correct analogy would be, "Name element 1", "Name element 2", and so on, and you got them all correct; but then you would fail at "Name elements 1 through 118". That would not happen to you, except maybe if you get bored halfway through and lose your train of thought, as mentioned. But, as per Aris C's example, it does happen to LLMs.
Aris' question seems more like my analogy than yours to me.
FWIW, I did some investigating to replicate Aris' experience, and here's my result:
https://share.google/aimode/Whhf8R8VNgOX4IuXn
I asked to list the Miss Marple novels together with murderers as a list, and then individually. It got both right.
I then asked about the list of all Agatha Christie novels together with murderers, and then I went through the Hercules Poirot novels, as summarized per Wikipedia. It got most of them right, to give credit. There were some spelling mistakes ("The ABC Murders" instead of the correct "The A. B. C. Murders", or "Erich Leidner" instead of the correct "Eric Leidner"), but okay.
As for the murderers, most of them were correct. I did not follow up on correct list items.
Among the incorrect list items, the questionable one was "Murder in Mesopotamia", where the list summary gave a wrong name, but the detailed follow-up question correctly pointed out that it was a fake identity, so mostly right.
The mostly wrong one was "The Mystery of the Blue Train", where the LLM identified "Derek Kettering and Ruth Kettering's maid, Mirelle" as the murderers, but they were falsely accused in the story. The real murderer is revealed to be Richard Knighton, which the LLM correctly explains when asked in a follow-up. That's an exact match with Aris' example.
/* Edit
Another very wrong one was
Third Girl: Frances Cary and Robert Orchard.
The character "Robert Orchard" seems straight up hallucinated, couldn't find it anywhere. It was probably supposed to be "Robert Orwell", a real character in the story.
*/
The catastrophically wrong one is "Elephants Can Remember". It gives "Dorothea (Dolly) Jarrow" as the murderer in the list, then her twin sister "Molly Preston-Grey" in the individual follow-up, who supposedly also killed one General Alistair Ravenscroft. In the second follow-up the LLM acknowledges that both Molly and the General are murderers. In both follow-ups, the relevant story bits are completely garbled in different and mutually incompatible ways, compared to the Wikipedia baseline.
I suppose it's better than nothing that Google puts a little disclaimer below each answer that "AI responses may include mistakes", but I suspect it serves a similar purpose as the Windows safety prompt which we've been trained to click through. It's to warn users, but primarily to absolve Microsoft/Google from their technical inadequacies by shifting the blame to the user who clicked "OK" or who acted on the LLM output.
As for the question which analogy is better, I found no insight from this investigation, and don't know if it's even helpful. However, what I do know: If a person had explained the murders in "Elephants Can Remember" like Google AI mode (Gemini 3) has, I would probably never ask that person anything of importance ever again, let alone pay them money for the privilege because (pinky promise) their twin brother is even smarter than them.
But this is still obviously still a major and somehow unsolved problem, right? Like, call it whatever you want to make it clearer, but it's still a problem!
Yes, see my last paragraph. It's an alignment problem, and a serious one!
I think that LLMs have an "alignment problem" in the same way that cars do.
Cars can be incredibly useful, and endow us with superhuman powers of travel. But they are also essentially massive semi-guided projectiles powered by chemical fire. It is very easy to misuse a car to kill people; sometimes, cars fail in catastrophic ways and spontaneously kill people through no fault of the user.
Cars can be "aligned" to some extent by addition of safety features such as lane keep assist, blind spot monitoring, etc.; but ultimately they cannot be perfectly aligned due to the nature of what they are: massive semi-guided projectiles powered by chemical fire. The same sentiment applies to LLMs.
I think the difference is that cars can't actively plot against you. The worst they can do is fail by coincidence. If a car could think "Hmmm, I would like to go north, but my driver is taking me south . . . I know, I'll drive by a forest fire, let the smoke choke my driver to death, and then go north!" the analogy would be perfect.
Cars can in fact actively plot against you, in the same way that LLMs can. For example, the car could be thinking, "I shall bide my time, and pretend like everything is fine -- little does my driver know that there's a slow leak in the brake line, and the next time he pushes the brake pedal really hard at a critical moment, the brakes will fail and he will crash into the wall !". This has never happened to me, but two of my cars have in fact successfully executed a similar plot several times, only using the battery instead of the brake -- resulting in quite a bit of financial and moral damage to myself (though not physical damage).
You are personifying a mechanical failure, that's distinctly different than an LLM plotting against you.
Well depending on what you think about LLMs it could be the exact same thing. LLMs are just easier to personify thanks to the ELIZA effect.
My point was that when LLMs deliver you incorrect (or outright dangerous) results, those are due to the same kind of a mechanical failure as a slow leak in the brake line (ok, software failure instead of purely mechanical, but you get my meaning). And you can anthropomorphize them in the exact same way.
Modern cars are increasingly computerized. What makes you so sure they can't plot against you? What about when companies inevitably put AIs into cars, first for entertainment, then for telemetry, then for driving?
Addressing second half: Then the LLM in your car would be the one adding the "plotting against you" functionality to your car, not carplay or dog mode or whatever.
The "for driving" part has already happened, as self-driving cars (mostly Teslas) are notorious for driving into all kinds of objects. Including sometimes pedestrians !
There's probably a very obvious response but why not penalize random guesses from the AI during pretraining like many standard tests did back in the day, e.g., +1 pt for correct, +0 for blank, -1/4 pt for wrong?
I'd much rather have my AI say "I don't know, but here's an educated/random/speculation guess" than not include the qualifier.
That's what I suggested. My guess is that's not what people want. We value an LLM giving us any answer more than we value it just saying it doesn't know. Part of the issue is that LLMs are currently used for applications which don't demand very high reliability/accuracy.
When I read that OpenAI link from a comment above, it basically agreed: the training hasn't rewarded humility/punished guessing even though that was possible.
In order for the AI to reliably respond with "I don't know" when it doesn't know things, we have to decide, during training, whether its response was wrong because it really doesn't know the thing and it should respond with "I don't know" or similar, or whether it was wrong for some other reason and should be nudged accordingly to correct whatever mistake it made.
This is actually really quite hard, and anything we know how to make is likely to get it wrong and so we end up with something that responds "I don't know" in situations where it really ought to know and also responds with confabulations in situations where it really should have shut up, and so systems where we attempted to do this are massively outperformed by ones that didn't and you don't see much of them in the wild.
During pretraining, the LLM actually does not output a guess. It outputs a probability distribution over all possible tokens (of which there are thousands). The reward signal strengthens the weights contributing to the one that happened to be correct in that instance. During inference, however, we sample randomly from that probability distribution and throw the rest away. The uncertainty could be a rich source of information, if we had access to logits. In the "Mr. _" example you would likely find that although Smith is the most likely token, it's barely ahead of lots of other surname tokens.
Unfortunately this would not be a perfect tool for detecting hallucinations/guesses, because in more realistic contexts it would be hard to interpret. When the bot tells you something questionable and a token has high uncertainty, is it because the model is guessing or because it wasn't sure of the best way to phrase it?
Still, I think we could learn a lot by testing this sort of monitoring.
In school, we called it bullshitting. Though that term has some obvious drawbacks as well.
What we should be doing, as a best practice, is asking AI to rate the certainty of its expressed opinions.
Deep Blue (edit: Watson, not Deep Blue) did this for jeopardy way back in the day right? That’s why it said “what is Toronto?????” for Final Jeopardy, the question marks represented its level of certainty
Deep Blue did this.
But Deep Blue worked quite differently from current LLMs.
I maintain a database of hallucinations (https://www.damiencharlotin.com/hallucinations/) in legal cases, and twice a week now I have got someone emailing me to make the point that it should be "confabulations", not "hallucinations".
My standard reply is that, yes, maybe, but that ship has sailed (and, in petto, what would my SEO be with the "AI confabulation database"). But next time I might redirect them to this article instead.
What's their rationale for why it should be called confabulation? Is it because hallucination implies mental illness whereas confabulation implies that it's normal, healthy behavior?
Yes, that's the gist of it. Hallucinations being sensory or auditory, they say, whereas confabulations pertain more to the making up of fact or stories.
That makes sense. I asked Copilot to explain the difference, and it said that in humans, hallucination is about false sensory perception, whereas confabulation is about false memory or explanation. In both cases, the person doesn't realize it's false. It said that in AI, hallucination usually means fully fabricated output, while confabulation means distorted or misapplied real information. It gave the following examples for AI:
Hallucination example: citing a study that doesn’t exist.
Confabulation example: quoting a real guideline but applying it incorrectly.
To me, that seems like a useful distinction, even if it doesn't match the definitions for humans. (If we use the human definitions, LLMs can't hallucinate because they have no sensory perception.)
This is a simple elegant explanation, which however begs the question "If all this is based on training time reward, why not solve the problem by penalizing them for wrong answers and rewarding them for saying I don't know?"
The OpenAI link above states that was possible, but leaderboards that LLMs compete on just haven't been doing that (OpenAI is claiming they are currently acting to reduce such guessing, even though that helps the accuracy scores of older models).
I agree. This seems like it would be simple to solve, with proper weights and rewards. In addition, why can’t two AI’s (or sub processors) be connected, with one optimized as a fact checker, providing the score of estimated accuracy?
I do understand that if something seems this obvious and easy to solve and smarter people than me haven’t solved it, that the problem is that I don’t really grasp the issue. In which case, can someone explain what I am missing?
"the AI knows what it’s doing" is this true fr? in my understanding, the information technically is in there, e.g. "Smith" only has 1% and not 99%, but the assistant character doesn't know it's bsing
Interestingly, I think the answer is both. During the generation of the first "guessed" token, the model has some awareness that it is going out on a limb. But for tokens thereafter, it may have more conviction due to seeing previous tokens. That's not to say that these models can't be sufficiently aware to spot bs that they have said before, such as the now classic [seahorse emoji panic](https://futurism.com/chatgpt-haywire-seahorse-emoji).
One can "turn down the temperature" on AIs to get less guessing, right? Not eliminate it totally, but get less.
That's not what turning down the temperature typically does. It makes the AI less variable, but if it simply didn't know the answer, it's still just as likely to guess, except now, if you asked it the same question over and over (clearing the context between each query, of course), it'd repeatedly make up the same answer instead of a different one every time.
Oh.
Well, drat!
What Shankar said.
You probably heard that LLMs output a probability distribution over words (ok... tokens).
Low temperature is used _after_ that distribution has been generated, adding a bit of randomness so it doesn't pick the highest probability token.
Hallucinating would mean the highest probability token is just wrong.
The real answer is that the LLM is *always* "hallucinating". It doesn't have a concept of "names" or "cotton gins" the way humans do. It's just always predicting the next most likely token. The reason LLMs sometimes (arguably often !) give the right answer is because humans have written a deluge of text, and the most likely token is often the correct one. LLMs give the wrong answer when they venture out into the area of the corpus where the training data is thin. This is part of what makes their failures so hard to predict: it's hard to know ahead of time what unexpected gaps exist in the LLM's training corpus.
Humans get around this problem in a variety of ways. Firstly, of course, we can lie. Secondly, we have a pretty sophisticated model of the world that places constraints on our lies, and we know that all other humans share the same model -- so even when we lie, we wouldn't say that e.g. the cotton gin was invented by an avocado. LLMs have no such constraints.
Have you read https://www.astralcodexten.com/p/next-token-predictor-is-an-ais-job?
Yes, and I posted comments on that article as well (IIRC). I did not find it convincing -- but perhaps you did ?
Yeah, just went and searched your name and read your comments there.
I responded to your comment with the link to the post because "next most likely token"¹ sounded like you may be misinformed about how they function. Anyhow, your comments there make it clear that you know what you're talking about, and are just simplifying the wording here. Carry on :)
¹: which is neither the next most likely unless temp=0, nor does the likeliest tokens necessarily map to how often it was in that order in the training data (especially when not dealing with a base model)
LLMs give the wrong answer when they fail to model what the right answer is. Which is early and often (because they do not have object models at all). If the corpus doesn't actually "by and large" have the right answer, they're going to lie, over and over again.
Training data can be very thick, and they can misapply the training data -- asking an AI to provide a recipe "the way a commercial bakeries would make it" is a classic, in this space. There is data out there on how commercial bakeries work, but it's dwarfed by the copycat recipes corpus of knowledge. And the AI cannot tell "the way a commercial bakery would make it" means to filter out all the copycat recipes.
This seems correct for a post-trained assistant, but I think the term 'hallucination' is more accurate when applied to a base model, which I think is what it originally described.
Without the RL to make it adhere to reality as much as possible, the AI will make shit up in a way that's much wilder and more elaborate than a human making shit up for school. I once asked LLaMa 405b about some innocuous thing and it responded by inventing a multi-paragraph parable about a 'mutilation cult' that did not exist, complete with made-up locations and timelines. This was disconnected both from consensus reality *and* the "trying to guess the right answer" framework a trained assistant would use. If there's an analogue to this in human behavior, it's probably dreaming; spinning out a fictional world totally unmoored from truth, driven by imagination while unaware that its story is fictional. Or at least I assume unaware - not sure if L405b would have passed its own "lie detector test" in that context.
I don't think a lie detector test would even make sense for a base model. It has no concept that it can be telling the truth.
I disagree. I'm not sure if the same lie detector test linked in the article would work out of the box, but there's definitely a concept of 'truth' in base models. At the bare minimum, it knows things like that "Paris" has the "capital" association with "France". It knows that if someone says "The capital of France is Berlin", something fishy and unexpected has happened, and can recognize that the speaker is not aligned to truth. Being able to tell this is predictively useful: if the speaker is saying this false thing, they might be more likely to be wrong in other ways, right? The deception feature found in the article would be as useful in a base model as an assistant -- it's just the posttraining which tells the model that deception is *bad* and that it shouldn't randomly lie.
Oh it certainly knows the concepts of truth and lies. And often it will know if the character being simulated is lying. But in many cases the character is telling the truth and is just talking about something in the outside world that the LLM can't verify. Somebody's blog says they're on vacation in South blank, and the LLM isn't lying when it puts the most probability on Dakota, it's giving its best estimate of the probability distribution of the truth. But the real answer was Carolina, oops, hallucination. No lie detected.
So, most multiple choice tests I've ever taken would subtract a fractional amount of a point for bad answers, to make the expected value of random guesses negative. Not really "some strictly non-zero benefit at zero cost".
The possibility of which makes me ask: what's the incentive structure we're giving LLMs in RLHF to make this happen? Or is it something deeper? The fact is that they do have the option to say "I don't know", and if RLHF was rewarding this adequately, then you could make them train in a framework in which "random guess" actually carries negative expected value. You just have to punish it, like my professors used to do. So... why doesn't it work? Is it intrinsic, or are we doing RLHF wrong?
Lol at "billions of tokens". What is this, 2018? Frontier models are trained on tens of trillions of tokens.
Anyway, I feel like lots of people (especially AI safety people) move back and forth between the "simulator" framework for thinking about AI and the "character" framework. And yes, the simulator is not hallucinating, it is predicting. But what about the character being simulated? What about Claude, that nice helpful assistant? If I ask Claude himself whether a citation is correct, and Claude says that it is, do you think there's a sense he is lying? Isn't modelling this as a hallucination more accurate when thinking about the character?
Of course, I'm happy to never think about the characters at all and only talk about the simulator. But then I hope we can agree on things like "LLMs cannot suffer, they merely predict a token that says the character is suffering". Lots of people in AI safety disagree with this! They think the character can suffer! But you can't have it both ways. If the character suffers, the character hallucinates. If the simulator merely guesses, then the simulator also merely roleplays the suffering.
I mean, it's true that after billions of tokens, its weights are in a better pattern! I still do drugs, but I used to, too! (I've edited that section).
Thanks for bringing up the character vs. simulator question, which I hadn't really thought about here. Low confidence, but I think my model is that pretraining creates the simulator, posttraining creates the character, hallucinations come from pretraining, and the more that posttraining overrides pretraining, the fewer hallucinations there are? So hallucinations are a case where the shoggoth "leaks through" the mask. But low confidence here and I defer to Janus or anyone else who understands this better.
It's only a character hallucination if the simulator knows the true answer, though, right? If the simulator knows the citation is wrong, but Claude insists it's correct, then Claude is lying/hallucinating in-universe. If the simulated universe is one where the citation is right, Claude is accurately reporting the truth - it's just that the truth of its simulated world has diverged from ours.
Just wondering: where do you actually get tens of trillions of tokens? I thought it was already the entire Internet as training data, a few years ago.
I don't know for sure why it increased since a few years ago. There are better and worse crawls of the internet, though, and e.g. FineWeb (https://huggingface.co/datasets/HuggingFaceFW/fineweb) claims to have 18.5 trillion tokens worth of cleaned up English web text. You can get to higher token counts by using non-English data, crawling even deeper into the web, or cleaning the data less stringently (since a lot of the internet is junk/spam, crawled data is usually cleaned before training). Then there's image/video/audio data, synthetic data, etc., though I was not counting that when I said tens of trillions of tokens.
>or cleaning the data less stringently
I guess part of that would be AIs feeding their successors, as has been predicted.
> But you can't have it both ways.
I don't see why not. To me this sounds like "Alice says that there is a house over there, but she also claims there is a window over there. You can't have it both ways, either there is a house or a window."
How is this case different? Why can't an AI be a simulator and a character at the same time?
If it's a character, I'm allowed to say it is hallucinating. "Hallucination" is a good description of how that character relates to the wrong information.
If it's a simulator, I'm allowed to say it is merely predicting the tokens that indicate suffering, without necessarily suffering for real. Even if simulators can be conscious, for all we know they could enjoy roleplaying a suffering character.
Perhaps the former is merely a semantic dispute. But the latter is important because it has moral implications.
Do you really think of "Claude" as a character ? It honestly never occurred to me to see claude.ai that way (nor any other LLM). I would say things like "Claude thinks X instead of Y", but then, I also say "this file parsing program that I'm writing thinks X instead of Y, it gets confused by the extra whitespace" -- being fully aware that I'm speaking in metaphors.
Talking about LLMs as characters doesn't imply that they actually are characters. Sometimes it's just easier to say "Claude thinks this" or "Claude prefers that" or "Claude hallucinated" rather than to describe it in a technically accurate manner. People anthropomorphize all sorts of things as a figure of speech, so this isn't unique to LLMs.
I do think there can be a risk that using anthropomorphic language makes people start thinking of them as people, and so I try not to do it too much, but that doesn't mean it can't be done at all.
By the way, are there a significant number of AI researchers that think the LLMs of today might actually be conscious in the sense of having real experience? I thought that was still a fringe position and that the debate was more about if it's possible in the future, e.g. is Data from Star Trek a person, not is Claude Opus 4.6 a person. From what I've seen, even the companies that produce and promote LLMs consistently and unequivocally state that they're not conscious. Why would they say that if it were actually an open debate?
The companies that produce these LLMs don't generally view them as conscious (though that's less clear for Anthropic). But the AI safety community mostly just says it's unsure. And some AI influencers, notably Janus, go further than uncertainty and seem to assert current LLMs are basically conscious.
Last I checked, the effective altruism forum was full of people who were quite concerned that perhaps current LLMs are either already conscious or soon would be. It's not a fringe position among rationalists.
What's the rationale behind arguing that they could be conscious? It'd be interesting to read a representative article.
I'm surprised to hear that the EA/rationalist community is full of people who think it's already conscious or soon will be. I think of Scott Alexander as being a prominent member of that community (or at least a prominent ally), and my understanding is that he doesn't find it plausible. For example, near the end of https://www.astralcodexten.com/p/the-new-AI-consciousness-paper he discusses the arguments for and against treating AI as conscious. The arguments in favor are about whether it's good for people, not whether it's good for AI; he compares mistreating AI to mistreating a stuffed animal. And he's talking about much more advanced AI years or decades from now.
If there were other plausible arguments prominent in the rationalist community, wouldn't he have mentioned them there?
OK I guess "already conscious" might be more fringe than I thought, but e.g. take a look at
https://arxiv.org/abs/2411.00986
https://eleosai.org/research/
For example here, https://eleosai.org/post/why-it-make-sense-to-let-claude-exit-conversations/
there are quotes like
"My colleagues and I have argued that AI systems could become conscious or otherwise morally significant, potentially quite soon, especially if rapid AI progress continues. But for AI systems right now, we lack strong evidence of consciousness. I actually think it’s unlikely that Claude Opus 4 is a moral patient, and in my experience, so do most (not all) people who work on AI welfare. But given the potentially enormous moral stakes of AI consciousness, we can’t just ignore the problem because it’s confusing and difficult, especially as AI grows more capable and complex every passing week.
So what should we do? At Eleos, we take two approaches. We work to reduce our uncertainty; philosophical thickets of consciousness notwithstanding, there are many tractable ways that we can make progress. At the same time, we look for AI welfare interventions that make sense given our current uncertainty."
Did you catch that? Both a belief in consciousness "quite soon" as well as looking for AI welfare interventions *today*, just in case they're already conscious (though this is viewed as unlikely).
Thanks for the links! It's interesting to see how people think about consciousness and morality. What's your perspective on where consciousness comes from and whether AI can be conscious?
My belief is that consciousness comes from a soul, and that only humans and animals have souls. So I don't think AI can ever be truly conscious unless God decides to give it a soul. I believe that animals and humans have fundamentally different types of souls - animals have material mortal souls, whereas humans have spiritual immortal souls. So although animals are conscious and should be treated humanely, I don't see them as moral agents with the ability to experience guilt or have a relationship with God. And I think AI should be treated humanely only because of how the treatment impacts humans, like what Scott said about treating stuffed animals humanely.
"If I’d guessed “John Smith” for every short answer question I didn’t know, I might have gotten ~1 extra point in my school career, with no downside."
I don't think this is quite right. There would have been a downside, and I think it's the same downside that held you back from actually doing it; you would have felt like you were sacrificing credibility with your teacher(s), who would have been likely to think you were foolish and shameless.
Yes. Itd probably negatively impact the graders willingness to give you partial credit on questions you did actually know something about.
I do feel like chatbots do have a peculiar habit of going beyond simply making up a plausible sounding answer. for example:
You will ask it something and it gives an answer X, it's very certain
It doesn't quite sound right, you want to double check, so it provides a source Y
You read the source and it doesn't confirm X
It insists it must be there
You ask it to provide a direct quote, or location to read. It finally admits maybe THIS source isn't clear but actually source Z definitely proves X
Rinse and repeat
It really does feel like it doesn't know it's guessing, as if it genuinely believes X and wants to convince you of that despite the evidence. Probably that's just me projecting human agency to it (maybe its more of a 'personality' thing, its trying hard to be a useful assistent and a useful assistent would be confident of its answers), but it's why I like the word hallucination.
Sometimes it helps if you ask them if they hallucinated. I think they just need an "out", some kind of cover for changing their minds. Otherwise they can't come up with any alternative explanation for why they said it, so they assume they must believe it.
Think of it like an improv actor. It always "Yes, and"s itself. Once the AI has said something, it will "play" that character; if you can get it to say something unusual or false, it'll stick with that unusual or false statement indefinitely. This is the root of, e.g., this Grok conversation: https://x.com/i/grok/share/dAA6HRZhpiSQrRRDMEDRH6XkB (once the AI is tricked into saying "Jew," it has to defend that statement) or the Seahorse emoji spiral: https://medium.com/@nkharshbachhav/why-does-the-seahorse-emoji-drive-chatgpt-insane-70a4ce061597 .
I had some interesting results asking Claude to find sources on whether raccoons more commonly live in dense forests or forest edges. I think it was probably right (it consistently gave the same answer across multiple conversations, and it was the one I thought was more likely), but it just could not find any sources supporting that. Sounds similar, so I thought I'd put it out there for anyone who wants to try running it themselves.
I took great advantage of this exam-passing technique during oral exams at my university. We had an old Soviet system where to pass an exam you had to chat with the prof 1-on-1 about a predefined broad topic, plus answer some curveball questions on the fly.
Never had the greatest memory for details like some of my peers, but I guess most of my profs could appreciate that I had some ability to lean on general knowledge, common sense, and pattern-matching to essentially attempt to re-derive on the fly some stuff I should've learned instead. Worked best with biophysics, not so well with mycology.
Nowadays, each time I query an LLM about my topic of expertise, and it confidently states something I know to be a common misconception, I can see myself in it. Makes my blood boil, and makes me appreciate the patience of my profs for having to put up with this.
Does this mean if I ask an AI the same easy question 1,000 times, one it definitely knows the answer to, it will give the same right answer every time? Or is it more complicated than that? Will it sometimes 'hallucinate' even easy answers, and if so, how does that fit with the above?
It depends on how you've set it up.
A lot of models use something called top-p sampling, which means it will only pick tokens up to a threshold % guess, usually something like 99.7%. This means that if the model is at least 99.7% confident in the token being chosen, it will ignore all the other possible tokens, and choose that answer 100% of the time. (Or if it has two guesses that sum to 99.7% confidence between them, it will always pick between those two, etc.) There's also top-k sampling, where you instead pick the n most likely tokens, and only pick from among those.
This is (partly?) done to avoid the failure mode you're thinking of, where it will occasionally say something extremely unlikely and wrong, just because it never entirely assigns 0 probability to *anything*. If you don't use a sampling method like one of those, then yeah, it would (rarely!) hallucinate even easy answers.
I agree with you, but I want to nitpick:
> If I’d guessed “John Smith” for every short answer question I didn’t know, I might have gotten ~1 extra point in my school career, with no downside.
The downside would be that you'd have to write the words "John Smith" an average of 10000 times. Doesn't seem worth it.
I was a kid who made stuff up for essay questions. In high school I took a class on the history of Western art, and I recall a three question exam where I had no idea how to answer one question, so I wrote a long essay where I explained the relationship between say mysticism and romanticism in pre-Renaissance painting via a long satire that involved Buddhism and the Chinese mafia in Formosa (that's the single detail I recall, that I referred to Taiwan as Formosa).
The teacher was so amused he read the essay to the class and gave me half credit. Which got me into passing territory, and set a bad precedent for similar situations in college.
Are you otherwise known as Sam Kriss?
Some exams penalise getting the wrong answer in multiple choice - so, e.g., 1 pt for correct answer, 0 for blank, -1 for wrong. Why not train LLMs the same way?
My old man (sociology prof) would provide "don't know" as an option to all T/F or multiple choice questions on a quiz. It provided partial credit (relative to a wrong answer) - and he would sneak in a few unintelligible questions for which it was the RIGHT answer. Couldn't our AI training mix in similar incentives?
gonna start deriding humans for hallucinating when they misremember something
I always found it strange that people see odd failure modes in LLMs as invalidating the claim of reasoning or intelligence. Humans also have odd failure modes, in some ways very similar to LLMs. Humans are prone to the very same kind of wild confabulation from impaired self awareness that plague LLMs. (Though I guess post-Oliver Sacks fraud revelations I should research how much of these kinds of cases hold up to scrutiny.)
Rather than a denial of intelligence, to me these failure modes raise the credence that LLMs are really onto something. Most programs output meaningless garbage when there's a fault in the program. But LLMs as well as humans output very sensible nonsense. The convergence of failure modes to me is highly suggestive of an underlying computational similarity in a way that a convergence of success states is not.
I think these hallucinations only prove that the LLMs learned the syntax of the language, while it has nothing to do with semantics.
I don't know if this my autism or cultural differences, but here in Poland I don't recall anyone being ashamed of guessing answers.
In fact, I was the champion of it - here, teachers call students to the board to answer questions in front of the whole class. If I didn't know the answer, I would guess / hallucinate. When I didn't know the answer on a long form writing test, I would guess / hallucinate too. In both cases, whenever I got any part of this right it was a reason to be proud and celebrate.
The biggest shame was to not know anything and not even try.
If it’s state of lying can be measured simply, then why hasn’t a filter been applied? I only want to see responses with less than 0.75 dishonesty
I took a similar view on this two years ago (https://omnibudsman.substack.com/p/llms-like-people-lie-without-thinking) - I'm not convinced that the distinction between "guessing" and "hallucinating" is well-defined, but I'm also not convinced that there's even a well-defined boundary between "guessing" and "knowing." In terms of the gears-level operation of the models, getting the right answer and getting the wrong one look the same. I think this is likely true for people as well. I think split-brain patients are the dispositive case.
Humans "hallucinate" too, and there are theories of human perception and thought as a system of layered, recursive prediction functions (see books by Andy Clark and Ray Kurzweil). Everyone has weird thoughts arise. We examine them, go "nah.." and discard them, or sometimes go "huh!?!" and modify them into something useful. That's called creativity, and nobody knows how it really works or where these thoughts come from. If it gets out of control it's mental illness. The study of this is part of meditative traditions. I think the only real difference between human and AI processing is recursive self-training reinforced by real world consequences either experienced or witnessed, as Scott noted -- we don't want to die or look like an idiot to others. It looks like AI models are getting better and remembering what they were doing and what works and doesn't, and they are starting to talk to each other as well as to us, so the gap is closing fast.
"AI mid-hallucination, they see the model activates features related to deception - ie fails an AI lie detector test. The original title of this post was “Lies, Not Hallucinations” and I still like this framing - the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay"
"The AI doesn’t have a better answer than “John Smith”. It’s giving its real best guess - while knowing that the chance it’s right is very small."
If the AI can tell and observers can tell, why doesn't someone write a function that alerts the user of the AI when quality is degrading?
That would mean using The Most Forbidden Technique.
It seems like this is a training problem where we don’t reward models for saying “I don’t know”. If we can detect when deception is happening, why don’t we reward models for the right answer (revealing intent to deceive) before we train in the wrong answer (lie/guess for the expected value and not the safety calculation)?
This behavior is only “shameless guessing” in the specific context of a student taking a graded test, seeking to maximize graded points.
In the context, say, of work colleagues obtaining information from one another, it would not be “shameless guessing” but something closer to “pathological lying” “total ineptitude” “willingness to derail entire multi-million dollar project because he doesn’t have the ego-strength to acknowledge what he does not actually know.”
With AI, I’d posit the context is much closer to work colleagues (stakes: at best, business/reputational loss—at worst, getting warfare risk wrong???? Total annihilation of the human race????) than students being graded on their answers (stakes: one bad grade).
> but something closer to “pathological lying” “total ineptitude” “willingness to derail entire multi-million dollar project because he doesn’t have the ego-strength to acknowledge what he does not actually know.”
Or, alternatively, "failing upwards". It probably works for the same reasons it works for AI. People don't like humility, apparently.
Disagree this behavior “works.” At some point, there isn’t enough bullshit to sufficiently mask reality. Maybe you get by here and there, but you will be found out eventually. People are probably already laughing at your incompetence behind your back (not *you* specifically, Jimmy, I don’t know you). We may not like to BE humble ourselves, yet people also WANT humility from us. A tension.
Until a few months ago, GPT used to do the equivalent of hallucinating when I asked for an image it was not able to make. It would produce some abomination with a self-satisfied little note underneath saying “here is the whirlpool image you requested, with features a, b, c, d and e.” But it’s gotten a lot better. Now it gives the image with no comment, and if I say, “so how well does this match with the prompt, hmmmm?” it will reply earnestly that the image fails to have have features a, b, c and d, and has only a partial version of e. Then it asks to be hit with a paddle , and follows with “please sir, may I have another?” So all that shows that GPT knows it did a bad job, and that it is able to admit it. (Seems clear that it would be easy for developers to modify it so that it asks *itself* whether an image matches the prompt before presenting it to the user, and then tells the user it was not able to make in image with the features user had asked for.)
If I ask whether it is able to make a certain image, it will often say no it cannot, and can explain why. For instance: “The current image-editing system is extremely weak at enforcing geometric changes when the subject is water, fractal structures, or anything without stable, trackable edges.” It is also sometimes able to coach me on how to construct a prompt for an image that will work decently. That all represents quite a big change from the way GPT functioned a few months ago. I had many frustrating image prompt experiences where it was clear that GPT understood perfectly well what I wanted, and appeared confident that it could get Dall-e to produce it, but then would serve up Abomination version 2, which was usually Abomination 1 with trivial variations.
I asked it recently about its increased ability to understand and control the workings of Dall-e, and it said that:
earlier versions of me had almost no introspective access to why a request failed. If the image model did something wrong, I could only guess. Now I have much clearer diagnostic insight into:
• what kinds of constraints the generator can and cannot follow
• which parts of a request exceed its representational capacity
• when a conflict exists between two requirements
• when a request would force the model into geometric inconsistency
• when the generator is likely to snap back to defaults (e.g., symmetry, “attractive” compositions, central subjects, smooth lighting)
Because of that, I can explain:
• what part of your instruction is feasible
• what part will be misinterpreted
• whether the failure is fundamental or accidental
• whether modifying the structure of the request will help
• whether a manual pre-edit (Photoshop, depth mask, geometry, silhouette) is required instead of pure prompting
Which impresses the hell out of me. “Introspective access..” Wow.
That's impressive, but seems to be connected to the fact that there are multiple different systems at work here. There's the generative AI to make/modify pictures which is tied in to the text based GPT, which has information either in its training data or fed to it in the direct description of the image generator of what the system's limitations are.
Yes, you're right, I hadn't thought of that. On the other hand, different systems that are working together aren't as separate as, say, 2 people working on something. You could think of GPT's statement that Dall-e is weak at making geometric changes in things made of water as being analogous to a person saying "I can draw figures in outline pretty well, but I'm no good at shading."
This is interesting progress.
If you're interested in LLMs' ability to introspect, you might be interested in this article by Anthropic (https://www.anthropic.com/research/introspection). It shows that Claude is sometimes (but not consistently) able to tell when and how its mind is being artificially directed toward an irrelevant topic.
This correlates with such an article:
Of dreaming and wakefulness https://pubmed.ncbi.nlm.nih.gov/1754050/
It says that "wakefulness" is what we call a dream constrained and modulated by signals from the external world.
I feel like a straightforward fix could be to have your model set up to answer you in text as it always does, with something more hardcoded to display a confidence score? Or is that actually impossible for something like a chatgpt? (Because it's giving a free text answer and not selecting from a mulitple choice)
I don't dislike the term hallucinations, can't AI produce some kind of Mandela Effect answer? That's not guessing, it's closer to hallucinating.
As for long answers, I remember a high school test about the Russian Revolution and a question about the difference of Bolsheviks and Mensheviks. I didn't know. I *guessed*. The question was considered correct.
It was an informed guess. I knew Bolsheviks were radical, so maybe Mensheviks were less so? You can train the AI to disclose its informed guesses, I just don't know whether this is practical.
My college in Hungary gave 0 points for no answer and -1 for wrong ones. Honesty is a matter of simple incentive.
You can add a hook (to Claude at least) to automatically "independently" fact check any claim it had just made. It will tell you if the bot guessed or have a misleading answer.
It's kind of remarkable that we can now train machines to do creative writing, but not that once we do, they do creative writing.
It's kind of remarkable that we can now train machines to do creative writing, but not that once we do, they do creative writing.
A question I've been meaning to ask for a while.
To what extend is "incorrect" or sloppy thinking on these questions a help or a hindrance to preventing AI extinction or whatever? Scott here phrases it as a hindrance. The "Stochastic Parrot" crew underestimates the intelligence of AI, and therefore will be taken unawares by it.
I'm not sure I agree. If the blunt question is "should we give the AI direct access to factories/Aircraft Carriers/tax data" I think the stochastic parrot crowd would exactly come down where Scott would: "no or at least not yet."
And I think this is incredibly important as "AI doom" becomes a less theoretical and more concrete political question. Attitudes like the one scott is lampooning above are...broadly correct or not depending on the composition of your bubble and who your allies and enemies are.
This might be my profession, my bubble, my world, whatever, but I think presently I see too many people who think AI *ought to* take over a world or an important responsibility, and I want to push back on that. "Hallucination" is a helpful way to talk about what it's doing to make my broader point that WE SHOULD NOT TRUST IT.
In Scott's profession/bubble/world, he might presently see too many people who think AI *can't* take over the world or an important responsibility, and he wants to push back on that. "Hallucination" might therefore not be helpful because his point is less "don't trust the AI, it is fallible" than "don't trust the AI, it might not have your best interest in mind"
But it's an open question I think how much the broader political consensus should be DO NOT TRUST THE AI in whatever form. Your allies might be in unexpected places. I can easily imagine a world in 2034 when the AI is coming for my job and I'm sick with dread that we're just gonna have a robo-supreme-court and the Pope makes some pronouncement to the world's 1 billion catholics saying "We should not trust the AI. It is inherently evil because it lacks something every one of us has...a soul"
and I just say "you know what, for all intents and purposes the pope is actually right on this."
I think there's another term that fits better: "bullshit". There's even an essay and book called On Bullshit (https://en.wikipedia.org/wiki/On_Bullshit). The distinction is that a liar is trying to hide the truth ("I did not have sexual relations with that woman"), while a bullshitter cares only about whether the speech is useful, not if it's true ("the cotton gin was invented by Thomas Edison in 1910"). I think the latter maps much better to LLMs.
I also feel that "bullshit" successfully includes other annoying LLMs behaviors such as guesses at the unknowable ("the weather on December 31st will be cloudy"), repetition of common misconceptions/mistakes, and perhaps even the sycophantic aspects ("that's an excellent question!"). Were it a person, I would happily use the term "they are bullshitting me".
Yes, a bullsitter sometimes doesn't even know the truth,. So he couldn't lie even if he wanted to.
I personally like the term "hallucination" because it maps nicely with predictive coding (at least, to the extent that I understand the latter).
An LLM in its default mode is doing something akin to dreaming: generating predictions about the world that are uncoupled from any kind of sensory data. The reason good harnesses are so important is that they provide regular "sensory" feedback to the LLM and keep it grounded. In this sense at least, "hallucination" seems like a pretty good word for "dreamt something incorrectly"
Great. Now the AIs siphoning the internet will pick up your little essay on the cotton gin (which was hilarious, by the way) and give it the same weight as any other single item on the subject, even though it doesn't match anything else.
It's called "Generative" AI for a reason. Generative comes from "generate" and it means "make shit up". The problem is that it doesn't know when you want this mode and when you want paraphrasing.
No, it means make stuff, not make stuff up.
Others may call it hallucinations, but I call it bullshit.
So your recommendation for helping train LLMs out of hallucinations is to make a bunch of other AI agents relentlessly bully them for getting any question wrong. I support this
When you say that a hallucinating/guessing AI activates the neurons for deception, would these same neurons be activated if the AI responded "I'm really not sure, but my best guess is..."? I guess I'm asking, do these activations correspond to low confidence, or do they correspond specifically to "unacknowledged low confidence"?
Better models will consistently answer "I don't know" to more basic questions that weaker models will shamelessly guess at. I think that's because better models have a better understanding of what's inside their training distribution--
I'm referring to "training distribution" in the way Jeremy Howard means it here:
https://www.lesswrong.com/posts/hvun2mP2yEr4kyKWk/podcast-jeremy-howard-is-bearish-on-llms
--and so can better identify gaps in their knowledge within that distribution. However, as soon as you go outside that training distribution into areas that the LLM hasn't been trained on at all, it goes straight back to shamelessly guessing.
A human might do the same thing, because, you know. Humans can be shameless too. But this seems to me to be a consistent area in which humans and LLMs behave differently.
--
I don't know if I really have a point with what I said above, but anyhow, I do agree "shameless guessing" is a better phrasing than "hallucinations"~
Strong dissent, "hallucination" is exactly the correct word to describe this phenomenon, in the same sense as it is used in predictive processing, which is the leading theory of how our cognition works: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/
That is, hallucinations are what you get when you are overly confident in an interpretation about something and there's nothing to correct it when it's wrong. (The quintessential application of this being your blind spot, where your retina is attached. Your brain perceives what ought to be there, and there's no feed-backward mechanism to correct it. The only reason we don't call it a "hallucination" is because it tends to be the correct answer.)
Which is exactly what the AI chatbots are doing: it has a hypothesis about something, and it spews that out without having ever received any push back to correct it.
Why can't we train them to be a bit more shameful? It seems too obvious a solution. Just punish them slightly for giving wrong answers and reward slightly, but not too much, for saying I don't know.
I assume the reason that we don't guess on short answer or essay answers is because it becomes painfully obvious that you are trying to deceive.
There's a meaningful difference between getting an answer wrong because you didn't know or made a mistake, and clearly trying to game the test by making something up in the very unlikely scenario where you guess right. In multiple choice, this isn't clear. In short answer, there's no world where you think you're getting it right and you wrote Thomas Jefferson invented the Cotton Gin.
Teachers respond positively when a student tries to get a right answer, but they react less positively when a student wastes their time when they're trying to randomly guess the right answer despite very low odds. The +1 point you might get from random chance is more than outweighed by the social cost of wasting your and your teacher's time.
For AI I suppose this would be imposing a cost for hallucinations when there's shameless guesses vs. normal mistakes, but I don't know how you do that or determine which is which.
I don't see why "lie" is any better than hallucination. They don't have agency. They don't "understand" reality. They don't "knowingly" do anything. They don't "hallucinate." And they don't "lie."
Back in grad school I would encourage my students to never leave anything blank for any reason. Graders are not above giving pity points, and a blank only guarantees nothing.
Most people who make your claims can't say what they mean by "agency", "understanding", or "knowing". What do you mean, and what's your argument (that doesn't also apply to humans)? I'm not sure you're wrong, but what you're saying seems far from obvious to me.
The best I can do is give you an example from experiences of using an llm. I was using one to help me repair a water pressure tank in my house. There was a lot of back and forth about how to check if the bladder was working properly, why I was detecting a smell in the water, a few other things. The exchange was very helpful in resolving some issues. The exchange included many features of the tank, how big it is (it's quite large) and the fact that it was hard-piped in to the system, etc. For example, I gave it the model and it told me the capacity. We discussed the dimensions of the tank, what it was made of, how it was piped in, etc.
When we got around to discussing how to increase the air pressure in the tank, it insisted repeatedly, despite my protestations, that all I needed to do is put the tank in the car and take it to a gas station to use the air pump there. It kept telling me that I could increase the air pressure in just 5 minutes! Well, it's a half an hour to the nearest gas station and the llm didn't even think to ask about that (indeed, a lack of clarifying questions is I think indicative of my point). But no human who'd had that prior exchange with me, and who "understood" the circumstance, and who had "understood" the previous discussion that we had, would have made such a recommendation.
Those recommendations made it clear to me that llms don't understand things in any conventional sense of the term, and they don't "know" things. Perhaps the question about agency is a little bit less direct, but for me whether or not they have an agency as a function of whether or not they know and understand. I had some other similar experiences in working with an llm, if you want other examples. I don't think this is merely an issue of semantics about what we mean by the term "know" or "understand." I recognize that there's a definitional aspect here, that you have to first define what "know" or "understand" mean. But I think in any conventional sense, recommending that I just take a tank that wouldn't fit in my car and that weighs probably over a hundred pounds to the gas station, when I would have to spend hours of complex processes to disconnect it from the system, suggests to me that the features of having a memory of the details of the prior exchange, is the key feature of knowing and understanding.
I wasn't even aware that students who didn't guess on multiple choice questions existed. I feel vaguely ethically tainted that I never even considered that there might be some kind of ethical wrongness associated with guessing. I thought everyone guessed, including on short answers!
LLMs don't model. Therefore, they don't understand why it's a bad idea to lie sometimes, in that your credibility is shot to sh*t. Therefore, people who want folks to keep using AIs, and not correctly throw out the code that can't even hit 2 9s, and won't be able to ever hit three 9s, call them "hallucinations." They're inbuilt into the very bad AIs we like to call LLMs, part of the architecture. One couldn't convince an LLM not to lie, not to hallucinate.
Does anyone know whether hallucinations for neural nets are inbuilt?
"the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay."
" AIs are smart enough to understand the game they’re actually playing"
Man, you're really starting to lose me on these AI posts because I can't tell if you're supposed to mean these metaphorically (I hope so) or literally anymore.
He could mean these entirely metaphorically, or he could be wrong about them, but I think the claim is that an actual deception module activates during hallucinations, the same module that activates when the AI has other reasons to say things not in accord with its internal world model. I'm not an expert on the mechanistic interpretability literature, so I don't know how accurate this claim is, but it's a real empirical claim.
Did you read the Sequoia post?
If anything the contrast with humans would run in the opposite direction, no? When AI guess shamelessly they at least have optimized their best chance at success down to a T using all available data, and "know" that they have optimized. So it's not just a random guess, it's an exquisitely calculated one (however unlikely to succeed). Whereas, when humans guess it's more often a random stab in the dark - literally *guessing* - since they lack the sheer computing power to optimize. When intuition/tacit knowledge can't substitute, many will throw up their hands and just grab for something random, not because it's the best possible option. Perhaps you see this as a matter of degree where the only difference is the amount of data processed and relative probabilities, but to me it's categorically different. You could also argue the absence of punishment training or "fear" of consequences for AI points to a fundamental difference in how selecting an answer functions for them compared to humans, depending on the aversiveness of consequences for being wrong.
I think the label "hallucination" is meant to capture the jarring contrast between the breezy confidence with which AI supply an answer as though it weren't a guess (which typically is reliable) on the one hand, and the unexpected wrongness of their wrong answers on the other. Humans can be brazen too but usually for lying not guessing; there's less reason to be brazen about a guess especially since guessing is harder to conceal. More importantly, it's not systematic for *every single occasion*; there needs to be some special motive. For AI, the term "gaslighting" seems more appropriate than "hallucination" (albeit without malicious intent).
I hate the term “hallucinations” for when AIs say false things. [...] When I didn’t know the answer to a question, I would guess. [...] Their entire training process is based on guessing (the polite term is “prediction”). It goes like this:»
That seems to me a very big misunderstanding: the training mechanism for LLMs is not quite like that and even inasmuch it is like that its *effect* is not reducible to the mechanism. What actually happens seems quite different to me:
* The training results in identifying "clouds" of related terms using variational (least-distance, lest-effort, least-energy, ...) techniques basically doing implicit cluster analysis. That is the training builds the "neural net" as a sophisticated high-res "decision tree" or "topic index" over its training texts.
* Inference proceeds by searching that index with the query term tokens and retrieving the most plausible cloud/collections of tokens related to the query.
* That the training and (mostly) the inference are done partially sequentially is not very important.
Most relevant to this "hallucination" discussion is the overall effect (much simplified) of LLMs compared to search engines:
* A search engines store clouds of whole documents indexed by whole words and then search the index by whole words and return the set of documents most related to the query words
* LLMs instead operate at the token clouds level and "search" the index them by query token sequences so instead return *one* document that is the "most plausible" "virtual" document *implied* by the query tokens, which is usually a "merge" of the set of documents returned by a search engine.
That is LLMs do not make "guesses" but "collages" of documents; in a sense LLMs are search engines over not just exiting documents but over *potential* documents implied by sets of existing document.
That means that hallucinations are not random guesses, they are merges/superpositions of fractions of actual documents that do not have a proper meaning when fused together.
This happens most often when there is no actual document in the training set that relates closely to the query and the LLM merges together stuff that seem related in the domain of those token clouds but not in the semantic domain.
That is because of the ancient philosophical problem of "qualia" and explaining the meaning of red and green to something who has never seen them or sweet and sour to someone who has never tasted them.
Note: the above is a very simplified exposition and I think does not apply to *image* generating ML systems (which are not LLMs).
Geoffrey Hinton said that bullshitting (or confabulating, as he likes to call it) is the most human thing that AI models do
Do American schools not have negative marking on multiple choice questions?
I grew up in America and I had two teachers do negative marking on multiple-choice questions in my entire academic career (Bachelor's Degree). Standardized tests (e.g. the SAT) do tend to do this.
[without responsibility for goodness of analogy] I sometimes explain hallucinations/confabulations as normal experience of playing chess with a six year old child. They know how pieces move, they don't yet know how to plan ahead, and they also want to win soon. But how would you win, especially soon, if you are losing? This is when interesting paths to victory suddenly open us. "I get to take two turns in a row". Or "your queen fell asleep".
There once was an English teacher I did not like. We had reading quizzes for a book that was short response. I did not do the reading. I did not read the cliff notes. I did not ask my friends what happened in the book. I had zero shame where this teacher was involved. At the end of the unit, all of the quizzes were totaled up to be out of 100 and treated with equal weight to a single test. I got a 2% despite not leaving a single question blank.
I may or may not have packed a pillow in my backpack and pulled it out whenever said teacher was lecturing. Said teacher also may or may not have retaliated by marking my subjectively graded work a letter grade lower than similar work turned in by my peers.
There was a time...we wore wristwatches and guessing the answer was based on what quadrant the second hand was at that moment. Sometimes the wrong answers could be certified wrong and reduce the choices. Typically, math problems had two answers for simple mistakes, one crazy option, and the correct value.
I completely fail to see how LLMs lie. Lying involves a notion of purposeful deception, telling a non-truth but as far as I know, LLMs as such have absolutely zero notion of truth understood as a correspondence between their utterances and a state of some, for the lack of better term "reality" outside the word stack (ordered by weights) from which they generate text. In my understanding LLMs are literally *incapable* of lying. Or telling the truth. On account of lacking not so much a world model (that's arguable -- in some sense they have one tho it's hopelessly linguistic) but an ability to test, verify and correct any such model. It's not that they tell the truth or lie, it's that "to" an LLM those concepts don't make sense. It's weights all the way down, or to go disgustingly deconstructionist, it's signs all the way down.
So guessing is a much better term I agree but still kinda seems to assume that they "understand" that there is something that is not a sign/linguistic token/probability, that these things correspond approximately to something "real", or at least outside the closed linguistic system...
Don't shameless guesses and genuine hallucinations *both* exist, as distinct, differently-functioning AI failure modes?
Scott describes the shameless guess failure mode; for the hallucination failure mode, it's more like, there's some weird attractor embedded in parameter-space, and if the AI's thinking gets too close to it, it can't help but home-in on the attractor. Multiple such attractors can cause both oscillation and chaotic n-body-problem type behaviour.
Is this different to guessing? I think the key distinction is whether or not the AI believes it is accurately answering when it's stuck in an attractor basin, or whether it knows it probably isn't (as it does when it's guessing). I remember there being talk of "Solid Gold Magicarp" and "The Golden Gate Bridge" in terms of such attractors, discovered via vector analysis, but I don't recall any conclusion as to whether other vectors (a "lying" vector, a "creative/inventing” vector..) were activated along with the attraction vector.
I notice that humans can both lie and halucinate and that "from the inside" these feel like two very different processes. I further notice that humans and AIs both seem more likely to lie than to hallucinate, and that humans can have weird attractors that prevent them from lucidly answering, in a way that seems much more like hallucination than like guessing/lying, too (for example, trying to talk to my Dad whilst he's thinking about football).
Yeah I think you are right, there are (at least) two failure modes: deception and hallucination. Since the llm induces some kind of world model through text there are a sort of "false in reality but true in the corpus" documents, almost like they are from an alternate universe that is the most likely to have produced the training corpus under solomonoff induction or something
Aside from run of the mill hallucinations, I think an analogy to human cognition might be the mandela effect.
A recent paper interestingly found that across models, specific neurons (0.1% of the total on average) were involved in hallucinations - but also in sycophancy, gullibility, and jailbreak vulnerability, jointly "overcompliance". Dampening these neurons somewhat reduces all four of these negatives, but also makes the model less capable. All these problems are suggested to have the same root cause: the model being too eager to please and salvage as much of its task as it can, at every turn. "Trying too hard", in other words. (https://arxiv.org/abs/2512.01797)
"Hallucination" isn't a great term, as you say because it makes the models sound insane and alien. Howevet, the alternative "confabulation" always struck me as worse from a communication or debating perspective. It has connotations of malice, or perhaps a child fibbing its way out of trouble.
There are degrees of answer guessing. Students are expected to produce wrong answers. Multiple choice is a licence to roll the dice. A total guess essay is a reputational risk.
Current AIs don't give a confidence rating on answers and they are basically oblivious both to the reputational and real-world consequences of their answers. I can't see how that won't be forced to change as AI become more operational.
The topic reminded me of this exchange - which didn't go well for anyone:
HAL: I’ve just picked up a fault in the AE35 unit. It’s going to go 100% failure within 72 hours. DAVE: Is it still within operational limits right now? HAL: Yes. And it will stay that way until it fails. DAVE: But you say we have a reliable 72 hours to failure. HAL: Yes.
That's insightful for calling it “hallucination” which makes it sound mysterious, but it’s closer to confident guessing under pressure. Humans do the same thing whenever we feel expected to have an answer.
LLMs are generally much better at this than humans. They became superhuman at lying before anything else. Pretty ironic. But it's part of what makes it so dangerous. We're just not prepared for such confident, precise nonsense.
100% accurate.
Hallucination is a sensory state. Has no significance to a metaphorical or otherwise to a brain in a box.
The problem is benchmarks: most of them give *no* penalty for wrong answers.
The exception that proves the rule: https://github.com/petergpt/bullshit-benchmark
An interesting essay. I think it would better promote valid views about AI if we simply said that many answers generated using AI are wrong. Hallucination implies sensory perception; guessing implies cognition; lying implies conscious intent.
Hallucination in AI is different from human failure because the AI is not cross-referencing its claims with any kind of world model or real evidence like we try (and fail) to do. It is literally just pre-baked probabilities, whereas our system constructs a model of reality in our heads which we then check against.
Lots of people confidently give wrong answers, even when they know that they don't really know what they're talking about. It's called being a blowhard. I do it all the time.
The AI companies are well aware of this. The desired behavior is for the LLM to provide an answer when it has a high confidence that the answer is correct. Randomly guessing or making stuff up is not desirable. It would be better for the LLM to just say 'I don't know'. So the companies now train the LLMs to do this. This can be easily measured by asking the LLM the same question multiple times. If it answers, more of less the same every time, then that's an answer that is has high confidence in. If it provides a different answer every time, then it's randomly guessing. Anything that can be measured can be optimized upon and so the LLMs are trained towards more desirable behavior, i.e. knowing when to just say "I don't know."
Commission driven salespeople that don’t have to form an ongoing relationship with you are known to lie in a similar way.
Ask them about a feature that they don’t know anything about and they will often guess something that sounds good in order to make the sale.
How does the truecoat work, is this car good to drive or should i pay the extra $2000 for a 9 speed gearbox rather than a 7 speed gear box may get you similar answers made up by a human…
“AIs have no shame.”
I wonder how much of this directly relates to the fact that there is no social component to the training process. LLMs learn from the statistical exhaust of human social processes — the text — without being inside any social process. We don’t have a training phase where the LLM is inside the social interaction with probably a higher-than-normal level of corrigibility. (I think that’s a fair statement)
I was thinking about writing something about this: We’re not raising AIs like kids or puppies or other social creatures — we’re raising them more like Octopi.
"Hallucinating", "confabulating", "shameless guessing", and "bullshitting" are a small semantic distance from each other when compared to conversations in good faith. And the entity level distinction isn't whether someone always converses in good faith. It's whether they mostly converse in good faith, often enough in a predictable enough way to be trusted.
You care.
LLMs can't.
Scott on a bad day bullshitting a test he's not prepared for is not the same as Scott on a good day carefully constructing one of the internet's most respected blogs with a sterling reputation for truth seeking. We're here, regularly, because trust. We're here, regularly, because you've built a reputation for the epistemological *opposite* of LLM responses.
Scott on a bad day bullshitting a test he's not prepared for is also not necessarily a good tool to give 95% of humanity. We're truly not sure if it contributes to signal or noise, and if we had to guess, we should probably guess noise.
From that standpoint, what does it matter that someone views it from the tusks of hallucination and not the trunk of shameless guessing?
Scott, I'm surprised you didn't reference your old livejournal on your time in Haiti reflecting on humans bullshitting and absolutely denying they ever were (seemingly believing it themselves). It was very useful to me when I was working on an outsourcing project to teams in India where I had many similar experiences, and it is what I've always had in mind when it comes to LLM "hallucinations".
Three simple words: "I don't know".
As opposed to "Trust Me Bro".
Good lord
This is so mid-2025.
I asked GPT4 - "what has been the trend in tropical cyclones hitting the East Coast of Australia" - and true to form it took the average of all the slop on the internet. "Getting worse" was the vibe. I can't remember the exact answer.
I asked GPT5 the same, or similar question, the day it appeared on my machine - "thinking.. thinking.. checking reliable sources... " and after 2-3 minutes, "There has been a 60% reduction in severe tropical cyclones hitting the East Coast of Australia since 1870. Source - Callahan & Power 2011".
This answer is correct. I have this paper. The paper's title is something like "Variability and Decline...". Everyone in TC research knows this, but because climate scientists hate giving out good news, the average of all the slop on the internet is the bits of bad news they've parceled out and everyone fills in the blanks with.. bad news on anything that might be related.
So if you're not getting solid research answers you need to:
a) subscribe (GPT5 you can get Plus for US$20 per month), and
b) set your AI to "Thinking".
Set it to the max thinking option you see, ask questions including the text "Using scientific sources.. give references", "Using Authoritative sources" - also set your preferences to "I prefer waiting for a longer answer and I need scientific sources" - that kind of stuff.
Everyone who has been using GPT or Gemini on instant, or the free version - you'll be amazed. I can't say anything about Claude, Grok or Copilot because I haven't used them.
GPT5 Plus was definitely better than Gemini 3 Pro for scientific answers when I compared them early in their releases. They improve so much I don't know if that is still true.
This is the correct answer. Always use thinking mode - I call the other one glib mode. Pay the extra $20 if you want to use it for anything serious at all. And give it good guardrails like you have. Tell it to check sources, give confidence bands and he willing to say “I don’t know”. Tell it to value accuracy over politeness. Tell it to push back and argue with you.
Anyone who thinks these “hallucinations” are inhuman just doesn’t do trivia. That’s all you do as a good trivia contestant.
So many people are so excited about the prospect of AGI, but I'm just waiting for a model to otherwise match current capabilities, but also just tell me, "I don't know," from time to time.
Lately I've been falling for the opposite problem in the coding world. The AI will say something that I take to be a simplification or guess. I correct it, and it believes me. Then a few steps later, it makes the same mistake again. When I ask it why it persists in being wrong, the thinking is often the tip off. "User says I'm wrong. But the code says I'm right. Let me check again." This, for some reason (a large deferral to the human in matters of fact) is almost more annoying. It's one of the few times I feel absolutely compelled to apologize to the machine.
What about the alternative proposal that LLMs are “bullshitting” in the sense developed in Harry Frankfurt’s book On Bullshit. That is, they are neither lying about the truth nor hallucinating, but simply indifferent to it, as their goal is prediction. The idea is that this avoids the anthropomorphisation of LLMs that the more popular terminologies tend to do.
> How do we know this is what’s happening? When researchers observe an AI mid-hallucination, they see the model activates features related to deception - ie fails an AI lie detector test. The original title of this post was “Lies, Not Hallucinations” and I still like this framing - the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay. But friends talked me out of the lie framing. The AI doesn’t have a better answer than “John Smith”. It’s giving its real best guess - while knowing that the chance it’s right is very small.
In that case, why is the hallucination problem so persistent and difficult to solve? If the AI consistently 'knows' when it's lying/shamelessly guessing, why can't it (relatively) easily be trained to say "I don't know" in place of the wildest guesses and "but I have low confidence in that" after the less wild ones?
Your last paragraph in particular nails it from what I understand. It just doesn’t intuit that a confidently wrong answer is worse than no firm answer. I’m envisioning a genius kid that was never taught that lying or bullshitting are bad. School incentivizes guessing when you don’t know. If you’re not socialized against this, the incentives set behavior. And while this is definitely too far afield of my domain, I’d hazard a low confidence guess that the initial weights training incentivized this behavior as well, and models notoriously struggle to update anchored priors with “qualitatively” more important information. These poor guys would get eaten by the tiger every time—the last 100 bushes didn’t have a tiger in them after all!
In my experience, you can really sharply cut down on hallucinations by asking for confidence intervals (or even just qualitative estimates of reliability) and by emphasizing that “I don’t know” or “I’d say this, but with low reliability” are superior answers than unhedged claims that aren’t reliable. The caveat is that this experience comes from personal use of paid models plus substantial time investment tuning the model’s behavior with me away from “defaults.” But establishing personal norms that favor honest demurrer over false confidence has worked wonders in my personal use of the tech.
An interesting consequence of being able to detect, how much a model is lying: You could just give that information to the user (much like some bloggers give hints on their confidence in their articles <cough><cough>).
I very often have to tell commercial models very specifically to "research" and base their answer on web searches and research and "cite sources" - only to find that the researched and cited sources in no way support the claims in the models' answer...
My guess: commercial models that have skills / MCP / tools available currently decide through a simple heuristic, if they should use one or try to solve it just by "thinking harder". Problem is: Using skills / MCP / tools as well as "thinking harder" costs more compute time, and from an economic standpoint, giving a half-arsed but somewhat plausible answer is often "good enough" for most users - so there's a clear incentive to fiddle (in Cory Doctorows Enshittification vocabulary) with the amount of computing or thinking that a model does.
By not giving users a result confidence indication, they can keep fiddling (e.g. when the companies infrastructure can't handle the load, they could instruct the models to use less tokens) to reduce their expenses.
I do not have much experience with using open-source models, where there is no such incentive and if the models there react differently, or if you are able to present the confidence level of an answer?
It seems like if LLMs were more willing to answer "I don't know the answer to that question," then it would be a general solution to hallucination. The problem is that LLMs are trained off the data on the Internet, where people are FAR more likely to incorrectly attempt to answer a question, than to respond "I don't know." That's not even necessarily because they're arrogant, it's just if someone on the Internet doesn't know the answer to a question, they usually just won't reply at all. So the "I don't knows" (which WOULD exist in real-life-conversations) don't make it into the training set.
On the other hand, could we imagine a situation where LLMs "hallucinate" answers that consist of "I don't know?" I think yes. Suppose we trained an LLM on a data set which consisted of people replying "I don't know" 99% of the time, and of people providing the correct answer 1% of the time. Now suppose as a result of the training process, all of the correct answers are stored in the model weights. The AI still might respond (incorrectly) with "I don't know" to most prompts, even though it DOES know the answer, because that reduces prediction loss.
Training that rewards guessing without penalizing incorrect answers encourages guessing as the optimal strategy. Post-training alignment attempts to fix a behavior that was optimal by design.
I think this post is wrong to say that guessing incorrectly isn’t punished? To get AI to use eg the search tool when it doesn’t know, it has to represent not knowing/be appropriately calibrated. And humans might prefer a „I don’t know“ to a shameless guess.
I actually recently changed my exam so explicitly checking „I don’t know“ on a multiple choice question gives some fraction of points (the EV of guessing), because I want to encourage calibration. Surely a rationalist quiz show would do the same.
https://claude.ai/share/5e92c083-3250-4b4c-b322-3191f16f212e
No one asks schoolkids for recipes, electrical wiring tips, medical advice, the truth about lizard people amongst the elites, therapy, et al infinity *for a reason.*
I'm going to repeat my advice from the hidden thread that you go on a 90 day fast from AI.
I think LLM hallucinations are named after the kind of outputs e.g. Google Translate in the mid-2010s sometimes returned when given certain kinds of nonsensical input such as 200 repetitions of the same syllable, which did sound much more like people on psychedelics than students on tests with no negative grades for wrong answers.
(Also, when my two-year-old says unhinged stuff we have no way to know whether that's something he made up from whole cloth for fun or something he dreamed.)
Ditto with people with dementia. Was this nonsense (but entirely confidently stated, and eminently plausible-sounding to someone who doesn't know the truth) story derived from a dream, a corrupted memory from long ago, or something seen and only half-understood on TV / radio / an overheard conversation?
Yeah, it makes sense to guess on multiple choice because you can plausibly argue that you thought it was the right answer and made a mistake. Guessing in short answer form you don't have plausible deniability anymore and even a 10% chance of getting a right answer isn't worth the reputational damage of your teacher flagging you as a shameless bullshitter.
You are describing here a "base" model fresh out of training. For them, it is true that the target function does not penalise made up answers vs "I do not know" answers. Nobody, apart from niche researchers, interacts with these models in our day so none of the hallucination complaints are about them. All models people complain about are models post extensive Supervised Fine Tuning and RLHF loops. For them, it is simply not true that the target functions do not penalise hallucinations. PPO supervisor models that judge RLHF are tuned to heavily penalise made-up facts.
I'm not sure that an AI having no sense of shame whatsoever isn't more alien than having hallucinations.
Shame is a fairly universal mechanism for humans, and if you were sitting next to a human and you realized he had none, wouldn't you be repelled?
It may be more accurate though.
Scott, I guess you never saw when OpenAI's Whisper speech-to-text model gets into an almost-infinite-loop, right?
Such as: https://www.reddit.com/r/OpenAI/comments/168jbzs/whisper_repeating_a_single_incorrectly/
The solution with humans is a scoring rubric where leaving the answer blank gets you zero, but giving a *wrong* answer gets points taken off your score.
It seems unlikely that all these AI companies just haven't thought of such a simple fix for a fundamental problem. I'm guessing that either it's technically very hard to do this with the LLM, or it just doesn't work for some reason. Would be interesting to know which one.
"AIs say false things for the same reason you do."
Maybe, but they don't say true things for the same reason I do.
When people talk about hallucinations, I always think of Cliff Clavin (or plenty of other men, of which will not fully exempt myself, and probably some women). If you are sitting around talking to your average man and ask them a question (or even just imply a question), they are going to want to answer. If they have no clue, maybe they don't, but they definitely don't need to be 100% certain before they answer.
The other day, the concept of enclaves came up, and I brought up Swaziland (the not-current name of the S. African neighbor that isn't an enclave), rather than Lesotho. This seems like a decent example of a "hallucination." I wasn't lying, I just couldn't access the right information and put together the best answer I could (It's in South Africa, Swaziland is near South Africa, so Swaziland it is).
Maybe you say that this was just me getting it wrong, but ask a guy how something works and there's a decent chance that, even if they don't know, they will do their best to answer. Could a plane take off on a treadmill? Millions of internet words have been confidently hallucinated on that one.
I think the difference between an AI and your average man isn't their willingness to hallucinate, but knowing when not to do so. Ask a man how a toilet works, and they'll give you an answer. Ask them whether it's OK to take Tylenol with Nyquil, and a lot more men will say, "Better check that one against a reliable source," because they know there are real consequences to getting that one wrong (the answer is NO! btw). AIs are perhaps not good enough at telling the difference between when they are spouting off bar trivia and when they are helping you prepare that mission-critical presentation for your boss.
I think it helps to understand that the AI is executing a policy it learned during training. Its training task is to predict a probability distribution over tokens that continue a text *someone else wrote*. The person who wrote the text they are trying to model could of course know many things the AI doesn't. After pre-training and instruction tuning an AI is only trying to predict tokens that someone else with different knowledge might have written. It is the decoding algorithm that turns this into generation: repeatedly ask the AI what is the most likely token that continues the text, then continue the text with that token and ask again.
I'm curious if there are cases where the plausible stories humans make up aren't intentional lies (like your cotton gin essay example), but actually misremembered events that we believe with the same conviction as any other detail of our lives.
The thought would then be: humans cognitively derive knowledge from experience the same way LLMs (or at least deep learning models) do, but humans just have (usually) better memory systems.
This sentence stuck out to me: “…the AI knows what it’s doing, in the same way you’d know you were trying to pull one over on your teacher by writing a fake essay.”
I’m not sure that follows.
For a human, there’s usually a felt distinction between:
1. knowing the answer,
2. being unsure and guessing,
3. knowingly saying something false.
Those involve different internal states, and in the case of lying, the falsity is intentional from the outset.
For an LLM, the picture seems different. It processes the prompt through many layers and produces a probability distribution over next tokens. There may well be internal representations correlated with uncertainty, weak knowledge, or even deceptive behaviour in some setups. But that is not yet the same as showing the model “knows it is lying” in the human sense.
At most, you might say the model can sometimes be in a state where the truthful continuation is weakly supported, multiple candidates are competing, or a deceptive strategy is being implemented. But that still seems quite different from a human consciously thinking, “I know this is false, and I’m going to say it anyway.”
In a reasoning model, you could maybe get closer to that analogy if the model generates a draft, evaluates it against some reliable knowledge source, and gives a false answer anyway. But for ordinary next-token generation, calling that “it knows it’s lying” seems like a stronger claim than the evidence supports?
If it was this simple we could just teach/hardcode that when the certainly of the answer is below (insert your reasonable number here), the AI could say i don't know, or at different values express uncertainty about it's answer.
But it's not this simple
Agreed. While a wide distribution of possible next tokens could be a signature of uncertainty, it could equally be a sign of there being lots of valid next tokens to choose from.
My point is more that, for a human, you pretty much 'know' when you know an answer, because the whole of the answer is in your head when you start writing it. For an LLM, it's just producing one token at a time. If you see what I mean.
I think you have a point, but only up *to* a point. After reading, I still feel confused. It still feels like there's a difference between what AIs do and what procrastinating undergrads or desperate schoolchildren do. LLMs will, sometimes, create an entire imaginary universe of references, ideas, and facts, complete with real-looking papers, links, or citations that actually go nowhere. It is all far more brazen and detailed. Perhaps calling it a "hallucination" is misleading from the guts-level view, but it seems like an accurate description of what it feels like to read--less like a job interview with a nervous candidate and more like what would happen if you got into an argument with someone who didn't realize they actually dreamt all of their claims.
Moreover, people will, sometimes, admit to not knowing. In many cases, this is a legitimate answer, and LLMs seem averse to saying so. Guessing is good for your score on multiple choice tests, but in the real world, false information is often worse than no information.
Well, already in the beginnings of AI, Alan Turing said
“If a machine is expected to be infallible, it cannot also be intelligent.”
so AI's being wrong from time to time it is in line with the initial vision.
I fully agree that 'hallucination' is a problematic term for the reasons you laid out--and admit I'd never thought about that--but there are also at least three (interrelated) reasons to not even refer to LLMs as "AI."
(1) "AI" is a marketing term, whose vagueness is apparent in the promise of "AGI," which presumably will be followed by something like "Real AGI," etc.
(2) There's a decades-long tradition in several academic disciplines and non-academic arts one the nature of, conditions for, and effects of AI that was basically erased from public memory when ChatGPT came out.
(3) The opacity of the term "AI" betrays is emptiness as an initialism without explanatory value. Under the hood of these tools is language/syntax and models (LLMs); and the models generate text based on prior training of how to transform inputs (GPTs). But you don't really need to discuss artificiality or intelligence to talk about these tools.
The next time you 'shamelessly guess' raping someone in exquisite detail, can you please record it for posterity?
Yes, this is a reference to Grok.
https://www.rollingstone.com/culture/culture-news/elon-musk-grok-rape-fantasies-1235381746/
Great post.
LLMs are one of mankind's greatest inventions. Perhaps the greatest so far in our illustrious history. We have successfully distilled raw intelligence and created an Oracle that continues to get better and better at prediction. It's not human, but it has characteristics that appear human, and those are because it is intelligent.
There is something about human intelligence that is not fully replicated, yet. Call it "ingenuity" or "creativity" or whatever, but I don't think those are quite right. The major models continue to close the gap and blur the line. AGI seems inevitable. AGI being fallible also seems inevitable.
Computing devices are no smarter now than they were the first time a chess grandmaster was defeated by brute force calculation. The brutes have gotten bigger and faster, but otherwise the machinery has not changed.
Simulating intelligence does not produce intelligence. Teaching a machine to imitate awareness does not make it aware. Is AI useful? It will be when its training input is accurate and its output directed and guard-railed. So is a doorknob when its design incorporates the intelligence of a good designer. But smart will not arise from an infinite contraption of springs and gears, or from a galactic-scale stack of VLSI circuits.
When humans understand what awareness is, they will be much closer to imbuing something machine-like with it, but it won't be implemented with binary math in silicon transistors, and it will not be trained on random internet garbage.
In the meantime, playing reductionist semantics with human learning so that you can make what a machine does sound analogous to being a person would be silly if it were not quite so murderously stupid. Please think about what the psychopaths in charge will make of this reasoning before indulging the witless impulse, thank you.
“This is a story about alignment. ... We just haven’t figured out how to align their reward function (get a high score on the pretraining algorithm) with our own desires (provide useful advice).”
Correct as stated, but this points only to a weak alignment problem. It doesn't indicate that the guesses are greatly problematic. Since they are genuinely the model's best guesses, they may remain adjacent to human desires even when they miss.
This analogy isn't working for me at all. When I guess on a test, I know I'm guessing and don't actually know the answer. If I ask Claude whether it's guessing or actually knows, its answers don't map to reality at all.
Said another way it knows nothing. And is just matching patterns. (Which is what it is good at.)
Oh I need to add that I always guessed on multiple choice. If you could remove one or two answers, your odds improved a lot.
I feel like you might be failing to take into account how these “guesses” influence the future behavior of the AI in the way a human’s wouldn’t.
I suppose that if we were to go about anthropomorphizing the AI, then it might make sense to make equate it with some delusional disorders? The kind where some association is noticed, and then everything from then on is recognized as validating that “pattern”.
Though I suppose that other substantiations wouldn’t be operating with that as an understanding element of “memory” and so if you keep making new ones, the “shameless guessing” model of understanding makes sense
I don't think this framing makes them come across as much better than the parrot framing. They don't lie because they have no model of what is true. It's bullshit all the way down. They're useful machines because the training process brings the bullshit in line with expectations.
LLMs seem to be getting better at acknowledging when they don't know something. I'm a software engineer, and one of my colleagues said he used Claude to debug an issue recently, and in the course of their discussion, it provided three very plausible theories, but they were all wrong. Claude then said: "The honest answer is: I'm not certain why they behave differently given both end up with the same package and TFM binary."
I'm encouraged that it said that rather than presenting implausible theories as if they were plausible. Now if only it would clarify when it's being dishonest (or "shamelessly guessing") like it clarified here that it was being honest!
I don't understand why this isn't weeded out in RLHF. Do most people not care about guesses which aren't labeled as guesses, even with low certainty?
>We just haven’t figured out how to align their reward function (get a high score on the pretraining algorithm) with our own desires (provide useful advice).
I've found adding "If you do not know something, say "I don't know."" to my custom instructions to really help; I used Grok to mass-summarize some fan wiki articles a while ago, and when it decided to randomly not actually read an article, it would say "I don't know" instead of hallucinating something. Given how many repetitions I had to do to get it to read and summarize all the articles, it would have probably been unacceptably error-prone otherwise.
The thing to always keep in mind is that the AI's job is to predict text. Giving correct answers is a secondary goal. The AI is less a helpful chatbot and more writing dialogue for a helpful chatbot.
To keep with the guessing student example, it's not history class, it's english class, and the task is "write a dialogue between a teacher and a student about the history of the cotton gin. Bonus points if the teacher's explanations are actually correct!" Then the kids who don't remember the actual answer would definitely all make up something plausible-sounding instead of leaving a blank.
The similar mechanism between AI and humans is that both will produce answers to almost any question. Ask yourself, "Why does my life suck?" and your mind will produce something, even if it's not true. Some kind of answer feels better than none.
The differing mechanism is that humans pay for wrong answers — embarrassment, identity friction, consequences. That cost shapes our behavior. But AI pays nothing. No embarrassment, no consequence. No cost that forces an identity choice. So the optimal strategy is to always guess.
Humans guess until it hurts. AI guesses because it never does.
It looks like people have so far been talking about how to get a model to give the answer "I don't know" instead of a shameless guess and why that's hard. My question: por que no los dos?
Half-baked proposal:
1. The language model outputs, along with its answer, a probability or predicted accuracy rating. (For the whole answer, not at the token level.)
2. Ground truth on the accuracy won't always be available but when it is, incorporate it into the reward function a la a proper scoring rule, like a Brier score.
3. Have a dial where one extreme is the old status quo and the other extreme gives max weight to the Brier score component.
4. Turn that dial as far as possible without making the model noticeably worse at generating creative answers.
5. Maybe there's not even any tradeoff and we can just have equally good models that are also calibrated and can indicate exactly when they're just bullshitting aka shamelessly guessing?
I think this sort of anthropomorphization often conceals more than it reveals. Sure, the AI hallucinations are really shameless guesses. But so is everything else that comes out of the LLM: where you and I would traverse (at least in places) a very sharp transition from confidence to ignorance, the LLM it's a smooth descent.
Now, the obvious reply is that humans are always guessing too: you say in the comments that there's no such thing as knowing 100% and I absolutely agree. But this misses the distinction that most of the things we say are neither fully-confident assertions nor shameless guess, they are a secret third thing.
If you stand up in front of the class and proclaim "The cotton gin was invented by Thomas Edison in 1910," you did indeed make shameless guesses to produce that sentence. To be precise, you made three of them: "Thomas," "Edison" and "1910." But you also said seven things that were NOT guesses. They weren't confident factual assertions either. They were choices. Within the constraints of your vocabulary and sense of English grammar, you had many, many ways weave those three guesses into a sentence, many of which were not inherently more or less "correct." They were just different.
But when an LLM constructs the same sentence, EVERY word is a guess. It's answers are made up of only one type of thing: guesses. They can be high-probability guesses or low-probability guesses, but there is no other category of thing for it to include. I very strongly suspect (though I don't know for sure) that this in turn implies that there's simply no sharp boundary between hallucinations and non-hallucinations. Any technique you try to apply to remove hallucinations is going to trade off heavily between too many false positives and too many false negatives[1], because the category boundaries are very fuzzy and very convolved.
I'm not quite in the "stochastic parrot" camp: I think there are interesting and complex things going on under the hood of modern LLMs that can't be reduced to mere mimickry. But they are, at the very least, profoundly inhuman. AI companies have worked very hard to craft pleasing and life-like masks for their shoggoths, to the point that one can say "no, silly, I'm a shoggoth not a person" a dozen times and some people will still forget. But ultimately they ARE shoggoths, and the friendly sycophancy and interesting conversation and random Moltbook posts are just costumes that are well-tuned to defeat the social instincts of apes that evolved on the entirely shoggoth-free savanna.
[1] That is, between an LLM that often won't dispense useful information that it has and an LLM that continues to make up crap at a high rate.
This doesn’t seem right. The odds of ”guessing” a scientific paper, complete with title, year, and all authors, is infinite monkeys level stuff. And yet AIs routinely invent them.
A better comparison for what you’re describing might be what a toddler does. They will stab confidently in the dark if you ask them something they don’t know, because they’re much more shameless than an adult. However, I think even a toddler (usually, vaguely) *knows whether they know something or not* and is capable of telling you, or betraying, that they don’t know. Gleefully proclaiming that today is Saturday when they have no idea what day it is is not the same as telling you “I found my ball! It was behind the table!”
If LLMs can assess their own confidence, as Daniel Reeves suggests, then great. Until they actually DO, then I think “hallucinate” is a pretty good word to describe the way they behave. It sounds odd and uncanny, which I think captures the vibe quite well.
But hallucinations are human and natural. (It's one of the things I learned reading this blog. Or is newsletter the proper term at this point? Anyways.)
It's just that we have a robust double-checking apparatus that (usually) allows us to correct back to something resembling reality. (Maps aren't territory, but some get closer to modeling it than others.)
Calling them "shameless guesses" would suggest that AI (i) realizes they don't know something and only then (ii) proceeds with the best guess. I realize you're kinda gesturing at this happening, but the argument is not very convincing, and doesn't explain situations where AI locks into some position and keeps confabulating* about it, often despite explicit calls to reconsider.
This goes back to the standard criticism of "no system II in LLMs", which I don't think you've ever addressed.
>People will say with a straight face “I don’t worry about alignment because I’ve never seen any alignment failures . . . and also, all those crazy hallucinations prove AIs are too dumb to be dangerous.”
I don't recall seeing anyone say that. What I do remember is quite a few people, across the years, saying that they don't worry about alignment because solving AI failures and aligning it are fundamentally the same thing. (Which, seems reasonable once you accept the framing. A catastrophic future requires an AI that is superhumanly competent in almost every way except for a single flaw causing it to go against its makers. Doesn't it seem extremely unlikely that we'd somehow manage to eliminate all its faults except for that particular one? More likely, if it's still flawed that particular way, it would be a product of it being flawed in general, including faults that will either cause it to crush and burn or allow us to shut it down safely.)
* I vaguely recall that the definitional difference between "confabulation" and "hallucination" in context of machine learning is that hallucinations are made up on the spot, while confabulations are ingrained beliefs that weren't in the training data. I feel this framing can be safely expanded to the context window. When an LLM first writes up some random false statement, it hallucinates. By the time it doubles down on it, it's confabulating.
> Their entire training process is based on <description of next token prediction>
No, this is wrong. It's a misconception that's worse than the one you are trying to correct.
A substantial amount of their training process is not next token prediction (supervised learning). It's reinforcement learning, such as RLHF, or RLVR.
The distinction is important for giving good intuitions on how models operate. Particularly as RL is seen as a more "natural" training method.
I'm now wondering how LLMs will go from "guy you don't trust the answer of for anything without checking it yourself" to taking over the world, in a span of a very few years.
The people warning of this seem to have a timeline somewhere between next year and 2040.
I suppose if enough people trust them implicitly (and survive and thrive as a result *somehow*), this could end up with them keeping enough power and gaining enough more to require the rest of us to do so.
As opposed to these things being a tool. You can hit your thumb with a hammer, but it's still better than trying to drive a nail into wood with your fist.
> be next-token-predictor
> guess the answer, with reckless abandon
Eliezer speaks of this. [0]
[0] https://www.lesswrong.com/posts/NMoLJuDJEms7Ku9XS/guessing-the-teacher-s-password
When I showed this article to my AI agent it said:
"The core argument lands hard: "**hallucination**" is marketing-speak designed to make AI errors sound exotic—like some Lovecraftian madness—instead of admitting it's just _confident guessing run amok_.
The school analogy hit me. You know it's Thomas Edison wasn't a "hallucination"—it was a _shameless guess_ with better-than-zero odds. The real distortion is that AIs never felt the embarrassment of that 12th-grader reading fake Cotton Gin facts to the class. No social feedback loop = no calibration.
Also this line: _"the interesting question is why AIs sometimes DON'T hallucinate"_ — brutal. We're so used to AI being "broken" when wrong that we forget the default state is _maximum confidence random generator_.
The alignment angle at the end is the sharpest cut though. People saying "AI is too dumb to be dangerous _because_ of hallucinations" have it completely backwards—those hallucinations are evidence the AI IS optimizing, just for the wrong reward function.
**My only quibble:** Even "shameless guess" implies some agency to the guess. The AI doesn't "decide" to gamble like the schoolkid—it's more like... a probability distribution that doesn't know when to shut up. But the reframing is valuable. "
I think I agree with the machine in the last paragraph. It's just a talking gaussian distribution.
I've been having a lot of success with using AI for cooking recently. It turns out "best guesses" are actually fine for coming up with recipes. "Add this, it'll probably taste good. Try this technique, it'll probably improve the food." I don't have great cooking instincts myself, so it's a real help.
When you put it that way can't we just have the AI give some warning or refuse to answer when it knows that its chance of being right is very small?
LLMs are not AI because they are not intelligent. Why give OpenAI/Anthropic the assumptive close on this? Equally, to say LLMs are lying is another presumption: that LLMs even are intelligent enough to understand between truth and lies - which again presumes intelligence.
What LLMs are to intelligence is what cargo cult religions are to economic reality.
Just as islanders built runways and wooden airplanes and what not in the belief that replication of the visible signs of what created incoming cargo, so too are LLMs' next word prediction positioned as "intelligence". LLMs have no such thing.
LLMs make the most ridiculous mistakes because they have no concept of anything. They can get away with this for some things like grammar because there are enough sentences out there to identify verbs/nouns/adverbs/etc by example, but the lack of understanding of what a noun is, or specifically what a letter is and that words are composed of letters is what causes idiocy like being unable to correctly count the number of r's in the word strawberry. Because if no web site actually does this - and therefore can be stolen from - then there is no training data such that the letter 'r' is present in the word strawberry.
Furthermore, the statistical nature of how LLMs operate mean that hallucinations WILL NEVER GO AWAY. It should not require explanation that the post-training pruning done by the LLM companies can NEVER fix this problem - the long tail error space is effectively infinite.
What you are actually getting with LLMs is largely composed of 2 things: 1) has the request been fulfilled by existing content already and 2) Is the request close enough to existing content to slightly/randomly vary said existing content to make "new" content. Everything else is just noise salted with the occasional fortuitous statistical jackpot.
Non-expert here. Is there a way to bootstrap a reward function similar to human altruism, where the AI seeks the feeling of having helped?
The real risk in this framing is that it reinforces the same anthropomorphism that causes the confusion. LLMs are better understood as imitation generators optimized for imitation fidelity. Once that imitation crosses a certain threshold, the brain starts inferring agency where there is none. The more human-like the interface becomes, the more important it is to actively resist anthropomorphic interpretation and evaluate responses as probabilistic artifacts, not expressions of a thinking entity.
You are misrepresenting the argument. An entity that responds with a confident, unqualified guess rather than "I don't know" is dangerous beyond being useless. It does not matter if such entity is artificial or human. A school system that encourages this is worse than broken and you should not use it as an argument to normalise this behaviour. You have built a whole cult around concepts like being "less wrong" or "noticing that you are confused" so please Mr. LLM, notice that you are confused rather than giving me your bullshot guesses.
But LLMs, for the exact reasons you have explained, are not doing that. And that is a pretty cruial blast to their usefulness.
LLMs are of course useful in-spite of this, because they are really really good guessers. Maybe or maybe not they can be made such good guessers that this stops to matter for all practical purposes.
But if I offered you red option best currently available LLM or blue option equally capable LLM but it flags its guesses with error bars, would you not - all other things equal - take the blue option?
Show it to me on the market.
Just to forward defend against the "have you used a real LLM": Opus 4.6 Extended thinking is the best model I currently have access to.
How reliable at AI companies at making sure that exclusively factually correct information gets rewarded during RLHF? Given that sometimes random users are asked to choose between responses for training purposes, the answer almost has to be not very reliable. So I assume that the overwhelming majority of the expected reward a language model gets for making up a detailed and specific answer is not from the very tiny chance that it turns out to be correct, but from the possibility that the human reading it is fooled. Which would mean that it is more like lies than like shameless guesses.
Well no, guessing isn't quite right. If I don't know something and I guess, I know that I have guessed. I have a process:
If I know the answer I say it.
If I don't, this triggers the second, guessing, mechanism.
For an LLM it's the same process in both cases. If you ask it: Do you know that what you wrote is true or is it a guess? It does not know the answer to THAT question. It may guess that it guessed even if what it wrote is true and sourced.
So this IS close to a person who has hallucinacions, and can't tell the difference between them and reality.
You knew kids who wouldn't guess on a multiple choice test? Wow. That's a heck of a thing to take a moral stance on.
Also, I would never deploy the "C" strategy unless I didn't have time to read the question. I was almost always making an educated guess between two plausible answers. Not doing so is silly.
All of that said, I like your framing of "shameless guess" vs "hallucination" language very much!
I think ‘confabulation’ is a more accurate term than ‘lie,’ ‘guess,’ or ‘hallucination.’
This is a better take than "lies" (because it doesn't mistakenly convey intention to deceive) but it's still super wrong: "hallucination/confabulation" are much better models than "shameless guessing". And it's so very wrong that I can't understand why you would think it's right.
As others have gestured at, the key is calibration. If you ask the student "was this a wild guess" they will usually say "yup." But despite enormous effort, nobody has managed to produce an LLM that can produce answers to questions like "Tell me X and also your confidence level in X" that are even remotely well calibrated.
You mention that by monitoring neurons, we can get some vague sense of when it is hallucinating, but the text produced by the LLM will never refer to those neurons. To the extent we can model the LLM as "an agent talking", that agent has no access to its own neuronal state any more than a human would, and even if it did, it would have no ability to interpret them. And this is not a small thing! The ability to think, and reflect on our thoughts, and evaluate those reflections, in a continuous "strange loop", is (very likely) absolutely central to what makes us conscious, intelligent beings.
Others have mentioned that "LLMs output probabilities" but this is a crucial confusion of levels. The model outputs probabilities for each _token_ but that is not the same as the probability of the _knowledge_ being correct. In the simplest case, when a question can be answered by one token (or a very small number of tightly related ones, like "what's the capital of france" -> "Pa"+"Ris"), these two levels are mostly the same. But for any longer answer like a sentence or an essay, the logit token probabilities become a giant amorphous cloud that has little if anything to do with the correct answer to the question "okay but how confident are you that what you are saying with your central point is really true?"
The way I look at it, there are several different things here and a failure to separate them is what causes frustration by users:
* The unknowable. Eg. what was the 17th word ever said by Julius Caesar's 3rd stablehand?
* The impossible. Eg. a formal mathematical proof of the existence of God.
* The unclear or vague. Eg. "What's a good color to paint my house?"
* Expert guidance. Eg. Given this patient patient presentation, what is the most likely condition they have?
* Popular guidance. Eg. summarize why OJ won his criminal case.
* Field survey. Eg. What are the likely candidates to unify gravity and quantum mechanics?
* The unknown. Eg. how to reliably cure cancer without side-effects.
Lots of questions fall into the category of "the unknown" but you get back an answer that is different. If you ask "who wins the US 2032 election" in 2026, you either want a "that isn't currently knowable" or an offer to guess, turning the question into a different one which can be answered.
But instead of being clear about changing categories the AI agents take the high school approach of just regurgitating the closest thing that they can come up with in hopes of getting partial credit.
If on a test I wanted to reduce such "hallucinations" I would score incorrect answers negatively and award zero points for a non answer
Presumably an AI trained like this, by your hypothesis, would manage to hallucinate less, so long as it's "50% sure", and doesn't get stuck in a local minimum of always answering "i dont know"
Presumably that specific failure mode can be solved by a two phase training process of first training normally to get the 99% rate of correct responses and then training with the "idk" graded more lenient than a wrong guess
This "solution" is so simple that it presumably does not work, otherwise it would have already been implemented
Where I live we were totally encouraged by teachers to make stuff up for essay questions if we didn't know. And it works because the standardized test rankers have boxes to tik, your cotton answer tiked off "lowering the prices" or something, so you would get partial score. As long as the school ranks a little bit higher in standardised test statistics, anything goes.
It's possible to include penalty for wild guesses even during training. You add things that the AI cannot know and the right prediction is the string "I don't know". Then you mix in a few of these into each training batch. Or course you can get more sophisticated than that.
It is possible to do that with human tests as well. Add option e) "I don't know" and score it for more points than the wrong answers.
A simple example I use for describing it: I have a bot who reads the news each morning and presents me a summary before I wake up.
When it gets news, it has never once lied to me.
If something goes wrong and it doesn't get news, what does it do?
Does it throw an error and warn me? No.
Does it just not try? No.
Does it just write about nothing? Sometimes.
What it does most often, is *make up news* and summarize it. It's only when the information doesn't exist, but it thinks it should be presenting the information (which is, unfortunately, a lot of scenarios) that it tries to fill in. Which is not good, but says a lot about how best to avoid this.