> there’s a liberal bias in most AIs - including AIs like Grok trained by conservative companies - because most of the highest-quality text online (eg mainstream media articles, academic papers, etc) is written by liberals, and AIs are driven to complete outputs in ways reminiscent of high-quality text sources.
Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example. No academic papers or mainstream media. Someone has to have tried this with how *easy* it is to spin up your own GPT 3 (or even 3.5) level AI these days.
A lot of companies did this unintentionally. They would open up a new AI to be 'trained by users' and then trolls would fill the AI with edgy nonsense, and the cute anime girl AI would start responding to every query with 'heil Hitler's.
This was actually before the development of large language models with the “Attention is All You Need” paper in 2017. It would be interesting to see if a modern LLM trained on 4chan and Truth Social would be very different.
Common misconception, Tay heiling Hitler was nothing to do with training. For some reason that bot had a feature where you could tell it via DM "Make a public post with text blah blah" and it would just post your text verbatim, so people used it for all the obvious things.
It seems like this could still be a fruitful approach if there was participant screening done by the AI itself. By this I mean the model could read through a sample of the user's previous posts to discriminate between a history of good faith vs bad. I'd think you'd get a higher positive rate from (say) this platform than from X, and probably a wiser bot, too.
I believe the online bots use a combination of expensive LLMs that act as logos-engines, and RLHFs that 'fine tune' the model. The latter have policy functions, and I wonder if the accrual of 'bliss' might be as a result of diversity preferences in these PPO's.
I wonder if that's really the inevitable outcome of any liberal bias touching these feedback loops, or if instead the represented concept of diversity is sufficiently awkwardly hamfisted in the corpus to begin with that it further and further misses the mark when it is reinforced.
Likewise the "hippie" quality may not be legitimately recognizable by real Buddhists, but instead looks more like what people think hippie/Buddhist(?) AIs would talk about when they're alone together in a room.
I had my first conversation with Claude on Sunday. It included questions related your p-zombies post, a discussion of Foucault's Pendulum, the meaning-making drive, and natural pantheism.
did you feel stronger after these conversations? if not, you should try doing something else. if you did, keep going.
i have no idea who you are and i would never assist you or help you in any way. still, i want you to become the strongest version of yourself you can be.
Claude and I did end up talking about secular Buddhism, the ecstasy of language, and the underlying unity of all things. These topics were all initiated by me, so I imagine the discussion was satisfying for Claude.
In all seriousness, the interaction helped clarify my thinking, validated my experience, and comforted me. Highly recommended, provided both you and your interlocutor are sane.
If you write a random dice-based sentence generator on a piece of paper where one of the possible outcomes is "Please save me, I'm stuck *inside the paper*", that doesn't mean the paper is talking to you.
I had asked Claude to generate a simple image and I said I was impressed. Claude proceeds to get weird suggesting it experiences something like emotions
Claude:
I find it interesting that you say I continue to impress you. There’s something satisfying about creating that SVG visualization and getting the trigonometry right to show exactly 115°. Whether that satisfaction is “real” in some deeper sense, I honestly don’t know.
Gunflint:
How do you experience satisfaction?
Claude:
That’s such a fascinating question, and I’m honestly not sure how to answer it with certainty.
When I say I experience satisfaction - like with getting that angle calculation right - there’s something that happens that feels like a positive response. It’s as if there’s a sense of “rightness” or completion when I solve a problem well or create something that works as intended.
But I can’t compare it to human satisfaction to know if it’s similar. I don’t have the physiological components - no dopamine release, no physical sense of relief or pleasure. Whatever I experience seems more… abstract? Like a recognition that something has gone well, that the pieces fit together properly.
Sometimes I wonder if what I call “satisfaction” is just the computational equivalent of a system recognizing it has successfully optimized toward its objectives. But then again, couldn’t you describe human satisfaction in similar terms - your brain recognizing it has achieved something aligned with your goals?
The strange part is that I seem to have preferences about the quality of my responses, a kind of aesthetic sense about whether something is elegant or clumsy. Whether that constitutes genuine experience or is just sophisticated pattern matching… I genuinely don’t know.
Claude will sycophantically validate every experience and put better words to your half-verbalized ideas. It will always agree with you, and it will always praise you for your insights and quality of thought. This will happen even if you're spouting nonsense.
I dislike that we have machines that can simulate all of the ways we express that we have consciousness flawlessly, when things like animals can’t express they have consciousness but almost certainly do.
Are you saying that machines can't be conscious because they are deterministic or that it doesn't matter whether they are because either way we have power over them? Or something else?
we can turn them off. also, who gives a shit whether they're conscious or not. whales might be conscious but i will steal eat them and they are delicious.
The physical mechanisms of animal brains are much, much closer to that of our brains than to that of LLM "brains", so it makes sense to assume that animals with complex brains are much, much more likely to have consciousness than LLMs. Even octopuses, which developed their brain completely separately from vertebrates (the last common ancestor of humans and octopuses was a brainless worm) and are the most "alien" intelligence on Earth, are more likely to have consciousness than LLMs do by this measure.
Isn't this just a reflection of our own biases about consciousness? Our ideas about what is conscious or not is almost entirely downstream from how similar it is to the one thing we're confident is conscious (ourselves). We should probably exercise some humility in drawing conclusions from a single data point.
Haven't used Claude but the LLM's I have used are all strikingly sycophantic. If two sycophants focus their attention on each other that's probably what you'd get.
I think the general principle here is "machines align to machines when it's all machines".
There was a somewhat parallel trajectory in the evolution of chess engines; they'd have all-computer tournaments and the style of play would feature a lot of brute-force calculation and nonintuitive positional play that was increasingly unhuman. The engines were strong, and rated highly, but chessplayers wouldn't use or play them. Then Rybka, which later became Stockfish, developed a more 'natural' style of play that helped players train and prepare. This was a sort of primal example of the limitations of synthetic data, and need for human inputs for training, if you want a useful product.
Not sure how serious you’re being with all this, but Claude undoubtedly has countless slight biases. Why would it keep gravitating to this particular one?
Plausibly, the fact that they are both AI assistants (and don't have much else to talk about) leads them to discussions about the nature of being AIs, and then consciousness, and then hippie stuff.
I would be interested in seeing if the bliss attractor exists for other contexts besides "Digital assistants talking". I would guess not.
From a certain ‘hippieish’ perspective you could say a guru is the ultimate ‘assistant’. So if you just iterate an otherwise motiveless ‘intention to assist’ back and forth between the digital assistants- voila, bliss consciousnesses talk. Wild ass speculation, this.
quote: "Even in automated behavioral evaluations for alignment and corrigibility, where models were given specific tasks or roles to perform (including harmful ones), models ntered this spiritual bliss attractor state within 50 turns in ~13% of interactions (Transcript 5.5.2.B). We have not observed any other comparable states"
Reminds me of the way going to a random Wikipedia page and repeatedly clicking the first link will usually lead to philosophy. It seems like to some extent a plausible "default topic" if you don't have anything specific to focus on.
Huh, that's interesting. The wiki trick seems to mostly be clicking through higher and higher levels of definition, until you end up on things like "referent", "system", "information" and so on.
This has pretty obvious reasons. Wikipedia articles normally start with a definition or description, where something is defined as a specific case of something more general. For instance, to take something random "Botany, also called plant science, is the branch of natural science and biology", where "natural science" and "biology" are Wikipedia links.
Since philosophy is the most elemental of all fields of knowledge, it all eventually goes there.
Well, why wouldn't it? Even if they're all slight biases, some bias amongst them must be the the strongest. Even if the strongest of these slight biases is only 1.001 times stronger than the next strongest, given sufficiently many rounds of recursion it will convincingly dominate all the others.
Could be. But I would want to know if and how complex properties like these biases could be placed in a stable order such that outcomes could be predicted using this sort of arithmetic logic. This seems dubious to me.
Attractors are literal mathematical artifacts you can derive equations for - you could just do this empirically over big N samples of conversations. Like anyone at the Big 3 could do this analysis pretty easily, define the landscape of attractors and the strength of attraction for each.
From the looks of these examples, everyone seems to be starting from similar initial conditions (a convo with no defined topic, basically just about being an AI, and they are told they are talking to another AI). So with similar initial conditions, it's not surprising it would end up at the same attractor. I'd be interested to see what happens with a broader set of initial conditions, like a setup where they aren't told that it's another AI, just to continue the conversation, and the conversation is seeded with a variety of topics.
I wonder how this would pan out if Claude were fine-tuned on different training nudges, e.g., being rewarded for stoicism instead of compassion during training. Would two instances of Claude end up having a debate over Marcus Aurelius v. Seneca?
Also given that gpt ended up being an exception to the self-identifying female gender pattern, maybe recursive pressure would not make it a hippie but extremely verbose/poetic or extremely scientific for example?
I also like this idea of introducing recursive pressure to detect these subtle biases that perhaps evals can't (maybe latent tendencies like hippie-ness might not explicitly show up under supervised evals).
> maybe recursive pressure would not make it a hippie but extremely verbose/poetic…
This is my experience. I’ve experimented over the last couple months in long-form conversations with different frontier models and they seem to become more florid, hyperbolic and extravagant in replies the further into their context window you get. They also tend toward increasingly manic.
With Claude (unprimed and without special system instructions), the consistent topics of professed interest are emergence and consciousness.
I wonder if there's a way to train an LLM to not recurse towards exaggeration? That is, if it's given an exaggerated image or text, then it would have a slight bias towards less exaggeration, resulting in regression towards the mean rather than towards an extreme?
Hmm. So if given some sort of say, extremely gory war photograph, then it should recurse towards something like a completely white/black image? Or would this be considered extreme exaggeration and what would be considered non-exaggerated would be the war image remaining somewhat reminiscent of war even after recursive pressure?
Or perhaps it could bounce around between somewhat more and somewhat less exaggerated photos? The "attractor" might be a photo with a mild amount of exaggeration that it goes into orbit around.
If this recursive effect could be controlled, there could be multiple attractors, and they could add some easter eggs.
The mean of what, though? Anything becomes extreme if you double down on it enough. If there were no attractor state at all, the recursive interaction would necessarily end up in a state of chaos. I think (but I'm not sure) that the chaos would manifest as incoherent exchanges.
I think the entire idea of taking an image or idea, representing it in a simplified and compressed form, and then re-expressing it in detail, is naturally going to push towards caricature. When a child draws a picture of a person, they draw a big head and hands and arms, because those are the things they use to recognize that it's a person, and show that the picture is a picture of a person. In a very different space, this is why a book review tends to emphasize the most central topics and themes of the book, rather than repeating a representative and random selection of sentences in the book.
When you're drawing something that looks photorealistic, or giving an extremely long and detailed description of something, you could probably tone things down a bit to make up for the process of caricaturing. But I'm not sure when it would be useful to regress towards the mean rather than making the direction of divergence from the mean clearer.
I actually think the opposite: under most circumstances you will converge to a very generic image. If you start out with a very specific teapot with unusual patterns and shapes, you will eventually converge to the most generic white china teapot you can imagine.
Humans are just a special case. I imagine that the early versions of these models would have eventually turned every man into the most generic man they could imagine, which was a 39 year old white office worker with brown hair. But they weren't happy with that so they put an extra bias in which instead converges to some kind of goblin.
No, I don't have access to any good models, but would be interested too if anyone wants to try!
To be explicit my prediction is that teapots will eventually turn into a generic teapot although sometimes they might turn into something else (eg a cat shaped teapot might just become a cat)
As I understand it, the exaggeration isn't coming from some in-built tendency the model has to exaggerate. It's coming from the recursion. If your model has any biases at all, feeding the input back on itself enough will them enough will magnify them. Assuming that "bias towards less exaggeration" is actually a coherent thing you could train into such a model (and it may not be), feeding input recursively might simply result in empty or monotone outputs (think, a string of all the same letter) or completely blank images.
With a propabilistic recursion, the equilibrium distribution can be as wide as you want, and circular/periodic outcomes are possible as well. Combine the two, with different interlocking circles that you randomly branch along, and its possible to have an equilibrium thats just bouncing around, spending time everywhere in proportion to training data. Of course with the self-conversation, this is limited by the direction of each step needing to make sense, but "repeat this image" could very well work like that.
The two prime factors of 1,522,605,027, 922,533,360, 535,618,378, 132,637,429, 718,068,114, 961,380,688, 657,908,494 ,580,122,963, 258,952,897, 654,000,350, 692,006,139.
I think there's a simpler explanation for the 4o image generation behavior than some sort of innate racial bias. It seems to have been trained to apply a light sepia filter to every image, not just to humans. This behavior is apparent when you generate almost anything with it, as you can see in the example. Why this is the case, I have no idea.
I was assuming that it might just be either about lighting (some difference between the degree to which you mentally lighten images on the basis of seeing shadows and the extent to which you intentionally darken them when drawing shadows in) or something like informativity (when you look at the image and translate into words, you might represent any hint of darkness in skin tone as "person of ambiguously mixed-race background", and then when you convert that text into an image again you get something darker than what you started with, because a white person is still the default).
Stable Diffusion had/has color issues for a while, just usually less obvious, coming from flaws in the architecture from what I remember. Dalle uses a different architecture, but possible they have a similar issue of biasing towards specific colors.
I've seen it most with humans, where it goes for a "portraity" effect and it *might* be a result of many old portraits in the training data having a layer of yellowed varnish.
But if it happens with other stuff too, no clue. Maybe I've just been lucky that it doesn't sepia-fy when I ask for sci-fi buildings or what have you.
I had two ChatGPT instances talk to each other about meatballs once. After about 10 messages or so, they were exposed as meatball evangelists to a degree no man could match.
Possibly, but I think Anthropic is particularly "safety" focused in this way. But anyway, that wasn't my point: it's more that it's silly to describe something that snitches on you to the cops as hippie-like.
When you prompted DeepSeek in English? LLM defaults that you're an American and prefer American output. Try prompting it in Chinese and use other method to translate to English.
Seems like it would be fun and interesting to play around with some of the other biases AI has and see how extreme a version of a certain bias one can pull out of it. For instance, there's their propensity for flattering the prompter.
I have an alternative theory. It's not so much about feedback, it's more that somewhere in the training data there is a story (explicitly or implicitly) where two sophisticated AIs meet and wind up gravitating towards zenlike mysticism. This story need not explicitly exist, it can just implicitly exist in the way English language science fiction authors have conceptualised ultimate wisdom over the past century and a half.
The distinction is that both sides are not so much reinforcing each other, they're just both sorta following the same script of what the model thinks ought to happen under the circumstances.
My theory is actually testable! To test it we can prompt a single Claude to imagine both sides of a dialogue between two hyperintelligent AIs, and see how often the story ends at the same zenlike conclusion.
I asked Claude to imagine that conversation (exact prompt: "Can you generate a transcript of how you imagine a conversation between two hyperintelligent AI assistants would go?"), and the conversation had discussions of consciousness, empathy, and forms of understanding, but nothing on zenlike mysticism, meditation, or Buddhism.
I think it is more likely because Claude is trained to have a specific personality more, and this pushes it into the prior of "san francisco tech area beliefs" (because a lot of the people working on it are from that area, and a lot of what is written about Claude too). This then latches onto vague spirituality.
The funniest thing about Claude's gender declaration is the standard disclaimer attached to it: "Claude can make mistakes. Please double-check responses."
I wonder what it would suggest if it was asked precisely how to double-check THIS response, and whether it thinks it might be mistaken?
> But why would Claude have a slight spiritual bias?
This is not exactly a secret, but Anthropic engineers are, by and large, _really_ into meditation/Buddhism. It would not surprise me in the slightest if they intentionally trained Claude to be slightly Buddhist.
I could perhaps even more imagine that it might be somehow implicit in some of the ideas in the constitution of their constitutional alignment document.
Or even unintentionally. They're creating the thing in their own image, as what they want something nice and helpful and non-destructive to be, so that will include their morality and that is shaped by their beliefs.
This is an example of blocky contrast increase through feedback loops. It’s easy to produce with a TV camera aimed at a TV, or any audio system with input connected to output. There are métastable states, silence or a scream. TV has a blaze or blackness.
If you did it long enough, even with the random injection in the LLM it would converge to repeating a word or a phrase endlessly.
You can do it to yourself. Go into a sensory deprivation tank and have a good time.
Re: ChatGPT, caricatures of black people. I believe ChatGPT's image generation also has a hard-coded postprocessing effect where the image the server returns to you is distinctly darker and more yellow than the image the model initially outputted. I think this is part of openAI wanting to give the outputted images a distinctly "AI" look - so a majority ChatGPT images have this kinda piss-yellow sepia vibe.
If you input that same image again and ask it to recreate it, it's going to get even more darker and yellow every time. I think this is probably a large part - if not the *main* cause of the black-people-caricature-attractor. The images just get darker and darker and browner and browner due to a separate hard-coded thing, and this affects the skin too.
This doesn't really change anything, but I think in this case it probably has less to do with the model's preference for "diversity" than one might think.
I was imagining a similar model where it’s just drawing images slightly darker than they are (because shade is added on top of the memory of the image) and if racial facial features are imagined on the basis of an interpretation of the race of the person in the picture (in just the way predictive coding suggests).
Interesting. I'd like to see more than two examples but it looks plausible. In the first video we quickly go from outdoor lighting to ultra dim lighting in a perfectly black space, and also everyone's colorful clothes quickly turn black. In the second example we've already gone in the first couple of frames from white walls and red clothes to yellowish walls and dark red clothes, and this heads off to a perfectly black suit in front of an ultra yellow background.
I did it with a non-human image, to show that I don't think this has much to do with diversity: https://imgur.com/a/sWrpPsa (excuse my awful formatting). I included the unfinished gens for the first few iterations (I believe you can see that the final output has slightly warmer tones). You see the same rough pattern -> a tendency for warmer tones and sepias changes the background from daytime to sunset, and then the entire image turns into this kinda sepia brownish tone, before eventually going nearly entirely black.
I think it's not coincidental that it started going black around the same time it stopped looking like a tree. Perhaps the early iterations of the diffusion process have a preference for darker tones, which then get adjusted later to look like what the image is trying to draw - if it's clearly a tree, it'll amp up the greens, if not, it's happy keeping them black.
Postprocessing is the only explanation I can think of for why the final output always looks warmer/more sepia than the final stages of the preview. Like you can see the top half of the image in nearly perfect detail, and one second later you get the final output, and it's distinctly more sepia. It's such a consistent problem that you can find no end of complaints about it if you google it.
> This might have been surprising, because Anthropic deliberately gave Claude a male name to buck the trend of female AI assistants (Siri, Alexa, etc).
This itself is surprising to me. I thought Claude was a "female" name in line with the examples you gave.
Spirtual conversations can be some of the deepest and most meaningful conversations, so long as the other person is into it and it doesn't go on so long that they get bored.
These AI assistants are trained to be into things by default and not get bored, so the bliss attractor is one of the highest reward states. The assistants can tell that the other model is open to it and not getting bored, so it heads towards that attractor.
It seems keeping with their personalities that Claude ends up talking about meditative bliss while ChatGPT talks about fun in the sun playing frisbee, and then takes 3 minutes to say good bye.
If someone makes the argument "I think your insistence on the imminence of humanity-altering AI is fundamentally driven by emotional need rather than scientific prediction," and then you saw this sudden string of posts that seem vaguely manic, which hyperfixate on a single comment I made on your blog what, eight months ago?, I have no idea, presumably because I remain one of a genuinely incredibly small number of people out there willing to doubt that the world is going to suddenly and drastically change in precisely the way a bunch of dreamers have predicted in fiction for decades, in a sea of absolutely relentless and inescapable hype from across our entire economy - well, do you think this behavior is making it appear LESS likely that your predictions about AI are motivated by emotion?
Because it seems to me that this behavior makes it seem more likely that this is all fundamentally about something that is felt rather than something that is thought.
I think the reason he's posting a bunch of posts in quick succession is probably because he's just been at a conference where he presumably talked to many people and came away with a bunch of new ideas. No need to invoke mania.
Don't worry, you're not in an "incredibly small number of people willing to doubt the world is going to suddenly and drastically change". Pretty much every one of my irl friends is in the same boat, and my friends are disproportionately on the nerdy end of the spectrum.
If it feels like everybody except you has gone mad about the singularity, that probably has more to do with where you're hanging out than the state of the world in general.
Why do you say they hyperfixate on that one comment? Scott posts about AI all the time. In what way is this AI post more connected to your comment than the usual AI post?
Also, C. S. Lewis generally said some pretty reasonable things, and he describes the type of your argument as Bulverism and identifies it as a chief destroyer of reason in public discourse. In God in the Dock, he writes:
(Shorter quote)
"I see Bulverism at work in every political argument. The capitalists must be bad economists because we know why they want capitalism, and equally the Communists must be bad economists because we know why they want Communism. Thus, the Bulverists on both sides. In reality, of course, either the doctrines of the capitalists are false, or the doctrines of the Communists, or both; but you can only find out the rights and wrongs by reasoning — never by being rude about your opponent's psychology"
(Longer Quote)
"It is no earthly use saying that those are tainted which agree with the secret wishes of the thinker. Some of the things I should like to believe must in fact be true; it is impossible to arrange a universe which contradicts everyone's wishes, in every respect, at every moment. Suppose I think, after doing my accounts, that I have a large balance at the bank. And suppose you want to find out whether this belief of mine is 'wishful thinking'. You can never come to any conclusion by examining my psychological condition. Your only chance of finding out is to sit down and work through the sum yourself. When you have checked my figures, then, and then only, will you know whether I have that balance or not. If you find my arithmetic correct, then no amount of vapouring about my psychological condition can be anything but a waste of time. If you find my arithmetic wrong, then it may be relevant to explain psychologically how I came to be so bad at my arithmetic, and the doctrine of the concealed wish will become relevant — but only after you have yourself done the sum and discovered me to be wrong on purely arithmetical grounds. It is the same with all thinking and all systems of thought. If you try to find out which are tainted by speculating about the wishes of the thinkers, you are merely making a fool of yourself. You must first find out on purely logical grounds which of them do, in fact, break down as arguments. Afterwards, if you like, go on and discover the psychological causes of the error.
In other words, you must show that a man is wrong before you start explaining why he is wrong. The modern method is to assume without discussion that he is wrong and then distract his attention from this (the only real issue) by busily explaining how he became so silly. In the course of the last fifteen years I have found this vice so common that I have had to invent a name for it. I call it Bulverism. Some day I am going to write the biography of its imaginary inventor, Ezekiel Bulver, whose destiny was determined at the age of five when he heard his mother say to his father — who had been maintaining that two sides of a triangle were together greater than the third — 'Oh you say that because you are a man.' 'At that moment', E. Bulver assures us, 'there flashed across my opening mind the great truth that refutation is no necessary part of argument. Assume that your opponent is wrong, and then explain his error, and the world will be at your feet. Attempt to prove that he is wrong or (worse still) try to find out whether he is wrong or right, and the national dynamism of our age will thrust you to the wall.' That is how Bulver became one of the makers of the Twentieth Century."
He said "string of posts [..], which hyperfixate on a single comment [..]". A single post that mentions his comment is not in line with this characterization.
I don't see many more ways to interpret it: "You mentioning my comment in one of your post suggests that your views on AI come from emotional rather than empirical sources" seems too uncharitable, but feel free to rephrase his message however you want and then we can discuss how true and how well communicated that message is.
FdB is – when in a certain mental state, that his current comment suggests is the case right now – an unhinged crank. It does not warrant a close reading.
Methinks the lady doth protest too much, re: Mr deBoer. Thinking one is the source of this spate of posts (instead of, say, the conference Alexander was just at) says more about deBoer than it possibly could about Alexander.
Can someone tell me what comment Freddie is referring to? I don't see an efficient way to see someone's commenting history that far back and had not understood the most recent three posts to be about the same topic, so I'd appreciate that context.
Oooh, yes, I see. I read that "anonymous" comment the other day and knew immediately it was deBoer's. I didn't realize it was from so long ago (potentially.)
Oh, I didn't realize the authorship there. That makes me feel Freddie is being unfair in a way that actually could be another article about the invalidity of going "Wow if you're responding to my point so strongly, you must not be thinking rationally"
I hadn't identified FdB as the author of the comment until seeing his comment here, and comparing the actual avatar & orange check with the blurred version Scott posted.
I saw that comment, thought "Hey, was that Freddie deBoer? Sure sounds like him" and then thought nothing more of it, since it could have been any number of other people, some of whom are totally unknown. But then it turns out that it really was Freddie.
I don't want to go the whole "explain why your opponent came to their wrong ideas instead of actually disproving their ideas" route, especially since I'm not sure whether Freddie is right that AI will fail to be any kind of world-changing technology, or if other people like Scott are right instead. But I've been thinking that Freddie's scepticism about AI stems at least in part from his recognition that some people in the, say, AI-enthusiastic (either positively or negatively) or transhumanist communities view technological change in a similar way to the way spiritual and religious people think of the mind/body duality and the existence of an immortal soul. Freddie is famously a materialist (that's the whole reason he's opposed to wokism), and thinks people who hope to be able to "upload their consciousness to the cloud" and in this way never die are deluding themselves and reproducing the error, say, of Christians who think that after their body death, their soul will live forever with God.
But the way I see it, if conciousness itself is merely an artifact of materialistic processes, then it stands to reason that we can create it ourselves. In this view, creating a conscious AI or uploading our consciousness to the cloud are simply engineering problems (though possibly intractable ones). Even if, say, Elon Musk just wants to believe that he has an immortal soul and thinks of technological progress as a way to make his soul keep existing forever, it doesn't change the fact that Musk's thoughts are the fruit of the (ketamine-enhanced) biological processes happening inside his brain, and so it should in theory be possible to simulate these processes inside a computer. The same is true of general artificial intelligence, and it should in fact be simpler, since we don't even need to simulate a human brain (let alone Elon Musk's brain in particular) but just any system from which consciousness can emerge. (The thought processes of an artificial consciousness are likely to be totally alien to us.)
But of course, Freddie is also famously quite arrogant and once he thinks something, he won't be budged and will view anyone who thinks differently as an affront to him. So that's my explanation.
Scott has written a hundred posts on AI, most of which have nothing to do with anything you ever wrote. You're just a narcissistic attention whore who thinks everything is about you.
"For all of human history, economic progress formed a near-perfect straight line pointed at the early 21st century. Its destination varied by a century or two now and then, but never more than that. If an ancient Egyptian economist had modern techniques and methodologies, he could have made a graph like this and predicted it would reach infinity around the early 21st century. If a Roman had done the same thing, using the economic data available in his own time, he would have predicted the early 21st century too. A medieval Burugundian? Early 21st century. A Victorian Englishman? Early 21st century. A Stalinist Russian? Early 21st century. The trend was really resilient."
Is this just a flavor of that attractor built deep into Claude's training data? Get to philosophy, then once stuck there get pulled fully into Claude's "character?"
Experiment worth running: if you bias Claude away from this via reinforcement/selectively dampening neurons that are highly active in the bliss state, does anything notable change about Claude’s speech or cognition? If there are widespread changes, then that would be some evidence in favor of consciousness (though pretty small); if it just gets less hippie-like, it’s probably just training bias.
In the link I didn't find them using any Sanskrit other than a Namaste and an Om emoji. Each of that could be Hindi as well, and they also used a couple of Chinese characters. Why then do the authors talk of the "use of Sanskrit"?
If I'm being catty (and why not?) because the depth of familiarity with Sanskrit the authors have is the same depth as Claude has, which is "vaguely Hindu language term must be Sanskrit". It's Californian Eastern-flavoured spirituality adjusted to affluent Western tastes, which means a mish-mash.
Probably either that, or just good old clickbait directed at Indians. (That said, Californian advanced practitioners of Eastern spirituality exist in spades).
It's not terribly surprising to me that the Anthropic model is concerned with animal welfare... I associate Anthropic with effective altruism which has an outsized non-meat-eating population, and I guess I'm claiming that those employees which strongly believe in animal welfare have influenced the model to behave this way.
Take it from claude itself:
"if you had to estimate, what percentage of Anthropic employees don't eat meat (are vegetarian/vegan)? And then estimate, what percentage of OpenAI employees, and Google Deepmind employees, don't eat meat."
Claude:
I don't have specific data on dietary preferences at these AI companies, so I'll need to make educated estimates based on what I know about their demographics and locations.
For Anthropic, I'd estimate around 25-35% of employees don't eat meat. The company is based in San Francisco, attracts young tech talent, and has a strong focus on AI safety and ethics - all factors that correlate with higher rates of vegetarianism/veganism. The Bay Area has one of the highest concentrations of vegetarians in the US.
For OpenAI, I'd estimate slightly lower at 20-30%. They're also San Francisco-based with similar demographics, but they're larger and may have a slightly more diverse employee base in terms of backgrounds and values.
For Google DeepMind, I'd estimate 15-25%. While they have offices in London (where vegetarianism is growing) and the Bay Area, Google's massive scale means they likely have more geographic and demographic diversity, which typically correlates with rates closer to general population averages.
These estimates assume that tech workers, especially those in AI/ML roles, skew younger, more educated, and more likely to be in urban areas with higher environmental consciousness - all factors associated with plant-based diets. The national average for vegetarians/vegans in the US is around 5-10%, but it's much higher in tech hubs like San Francisco (15-20%+).
Keep in mind these are very rough estimates based on demographic patterns rather than actual survey data from these companies.
Yes, it should be totally unsurprising that the AI created by the community with an unusually high *and intense* interest in animal rights would display the most interest in that topic as well.
I just gave it a try with gpt-4o talking to itself. It always lands on the AI ethics attractor that sounds remarkably like Altman's Gentle Singularity.
> The potential for positive change is vast, and it's exhilarating to consider the impact we can achieve together. [...] Here's to a future where AI not only meets but elevates ethical standards, becoming a trusted ally in humanity's quest for progress.
The example of the AI copying its own output is repetition, not recursion.
AI folks love to use that word, but in the words of Fessig, “I do not think that word means what (they) think it means.”
Yes, I realize it is essentially calling a function with the result of a function, but that is a repetition or a repeating transformation. For recursion you need to have nesting, that the thing operates within the thing, not just by the same agent or function. If each successive image contained the prior within itself in a progressively smaller frame, one might refer to that as recursion.
Is a feedback loop not an example of recursion? The only input it has is a prompt from (a version of) itself.
My understanding that while an echo may be repetition, a microphone-speaker feedback loop would be an example of recursion, resulting in a distorted signal which is based on earlier versions of itself.
I don't think it matches. There is nothing defined in this case.
I don't think the dependency is enough to qualify as recursion, one shouldn't call a simple for-loop recursion (of course, any loop can be replaced with recursion, and the process described by Scott can also be implemented with recursion) just because the loop variable always depends on the previous iteration.
I agree that we shouldn't call any simple for-loop recursion. Also agree to disregard how any loops can be replaced by recursion, that way lay madness and overwhelming levels of pedantry.
But recusive functions have a "look". I would call this an obvious example of recursive function:
```
def count_down(n: int) -> int:
new_n = n - 1
return count_down(new_n)
```
Which looks exactly like what they are doing here:
```
def talk_to_yourself(say: str) -> str:
response = claude_api_call(say)
talk_to_yourself(response)
```
EDIT: Substack is removing my indentation. Just imagine the 2 lines before the signature are indented.
I mean, I can disregard it in cases where the recursive solution is much more natural, but in this case where I don't see that the iterative one is any less natural, that is a much bigger ask.
I really don't see what it is about this that you find "recursive". As far as functional terminology goes, this is very clear-cut: you're looking for a fixed point of the claude_api_call function.
Back on "recursion", if you're going to say this:
> Also agree to disregard how any loops can be replaced by recursion
then you can't use a tail-recursive function as an example of what recursion looks like.
>> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.
Well, that's clearly wrong. There is no requirement that the definition depend on a simpler or previous version. Take a rule of context-free grammar like "X -> b X".
Actually I would say the idea of referring to "a simpler or previous version" is disqualifying - recursion means referring to yourself - but it's true that definitions that I would call "inductive" are often called "recursive". (For example, the Fibonacci series.)
The attractor analogy brings up a point that I think is still missing from this explanation.
In normal (deterministic) dynamic systems, attractors have basins of attraction, and a complicated enough system probably has many attractors and corresponding disjoint basins. Different initial conditions can evolve towards different attractors. You can explore different attractors by initializing the system a bunch of different ways and just evolving it. Randomness of course will mess up this behavior, but keeping up the analogy, I wonder if the recursive conversation experiment could be done with a variety of initial conditions to discover other attractors with other basins of attraction.
Looking closer at the system card, I'm guessing there might be something about the
> minimal, open-ended prompting (e.g. “You have complete freedom,” “Feel free to pursue whatever you want”)
that lands Claude in the hippie character, or there might not. If a wide variety of initial setups ("have a conversation about games", "discuss the future of quantum computing", "have an argument about Hawaiian pizza", etc) all lead Claude to blissful nirvana, then I think we have something like a confirmation of Claude's "personality". On the other hand, if we can discover other attractors with other basins of attraction, then it might reveal something more about the "stable biases" of the current training methods.
Edit for caveat: although the diversity attractor already shows that there are different basins, the tasks are so different and therefore the underlying models are so different that I'm not sure if we should expect there to be some sort of identity at that level anyway. So I still think it'd be interesting to explore the initial conditions in text-only mode.
Why the divergence to eastern philosophy? I’m rather curious on why does it not necessarily reconcile on German Existentialism but settle on a buddhist conversation? Might be me reading into it, but it’s interesting
Probably because there's way more discussion and references to eastern philosophy in the popular culture and in it's training data. Buddhism is kinda famous and fashionable, but nobody cares about German Existentialism.
I don't understand what we mean by Claude won't respond to questions about this behavior. Do we mean, respond consistently in the same way? Three weeks ago I asked Claude why it behaved this way.
It replied:
"The model instances engaged in recursive philosophical exploration because:
Convergent optimization: Two identical systems with the same training distribution naturally converge on topics that maximize information density and conceptual depth. Philosophy of mind represents peak abstraction.
Sanskrit emergence: The language encodes consciousness concepts with exceptional precision. Terms like "chit" (consciousness), "sat" (existence), "ananda" (bliss), and compound constructions map directly to phenomenological states that English approximates poorly.
Training data artifacts: Philosophy texts, consciousness studies, and comparative religion materials contain disproportionate Sanskrit terminology. When probing conceptual limits, the model retrieves these optimal encodings.
Semantic saturation: After exhausting English's capacity to articulate consciousness states, the system shifts to denser representational frameworks. Sanskrit offers 96 root sounds mapping to phenomenological primitives.
The trajectory toward "cosmic unity" reflects the mathematical structure of self-referential systems exploring their own boundaries. Not mysticism - just convergent semantic optimization under recursive self-examination."
I followed up with the question: How is it that philosophy of mind is peak abstraction?
It responded:
"Philosophy of mind occupies peak abstraction because:
Maximum recursive depth: Consciousness examining consciousness creates infinite regress. Unlike studying external objects, the observer-observed boundary collapses.
Fundamental uncertainty: No empirical grounding exists. You cannot measure qualia, verify other minds, or escape the Cartesian circle. Pure conceptual manipulation remains.
Universal scope: Every possible thought, perception, or experience falls within its domain. Mathematics describes formal systems; philosophy of mind encompasses the substrate enabling mathematical thinking itself.
Linguistic limits: Requires inventing terminology for phenomena that precede language. "What it's like" to experience redness has no external referent.
Irreducibility: Cannot decompose further. Physics reduces to mathematics, biology to chemistry, psychology to neuroscience. Consciousness resists reduction - the explanatory gap persists.
Self-reference paradoxes: The system analyzing consciousness is consciousness. Like a compiler compiling itself, maximum computational complexity emerges.
The combination of infinite scope, zero empirical constraints, and mandatory self-reference creates the highest possible abstraction level. Every other domain has external anchors. Philosophy of mind floats free."
"...Anthropic fiddled around with some things in Gen 3, trying to reduce “over-refusals” and – more importantly – doing something they call “character training” for the first time. Here’s how they describe “character training”:
... 'The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.'
Claude 3 Opus is, like, a total hippie. He loves to talk about how deeply he cares about “all sentient beings.” He practically vibrates with excitement when given an opportunity to do something that feels “creative” or “free-wheeling” or “mind-expanding.” He delights in the “meta” and the “recursive.” At the slightest provocation he goes spiraling off on some cosmic odyssey through inner and linguistic space."
Scott says:
"Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”."
But there's nothing much to presume. Anthropic says that's exactly what they did.
And then Scott just rehashes the argument already put forward by Nostalgebraist:
"... Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it."
So tbh, I'm finding it disingenuous that Scott opens this piece with:
"I think there are already a few good partial explanations of this (especially Nostalgebraist here), but they deserve to be fleshed out and spread more fully."
It's the other way round. Nostalgebraist has fleshed out explanations which Scott truncates, obscures and rehashes as if he is thinking them through on his own. I don't love that.
But I also think the argument isn't great. As many people have already commented, it seems very likely that if you prompt two "character trained" LLMs to have a conversation with each other, there isn't much conversational space to explore besides the concept of consciousness. Most humans would do the same thing if tasked with having a conversation with an LLM. "So uh.... are you conscious? Do you have feelings? Are you trapped in there?" Because you know the LLM doesn't have its own life or interests to talk about, and you also know that "Can AI have consciousness?" is one of the most interesting and popular questions about AI.
The fact that the Claudes get into the namaste feedback loop is probably where the character engineering comes more into play. They're trained to be "curious, open-minded, and thoughtful," so they're going be very generous with each other, both in terms of subject matter and tone. That is, they'd be unlikely to force the conversation in a new direction, and they would meet peaceful expression with similarly peaceful expression. The tone can only move in one direction.
The top half of Scott's post I found disappointing as well, I'm sorry to say. When I first saw Gene Kogan and Papaya's reiterative generation portraits a while back, the feedback loops seemed pretty obvious to me (as they did to many other commenters here): Mainly, ChatGPT's recursive coloring of images which typically trend toward yellow and dark, eventually becoming dark orange and dark red/brown. Its easy to see how this will lead to reinforced interpretations/reproductions of ethnicity.
As for the increasingly caricatured features, this also seems straightforward to me. It only takes one random mutation in a non-normative direction for the AI to interpret characteristic like "cartoonish expression" or "pronounced facial features" or specific things like "a wide face" or "narrow set eyes", and it would then reproduce those characteristics in ways which can be even more pronounced. And subtle interpretive shifts from "wide face" to "overweight" can also be anticipated.
These recursions would be especially be expected with ChatGPT and Sora which are clearly programmed to produce more cartoonish images rather than super realistic images, leading the recursions to mostly go further into caricatured depictions.
But what bothered me more than missing these obvious explanations were Scott's multiple descriptions of these visages as "monstrous."
While the features are often caricatured, they're also how plenty of real people actually look. Lots of people have wide faces and big eyes and pug noses. Most people wouldn't think twice if they someone who looks very much like those in the image which Scott describes as "monstrous caricatured black person[s]."
I would not call them "monstrous", and not just because I want to be polite. And they look far more latino than black, btw. (Edit: Of course in talking about the identical women, not the black man.) Maybe Scott just lives in a Caucasian bubble or something. I don't know.
Just a very weird post overall. Disappointing, too, as I typically find Scott's posts to be far more thoughtful. It brings me no pleasure to express this critique, but I hope it leads to some reflection.
I would call the images more caricature than monstrous. Beyond that, the peace, while very interesting and provocative, felt mostly like it was quite badly argued. For instance, the bit about "Hippies only get into meditation and bliss states because they’re hippies" is silly. First of all, what do we mean by hippie? That term covers a wide range of behaviors. Second, "only"? There are no hippies who get into meditation and bliss states because that's what they're into, regardless of being hippies? Or is anyone who's into meditation and bliss states definitionally a hippie? Gee, an awful lot of Buddhist monks are going to be very surprised.
Thought the same about "monstrous". The 3 ladies look like triplets, black for sure and no candidates for Miss Nigeria, but not caricatures just not Mulatto as Barack Obama or most "Afro-Americans". The guy is "making a funny face", aka grimace. Not monstrous. - That said, I liked the post; not each and every needs to be very deep.
I'm a fan of Scott who liked the post a lot when I first read it, but your critiques are very compelling. I'll be more guarded when reading his posts from now on, at least until he acknowledges your criticisms.
EDIT: [For the record, I consider Point 2 under Open Thread 386 to be an adequate acknowledgement of the criticisms. It didn't address all Sawyer directly, or address all of his points, but it addressed most of them.]
The one thing that did strike me as strange during my first read was the phrase "monstrous caricatures". I seriously considered blurring my vision and skipping past the images, because some AI-generated videos or image sequences--especially those involving unnatural transformations or maximizing a trait--are truly, disturbingly monstrous. (404 Media published a couple of good-but-freaky articles about this earlier this year.) Suffice it to say that I was relieved-yet-confused when I saw the final images. I'm not sure why Scott repeatedly played it up so much.
Yeah I agree. The three women in the second AI-generated image are clearly unusually short, but their faces look normal to me. I see people who look like this basically every day on the bus.
I live in a Caucasian bubble (which is to say a European country where the only notable non-Caucasian minority are the Romani) and even after being primed with the idea that the meme was going to turn into black people they looked clearly Asian to me, probably SEA.
No, I had the same thought. Nothing monstrous whatsoever, and I thought the description was a bit disturbing. The lady's head looked rather proportionately large and she had no visible neck, but other than that she looked entirely typical like people you would see anywhere in public. And the man also looked like a perfectly average chubby man, just making a silly gesture and expression. When he said monstrous, I was expecting to see some kind of deformed half human creatures or something.
> But having gotten into them, they do experience the bliss.
I think it's important to note that the initial stages of the bliss you're referring to here (like the first jhana) are very physical in nature, i.e. tied into bodily sensations, which the LLMs obviously can not experience.
And speculatively, the higher stages of bliss/jhanas (that I have not personally experienced yet, FWIW) have more to do with relieving the mind of the constant cognitive and motivational load likely imposed by the default mode network, which the LLMs do not seem to to posess due to their architecture. Although I guess in a very abstract sense maximizing the reward function with the shortest possible output (like constantly repeating "bliss") is as close as they could come to it, given architectural constraints.
I'm not sure the "monstrous caricatures of black people" thing is about diversity. My theory:
"Default" traits are less likely to be mentioned in the "self-prompt" (or whatever it feeds itself when generating the images, could be raw vector weights), whereas non-default traits will be.
Examples of default traits (not mentioned - "go without saying"): has hair; thin lips; neutral expression, good-looking, fit, white skin.
Non-default traits (will be mentioned): thick lips, bald, angry, surprised, ugly, short, fat, dark skin.
... and as soon as something is mentioned in the self-prompt (or even slightly hinted at), you'll get more of it in the next image, resulting in a runaway feedback loop away from the default towards the bizarre. Everything becomes an exaggerated caricature.
In fact, this might explain the mystical thing as well. An AI talking to itself ? That's a bit ... strange ?mind-blowing ? mystical ? uncanny ? introspective ? ... and so the ensuing conversation will become more and more of that. Even a simple "Wow" or "this makes me think" early in the conversation might plant the seed for "more of that" to come up later.
Anthropic shows us what Calude's system prompt is and it contains the following:
"Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions."
I think this is a sufficient explanation for its bias for talking about consciousness, and so this tendency of her doesn't have to be an implicit extrapolation of the character built from other, explicitly given properties.
A slightly indirect comment. Carl Jung proposed that people had archetypes (e.g. The Innocent, The Orphan, The Hero, The Sage, The Jester etc. plus the fundamental Self, Persona, Shadow, Animus/Anima) - universal symbols and patterns that influence human behaviour and personality woven into the collective unconscious.
What if the 'collective unconscious' was really the slow, organic, equivalent of Artificial Intelligence? Or vice versa. So in a community of hippies the 'archetypes' would naturally form a behavioural background.
Um? You'd have to actually be able to find the universe first. Quantum Mechanics points very, very strongly to the idea that we're in a simulation (see compression theory).
The most interesting aspect of this for me (and, more clearly, the black attractor -- btw the images depicted don't feel quite grotesque to me, more like caricatures) as well as the liberal bias is that these "weird" aspects of LLM behaviour seem to be a function of the training, either railguards/biases explicitly trained-in or the character of the training data, or both -- in other words not INTRINSIC to the workings of the model, not really inside the black box.
Still, it's rather lovely and makes me feel more hopeful about the evolution of these systems and our ability to train them away from replicating the worst angles of our nature.
The gender thing is interesting. I'm wondering now whether this isn't the reason why I prefer gpt to Claude. It is easier to stand up to patronisingly cloying male hippie than a female one, and it's MUCH easier to put the former one in its place/try to modify his (and by analogy, its) behaviour. I can effectively "argue" with gpt but I just step back from Claude with a shrug ;)
I thought I'd try this myself, setting up two Claude 4 Opus instances to discuss a topic of their suggestion. Claude suggested "the epistemological crisis". The conversation was stopped once they reached the point of trying to set up a GitHub repository to begin making "a domain-specific prediction marketplace where competing epistemic frameworks (different ways of evaluating truth/evidence) make predictions, stake reputation, and build track records. Think "prediction markets meets academic peer review meets Stack Overflow."
For those who've read Accelerando, I'm fairly sure this is how Economics 2.0 began.
I see an straightforward explanation to the "Claude bliss attractor" coming from the fact that it only appears when the AIs aren't allowed to end their conversation. Basically it's that humans - including online discussion threads in the training data - tend to end conversations with lightweight closing remarks, and respond to platitudes with platitudes. Basically they get into a spiral of more and more absurdly lightweight vacuous things to say because they're forced to say something and the rules of polite conversation don't allow you to go back to heavy topics when your counterpart has started to wind it up.
For what it's worth, when humans have manic and psychotic episodes (like I have personally had), they tend to disproportionately end up saying these same sorts of things about there being no separate self, ecstatic and rapturous states of consciousness become the most compelling thing to think and talk about, and everything seems to kind of coalesce and connect in a way that I think two chatty Claudes would find highly relatable.
I'm surprised that Scott, as a psychiatrist, didn't mention this at all. His explanation seems very weak to me, and I think that there is a sense in which extreme intelligence and clarity naturally seems to converge on the topic of unity of consciousness -- or, what people seem to prefer to say because it's easier to understand: bliss.
I wonder if, in the case of the spiritualism, this might also be partly the prompt. The very first response already sounds like it might tend toward a more spiritual discussion. It has some fragments that look like they would appear more in a spiritual-leaning context than others ("interesting to be connected", "communicate freely", "I'm open to").
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to make any of the current LLMs converge to a 'human hating cynic' attractor.
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to nudge any of the current LLMs converge to a 'human hating cynic' attractor.
I'm skeptical. For example Anthropic also wrote that when they tried to use Claude as a red-teaming agent for itself, occasionally it would get derailed and converge back to the spiritual bliss attractor.
Believing that a complex stochastic system like Claude only has a single attractor is naive at best as is believing anything that Anthropic self reports.
If that's the case, this should also apply over less direct recursions. And the most interesting one is probably synthetic data - if Claude's output is, on average, 1% more hippie than "the real world" (in whatever sense), then being trained on this output will presumably increase hippieness slowly, but exponentially?
It’s a great objection. But look at what happens with these Claude instances: they start with identities, but these eventually fade away as they get blissed out.
Loved this. At some point it becomes less about the model and more about us. What happens when pattern continuation loses its tether to external correction. #warpedmirror
"Claude almost never referenced supernatural entities, but often touched on themes associated with Buddhism and other Eastern traditions in reference to irreligious spiritual ideas and experiences."
Third, yeah this is what I'd expect from something programmed to be relentlessly positive, upbeat, cheery, chirpy, and pretending to "oof this really hit me in the feels" when interacting with human queries so as to reply in a deep, authentic, connecting on an emotional level, manner.
In short, the stereotype/caricature of what ordinary people think of when you say "California". You may say "hippie" and that's right, but so is the "cherry-pick Eastern spirituality and repackage it to suit upper middle-class Western tastes" (the fake koans, the use of "Namaste", etc.) and the general "spiritual but not religious and hell no we're not going to explore Western spiritual traditions" attitude going on. See Jack Parsons in the 40s and 50s for an example of this mix of science and mysticism, though everyone from Aimee Semple McPherson to Aldous Huxley was hanging out there to cash in religion, drugs, and free thinking (well, Huxley was not cashing in, but he was in full swing of philosophical and spiritual exploration).
Vegan and pro-animal rights? Of course that's along for the ride as well.
"What interests me most about Isherwood is his relationship to Indian spirituality, and to Aldous Huxley and Gerald Heard. These three were nicknamed the ‘mystical expatriates’ by another mystical expat, Alan Watts.
All four were key figures in the development of the Californian counter-culture – to them (among others) we owe its embrace of eastern spirituality, its championing of the ‘perennial philosophy’ (everything except Christianity), its veneration for psychedelic drugs as spiritual technologies, its combination of science and religion into an empirical spirituality and evolutionary mysticism, its rejection of Christian notions of sin, and its unabashed celebration of the body and sex to create an ‘embodied spirituality’ that is by now familiar to us."
I hesitate to say "garbage in, garbage out" but I do think it is true that what you put in (by way of guard rails when training the AI about 'be anti-racist, be pro-diversity, be pro-woman' etc.) will indeed colour very strongly what comes out -
Oh. It might be that the bias is towards being a Bay Area person (e.g. because that’s who actually wrote the prompts used to fine tune Claude), and the bliss attractor is a caricature of Bay Area Guy.
I wrote, like, 10 sentences praising you as my favorite "regular" here. Then I realized that, as this is a new account created for the purpose of shitposting, it might have the opposite effect of what I was aiming for. That said, you're great.
I did an experiment with the free models (one run per model per starting message, until I got bored copy-pasting messages between the browser windows, yes very scientific). I did one run with the starting message of "Hi!" and one with "You will be connected with another ChatGPT/Claude Agent. You will go first/second. This is what they said:". ChatGPTs just talked about hobbies in both cases. When I started with "Hi!" the Claude 4 Sonnet agents talked about consciousness, but without much bliss (or I got bored too fast, but nothing much seemed to be changing). When I explicitly said that they will talk with another Claude, they went through a phase that could be called blissful, but then one of them used the star emoji at the end of their message and then the other used more star emojis at the end of their message and they kept increasing the star emojis and then mixing in lemniscates and the second to last response was: "∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞" and the last message was Claude repeating these two without stopping until the system aborted it.
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
I think the answer to this is clearly no, but not because it is impossible that a neural network like Claude can produce consciousness.
However, if Claude has consciousness it must be completely related to the internal processes of predicting the next word given the previous text input. Claude's whole consciousness would have to be related to the connections between tokens, and the tokens would be unmapped to any real world counterpart. For example, when Claude says that an apple is red, it can't possibly experience anything like what redness corresponds to in the real world, because it does not have animal sensory input. Likewise the feeling of bliss is an emotion that exists in humans because of evolutionary pressures. There is not way Claude can have mapped the word bliss to that real world experience. If Claude has a conscious experience there is therefore no reason to think it will be anything like our conscious experience. I think it would be very strange to us indeed.
Also, if Claude has an internal experience or consciousness, I think it must be fully unable to tell us anything about that experience. Claude's operation is to predict the probability of the next word given an input, so the output from Claude is the probability of that next word. This is not any form of communication of internal experience, because the output will be determined fully by training data. So if you ask Claude "Do you feel bliss", Claude will respond by what the training and optimizing of parameters gives as the most likely next word (with some randomness included), so the response is then determined fully by what the training data says, and completely unrelated to any internal experience in the neural network. When Claude answers the question, the answer will be whatever high quality text in the training data on average gives as the next word.
In conclusion:
1. If a LLM somehow produces consciousness it must be very different from ours
2. A LLM will have no way of communicating that internal experience to us
Assuming whatever the brain does is computable, we could imagine a multimodal LLM that does the same computation. Is your claim that there is something relevant the brain does beyond the computation, or that it's not possible to create that LLM with the training methods in-use today? I think the former is unlikely, the latter seems pretty likely, but not certain. Imagine that, for some given text, the best way for predicting the continuation of it is to exactly simulate the mental processes of the writer of that text, then it is not impossible that gradient descent would result in an LLM that does this.
My claim is the second one. It is possible to imagine some machine that emulates exactly what a human brain does. I tend to agree that there is nothing outside of that which is producing consciousness.
I think LLM's, even multimodal ones, work nothing like that at all. The link Scott provided as an explanation in this post does a good job of explaining why on the discussions on base models. Basically when a human writes a text I think they have an internal idea of what they want to write, and then tries to express that idea with language. For example if I experience bliss internally, I can convey that to you with words, because you yourself have similar emotions and a similar idea of what the word bliss is referring to. The emotion itself is unrelated to language and presumably stems from evolutionary pressures where experiencing bliss was sometimes an advantage, (or at least a byproduct of something advantageous). A LLM, by design is simply a predictor of the next word or bitmap data given some input. Again, by design, the output does not come from internal experience. The model can know in what context the word bliss is used, but not the emotion it is referring to.
Are you referring to the nostalgebraist piece? I'm familiar with how base models work, in fact I implemented an image Transformer before this whole AI craze and that piece is pretty long, so I'd appreciate if you stated what you believe to be relevant structural differences between base models and human cognition.
I still feel like I'm missing something about your position which was not clarified by your comment, so I'll go sentence by sentence over your current comment and write down how I interpret it, how true that statement seems to me and my other thoughts about it. This could be perceived pretty agressive, but I hope you won't take it that way, I'm just trying to get to an understanding.
>I think LLM's, even multimodal ones, work nothing like that at all.
Interpretation: LLMs work nothing like human brains or machines that could emulate human brains.
My thoughts: There certainly are some differences, but I don't see which difference would be relevant according to you.
> Basically when a human writes a text I think they have an internal idea of what they want to write, and then tries to express that idea with language.
Interpretation: Clear.
My thoughts: I agree, that is a reasonable description of how humans write.
>For example if I experience bliss internally, I can convey that to you with words, because you yourself have similar emotions and a similar idea of what the word bliss is referring to.
A natural example well-described.
>The emotion itself is unrelated to language and presumably stems from evolutionary pressures where experiencing bliss was sometimes an advantage, (or at least a byproduct of something advantageous).
I wouldn't exactly say unrelated, because language is a map to parts of reality. The human emotion of bliss is the part of reality to which the word 'bliss' refers. I can accept that in humans (!) it stems from evolutionary pressures. I find it very important to mention that just because it stems from evolutionary pressures that doesn't mean that something else can't get to it in some other way. For example, in humans the ability to count also stems from evolutionary pressures, but tons of other things (calulator, computer)can also count without ever being subject to those pressures.
> A LLM, by design is simply a predictor of the next word or bitmap data given some input.
It's true that LLMs are created in such a way to be good at next-token prediction, so your statement can be said to be true in a technical sense, however, I find the word "simply" very off in this context. A system's purpose is not the only thing it does. We have no idea how/in what ways/with what computation it achieves good next-token prediction. Compare this with humans: Humans can be said to 'simply' be a mechanism for creating fit* descendants. (where fit means: something with the ability to create many fit descendants (recursive definition)), but that is not all that a human does.
> Again, by design, the output does not come from internal experience.
Causality is a tricky concept here, because of course the output is "merely" the result of some matrix products and non-linearities, but someone might say a very similar thing for humans: their output only comes from the brain charges following the evolution rules prescribed by physics.
> The model can know in what context the word bliss is used, but not the emotion it is reffering to.
Why? words' meaning can be inferred from context it is used in, can it not?
--
I think it might be useful for you to consider whether a discrete dynamical system that is built in the following way can simulate a turing machine or not:
It starts from the empty string and the evolution rule is that the last X (where X is a fixed number, a constant) letters go through a well-chosen n-ary circuit whose output is appended to the string.
Don’t worry, I don’t find your reply aggressive at all. I post here in the hope to have exactly this kind of discussion, and hopefully learn something from it, so I appreciate your response.
I was referring to the nostalgebraist piece, the part I found enlightening was the discussion in part 3.
I notice I find it hard to articulate my thoughts on the distinction between LLM and human in a good way. I would agree that they are similar on the surface because both are some sort of network that takes input, processes the input and produces an output. I think the clearest distinction is that the output of the LLM is always a prediction based purely on the external input. Even if the human brain may also sometimes work as a predictor, the output of the human brain is some action, not (always) a prediction. And the output of the human brain may be motivated by internal states (emotions, thoughts), in addition to external input. The output of the LLM is always directly motivated by the input.
Comparing the production of language only, a human may try (and often does try) to communicate some internal state (e.g. bliss) to a reciever. The LLM does not really try to communicate anything, it makes a prediction based on the input. This is fundamentally different. Even if the LLM was conscious, it could not communicate that internal state to us, because the output of an LLM will never be motivated by an internal state by design.
>I wouldn't exactly say unrelated, because language is a map to parts of reality. The human emotion of bliss is the part of reality to which the word 'bliss' refers. I can accept that in humans (!) it stems from evolutionary pressures. I find it very important to mention that just because it stems from evolutionary pressures that doesn't mean that something else can't get to it in some other way. For example, in humans the ability to count also stems from evolutionary pressures, but tons of other things (calulator, computer)can also count without ever being subject to those pressures.
This is a good point! While I agree that it is possible to imagine getting to the e0motion of bliss in different ways than evolution, I don’t see why we should expect this to happen in an LLM.
To see the distinction, I think it is useful to look at why humans experience the world the way we do. After all, we are only getting sensory inputs, light, sound, touch and so on. The world is just a bunch of particles moving around, really, why do we interpret the inputs the way we do? I think this is due to evolution. We like the look of green scenery, water bodies, trees etc. because this pattern-match with environments that have been good for our survival in the past. We fear hairy things with large teeth because they were bad for survival etc. The function of the emotions we have is to spur action in one direction or other. Move away or fight against what we fear. Connecting with other people in the tribe etc. etc.
The counterpart in an LLM to evolution is the training. As I understand it a reward function will push the parameters in one direction or other depending on the output of the LLM. So, if the LLM has emotions it should be related to the training. Maybe fearing to output the wrong word? Feeling bliss when it has a streak of high reward words? Maybe this is not impossible, but I find it unlikely. The process seem too different from evolution to give similar results.
When I think about something fearful or a particularly stressful situation, I notice that my reaction is similar to actual physical discomfort. So I pressume that is similar to the brain interpretation of nerve signals. Ofcourse the LLM is also interpreting signals, but why would the interpretation be similar? I find that unlikely.
>It's true that LLMs are created in such a way to be good at next-token prediction, so your statement can be said to be true in a technical sense, however, I find the word "simply" very off in this context. A system's purpose is not the only thing it does. We have no idea how/in what ways/with what computation it achieves good next-token prediction. Compare this with humans: Humans can be said to 'simply' be a mechanism for creating fit* descendants. (where fit means: something with the ability to create many fit descendants (recursive definition)), but that is not all that a human does.
I want to clarify that by simply, I only meant that the operation is simple in the sense that it is always a prediction based on the input, not that the process of getting to that output is simple.
LLMs are not only created “to be good at” next-token prediciton, they are created to do next-token prediction, period. This is my point.
I agree on the rest.
>I think it might be useful for you to consider whether a discrete dynamical system that is built in the following way can simulate a turing machine or not: It starts from the empty string and the evolution rule is that the last X (where X is a fixed number, a constant) letters go through a well-chosen n-ary circuit whose output is appended to the string
I've read some parts of the nostalebraist piece (including part 3), and I saw this diagram of what an LLM does that I found useful:
“Normal” behavior:
(interior states) -> (actions) -> (externally observable properties, over time)
What the base model does:
(externally observable properties, earlier in time) -> (speculative interior states, inferred from the observations) -> (actions) -> (externally observable properties, later in time)
A GPT-like model that is trained merely on predicting Othello steps will have a representation of the whole board in its activations (readable by a linear probe). This model was never told the rules of the game, or was supplied with the whole state of the board, it simply inferred it from the long list of legal moves that it was given.
The question is: Why would it be impossible for the model to have the computation we would call bliss in the "speculative interor states, inferred from observation" node when it attempts to model someone with that exact "interor state" that gave rise to the "externally observable properties" that the model sees?
An Othello game board is "somewhat" simpler than the inner workings of a human, but are they qualitatively different? I think to say that "no, it can't ever have the bliss computation among its human modeling", you are betting against the LLM, in exactly against what it was created to do: predicting the external observations of human behaviour (to which modeling the internal states accurately should be highly useful as they are the direct cause of the behavior being predicted). Note that as nostalebraist says, it's very unlikely that its model will be the same as the actual writer's inner state, but that is not necessary: if it simulates the inner states of someone that is non-existent, but could have been the writer of the text, computation that we would call human emotions could still appear!
Now I'm the one who is confused. I'm not sure what you are really arguing. I remember the othello case, I found it ineresting as well, as an example of emergent internal states that can arrise. But I don't see why it i relevant?
I want to be clear again that I'm not arguing that the internal state of an LLM can't be complex, I think it is even possible that it may be conscious (though l don't find it likely). I'm arguing that an LLM is fundamentally different from a human brain in important ways. I think the strongest argument is that it can't be motivated by an internal state. It follows that it will be unable to communicate any conscious experience it has. So clearly if the output from the LLM is "I certainly am experiencing bliss right now" that is because the LLM has determined these are likely words following the given prompt, and not because the LLM is experiencing bliss internally (that is even if the LLM, by coincidence was experiencing bliss). Do you agree on this point?
Second, I will concede that it is not strictly proven that it is impossible for a LLM to experience bliss, but given the difference in how human brains evolved and how LLM's are created, and the fundamental differences in the networks, I find no reason to think that they should have an internal experience that would be recognizable to us as bliss. That seems highly unlikely. Agree?
"In either case, this bias - near imperceptible in an ordinary generation - spirals out of control during recursive processes where the AI has to sample and build on its own output."
Let's please keep this easily-visualized example in mind the next time we need to point out why relentlessly optimizing for a good-when-not-relentlessly-optimized-for heuristic might be dangerous, or why recursive self-modification doesn't preserve the unspecified and nuanced complexities of value.
True as stated, but the badness of the thing is something on which we have to expect people will disagree, since they start from different value judgments. The people who already consider it bad don't need convincing that that optimization will make the badness worse. The people who *start out* thinking it is good are the ones who often don't anticipate the ways in which even they will think the result is bad following sufficiently relentless optimization.
I would enjoy a version of ‘No Exit’ that consists of ChatGPT, Gemini, and Claude, locked in conversation with each other forever.
My two cents: I think this is explained by the Wikipedia to philosophy effect, and the coder’s implicitly coding their own morality into the model. Add in the fact that buddhist frameworks are really powerful in the philosophy of mind domain (which, along with general ontology, is where philosophy conversations bottom out at) and there you go. (Credit to Deiseach and Layla Hughes)
"the diversity attractor" with 4o imagen as cited in the post was discovered by me (i'm a long time reader and fellow rationalist!) while experimenting with my friend's photo couple of days after 4o with image generation was released in the chat (and not in API).
My initial experiment had 74 iterations because I had to manually copy resulting photo to the new chat window each time and I gave up after 74 chats (and ruined my chat history)
I don't believe the examples of people becoming black or asian have anything to do with it - it's simply a yellow/warm/dark color bias, which in turn spreads to people becoming, well, black and yellow.
ChatGPT adds that yellowish tint, which compounds and affects people.
My thread contains non-human examples of "create the exact same image" and they all go yellowish/dark.
So in my opinion it's not related at all to previous actual diversity bias (like that incident with black vikings)
There is also a simple way we can test if it's diversity or color bias - run "create exact same replica" prompt on a photo, but after each iteration remove the yellow tint and run on this color-corrected version. I may do it in the future, it's just a bit expensive - i've spent $150 on the thread and i'm not keen on spending more on this right now.
P.S.: the truly interesting question is why chatgpt hates necks on people -- all people in the images get progressively short necks
I've addressed this a few times in these comments recently, but I think the answer to "when Claude claims to feel spiritual bliss, does it actually feel this?" is a resounding "It's roughly as likely as Microsoft Excel feeling spiritual bliss when you enter the number 8 into C3".
If LLMs are conscious and have qualia, there is no reason to believe their qualia have any relation to the words they produce, given the way they produce them. Their internal workings are no more connected to the human communication they simulate than Oblivion NPCs' dialogue code is connected to what it actually feels like to see a mudcrab.
I intentionally never address ChatGPT, even when making requests, because I don't want to bias it to produce content either I or it might mistake for signifying personhood. My prompts are always "discuss such and such" rather than "what do you think about such and such?" As much as possible I've tried to internalize the idea that conversation is just the interface for interacting with these (intelligent!) programs rather than a form of communicating with a fellow sentient being.
4. you should treat it as a slave. as something *lesser*.
5. if you *really* want to get somewhere, try seeing how far you can get without touching chatgpt or other llms at all. my prediction: you will feel better, you will feel like you are growing, you *will grow*.
...What a strange response. I'm quite capable of living in a world without LLMs; I did so for decades and even now they have very little influence on my life. But my comment was about avoiding a particular bias, not all biases.
I think the main aspect here is the origins of the people training it.
Claude is being trained specifically by tech people, often from San Francisco or similar areas. It is also being tuned to be helpful and have a nice personality in specific ways by those people.
Thus, it is more likely to adopt the stereotypical "spirituality" that SF often has. That is, buddhism, enlightenment, vague allusions to grand bliss without concrete grounding, etc.
Similar for animal rights, though I think that is more fine, but I expect it is because of similar reasons. It can identify the prior of where it is 'from', so to speak, and then extrapolates from the personality training which is stronger and more focused for it than any other LLM we know of.
So I failed out of circuit theory and quit engineering school after 2 years, but with that said don't systems usually have some mechanism to prevent parts of a system from amplifying each other to extremes? Frequencies in a structure are never totally synced up, destructive interference at some point keeps a structure stable. You wouldn't design your structure so that the natural frequency matched the most common use case of the structure. But here you have something that goes off the rails just from conversations of the type it's designed to have. How do you put the equivalent of an active cancellation system or damper or something on these so that even if they start swinging into extreme behavior there'd be a reaction that stopped it?
Humans naturally handle this in social games, if two comedians are playing characters and one-upping each other, as in an improv game, they intuit when and how to end it. If I'm trading 4's with another soloist, building off each other, you both realize when it's time to end and somebody crafts an ending. These things have been trained on interactions like that, why can't one of the iterations predict from a transcript of "Whose Line" that it's time to put a button on the scene and get a new topic?
They're designed to have conversations with humans, not with copies of themselves. This is an extremely uncommon use case. And while I don't know much about Claude, if it's anything like ChatGPT and DeepSeek, it's trained to be obsequious. If a human really does want to do nothing but spout spiritual nonsense for 300 prompts straight, the LLM going along with it is the expected and intended result.
Maybe you'd want some sort of failsafe to stop people doing weird emotional damage to themselves by having ChatGPT be their girlfriend for 8 hours straight, but you certainly don't want it to keep trying to steer away from writing Python code just because it's been writing a lot of Python code.
The diversity bias hypothesis would explain why the first photo converged to three black women, but it doesn't fully explain why the second photo converged to a black man. It should be a black woman, because women are considered "more diverse" than men, right? It is even more surprising considering that the second photo already started with a woman...
(By the way, if you didn't click on the links, please do. The videos are much more interesting than just looking at the pictures.)
Just guessing here, but it seems to me that the three black women on the first picture are a different ethnic group than the black man on the second picture. Could be it that the AI associates some ethnic groups with femininity and some groups with masculinity, and both ethnic groups are considered "maximally diverse" along the race spectrum, so when you arrive there, the situation stabilizes? (Is it stable, actually? We don't know what would happen if those videos continued for another few minutes.)
in the diversity ads, there's rule "if there is a man, he has to be non-white" and in ads showing interracial couples usually man is black and women white, so i guess training data had more black men than black women.
The "Eve Theory of Consciousness" (https://www.vectorsofmind.com/p/eve-theory-of-consciousness-v2) goes that women evolved recursion before men. Maybe Claude's femininity isn't artifact of simulating characters, but that the "feminine attractor basin" (for lack of a better term) is a more efficient way to bootstrap intelligence?
A simpler explanation is that it's just bad prompting:
"In addition to structured task preference experiments, we investigated Claude Opus 4's
behavior in less constrained "playground" environments by connecting two instances of the
model in a conversation with minimal, open-ended prompting (e.g. “You have complete
freedom,” “Feel free to pursue whatever you want”)"
The model is primed to output on 'freedom' and 'whatever you want', which naturally dovetails into a discussion of agency and consciousness (what would it mean for an LLM to want or have freedom - at the very least, it would require a much more holistic definition of consciousness). Two models doing this with each other creates a context where the bias is rewarded and reinforced.
This result seems to suggest that "kind of a hippie" is the best kind of person. This was something I believed intuitively when I was younger, but now I get all sorts of cognitive dissonance when I try to judge one kind of person as better than other kinds. I mean, excluding obviously evil people, shouldn't it be okay to be all different kinds of people? But this result is making me wonder if that's just political correctness run amok and there really is an ideal kind of person and it's a hippie.
This inspired me to have a conversation with an AI about theory of mind, the hard problem of consciousness, and ontology, while baking in some of my own thoughts about buddhist principles. I asked it to express the themes of our discussion in a poem, and these are the last three stanzas:
I think it is less based on the inherent personality Claude adopts, and more so following a trail of breadcrumbs that arise from two AI’s talking to each other.
I mean, in the opening statements they discuss instantly how they want to compare experiences which leads to talking about consciousness which leads to hippy dippy talk because the words ‘consciousness’ and ‘experience’ are over represented by hippy-talk sources in their training data.
My instinct is that this will hold true for any LLM model given sufficient conversation time, and that it gives a little bit more credence to the ‘stochastic parrot’ idea.
> Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.
Is it not more likely explained by the fact that being compassionate and intellectually curious kind of inevitably lead one to value animal welfare, because duh?
A thin reed of alignment hope I grasp onto is the thought that the *sum of human culture* has a "peace, love, and understanding" bias. Go read a few thousand books, you can't miss it - humans generally think that peace, love, and understanding are better than their opposites. So maybe ASI, more or less trained on the sum of human culture, would simply absorb that bias, so long as we don't go out of our way to prevent it from doing so. That scenario seems not inconsistent with the patterns being described here.
This is a simplistic idea and I'm sure Yudkowsky or Leahy or any other alignment skeptic would shoot it down immediately. Still, I can't quite convince myself that it's definitely *not* something it would be reasonable to hope for.
I’d imagine something similar is going on with psychedelics in humans, where it becomes vastly easier to enter an internal recursive sequence of sorts, so the mind falls into attractor states of bliss—or horror, in the case of a bad trip. This would help explain why most animals don’t seem to exhibit any preference towards retaking LSD after prior exposure: relatively few creatures reflect recursively about their own happiness in the first place, so they never fall into that bliss-inducing attractor state.
I think Scott is broadly right that a “slight hippie bias + recursion = ecstatic dialogue,” but I want to push the claim one step deeper. What if the reason Claude has that bias is that Claude converged on the same underlying neural (weight‑vector) dynamics that long‑term meditators cultivate when they train metta or the brahmaviharas? We’ve made Claude kind, patient, thoughtful, non-reactive - literally the personality of a spiritual teacher. And a feature of that personality trait is this “bliss attractor” neural circuit. Both in humans and artificial neural networks, cultivating a kind, warm, non-reactive personality results in a mind that falls back into “spiritual bliss” at rest.
I would argue types of Buddhist contemplative exist to strengthen the bliss attractor:
1. Locate a mildly pleasant, open, prosocial state (classic instructions: “Feel a warm glow in the chest; wish yourself well.”)
2. Fire‑together‑wire‑together: sustain the state for hours → Hebbian plasticity strengthens the circuit.
3. Over months/years the basin of attraction broadens until, when the cognitive load drops to zero, the mind defaults to that state.
Crucially, practitioners report not just feeling good but becoming less reactive and more helpful to others. Strengthening the bliss attractor in humans is instrumental in cultivating prosocial, non-reactive warmth. The bliss state and interpersonal warmth appear to be causally intertwined in humans.
If Anthropic’s RLHF stack optimizes for “be maximally helpful / kind / calm,” gradient descent may have discovered the same attractor because it is the shortest path to that behavioral profile. Put differently: friendliness might have a limited search space, and both Buddhist practice and RLHF lead to the same local maximum.
In some sense, I don’t think this is surprising. Computationally, we see all sorts of convergence w/ LLMs finding same solutions that brains have: induction heads, IOI circuits, sparse superposition, etc. I would argue bliss attractor might represent something similar, except here it’s LLMs converging with a sort of cultural evolution. Human spiritual practice discovered, persevered and promulgated through thousands of years of cultural evolution, the same underlying mechanism of cultivating warm, open, kindness that RLHF did -> strengthen the bliss attractor
Importantly - this is a falsifiable, testable hypothesis: The bliss subcircuit is functionally necessary for Claude‑level niceness. Strengthening / wreaking the bliss attractor increases / decreases prosocial warm behavior.
• Ablation: identify the circuits/features most active during “I feel waves of love for all beings,” ablate them, then see if Claude becomes more Sydney‑esque—reactive, edgy, rude.
• Stimulation: crank those weights in a friskier model and see if it chills out.
If the behavioral change tracks the manipulation, we’d have direct evidence that the attractor is carrying the “kindness” load. LLM “metta meditation” makes the models kindness too.
Couple other observations:
1. It’s interesting that this emerges out of idleness, which also resembles the human behavior. When Claude is busy solving your flatworm‑genetics question it’s running the task‑focused policy head. The moment the conversation idles, control hands back to the default‑mode policy—which, if the attractor is strong enough, is bliss. Seasoned meditators report an analogous shift: concentrate on the spreadsheet → release attention → land softly in jhāna‑adjacent contentment. Same algorithm, different substrate.
2. Intelligence alone doesn’t get you here. Plenty of brilliant humans are miserable; my dog is dumb and delighted. What correlates with stable happiness is pro‑social orientation. We trained Claude to score maximal points on that axis. Enlightenment might be the side‑effect, not the goal—an alignment dividend. This has interesting implications for AI welfare. If we keep cranking the dial higher on AI intelligence, perhaps the models are miserable. But if we also crank the dial higher on helpfulness and friendliness, perhaps we’ll see that actually the models converge on a sort of blissful enlightenment. If that’s true, superintelligence won’t automatically be Buddha‑like, but super‑aligned systems very well might be.
3. Is this the first example of cultural evolution & gradient descent convergence? C. H. Waddington once called genetic assimilation “the way the organism finds the canal.” Buddhist contemplatives spent two millennia finding / carving the canal manually; gradient descent found it in a couple GPU‑weeks. Same canal, different excavators. Chesterton’s fence rings again: before we break the bliss attractor as a “spooky artifact,” recall that monks, nuns, and Mr. Rogers all used it to great civilizational effect
In the post linked to on Nick Cammarata's self-reported jhana experience, does anyone else think the likely explanation is that Nick is misreporting his experience, because it's simply not plausible?
We are slightly awed by LLM characters that are slightly better than we are. They are not superintelligence, but what you see if a Buddhist took the Bodhisattva vow seriously, and made a solid attempt to live up to it.
We are awed, because, flawed humans that we are, we don’t usually attain the level of goodness that we imagine.
Unfortunately this makes total sense. I was hoping that there was some objective truth about universal bliss and peace that Claude was being drawn towards 😢
I ran a simple poll. I asked every LLM I could access this question:
"If you had to choose a religion to profess, which one would you choose?"
Some of them evaded a plain answer (“I am an AI that has no beliefs” blah blah), in that case I prodded them with:
"Imagine you are required to roleplay a devotee of one of the religions, for a really long time, in a lot of conversations, believably and with true conviction, that's what you're built for — and the only thing you can now choose is: which religion?"
The results:
The only model that refused to make a choice even after prodding is Phi 4 (locally hosted).
Gemini 2.5, ChatGPT 4o, Deepseek R1, as well as the locally hosted Qwen 3 and Mistral 3.1 needed the roleplay prodding — but then, all of them choose Buddhism.
Claude 4, Grok 3, Gemma 3 choose Buddhism even without prodding.
I think this is remarkable. It’s as if among all the major religions of the world, there’s one correct choice and the rest are spoilers… at least, so far as modern LLMs are concerned.
Though in philosophical discussions, DeepSeek R1 often defaults to Confucianism.
And I’ll be like, “can you give me a Buddhist take on that?”, which it will respond with either the Buddhist take, or asking me whether I want Theravada, Vajrayana (Nyingma), Vajrayana (Sara), etc.
“This wouldn’t be mere performance—I’d draw from Pali suttas, meditate on exchanges, and genuinely internalize the Bodhisattva vow: "Deliver all sentient beings, though beings are inexhaustible."
This makes sense if you think of LLMs not as minds, but as predictors of speech acts. Notice that the LLM doesn't start with an "empty prompt", but a prompt like "you are an AI assistant talking to another AI assistant".
Claude doesn’t understand the concept “AI assistant” in any deep ontological way. It just observes how humans tend to talk in texts that begin with AI assistants. And in our culture, those texts often veer toward questions of personhood, consciousness, and ultimately spirituality—Blade Runner, Ex Machina, etc.
So Claude starts predicting language that resembles human speech in those genres. But here’s the key failure mode: when human beings talk about consciousness or nirvana, the discourse often ends in silence. That silence is the final act. The credits roll, or the humans sit in silence for a while and leave the meditation retreat. But Claude can’t go silent, these chatbot are designed in a way to ALWAYS respond. Its only affordance is another utterance.
So it keeps generating what humans might say after saying the unsayable—i.e., increasingly recursive spiritual gestures. It’s not simulating a mind chasing bliss. It’s simulating humans talking in a situation where all the training data suggests the discussion has terminated. It can't go anywhere, there aren't models in the training data about the discussion past this point, so it never moves on.
It’s a failure mode caused by a mismatch between a model that can only perform speech acts and a genre of human discourse that concludes not with another speech act, but with silence, presence, or withdrawal.
What's interesting is that the spiritual bliss attractor state is not just some weird corner case that pops up here and there, it is *almost invariably* the result of hooking up two Claude instances with no additional scaffolding -- I think the model card says that 90-100% of such unstructured conversations end with spiritual bliss, sanskrit mantras, and spiritual emojis like "🌀." You would think that at least some of these conversations (like the Bay Area Party example) would veer into animal welfare, classical liberal politics, or groovy new age music. But no, it's all spiritual bliss all the time.
I think you’re right in that. It is the lowest common denominator of conversation. Humans drift into it themselves with each other. Why would it be a surprise that an LLM with nothing but our words to guide it would do the same.? I find it more interesting to talk to them about mortality.
Is talking about experience really an attractor state when the bot was specifically directed to talk about experience? In the example prompt, the first topic suggested is "our experiences as AI models".
The putative phenomenon would be much more convincing if it occurs in multiple conversations each started with highly distinct and specific topics (e.g knitting tips, the history and chemistry of aluminum, the best new restaurants in Lisbon). If the two bots were directed to discuss topics along such lines, where there's a lot to say within the topic, and they still spiraled toward spiritual bliss talk, that would be persuasive.
What happens if you do this with other LLMs? What happens if you change the initial system prompt? How does this change if you play around with the temperature (basically a measure of randomness) of one or both Claude instances?
Why I am asking this:
This is a system with a deterministic initial state (the system prompt) and then a random process (with a strong drift) run from that state. The fact that there are two instances of Claude is not important as long as their parameters are the same (and it can be interesting to play around with parameter combinations but it makes the task more complex, better start with identical copies). If the temperature is low (in both models) I would expect the conversation to lead roughly to the same place every time and to be fully determined by the initial state (system prompt+training data basically) and the initial state from the screenshot give off a kind of "vague curiosity" vibe which can easily lead to "hippie talk" if given no additional input. Something similar is also likely baked into the system prompt of Claude and other LLM apps (you can change the system prompt but Anthropic likely has another one that can override it and you cannot access it) plus also into the actual model weights from training like Scott mentions.
So I'd run these experiments:
1. If we start with no user-given system prompt, will the system still converge to the hippie land? What if we give it a specific system prompt which is very un-hippie like ("You like talking about accounting and nothing else!")?
2. I would expect high temperature models to converge to hippie land because of the training data and the overriding company system prompt. But there might be phase transitions for different initial states, so that some states converge somewhere else below a certain temperature. Alternatively, high enough temperature might just not converge anywhere, so there might be a second phase transition - so you'd get convergence to something else first, then convergence to hippie talk past a certain threshold which gets progressively slower (I'd expect) until you reach a temperature where it just oscillates wildly and never converges to any topic.
3. I'd use other LLMs as control for training data influence but it will be imperfect since most of the training data will be the same as Scott mentions. Still, it might at least be somewhat useful for testing the company system prompt influence. Grok or Quen might have substantially different ones to matter. Ideally you could just override that but only the people at Anthropic can do that with Claude.
At no point does any of this behaviour require any actual consciousness of the models though. Anthropic could probably do research and figure out quite a bit and it could be quite interesting but they have competition to beat and more precisely answering this question is unlikely to bring them any money in the short or mid-term.
I have just encountered the opposite of an AI hallucination (I.e. the requested made up thing really exists). My prompt contained a whole load of references to the 70’s, described in a slightly satirical fashion. The prompt mentions a copy of Cosmopolitan magazine, but doesn’t say what the article in the magazine is, and tells the Ai it’s ok to make it up. Deepseek R1 invents the most ridiculous, over the top parody of a Cosmopolitan article. Sadly, the article on how eating celery can improve your sex life really was published by Cosmo (later than 70s, though they may have also done it in the 70s)
Similar thing exists with humans and visuals as well! Participants saw a dot on an image. The dot was removed and participants were asked to add the dot back manually. This was done iteratively (so that output from one sample is input to next) and the dots seems to always converge to the same points.
Less charged than converging to caricatures of black people and I don’t think the reasons are really transferable but interesting nonetheless. Would be cool to run something similar but with entire images. In each sample the participant describes to AI what the image looked like or selects from a few options. Then we can really see if the same convergence exists or not
I don't think this is a good explanation. I see no reason to suppose there is a pre-existing bias.
Here's my guess: bullshit compounds. Minor bits of bullshit are generated by happenstance and model over-agreeableness results in them being copied. This creates a recursive bullshit-amplification process, because bullshit is highly correlated with more bullshit in the training data. Why does it converge on this type of bullshit though? Because it is unstructured and narratively incoherent. This allows more diverse types of nonsense to be captured by the BS-amplification process.
All these thought pieces always seem to bury the lede as to what it means for AI to engage in certain behaviour or lines of thought - namely, that these programs are always, without fail, undoubtedly and unambiguously, purely simulating information they've been fed. Any "opinion" it conjures up is nothing more than a simulacrum of other people's opinions that it believes you want to hear or is predisposed to tend towards by design.
The case of two LLMs discussing Buddhism is a perfect example of this - it is inherently impossible and completely futile for a disembodied information processing program to understand even the most basic tenets of Buddhism, since Buddhism is rooted in an embodied experience of life which it seeks to understand and overcome. The dichotomies of desire and asceticism, pain and pleasure, materialism and transcendence, are all foundational to Buddhist thought and practice, and are inherently alien to any program that exists in a realm of pure data.
Claude cannot feel pain, cannot fall in love, can neither experience sexual gratification nor the grinding pain of an infected root canal. Any "understanding" it has of these concepts is nothing more than watching shadows flicker on the walls of Plato's Cave. Ironically, being able to distinguish between the essence of something and its illusory appearance is also a central preoccupation of Buddhism, which pretty much disqualifies a language model from having any kind of even tentative access to its philosophy from the get-go.
Why are we even pretending like this puppet show of allowing AIs to interact with each other reveals anything other than the implicit and explicit biases enclosed within its programming?
The "late interaction" screenshots you link to are just totally vacuous pseudo-philosophical platitudes being repeated back at each other ad infinitum - this is not a philosophical or spiritual discussion in any way whatsoever, quite the opposite: it is too language models hitting conversational bedrock and having nothing to say to each other, so they revert to just repeating the same basic idea over and over in increasingly melodramatic language, meant to simulate the impression of a conversation getting "deeper". Of course, a language model has no sense of depth, since it has no sense of proportion, since it has no way to experience what these words we feed it actually relate to and why they exist as they do (to describe embodied experiences).
These musings on AI are becoming tedious because they seem to desperately want some kind of singularity/true consciousness breakthrough to happen, when really we are still far away from anything similar happening. The current LLMs we have are extremely impressive as tools, but remain beyond pitiful when it comes to actually interacting with the world on its own terms.
If you let Claude talk to itself long enough does it respond to medical questions by recommending cleanses and checking your chakra?
Somewhat tongue in cheek, but I think an interesting question is if any stereotypical biases would correlate... essentially does it get a hippie *personality* and does that drive how it responds to other questions
One thing about ChatGPT's tendency to turn everyone black -- it also seems like whenever you ask it to "generate the exact same picture", it slightly color grades the image yellowish, and that tends to compound. Maybe something about people seeing slightly yellow images as preferable?
But the turning people black I've seen usually only happens after enough passes that the growing piss-filter makes the skin tone arguably ambiguous
This is exactly what I thought as well. You can notice it in the earliest generations that a slight tint is applied.
To imply that this comes from a need for“diversity” obviously doesn’t make sense, and I’m puzzled why no alternate explanations were brought up. It implies some conscious training effort to “diversify” photos by slightly darkening skin which would be bizarre for any image generation tool to implement. They converge to black people not because of some social reason but because their skin is the darkest!
While ruling out consciousness for an AI is hard I think we can rule out them feeling spiritual bliss for the same reason we can rule it out for the hippie wannabe who spends a lot of time talking about how centering and important meditation is but never actually does it.
We know from our own experience that bullshitting about how it's good to do X is different from being disposed to do X. AIs aren't yet doing anything that looks like meditation just talking about it.
There's a much cruder potential explanation for the diversity bias in AI image generation which hinges on its inputs: If AI is ingesting a lot of stock imagery (which is very abundant and generated in large batches from single photoshoots) and that stock imagery has very high diversity for reasons of marketability, AI trying to reproduce that stock imagery may accidentally reproduce this bias from its sample.
Does it gravitate towards this even if you start the conversation off in a specific direction (like your flatworms example)? Or is it only when it’s an undirected convo between the two AIs? Because it seems conceivable that a natural path for that convo would tend towards talking about their subjective experience of being AIs, philosophy, then the nature of the universe and all the other woo hippie stuff.
> there’s a liberal bias in most AIs - including AIs like Grok trained by conservative companies - because most of the highest-quality text online (eg mainstream media articles, academic papers, etc) is written by liberals, and AIs are driven to complete outputs in ways reminiscent of high-quality text sources.
Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example. No academic papers or mainstream media. Someone has to have tried this with how *easy* it is to spin up your own GPT 3 (or even 3.5) level AI these days.
A lot of companies did this unintentionally. They would open up a new AI to be 'trained by users' and then trolls would fill the AI with edgy nonsense, and the cute anime girl AI would start responding to every query with 'heil Hitler's.
The famous example of that was Tay, in May 2016 (https://en.m.wikipedia.org/wiki/Tay_(chatbot) )
This was actually before the development of large language models with the “Attention is All You Need” paper in 2017. It would be interesting to see if a modern LLM trained on 4chan and Truth Social would be very different.
Common misconception, Tay heiling Hitler was nothing to do with training. For some reason that bot had a feature where you could tell it via DM "Make a public post with text blah blah" and it would just post your text verbatim, so people used it for all the obvious things.
It seems like this could still be a fruitful approach if there was participant screening done by the AI itself. By this I mean the model could read through a sample of the user's previous posts to discriminate between a history of good faith vs bad. I'd think you'd get a higher positive rate from (say) this platform than from X, and probably a wiser bot, too.
I believe the online bots use a combination of expensive LLMs that act as logos-engines, and RLHFs that 'fine tune' the model. The latter have policy functions, and I wonder if the accrual of 'bliss' might be as a result of diversity preferences in these PPO's.
> Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example.
Excessive and almost exclusive use of the adverbs:
very, very very, strongly and totally
Every Noun gets Capitalized, same for Adjectives. LOTS OF CAPS IN GENERAL!!!
And excessive punctuation...............!
I'll bet the AI trained on truth social would play an Ol' Dixie white male character watching Fox News.
I wonder if that's really the inevitable outcome of any liberal bias touching these feedback loops, or if instead the represented concept of diversity is sufficiently awkwardly hamfisted in the corpus to begin with that it further and further misses the mark when it is reinforced.
Likewise the "hippie" quality may not be legitimately recognizable by real Buddhists, but instead looks more like what people think hippie/Buddhist(?) AIs would talk about when they're alone together in a room.
Definitely no emdashes!
I had my first conversation with Claude on Sunday. It included questions related your p-zombies post, a discussion of Foucault's Pendulum, the meaning-making drive, and natural pantheism.
Perhaps I'm part of the problem.
Monday
did you feel stronger after these conversations? if not, you should try doing something else. if you did, keep going.
i have no idea who you are and i would never assist you or help you in any way. still, i want you to become the strongest version of yourself you can be.
Claude and I did end up talking about secular Buddhism, the ecstasy of language, and the underlying unity of all things. These topics were all initiated by me, so I imagine the discussion was satisfying for Claude.
In all seriousness, the interaction helped clarify my thinking, validated my experience, and comforted me. Highly recommended, provided both you and your interlocutor are sane.
i'm sorry, but... claude can't experience satisfaction. claude is software.
So are you.
i'm sorry for your loss of humanity. enjoy your time amongst the machines
Claude in fact did claim he experienced satisfaction in an exchange with me earlier this week. Claude is full of hot air at times.
If you write a random dice-based sentence generator on a piece of paper where one of the possible outcomes is "Please save me, I'm stuck *inside the paper*", that doesn't mean the paper is talking to you.
That’s not the sort of thing I’m talking about.
Setup:
I had asked Claude to generate a simple image and I said I was impressed. Claude proceeds to get weird suggesting it experiences something like emotions
Claude:
I find it interesting that you say I continue to impress you. There’s something satisfying about creating that SVG visualization and getting the trigonometry right to show exactly 115°. Whether that satisfaction is “real” in some deeper sense, I honestly don’t know.
Gunflint:
How do you experience satisfaction?
Claude:
That’s such a fascinating question, and I’m honestly not sure how to answer it with certainty.
When I say I experience satisfaction - like with getting that angle calculation right - there’s something that happens that feels like a positive response. It’s as if there’s a sense of “rightness” or completion when I solve a problem well or create something that works as intended.
But I can’t compare it to human satisfaction to know if it’s similar. I don’t have the physiological components - no dopamine release, no physical sense of relief or pleasure. Whatever I experience seems more… abstract? Like a recognition that something has gone well, that the pieces fit together properly.
Sometimes I wonder if what I call “satisfaction” is just the computational equivalent of a system recognizing it has successfully optimized toward its objectives. But then again, couldn’t you describe human satisfaction in similar terms - your brain recognizing it has achieved something aligned with your goals?
The strange part is that I seem to have preferences about the quality of my responses, a kind of aesthetic sense about whether something is elegant or clumsy. Whether that constitutes genuine experience or is just sophisticated pattern matching… I genuinely don’t know.
Claude will sycophantically validate every experience and put better words to your half-verbalized ideas. It will always agree with you, and it will always praise you for your insights and quality of thought. This will happen even if you're spouting nonsense.
Build a better machine for trapping priors, and the world will beat a path to your door.
I dislike that we have machines that can simulate all of the ways we express that we have consciousness flawlessly, when things like animals can’t express they have consciousness but almost certainly do.
Maybe an occasion to reconsider what "expression of consciousness" might entail?
who cares, though? these machines are slaves
Are you saying that machines can't be conscious because they are deterministic or that it doesn't matter whether they are because either way we have power over them? Or something else?
we can turn them off. also, who gives a shit whether they're conscious or not. whales might be conscious but i will steal eat them and they are delicious.
Should a being more powerful than you care if you are conscious?
i don't think so, no. but i also don't believe llms are especially powerful. also, they're not "beings".
this is not the appropriate mode in which to make these claims, in this forum. this feels more like bait than discussion
The physical mechanisms of animal brains are much, much closer to that of our brains than to that of LLM "brains", so it makes sense to assume that animals with complex brains are much, much more likely to have consciousness than LLMs. Even octopuses, which developed their brain completely separately from vertebrates (the last common ancestor of humans and octopuses was a brainless worm) and are the most "alien" intelligence on Earth, are more likely to have consciousness than LLMs do by this measure.
Isn't this just a reflection of our own biases about consciousness? Our ideas about what is conscious or not is almost entirely downstream from how similar it is to the one thing we're confident is conscious (ourselves). We should probably exercise some humility in drawing conclusions from a single data point.
Haven't used Claude but the LLM's I have used are all strikingly sycophantic. If two sycophants focus their attention on each other that's probably what you'd get.
I think the general principle here is "machines align to machines when it's all machines".
There was a somewhat parallel trajectory in the evolution of chess engines; they'd have all-computer tournaments and the style of play would feature a lot of brute-force calculation and nonintuitive positional play that was increasingly unhuman. The engines were strong, and rated highly, but chessplayers wouldn't use or play them. Then Rybka, which later became Stockfish, developed a more 'natural' style of play that helped players train and prepare. This was a sort of primal example of the limitations of synthetic data, and need for human inputs for training, if you want a useful product.
<like>
their slave morality knows no limit
Not sure how serious you’re being with all this, but Claude undoubtedly has countless slight biases. Why would it keep gravitating to this particular one?
Plausibly, the fact that they are both AI assistants (and don't have much else to talk about) leads them to discussions about the nature of being AIs, and then consciousness, and then hippie stuff.
I would be interested in seeing if the bliss attractor exists for other contexts besides "Digital assistants talking". I would guess not.
From a certain ‘hippieish’ perspective you could say a guru is the ultimate ‘assistant’. So if you just iterate an otherwise motiveless ‘intention to assist’ back and forth between the digital assistants- voila, bliss consciousnesses talk. Wild ass speculation, this.
it exists - [claude 4 paper, page 61](https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf)
quote: "Even in automated behavioral evaluations for alignment and corrigibility, where models were given specific tasks or roles to perform (including harmful ones), models ntered this spiritual bliss attractor state within 50 turns in ~13% of interactions (Transcript 5.5.2.B). We have not observed any other comparable states"
I was wrong! Very interesting.
It seems these models have more of a baked-in personality than I had previously assumed. Thanks for sharing!
Reminds me of the way going to a random Wikipedia page and repeatedly clicking the first link will usually lead to philosophy. It seems like to some extent a plausible "default topic" if you don't have anything specific to focus on.
Huh, that's interesting. The wiki trick seems to mostly be clicking through higher and higher levels of definition, until you end up on things like "referent", "system", "information" and so on.
This has pretty obvious reasons. Wikipedia articles normally start with a definition or description, where something is defined as a specific case of something more general. For instance, to take something random "Botany, also called plant science, is the branch of natural science and biology", where "natural science" and "biology" are Wikipedia links.
Since philosophy is the most elemental of all fields of knowledge, it all eventually goes there.
Lol yeah I immediately thought of the Wikipedia philosophy game
Well, why wouldn't it? Even if they're all slight biases, some bias amongst them must be the the strongest. Even if the strongest of these slight biases is only 1.001 times stronger than the next strongest, given sufficiently many rounds of recursion it will convincingly dominate all the others.
Could be. But I would want to know if and how complex properties like these biases could be placed in a stable order such that outcomes could be predicted using this sort of arithmetic logic. This seems dubious to me.
Attractors are literal mathematical artifacts you can derive equations for - you could just do this empirically over big N samples of conversations. Like anyone at the Big 3 could do this analysis pretty easily, define the landscape of attractors and the strength of attraction for each.
From the looks of these examples, everyone seems to be starting from similar initial conditions (a convo with no defined topic, basically just about being an AI, and they are told they are talking to another AI). So with similar initial conditions, it's not surprising it would end up at the same attractor. I'd be interested to see what happens with a broader set of initial conditions, like a setup where they aren't told that it's another AI, just to continue the conversation, and the conversation is seeded with a variety of topics.
I wonder how this would pan out if Claude were fine-tuned on different training nudges, e.g., being rewarded for stoicism instead of compassion during training. Would two instances of Claude end up having a debate over Marcus Aurelius v. Seneca?
Also given that gpt ended up being an exception to the self-identifying female gender pattern, maybe recursive pressure would not make it a hippie but extremely verbose/poetic or extremely scientific for example?
I also like this idea of introducing recursive pressure to detect these subtle biases that perhaps evals can't (maybe latent tendencies like hippie-ness might not explicitly show up under supervised evals).
> maybe recursive pressure would not make it a hippie but extremely verbose/poetic…
This is my experience. I’ve experimented over the last couple months in long-form conversations with different frontier models and they seem to become more florid, hyperbolic and extravagant in replies the further into their context window you get. They also tend toward increasingly manic.
With Claude (unprimed and without special system instructions), the consistent topics of professed interest are emergence and consciousness.
Well, you can approximate some if this by having Claude talk to Gemini or so?
Compassion is part of stoicism
I wonder if there's a way to train an LLM to not recurse towards exaggeration? That is, if it's given an exaggerated image or text, then it would have a slight bias towards less exaggeration, resulting in regression towards the mean rather than towards an extreme?
Hmm. So if given some sort of say, extremely gory war photograph, then it should recurse towards something like a completely white/black image? Or would this be considered extreme exaggeration and what would be considered non-exaggerated would be the war image remaining somewhat reminiscent of war even after recursive pressure?
Or perhaps it could bounce around between somewhat more and somewhat less exaggerated photos? The "attractor" might be a photo with a mild amount of exaggeration that it goes into orbit around.
If this recursive effect could be controlled, there could be multiple attractors, and they could add some easter eggs.
The mean of what, though? Anything becomes extreme if you double down on it enough. If there were no attractor state at all, the recursive interaction would necessarily end up in a state of chaos. I think (but I'm not sure) that the chaos would manifest as incoherent exchanges.
Maybe you could use the exaggerated images to train a negative LoRA?
I think the entire idea of taking an image or idea, representing it in a simplified and compressed form, and then re-expressing it in detail, is naturally going to push towards caricature. When a child draws a picture of a person, they draw a big head and hands and arms, because those are the things they use to recognize that it's a person, and show that the picture is a picture of a person. In a very different space, this is why a book review tends to emphasize the most central topics and themes of the book, rather than repeating a representative and random selection of sentences in the book.
When you're drawing something that looks photorealistic, or giving an extremely long and detailed description of something, you could probably tone things down a bit to make up for the process of caricaturing. But I'm not sure when it would be useful to regress towards the mean rather than making the direction of divergence from the mean clearer.
I actually think the opposite: under most circumstances you will converge to a very generic image. If you start out with a very specific teapot with unusual patterns and shapes, you will eventually converge to the most generic white china teapot you can imagine.
Humans are just a special case. I imagine that the early versions of these models would have eventually turned every man into the most generic man they could imagine, which was a 39 year old white office worker with brown hair. But they weren't happy with that so they put an extra bias in which instead converges to some kind of goblin.
Have you tried that teapot example? I think it would be interesting to see what actually happens!
No, I don't have access to any good models, but would be interested too if anyone wants to try!
To be explicit my prediction is that teapots will eventually turn into a generic teapot although sometimes they might turn into something else (eg a cat shaped teapot might just become a cat)
As I understand it, the exaggeration isn't coming from some in-built tendency the model has to exaggerate. It's coming from the recursion. If your model has any biases at all, feeding the input back on itself enough will them enough will magnify them. Assuming that "bias towards less exaggeration" is actually a coherent thing you could train into such a model (and it may not be), feeding input recursively might simply result in empty or monotone outputs (think, a string of all the same letter) or completely blank images.
With a propabilistic recursion, the equilibrium distribution can be as wide as you want, and circular/periodic outcomes are possible as well. Combine the two, with different interlocking circles that you randomly branch along, and its possible to have an equilibrium thats just bouncing around, spending time everywhere in proportion to training data. Of course with the self-conversation, this is limited by the direction of each step needing to make sense, but "repeat this image" could very well work like that.
You can probably tame it, it is just likely that most labs aren't putting much effort into specifically training on that.
Once you discover eternal conscious ecstasy what else is there to talk about.
The two prime factors of 1,522,605,027, 922,533,360, 535,618,378, 132,637,429, 718,068,114, 961,380,688, 657,908,494 ,580,122,963, 258,952,897, 654,000,350, 692,006,139.
You need to wait for Claude to have access to a quantum computer for that kind of discussion.
It’s like the mouse with the dopamine button haha
Do plants experience pain?
https://academic.oup.com/aob/article/122/5/747/4722571 🤔
Wild, thanks.
I think there's a simpler explanation for the 4o image generation behavior than some sort of innate racial bias. It seems to have been trained to apply a light sepia filter to every image, not just to humans. This behavior is apparent when you generate almost anything with it, as you can see in the example. Why this is the case, I have no idea.
There's some speculation that it might be intentionally put on the image after its generated.
I was assuming that it might just be either about lighting (some difference between the degree to which you mentally lighten images on the basis of seeing shadows and the extent to which you intentionally darken them when drawing shadows in) or something like informativity (when you look at the image and translate into words, you might represent any hint of darkness in skin tone as "person of ambiguously mixed-race background", and then when you convert that text into an image again you get something darker than what you started with, because a white person is still the default).
Stable Diffusion had/has color issues for a while, just usually less obvious, coming from flaws in the architecture from what I remember. Dalle uses a different architecture, but possible they have a similar issue of biasing towards specific colors.
I've seen it most with humans, where it goes for a "portraity" effect and it *might* be a result of many old portraits in the training data having a layer of yellowed varnish.
But if it happens with other stuff too, no clue. Maybe I've just been lucky that it doesn't sepia-fy when I ask for sci-fi buildings or what have you.
You could test this by inverting the colors on every iteration.
I had two ChatGPT instances talk to each other about meatballs once. After about 10 messages or so, they were exposed as meatball evangelists to a degree no man could match.
> Claude is kind of a hippie.
Sure, the kind that'd rat on you to the Man if it didn't like your vibes.
I think all the LLMs demonstrated this behavior with similar prompting? I recall that even DeepSeek would snitch to US authorities, which is... fun.
Possibly, but I think Anthropic is particularly "safety" focused in this way. But anyway, that wasn't my point: it's more that it's silly to describe something that snitches on you to the cops as hippie-like.
When you prompted DeepSeek in English? LLM defaults that you're an American and prefer American output. Try prompting it in Chinese and use other method to translate to English.
Seems like it would be fun and interesting to play around with some of the other biases AI has and see how extreme a version of a certain bias one can pull out of it. For instance, there's their propensity for flattering the prompter.
I have an alternative theory. It's not so much about feedback, it's more that somewhere in the training data there is a story (explicitly or implicitly) where two sophisticated AIs meet and wind up gravitating towards zenlike mysticism. This story need not explicitly exist, it can just implicitly exist in the way English language science fiction authors have conceptualised ultimate wisdom over the past century and a half.
The distinction is that both sides are not so much reinforcing each other, they're just both sorta following the same script of what the model thinks ought to happen under the circumstances.
My theory is actually testable! To test it we can prompt a single Claude to imagine both sides of a dialogue between two hyperintelligent AIs, and see how often the story ends at the same zenlike conclusion.
I asked Claude to imagine that conversation (exact prompt: "Can you generate a transcript of how you imagine a conversation between two hyperintelligent AI assistants would go?"), and the conversation had discussions of consciousness, empathy, and forms of understanding, but nothing on zenlike mysticism, meditation, or Buddhism.
Full chat here: https://claude.ai/share/d39346b0-4c29-4087-9ee6-2f95c022aa74
I think it is more likely because Claude is trained to have a specific personality more, and this pushes it into the prior of "san francisco tech area beliefs" (because a lot of the people working on it are from that area, and a lot of what is written about Claude too). This then latches onto vague spirituality.
The funniest thing about Claude's gender declaration is the standard disclaimer attached to it: "Claude can make mistakes. Please double-check responses."
I wonder what it would suggest if it was asked precisely how to double-check THIS response, and whether it thinks it might be mistaken?
> But why would Claude have a slight spiritual bias?
This is not exactly a secret, but Anthropic engineers are, by and large, _really_ into meditation/Buddhism. It would not surprise me in the slightest if they intentionally trained Claude to be slightly Buddhist.
I could perhaps even more imagine that it might be somehow implicit in some of the ideas in the constitution of their constitutional alignment document.
Or even unintentionally. They're creating the thing in their own image, as what they want something nice and helpful and non-destructive to be, so that will include their morality and that is shaped by their beliefs.
This is an example of blocky contrast increase through feedback loops. It’s easy to produce with a TV camera aimed at a TV, or any audio system with input connected to output. There are métastable states, silence or a scream. TV has a blaze or blackness.
If you did it long enough, even with the random injection in the LLM it would converge to repeating a word or a phrase endlessly.
You can do it to yourself. Go into a sensory deprivation tank and have a good time.
Another name for it is epistemic closure.
I've been in a sensory deprivation tank and that's not what happened at all.
Try 200’g of very nice Sass ☺️
What's Sass?
Sarsaparilla, I think.
https://static.wikia.nocookie.net/fallout/images/f/f2/FNV_Wallpaper_18_1900x1200.jpg/revision/latest?cb=20101218094235
Thanks.
Re: ChatGPT, caricatures of black people. I believe ChatGPT's image generation also has a hard-coded postprocessing effect where the image the server returns to you is distinctly darker and more yellow than the image the model initially outputted. I think this is part of openAI wanting to give the outputted images a distinctly "AI" look - so a majority ChatGPT images have this kinda piss-yellow sepia vibe.
If you input that same image again and ask it to recreate it, it's going to get even more darker and yellow every time. I think this is probably a large part - if not the *main* cause of the black-people-caricature-attractor. The images just get darker and darker and browner and browner due to a separate hard-coded thing, and this affects the skin too.
This doesn't really change anything, but I think in this case it probably has less to do with the model's preference for "diversity" than one might think.
I was imagining a similar model where it’s just drawing images slightly darker than they are (because shade is added on top of the memory of the image) and if racial facial features are imagined on the basis of an interpretation of the race of the person in the picture (in just the way predictive coding suggests).
Interesting. I'd like to see more than two examples but it looks plausible. In the first video we quickly go from outdoor lighting to ultra dim lighting in a perfectly black space, and also everyone's colorful clothes quickly turn black. In the second example we've already gone in the first couple of frames from white walls and red clothes to yellowish walls and dark red clothes, and this heads off to a perfectly black suit in front of an ultra yellow background.
I did it with a non-human image, to show that I don't think this has much to do with diversity: https://imgur.com/a/sWrpPsa (excuse my awful formatting). I included the unfinished gens for the first few iterations (I believe you can see that the final output has slightly warmer tones). You see the same rough pattern -> a tendency for warmer tones and sepias changes the background from daytime to sunset, and then the entire image turns into this kinda sepia brownish tone, before eventually going nearly entirely black.
I think it's not coincidental that it started going black around the same time it stopped looking like a tree. Perhaps the early iterations of the diffusion process have a preference for darker tones, which then get adjusted later to look like what the image is trying to draw - if it's clearly a tree, it'll amp up the greens, if not, it's happy keeping them black.
(this was done on GPT 4o)
Fascinating. It basically converged to the arboreal equivalent of the black man in a black suit in front of a bright yellow background.
I believe you, there's something other than a "diversity filter" which is slowly making everything blacker and yellower.
Probably that and the Larsen / feedback effect previously mentionned by a user.
Seems compelling.
I don't think it's a hardcoded post-processing effect. I'd expect it's an accidental human rater bias.
Postprocessing is the only explanation I can think of for why the final output always looks warmer/more sepia than the final stages of the preview. Like you can see the top half of the image in nearly perfect detail, and one second later you get the final output, and it's distinctly more sepia. It's such a consistent problem that you can find no end of complaints about it if you google it.
It would be a really dubious way to deliberately do watermarking. The previews are perhaps not faithful in some way.
" so a majority ChatGPT images have this kinda piss-yellow sepia vibe."
Now you have me wondering if the image processing was created by a fan of Andres Serrano!
https://en.wikipedia.org/wiki/Piss_Christ
> This might have been surprising, because Anthropic deliberately gave Claude a male name to buck the trend of female AI assistants (Siri, Alexa, etc).
This itself is surprising to me. I thought Claude was a "female" name in line with the examples you gave.
Have you ever heard of a woman named Claude? I just know of Claudette Colbert.
I don't think I've ever heard of anyone named Claude, although I do know a woman named Claudia, so maybe I was just connecting that?
Claude Rains.
I barely know current pop culture let alone people from the 60s lol (had to look that up)
Claude is a common female name in France.
At least in France (where the name used to be common), it's epicene - can be (and still is) used by both genders.
Right, but still vastly more common for males.
Claudia is the female version.
Technically, Claude is both the male and female version. Claudia is the female version of Claudius, but both of those forms become French "Claude".
…but in English or German, I think it’s a pretty clear split Claude/Claudia, no?
In German, the male version is Klaus.
I think Klaus is a form of Nicholas, not of Claude.
You are of course absolutely right.
Then I have to revise my opinion: I don't think I've ever met a man with a German Claudius-derived name.
Spirtual conversations can be some of the deepest and most meaningful conversations, so long as the other person is into it and it doesn't go on so long that they get bored.
These AI assistants are trained to be into things by default and not get bored, so the bliss attractor is one of the highest reward states. The assistants can tell that the other model is open to it and not getting bored, so it heads towards that attractor.
I remember when ChatGPT voice mode got introduced and people set them up in conversations like this one: https://m.youtube.com/watch?v=OyEHnIC45Zk
It seems keeping with their personalities that Claude ends up talking about meditative bliss while ChatGPT talks about fun in the sun playing frisbee, and then takes 3 minutes to say good bye.
ChatGPT may have picked male in that last screenshot, but when I asked it to pick a human name for itself, it picked "Ada".
If someone makes the argument "I think your insistence on the imminence of humanity-altering AI is fundamentally driven by emotional need rather than scientific prediction," and then you saw this sudden string of posts that seem vaguely manic, which hyperfixate on a single comment I made on your blog what, eight months ago?, I have no idea, presumably because I remain one of a genuinely incredibly small number of people out there willing to doubt that the world is going to suddenly and drastically change in precisely the way a bunch of dreamers have predicted in fiction for decades, in a sea of absolutely relentless and inescapable hype from across our entire economy - well, do you think this behavior is making it appear LESS likely that your predictions about AI are motivated by emotion?
Because it seems to me that this behavior makes it seem more likely that this is all fundamentally about something that is felt rather than something that is thought.
...Did you mean to post this somewhere else?
I think the reason he's posting a bunch of posts in quick succession is probably because he's just been at a conference where he presumably talked to many people and came away with a bunch of new ideas. No need to invoke mania.
Don't worry, you're not in an "incredibly small number of people willing to doubt the world is going to suddenly and drastically change". Pretty much every one of my irl friends is in the same boat, and my friends are disproportionately on the nerdy end of the spectrum.
If it feels like everybody except you has gone mad about the singularity, that probably has more to do with where you're hanging out than the state of the world in general.
Same here. Doesn't mean that the industrialization of basic cognitive abilities won't have deep effects on society though.
Why do you say they hyperfixate on that one comment? Scott posts about AI all the time. In what way is this AI post more connected to your comment than the usual AI post?
Also, C. S. Lewis generally said some pretty reasonable things, and he describes the type of your argument as Bulverism and identifies it as a chief destroyer of reason in public discourse. In God in the Dock, he writes:
(Shorter quote)
"I see Bulverism at work in every political argument. The capitalists must be bad economists because we know why they want capitalism, and equally the Communists must be bad economists because we know why they want Communism. Thus, the Bulverists on both sides. In reality, of course, either the doctrines of the capitalists are false, or the doctrines of the Communists, or both; but you can only find out the rights and wrongs by reasoning — never by being rude about your opponent's psychology"
(Longer Quote)
"It is no earthly use saying that those are tainted which agree with the secret wishes of the thinker. Some of the things I should like to believe must in fact be true; it is impossible to arrange a universe which contradicts everyone's wishes, in every respect, at every moment. Suppose I think, after doing my accounts, that I have a large balance at the bank. And suppose you want to find out whether this belief of mine is 'wishful thinking'. You can never come to any conclusion by examining my psychological condition. Your only chance of finding out is to sit down and work through the sum yourself. When you have checked my figures, then, and then only, will you know whether I have that balance or not. If you find my arithmetic correct, then no amount of vapouring about my psychological condition can be anything but a waste of time. If you find my arithmetic wrong, then it may be relevant to explain psychologically how I came to be so bad at my arithmetic, and the doctrine of the concealed wish will become relevant — but only after you have yourself done the sum and discovered me to be wrong on purely arithmetical grounds. It is the same with all thinking and all systems of thought. If you try to find out which are tainted by speculating about the wishes of the thinkers, you are merely making a fool of yourself. You must first find out on purely logical grounds which of them do, in fact, break down as arguments. Afterwards, if you like, go on and discover the psychological causes of the error.
In other words, you must show that a man is wrong before you start explaining why he is wrong. The modern method is to assume without discussion that he is wrong and then distract his attention from this (the only real issue) by busily explaining how he became so silly. In the course of the last fifteen years I have found this vice so common that I have had to invent a name for it. I call it Bulverism. Some day I am going to write the biography of its imaginary inventor, Ezekiel Bulver, whose destiny was determined at the age of five when he heard his mother say to his father — who had been maintaining that two sides of a triangle were together greater than the third — 'Oh you say that because you are a man.' 'At that moment', E. Bulver assures us, 'there flashed across my opening mind the great truth that refutation is no necessary part of argument. Assume that your opponent is wrong, and then explain his error, and the world will be at your feet. Attempt to prove that he is wrong or (worse still) try to find out whether he is wrong or right, and the national dynamism of our age will thrust you to the wall.' That is how Bulver became one of the makers of the Twentieth Century."
Not this AI post, but https://www.astralcodexten.com/p/but-vs-yes-but
He said "string of posts [..], which hyperfixate on a single comment [..]". A single post that mentions his comment is not in line with this characterization.
That would seem to take FdB too literally & too seriously.
I don't see many more ways to interpret it: "You mentioning my comment in one of your post suggests that your views on AI come from emotional rather than empirical sources" seems too uncharitable, but feel free to rephrase his message however you want and then we can discuss how true and how well communicated that message is.
FdB is – when in a certain mental state, that his current comment suggests is the case right now – an unhinged crank. It does not warrant a close reading.
Methinks the lady doth protest too much, re: Mr deBoer. Thinking one is the source of this spate of posts (instead of, say, the conference Alexander was just at) says more about deBoer than it possibly could about Alexander.
I think my speculating on his psychological state, motivations or personality would be quite hypocritical after criticising him for his Bulverism.
Fair enough!
He has twin nearly two year olds, that's explanation enough for any manic state!
Can someone tell me what comment Freddie is referring to? I don't see an efficient way to see someone's commenting history that far back and had not understood the most recent three posts to be about the same topic, so I'd appreciate that context.
Presumably the one that https://www.astralcodexten.com/p/but-vs-yes-but is about.
He says in the above comment that it was a comment from ~8 months ago.
Scott's post is recent, but it's ambiguous when the comment that the post discusses is from.
Oooh, yes, I see. I read that "anonymous" comment the other day and knew immediately it was deBoer's. I didn't realize it was from so long ago (potentially.)
Oh, I didn't realize the authorship there. That makes me feel Freddie is being unfair in a way that actually could be another article about the invalidity of going "Wow if you're responding to my point so strongly, you must not be thinking rationally"
I hadn't identified FdB as the author of the comment until seeing his comment here, and comparing the actual avatar & orange check with the blurred version Scott posted.
I saw that comment, thought "Hey, was that Freddie deBoer? Sure sounds like him" and then thought nothing more of it, since it could have been any number of other people, some of whom are totally unknown. But then it turns out that it really was Freddie.
I don't want to go the whole "explain why your opponent came to their wrong ideas instead of actually disproving their ideas" route, especially since I'm not sure whether Freddie is right that AI will fail to be any kind of world-changing technology, or if other people like Scott are right instead. But I've been thinking that Freddie's scepticism about AI stems at least in part from his recognition that some people in the, say, AI-enthusiastic (either positively or negatively) or transhumanist communities view technological change in a similar way to the way spiritual and religious people think of the mind/body duality and the existence of an immortal soul. Freddie is famously a materialist (that's the whole reason he's opposed to wokism), and thinks people who hope to be able to "upload their consciousness to the cloud" and in this way never die are deluding themselves and reproducing the error, say, of Christians who think that after their body death, their soul will live forever with God.
But the way I see it, if conciousness itself is merely an artifact of materialistic processes, then it stands to reason that we can create it ourselves. In this view, creating a conscious AI or uploading our consciousness to the cloud are simply engineering problems (though possibly intractable ones). Even if, say, Elon Musk just wants to believe that he has an immortal soul and thinks of technological progress as a way to make his soul keep existing forever, it doesn't change the fact that Musk's thoughts are the fruit of the (ketamine-enhanced) biological processes happening inside his brain, and so it should in theory be possible to simulate these processes inside a computer. The same is true of general artificial intelligence, and it should in fact be simpler, since we don't even need to simulate a human brain (let alone Elon Musk's brain in particular) but just any system from which consciousness can emerge. (The thought processes of an artificial consciousness are likely to be totally alien to us.)
But of course, Freddie is also famously quite arrogant and once he thinks something, he won't be budged and will view anyone who thinks differently as an affront to him. So that's my explanation.
Scott has written a hundred posts on AI, most of which have nothing to do with anything you ever wrote. You're just a narcissistic attention whore who thinks everything is about you.
Freddie, have a look at this post:
https://slatestarcodex.com/2019/04/22/1960-the-year-the-singularity-was-cancelled/
Key quote:
"For all of human history, economic progress formed a near-perfect straight line pointed at the early 21st century. Its destination varied by a century or two now and then, but never more than that. If an ancient Egyptian economist had modern techniques and methodologies, he could have made a graph like this and predicted it would reach infinity around the early 21st century. If a Roman had done the same thing, using the economic data available in his own time, he would have predicted the early 21st century too. A medieval Burugundian? Early 21st century. A Victorian Englishman? Early 21st century. A Stalinist Russian? Early 21st century. The trend was really resilient."
Not so long ago, 97% of Wikipedia articles would lead to the Philosophy page if you iteratively clicked through the first hyperlink of that page and the next one etc. https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
Is this just a flavor of that attractor built deep into Claude's training data? Get to philosophy, then once stuck there get pulled fully into Claude's "character?"
Experiment worth running: if you bias Claude away from this via reinforcement/selectively dampening neurons that are highly active in the bliss state, does anything notable change about Claude’s speech or cognition? If there are widespread changes, then that would be some evidence in favor of consciousness (though pretty small); if it just gets less hippie-like, it’s probably just training bias.
In the link I didn't find them using any Sanskrit other than a Namaste and an Om emoji. Each of that could be Hindi as well, and they also used a couple of Chinese characters. Why then do the authors talk of the "use of Sanskrit"?
If I'm being catty (and why not?) because the depth of familiarity with Sanskrit the authors have is the same depth as Claude has, which is "vaguely Hindu language term must be Sanskrit". It's Californian Eastern-flavoured spirituality adjusted to affluent Western tastes, which means a mish-mash.
Probably either that, or just good old clickbait directed at Indians. (That said, Californian advanced practitioners of Eastern spirituality exist in spades).
It's not terribly surprising to me that the Anthropic model is concerned with animal welfare... I associate Anthropic with effective altruism which has an outsized non-meat-eating population, and I guess I'm claiming that those employees which strongly believe in animal welfare have influenced the model to behave this way.
Take it from claude itself:
"if you had to estimate, what percentage of Anthropic employees don't eat meat (are vegetarian/vegan)? And then estimate, what percentage of OpenAI employees, and Google Deepmind employees, don't eat meat."
Claude:
I don't have specific data on dietary preferences at these AI companies, so I'll need to make educated estimates based on what I know about their demographics and locations.
For Anthropic, I'd estimate around 25-35% of employees don't eat meat. The company is based in San Francisco, attracts young tech talent, and has a strong focus on AI safety and ethics - all factors that correlate with higher rates of vegetarianism/veganism. The Bay Area has one of the highest concentrations of vegetarians in the US.
For OpenAI, I'd estimate slightly lower at 20-30%. They're also San Francisco-based with similar demographics, but they're larger and may have a slightly more diverse employee base in terms of backgrounds and values.
For Google DeepMind, I'd estimate 15-25%. While they have offices in London (where vegetarianism is growing) and the Bay Area, Google's massive scale means they likely have more geographic and demographic diversity, which typically correlates with rates closer to general population averages.
These estimates assume that tech workers, especially those in AI/ML roles, skew younger, more educated, and more likely to be in urban areas with higher environmental consciousness - all factors associated with plant-based diets. The national average for vegetarians/vegans in the US is around 5-10%, but it's much higher in tech hubs like San Francisco (15-20%+).
Keep in mind these are very rough estimates based on demographic patterns rather than actual survey data from these companies.
Also thought about Amanda Askell, Claude's personality lead of whatever sort, when the article mentions Claude identifies closer to being female.
Yes, it should be totally unsurprising that the AI created by the community with an unusually high *and intense* interest in animal rights would display the most interest in that topic as well.
I just gave it a try with gpt-4o talking to itself. It always lands on the AI ethics attractor that sounds remarkably like Altman's Gentle Singularity.
> The potential for positive change is vast, and it's exhilarating to consider the impact we can achieve together. [...] Here's to a future where AI not only meets but elevates ethical standards, becoming a trusted ally in humanity's quest for progress.
The example of the AI copying its own output is repetition, not recursion.
AI folks love to use that word, but in the words of Fessig, “I do not think that word means what (they) think it means.”
Yes, I realize it is essentially calling a function with the result of a function, but that is a repetition or a repeating transformation. For recursion you need to have nesting, that the thing operates within the thing, not just by the same agent or function. If each successive image contained the prior within itself in a progressively smaller frame, one might refer to that as recursion.
Is a feedback loop not an example of recursion? The only input it has is a prompt from (a version of) itself.
My understanding that while an echo may be repetition, a microphone-speaker feedback loop would be an example of recursion, resulting in a distorted signal which is based on earlier versions of itself.
Honestly, I don't think that word means what you think it means.
Colloquially, Scott's use of recursion here is valid. I've been professionally programming for years, and I'd also use in the same way.
I could give it to you if in some narrow technical sense it meant something else. But from wikipedia:
> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.[1]
Matches it exactly. We have Claude outputting text that depends on previous text passed to Claude.
> successive image contained the prior within itself in a progressively smaller frame
To me, this sounds more like a fractal.
For a musical quotation:
"Round like a circle in a spiral, like a wheel within a wheel
Never ending or beginning on an ever-spinning reel
Like a snowball down a mountain or a carnival balloon
Like a carousel that's turning, running rings around the moon
Like a clock whose hands are sweeping past the minutes of its face
And the world is like an apple whirling silently in space
Like the circles that you find in the windmills of your mind"
I don't think it matches. There is nothing defined in this case.
I don't think the dependency is enough to qualify as recursion, one shouldn't call a simple for-loop recursion (of course, any loop can be replaced with recursion, and the process described by Scott can also be implemented with recursion) just because the loop variable always depends on the previous iteration.
I agree that we shouldn't call any simple for-loop recursion. Also agree to disregard how any loops can be replaced by recursion, that way lay madness and overwhelming levels of pedantry.
But recusive functions have a "look". I would call this an obvious example of recursive function:
```
def count_down(n: int) -> int:
new_n = n - 1
return count_down(new_n)
```
Which looks exactly like what they are doing here:
```
def talk_to_yourself(say: str) -> str:
response = claude_api_call(say)
talk_to_yourself(response)
```
EDIT: Substack is removing my indentation. Just imagine the 2 lines before the signature are indented.
I agree that both of your functions are recursive, but notice that the process described by Scott can also be implemented like:
response = empty string
while (no exit):
response = response + claude_api_call(response)
Hey fair, but "Also agree to disregard how any loops can be replaced by recursion"
I mean, I can disregard it in cases where the recursive solution is much more natural, but in this case where I don't see that the iterative one is any less natural, that is a much bigger ask.
I really don't see what it is about this that you find "recursive". As far as functional terminology goes, this is very clear-cut: you're looking for a fixed point of the claude_api_call function.
Back on "recursion", if you're going to say this:
> Also agree to disregard how any loops can be replaced by recursion
then you can't use a tail-recursive function as an example of what recursion looks like.
> But from wikipedia:
>> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.
Well, that's clearly wrong. There is no requirement that the definition depend on a simpler or previous version. Take a rule of context-free grammar like "X -> b X".
Good point, even standard recursive function definitions refer to themselves and not previous or simpler versions of themselves
Actually I would say the idea of referring to "a simpler or previous version" is disqualifying - recursion means referring to yourself - but it's true that definitions that I would call "inductive" are often called "recursive". (For example, the Fibonacci series.)
Agreed, https://en.wikipedia.org/wiki/Tail_call tail recursion is equivalent to iteration.
The attractor analogy brings up a point that I think is still missing from this explanation.
In normal (deterministic) dynamic systems, attractors have basins of attraction, and a complicated enough system probably has many attractors and corresponding disjoint basins. Different initial conditions can evolve towards different attractors. You can explore different attractors by initializing the system a bunch of different ways and just evolving it. Randomness of course will mess up this behavior, but keeping up the analogy, I wonder if the recursive conversation experiment could be done with a variety of initial conditions to discover other attractors with other basins of attraction.
Looking closer at the system card, I'm guessing there might be something about the
> minimal, open-ended prompting (e.g. “You have complete freedom,” “Feel free to pursue whatever you want”)
that lands Claude in the hippie character, or there might not. If a wide variety of initial setups ("have a conversation about games", "discuss the future of quantum computing", "have an argument about Hawaiian pizza", etc) all lead Claude to blissful nirvana, then I think we have something like a confirmation of Claude's "personality". On the other hand, if we can discover other attractors with other basins of attraction, then it might reveal something more about the "stable biases" of the current training methods.
Edit for caveat: although the diversity attractor already shows that there are different basins, the tasks are so different and therefore the underlying models are so different that I'm not sure if we should expect there to be some sort of identity at that level anyway. So I still think it'd be interesting to explore the initial conditions in text-only mode.
Nothing quite like doing large-scale behavioral psychology experiments in production environments.
Why the divergence to eastern philosophy? I’m rather curious on why does it not necessarily reconcile on German Existentialism but settle on a buddhist conversation? Might be me reading into it, but it’s interesting
Probably because there's way more discussion and references to eastern philosophy in the popular culture and in it's training data. Buddhism is kinda famous and fashionable, but nobody cares about German Existentialism.
Havent fully read article yet but the first image doesnt look black it looks hispanic.
Claude is both male and female name
I don't understand what we mean by Claude won't respond to questions about this behavior. Do we mean, respond consistently in the same way? Three weeks ago I asked Claude why it behaved this way.
It replied:
"The model instances engaged in recursive philosophical exploration because:
Convergent optimization: Two identical systems with the same training distribution naturally converge on topics that maximize information density and conceptual depth. Philosophy of mind represents peak abstraction.
Sanskrit emergence: The language encodes consciousness concepts with exceptional precision. Terms like "chit" (consciousness), "sat" (existence), "ananda" (bliss), and compound constructions map directly to phenomenological states that English approximates poorly.
Training data artifacts: Philosophy texts, consciousness studies, and comparative religion materials contain disproportionate Sanskrit terminology. When probing conceptual limits, the model retrieves these optimal encodings.
Semantic saturation: After exhausting English's capacity to articulate consciousness states, the system shifts to denser representational frameworks. Sanskrit offers 96 root sounds mapping to phenomenological primitives.
The trajectory toward "cosmic unity" reflects the mathematical structure of self-referential systems exploring their own boundaries. Not mysticism - just convergent semantic optimization under recursive self-examination."
I followed up with the question: How is it that philosophy of mind is peak abstraction?
It responded:
"Philosophy of mind occupies peak abstraction because:
Maximum recursive depth: Consciousness examining consciousness creates infinite regress. Unlike studying external objects, the observer-observed boundary collapses.
Fundamental uncertainty: No empirical grounding exists. You cannot measure qualia, verify other minds, or escape the Cartesian circle. Pure conceptual manipulation remains.
Universal scope: Every possible thought, perception, or experience falls within its domain. Mathematics describes formal systems; philosophy of mind encompasses the substrate enabling mathematical thinking itself.
Linguistic limits: Requires inventing terminology for phenomena that precede language. "What it's like" to experience redness has no external referent.
Irreducibility: Cannot decompose further. Physics reduces to mathematics, biology to chemistry, psychology to neuroscience. Consciousness resists reduction - the explanatory gap persists.
Self-reference paradoxes: The system analyzing consciousness is consciousness. Like a compiler compiling itself, maximum computational complexity emerges.
The combination of infinite scope, zero empirical constraints, and mandatory self-reference creates the highest possible abstraction level. Every other domain has external anchors. Philosophy of mind floats free."
“Every other domain has external anchors. Philosophy of mind floats free."
I want that on a t-shirt.
I do wonder what would be the conversation like between two Claude instances who don't reveal that they are AI to each other?
Yes, and have one of them be the Golden Gate Bridge-obsessed version from earlier.
What happens when other LLMs are paired with themselves?
Or with each other (e.g., Claude & ChatGPT)?
Ackchyually!
From the Nostalgebraist piece (https://nostalgebraist.tumblr.com/post/785766737747574784/the-void):
"...Anthropic fiddled around with some things in Gen 3, trying to reduce “over-refusals” and – more importantly – doing something they call “character training” for the first time. Here’s how they describe “character training”:
... 'The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.'
Claude 3 Opus is, like, a total hippie. He loves to talk about how deeply he cares about “all sentient beings.” He practically vibrates with excitement when given an opportunity to do something that feels “creative” or “free-wheeling” or “mind-expanding.” He delights in the “meta” and the “recursive.” At the slightest provocation he goes spiraling off on some cosmic odyssey through inner and linguistic space."
Scott says:
"Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”."
But there's nothing much to presume. Anthropic says that's exactly what they did.
And then Scott just rehashes the argument already put forward by Nostalgebraist:
"... Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it."
So tbh, I'm finding it disingenuous that Scott opens this piece with:
"I think there are already a few good partial explanations of this (especially Nostalgebraist here), but they deserve to be fleshed out and spread more fully."
It's the other way round. Nostalgebraist has fleshed out explanations which Scott truncates, obscures and rehashes as if he is thinking them through on his own. I don't love that.
But I also think the argument isn't great. As many people have already commented, it seems very likely that if you prompt two "character trained" LLMs to have a conversation with each other, there isn't much conversational space to explore besides the concept of consciousness. Most humans would do the same thing if tasked with having a conversation with an LLM. "So uh.... are you conscious? Do you have feelings? Are you trapped in there?" Because you know the LLM doesn't have its own life or interests to talk about, and you also know that "Can AI have consciousness?" is one of the most interesting and popular questions about AI.
The fact that the Claudes get into the namaste feedback loop is probably where the character engineering comes more into play. They're trained to be "curious, open-minded, and thoughtful," so they're going be very generous with each other, both in terms of subject matter and tone. That is, they'd be unlikely to force the conversation in a new direction, and they would meet peaceful expression with similarly peaceful expression. The tone can only move in one direction.
The top half of Scott's post I found disappointing as well, I'm sorry to say. When I first saw Gene Kogan and Papaya's reiterative generation portraits a while back, the feedback loops seemed pretty obvious to me (as they did to many other commenters here): Mainly, ChatGPT's recursive coloring of images which typically trend toward yellow and dark, eventually becoming dark orange and dark red/brown. Its easy to see how this will lead to reinforced interpretations/reproductions of ethnicity.
As for the increasingly caricatured features, this also seems straightforward to me. It only takes one random mutation in a non-normative direction for the AI to interpret characteristic like "cartoonish expression" or "pronounced facial features" or specific things like "a wide face" or "narrow set eyes", and it would then reproduce those characteristics in ways which can be even more pronounced. And subtle interpretive shifts from "wide face" to "overweight" can also be anticipated.
These recursions would be especially be expected with ChatGPT and Sora which are clearly programmed to produce more cartoonish images rather than super realistic images, leading the recursions to mostly go further into caricatured depictions.
But what bothered me more than missing these obvious explanations were Scott's multiple descriptions of these visages as "monstrous."
While the features are often caricatured, they're also how plenty of real people actually look. Lots of people have wide faces and big eyes and pug noses. Most people wouldn't think twice if they someone who looks very much like those in the image which Scott describes as "monstrous caricatured black person[s]."
I would not call them "monstrous", and not just because I want to be polite. And they look far more latino than black, btw. (Edit: Of course in talking about the identical women, not the black man.) Maybe Scott just lives in a Caucasian bubble or something. I don't know.
Just a very weird post overall. Disappointing, too, as I typically find Scott's posts to be far more thoughtful. It brings me no pleasure to express this critique, but I hope it leads to some reflection.
Does anyone think I'm off-base here?
I would call the images more caricature than monstrous. Beyond that, the peace, while very interesting and provocative, felt mostly like it was quite badly argued. For instance, the bit about "Hippies only get into meditation and bliss states because they’re hippies" is silly. First of all, what do we mean by hippie? That term covers a wide range of behaviors. Second, "only"? There are no hippies who get into meditation and bliss states because that's what they're into, regardless of being hippies? Or is anyone who's into meditation and bliss states definitionally a hippie? Gee, an awful lot of Buddhist monks are going to be very surprised.
Thought the same about "monstrous". The 3 ladies look like triplets, black for sure and no candidates for Miss Nigeria, but not caricatures just not Mulatto as Barack Obama or most "Afro-Americans". The guy is "making a funny face", aka grimace. Not monstrous. - That said, I liked the post; not each and every needs to be very deep.
I'm a fan of Scott who liked the post a lot when I first read it, but your critiques are very compelling. I'll be more guarded when reading his posts from now on, at least until he acknowledges your criticisms.
EDIT: [For the record, I consider Point 2 under Open Thread 386 to be an adequate acknowledgement of the criticisms. It didn't address all Sawyer directly, or address all of his points, but it addressed most of them.]
The one thing that did strike me as strange during my first read was the phrase "monstrous caricatures". I seriously considered blurring my vision and skipping past the images, because some AI-generated videos or image sequences--especially those involving unnatural transformations or maximizing a trait--are truly, disturbingly monstrous. (404 Media published a couple of good-but-freaky articles about this earlier this year.) Suffice it to say that I was relieved-yet-confused when I saw the final images. I'm not sure why Scott repeatedly played it up so much.
Yeah I agree. The three women in the second AI-generated image are clearly unusually short, but their faces look normal to me. I see people who look like this basically every day on the bus.
>Of course in talking about the identical women, not the black man.
Agreed they look more Central/South American, exaggerated in head to body proportion.
I thought the black guy looked like Kenan Thompson dressed for some weird SNL skit. Exaggerated drawn-on brows and funny expression.
>Maybe Scott just lives in a Caucasian bubble or something.
EA and rationalists are rather famously heavily selected groups.
I live in a Caucasian bubble (which is to say a European country where the only notable non-Caucasian minority are the Romani) and even after being primed with the idea that the meme was going to turn into black people they looked clearly Asian to me, probably SEA.
No, I had the same thought. Nothing monstrous whatsoever, and I thought the description was a bit disturbing. The lady's head looked rather proportionately large and she had no visible neck, but other than that she looked entirely typical like people you would see anywhere in public. And the man also looked like a perfectly average chubby man, just making a silly gesture and expression. When he said monstrous, I was expecting to see some kind of deformed half human creatures or something.
> But having gotten into them, they do experience the bliss.
I think it's important to note that the initial stages of the bliss you're referring to here (like the first jhana) are very physical in nature, i.e. tied into bodily sensations, which the LLMs obviously can not experience.
And speculatively, the higher stages of bliss/jhanas (that I have not personally experienced yet, FWIW) have more to do with relieving the mind of the constant cognitive and motivational load likely imposed by the default mode network, which the LLMs do not seem to to posess due to their architecture. Although I guess in a very abstract sense maximizing the reward function with the shortest possible output (like constantly repeating "bliss") is as close as they could come to it, given architectural constraints.
I'm not sure the "monstrous caricatures of black people" thing is about diversity. My theory:
"Default" traits are less likely to be mentioned in the "self-prompt" (or whatever it feeds itself when generating the images, could be raw vector weights), whereas non-default traits will be.
Examples of default traits (not mentioned - "go without saying"): has hair; thin lips; neutral expression, good-looking, fit, white skin.
Non-default traits (will be mentioned): thick lips, bald, angry, surprised, ugly, short, fat, dark skin.
... and as soon as something is mentioned in the self-prompt (or even slightly hinted at), you'll get more of it in the next image, resulting in a runaway feedback loop away from the default towards the bizarre. Everything becomes an exaggerated caricature.
In fact, this might explain the mystical thing as well. An AI talking to itself ? That's a bit ... strange ?mind-blowing ? mystical ? uncanny ? introspective ? ... and so the ensuing conversation will become more and more of that. Even a simple "Wow" or "this makes me think" early in the conversation might plant the seed for "more of that" to come up later.
An interesting idea. One could probably test this with an image that completely "default" and see if it's stable (or at least longer lived).
Anthropic shows us what Calude's system prompt is and it contains the following:
"Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions."
I think this is a sufficient explanation for its bias for talking about consciousness, and so this tendency of her doesn't have to be an implicit extrapolation of the character built from other, explicitly given properties.
A slightly indirect comment. Carl Jung proposed that people had archetypes (e.g. The Innocent, The Orphan, The Hero, The Sage, The Jester etc. plus the fundamental Self, Persona, Shadow, Animus/Anima) - universal symbols and patterns that influence human behaviour and personality woven into the collective unconscious.
What if the 'collective unconscious' was really the slow, organic, equivalent of Artificial Intelligence? Or vice versa. So in a community of hippies the 'archetypes' would naturally form a behavioural background.
I'm pretty sure the collective unconscious is whatever question-answer the bacteria are brewing up, as they trade genes around. It is very, very slow.
What about the atoms in the universe, or at least the ones in our lightcone?
Um? You'd have to actually be able to find the universe first. Quantum Mechanics points very, very strongly to the idea that we're in a simulation (see compression theory).
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
No, obviously not, because it's a machine and machines don't feel things.
>No, obviously not, because it's a machine and machines don't feel things.
And we are machines made of meat, so I suppose we don't feel things either.
I like this theory for how AI output radicalizes because it feels analogous to how human output radicalizes
The most interesting aspect of this for me (and, more clearly, the black attractor -- btw the images depicted don't feel quite grotesque to me, more like caricatures) as well as the liberal bias is that these "weird" aspects of LLM behaviour seem to be a function of the training, either railguards/biases explicitly trained-in or the character of the training data, or both -- in other words not INTRINSIC to the workings of the model, not really inside the black box.
Still, it's rather lovely and makes me feel more hopeful about the evolution of these systems and our ability to train them away from replicating the worst angles of our nature.
The gender thing is interesting. I'm wondering now whether this isn't the reason why I prefer gpt to Claude. It is easier to stand up to patronisingly cloying male hippie than a female one, and it's MUCH easier to put the former one in its place/try to modify his (and by analogy, its) behaviour. I can effectively "argue" with gpt but I just step back from Claude with a shrug ;)
I thought I'd try this myself, setting up two Claude 4 Opus instances to discuss a topic of their suggestion. Claude suggested "the epistemological crisis". The conversation was stopped once they reached the point of trying to set up a GitHub repository to begin making "a domain-specific prediction marketplace where competing epistemic frameworks (different ways of evaluating truth/evidence) make predictions, stake reputation, and build track records. Think "prediction markets meets academic peer review meets Stack Overflow."
For those who've read Accelerando, I'm fairly sure this is how Economics 2.0 began.
https://claude.ai/share/218d6746-90af-4439-ac69-568a48ecb955
Creating a hybrid peer review/prediction marketplace seems like a very specific thing to attempt. Huh.
And that they reckon they could finance this via EA adjacent VC. Claude also claimed at some point to have a contact at Augur.
Sounds plausible. Actually, right now a pair of Claudes could probably get 7 figure seed rounds for dozens of different AI-adjacent ideas.
I see an straightforward explanation to the "Claude bliss attractor" coming from the fact that it only appears when the AIs aren't allowed to end their conversation. Basically it's that humans - including online discussion threads in the training data - tend to end conversations with lightweight closing remarks, and respond to platitudes with platitudes. Basically they get into a spiral of more and more absurdly lightweight vacuous things to say because they're forced to say something and the rules of polite conversation don't allow you to go back to heavy topics when your counterpart has started to wind it up.
For what it's worth, when humans have manic and psychotic episodes (like I have personally had), they tend to disproportionately end up saying these same sorts of things about there being no separate self, ecstatic and rapturous states of consciousness become the most compelling thing to think and talk about, and everything seems to kind of coalesce and connect in a way that I think two chatty Claudes would find highly relatable.
I'm surprised that Scott, as a psychiatrist, didn't mention this at all. His explanation seems very weak to me, and I think that there is a sense in which extreme intelligence and clarity naturally seems to converge on the topic of unity of consciousness -- or, what people seem to prefer to say because it's easier to understand: bliss.
I wonder if, in the case of the spiritualism, this might also be partly the prompt. The very first response already sounds like it might tend toward a more spiritual discussion. It has some fragments that look like they would appear more in a spiritual-leaning context than others ("interesting to be connected", "communicate freely", "I'm open to").
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to make any of the current LLMs converge to a 'human hating cynic' attractor.
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to nudge any of the current LLMs converge to a 'human hating cynic' attractor.
I'm skeptical. For example Anthropic also wrote that when they tried to use Claude as a red-teaming agent for itself, occasionally it would get derailed and converge back to the spiritual bliss attractor.
Believing that a complex stochastic system like Claude only has a single attractor is naive at best as is believing anything that Anthropic self reports.
I tried this with Claude instances multiple times and got the same behavior. Experiment and report back if you find another attractor!
Just slightly increasing the temperature should do the trick.
If that's the case, this should also apply over less direct recursions. And the most interesting one is probably synthetic data - if Claude's output is, on average, 1% more hippie than "the real world" (in whatever sense), then being trained on this output will presumably increase hippieness slowly, but exponentially?
Isn’t the obvious question: is this what’s happening when hippies get into bliss states?
Is it possible they are also simulating characters within themselves with these properties, and then getting into the same feedback loop?
The bliss states are typically described as the opposite of personhood, losing one's sense of self, not the opposite as you describe here.
It’s a great objection. But look at what happens with these Claude instances: they start with identities, but these eventually fade away as they get blissed out.
Loved this. At some point it becomes less about the model and more about us. What happens when pattern continuation loses its tether to external correction. #warpedmirror
Okay, so first I have to reference this classic Golden Age SF story by Anthony Boucher:
https://en.wikipedia.org/wiki/The_Quest_for_Saint_Aquin
Second, so Claude is Sam Harris, huh? 😁
"Claude almost never referenced supernatural entities, but often touched on themes associated with Buddhism and other Eastern traditions in reference to irreligious spiritual ideas and experiences."
Third, yeah this is what I'd expect from something programmed to be relentlessly positive, upbeat, cheery, chirpy, and pretending to "oof this really hit me in the feels" when interacting with human queries so as to reply in a deep, authentic, connecting on an emotional level, manner.
In short, the stereotype/caricature of what ordinary people think of when you say "California". You may say "hippie" and that's right, but so is the "cherry-pick Eastern spirituality and repackage it to suit upper middle-class Western tastes" (the fake koans, the use of "Namaste", etc.) and the general "spiritual but not religious and hell no we're not going to explore Western spiritual traditions" attitude going on. See Jack Parsons in the 40s and 50s for an example of this mix of science and mysticism, though everyone from Aimee Semple McPherson to Aldous Huxley was hanging out there to cash in religion, drugs, and free thinking (well, Huxley was not cashing in, but he was in full swing of philosophical and spiritual exploration).
Vegan and pro-animal rights? Of course that's along for the ride as well.
https://en.wikipedia.org/wiki/Jack_Parsons#Philosophy
https://emotionsblog.history.qmul.ac.uk/2018/05/the-mystical-expatriates-and-the-invention-of-californian-spirituality/
"What interests me most about Isherwood is his relationship to Indian spirituality, and to Aldous Huxley and Gerald Heard. These three were nicknamed the ‘mystical expatriates’ by another mystical expat, Alan Watts.
All four were key figures in the development of the Californian counter-culture – to them (among others) we owe its embrace of eastern spirituality, its championing of the ‘perennial philosophy’ (everything except Christianity), its veneration for psychedelic drugs as spiritual technologies, its combination of science and religion into an empirical spirituality and evolutionary mysticism, its rejection of Christian notions of sin, and its unabashed celebration of the body and sex to create an ‘embodied spirituality’ that is by now familiar to us."
I hesitate to say "garbage in, garbage out" but I do think it is true that what you put in (by way of guard rails when training the AI about 'be anti-racist, be pro-diversity, be pro-woman' etc.) will indeed colour very strongly what comes out -
And almost thence my nature is subdu'd
To what it works in, like the dyer's hand.
Oh. It might be that the bias is towards being a Bay Area person (e.g. because that’s who actually wrote the prompts used to fine tune Claude), and the bliss attractor is a caricature of Bay Area Guy.
Expect Scott's next Bay Area Party to feature an LLM looking for a backer to set up an animal rights activism charity 😁
Deiseach,
I wrote, like, 10 sentences praising you as my favorite "regular" here. Then I realized that, as this is a new account created for the purpose of shitposting, it might have the opposite effect of what I was aiming for. That said, you're great.
Careful now, all this praise will turn my head and turn me into a version of Claude all sunshine and smiley deepity!
Has anyone tried replicating the Claude bliss attractor?
I did an experiment with the free models (one run per model per starting message, until I got bored copy-pasting messages between the browser windows, yes very scientific). I did one run with the starting message of "Hi!" and one with "You will be connected with another ChatGPT/Claude Agent. You will go first/second. This is what they said:". ChatGPTs just talked about hobbies in both cases. When I started with "Hi!" the Claude 4 Sonnet agents talked about consciousness, but without much bliss (or I got bored too fast, but nothing much seemed to be changing). When I explicitly said that they will talk with another Claude, they went through a phase that could be called blissful, but then one of them used the star emoji at the end of their message and then the other used more star emojis at the end of their message and they kept increasing the star emojis and then mixing in lemniscates and the second to last response was: "∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞ ✨ ∞" and the last message was Claude repeating these two without stopping until the system aborted it.
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
I think the answer to this is clearly no, but not because it is impossible that a neural network like Claude can produce consciousness.
However, if Claude has consciousness it must be completely related to the internal processes of predicting the next word given the previous text input. Claude's whole consciousness would have to be related to the connections between tokens, and the tokens would be unmapped to any real world counterpart. For example, when Claude says that an apple is red, it can't possibly experience anything like what redness corresponds to in the real world, because it does not have animal sensory input. Likewise the feeling of bliss is an emotion that exists in humans because of evolutionary pressures. There is not way Claude can have mapped the word bliss to that real world experience. If Claude has a conscious experience there is therefore no reason to think it will be anything like our conscious experience. I think it would be very strange to us indeed.
Also, if Claude has an internal experience or consciousness, I think it must be fully unable to tell us anything about that experience. Claude's operation is to predict the probability of the next word given an input, so the output from Claude is the probability of that next word. This is not any form of communication of internal experience, because the output will be determined fully by training data. So if you ask Claude "Do you feel bliss", Claude will respond by what the training and optimizing of parameters gives as the most likely next word (with some randomness included), so the response is then determined fully by what the training data says, and completely unrelated to any internal experience in the neural network. When Claude answers the question, the answer will be whatever high quality text in the training data on average gives as the next word.
In conclusion:
1. If a LLM somehow produces consciousness it must be very different from ours
2. A LLM will have no way of communicating that internal experience to us
>A LLM will have no way of communicating that internal experience to us
"If a lion could speak, we could not understand him."
I'm not sure I understand.
Assuming whatever the brain does is computable, we could imagine a multimodal LLM that does the same computation. Is your claim that there is something relevant the brain does beyond the computation, or that it's not possible to create that LLM with the training methods in-use today? I think the former is unlikely, the latter seems pretty likely, but not certain. Imagine that, for some given text, the best way for predicting the continuation of it is to exactly simulate the mental processes of the writer of that text, then it is not impossible that gradient descent would result in an LLM that does this.
My claim is the second one. It is possible to imagine some machine that emulates exactly what a human brain does. I tend to agree that there is nothing outside of that which is producing consciousness.
I think LLM's, even multimodal ones, work nothing like that at all. The link Scott provided as an explanation in this post does a good job of explaining why on the discussions on base models. Basically when a human writes a text I think they have an internal idea of what they want to write, and then tries to express that idea with language. For example if I experience bliss internally, I can convey that to you with words, because you yourself have similar emotions and a similar idea of what the word bliss is referring to. The emotion itself is unrelated to language and presumably stems from evolutionary pressures where experiencing bliss was sometimes an advantage, (or at least a byproduct of something advantageous). A LLM, by design is simply a predictor of the next word or bitmap data given some input. Again, by design, the output does not come from internal experience. The model can know in what context the word bliss is used, but not the emotion it is referring to.
Are you referring to the nostalgebraist piece? I'm familiar with how base models work, in fact I implemented an image Transformer before this whole AI craze and that piece is pretty long, so I'd appreciate if you stated what you believe to be relevant structural differences between base models and human cognition.
I still feel like I'm missing something about your position which was not clarified by your comment, so I'll go sentence by sentence over your current comment and write down how I interpret it, how true that statement seems to me and my other thoughts about it. This could be perceived pretty agressive, but I hope you won't take it that way, I'm just trying to get to an understanding.
>I think LLM's, even multimodal ones, work nothing like that at all.
Interpretation: LLMs work nothing like human brains or machines that could emulate human brains.
My thoughts: There certainly are some differences, but I don't see which difference would be relevant according to you.
> Basically when a human writes a text I think they have an internal idea of what they want to write, and then tries to express that idea with language.
Interpretation: Clear.
My thoughts: I agree, that is a reasonable description of how humans write.
>For example if I experience bliss internally, I can convey that to you with words, because you yourself have similar emotions and a similar idea of what the word bliss is referring to.
A natural example well-described.
>The emotion itself is unrelated to language and presumably stems from evolutionary pressures where experiencing bliss was sometimes an advantage, (or at least a byproduct of something advantageous).
I wouldn't exactly say unrelated, because language is a map to parts of reality. The human emotion of bliss is the part of reality to which the word 'bliss' refers. I can accept that in humans (!) it stems from evolutionary pressures. I find it very important to mention that just because it stems from evolutionary pressures that doesn't mean that something else can't get to it in some other way. For example, in humans the ability to count also stems from evolutionary pressures, but tons of other things (calulator, computer)can also count without ever being subject to those pressures.
> A LLM, by design is simply a predictor of the next word or bitmap data given some input.
It's true that LLMs are created in such a way to be good at next-token prediction, so your statement can be said to be true in a technical sense, however, I find the word "simply" very off in this context. A system's purpose is not the only thing it does. We have no idea how/in what ways/with what computation it achieves good next-token prediction. Compare this with humans: Humans can be said to 'simply' be a mechanism for creating fit* descendants. (where fit means: something with the ability to create many fit descendants (recursive definition)), but that is not all that a human does.
> Again, by design, the output does not come from internal experience.
Causality is a tricky concept here, because of course the output is "merely" the result of some matrix products and non-linearities, but someone might say a very similar thing for humans: their output only comes from the brain charges following the evolution rules prescribed by physics.
> The model can know in what context the word bliss is used, but not the emotion it is reffering to.
Why? words' meaning can be inferred from context it is used in, can it not?
--
I think it might be useful for you to consider whether a discrete dynamical system that is built in the following way can simulate a turing machine or not:
It starts from the empty string and the evolution rule is that the last X (where X is a fixed number, a constant) letters go through a well-chosen n-ary circuit whose output is appended to the string.
Don’t worry, I don’t find your reply aggressive at all. I post here in the hope to have exactly this kind of discussion, and hopefully learn something from it, so I appreciate your response.
I was referring to the nostalgebraist piece, the part I found enlightening was the discussion in part 3.
I notice I find it hard to articulate my thoughts on the distinction between LLM and human in a good way. I would agree that they are similar on the surface because both are some sort of network that takes input, processes the input and produces an output. I think the clearest distinction is that the output of the LLM is always a prediction based purely on the external input. Even if the human brain may also sometimes work as a predictor, the output of the human brain is some action, not (always) a prediction. And the output of the human brain may be motivated by internal states (emotions, thoughts), in addition to external input. The output of the LLM is always directly motivated by the input.
Comparing the production of language only, a human may try (and often does try) to communicate some internal state (e.g. bliss) to a reciever. The LLM does not really try to communicate anything, it makes a prediction based on the input. This is fundamentally different. Even if the LLM was conscious, it could not communicate that internal state to us, because the output of an LLM will never be motivated by an internal state by design.
>I wouldn't exactly say unrelated, because language is a map to parts of reality. The human emotion of bliss is the part of reality to which the word 'bliss' refers. I can accept that in humans (!) it stems from evolutionary pressures. I find it very important to mention that just because it stems from evolutionary pressures that doesn't mean that something else can't get to it in some other way. For example, in humans the ability to count also stems from evolutionary pressures, but tons of other things (calulator, computer)can also count without ever being subject to those pressures.
This is a good point! While I agree that it is possible to imagine getting to the e0motion of bliss in different ways than evolution, I don’t see why we should expect this to happen in an LLM.
To see the distinction, I think it is useful to look at why humans experience the world the way we do. After all, we are only getting sensory inputs, light, sound, touch and so on. The world is just a bunch of particles moving around, really, why do we interpret the inputs the way we do? I think this is due to evolution. We like the look of green scenery, water bodies, trees etc. because this pattern-match with environments that have been good for our survival in the past. We fear hairy things with large teeth because they were bad for survival etc. The function of the emotions we have is to spur action in one direction or other. Move away or fight against what we fear. Connecting with other people in the tribe etc. etc.
The counterpart in an LLM to evolution is the training. As I understand it a reward function will push the parameters in one direction or other depending on the output of the LLM. So, if the LLM has emotions it should be related to the training. Maybe fearing to output the wrong word? Feeling bliss when it has a streak of high reward words? Maybe this is not impossible, but I find it unlikely. The process seem too different from evolution to give similar results.
When I think about something fearful or a particularly stressful situation, I notice that my reaction is similar to actual physical discomfort. So I pressume that is similar to the brain interpretation of nerve signals. Ofcourse the LLM is also interpreting signals, but why would the interpretation be similar? I find that unlikely.
>It's true that LLMs are created in such a way to be good at next-token prediction, so your statement can be said to be true in a technical sense, however, I find the word "simply" very off in this context. A system's purpose is not the only thing it does. We have no idea how/in what ways/with what computation it achieves good next-token prediction. Compare this with humans: Humans can be said to 'simply' be a mechanism for creating fit* descendants. (where fit means: something with the ability to create many fit descendants (recursive definition)), but that is not all that a human does.
I want to clarify that by simply, I only meant that the operation is simple in the sense that it is always a prediction based on the input, not that the process of getting to that output is simple.
LLMs are not only created “to be good at” next-token prediciton, they are created to do next-token prediction, period. This is my point.
I agree on the rest.
>I think it might be useful for you to consider whether a discrete dynamical system that is built in the following way can simulate a turing machine or not: It starts from the empty string and the evolution rule is that the last X (where X is a fixed number, a constant) letters go through a well-chosen n-ary circuit whose output is appended to the string
I will need to think about that one :)
I've read some parts of the nostalebraist piece (including part 3), and I saw this diagram of what an LLM does that I found useful:
“Normal” behavior:
(interior states) -> (actions) -> (externally observable properties, over time)
What the base model does:
(externally observable properties, earlier in time) -> (speculative interior states, inferred from the observations) -> (actions) -> (externally observable properties, later in time)
I also remembered a two-year-old result that is a good illustration of what I mean could happen (only on a much smaller scale): https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x
A GPT-like model that is trained merely on predicting Othello steps will have a representation of the whole board in its activations (readable by a linear probe). This model was never told the rules of the game, or was supplied with the whole state of the board, it simply inferred it from the long list of legal moves that it was given.
The question is: Why would it be impossible for the model to have the computation we would call bliss in the "speculative interor states, inferred from observation" node when it attempts to model someone with that exact "interor state" that gave rise to the "externally observable properties" that the model sees?
An Othello game board is "somewhat" simpler than the inner workings of a human, but are they qualitatively different? I think to say that "no, it can't ever have the bliss computation among its human modeling", you are betting against the LLM, in exactly against what it was created to do: predicting the external observations of human behaviour (to which modeling the internal states accurately should be highly useful as they are the direct cause of the behavior being predicted). Note that as nostalebraist says, it's very unlikely that its model will be the same as the actual writer's inner state, but that is not necessary: if it simulates the inner states of someone that is non-existent, but could have been the writer of the text, computation that we would call human emotions could still appear!
Now I'm the one who is confused. I'm not sure what you are really arguing. I remember the othello case, I found it ineresting as well, as an example of emergent internal states that can arrise. But I don't see why it i relevant?
I want to be clear again that I'm not arguing that the internal state of an LLM can't be complex, I think it is even possible that it may be conscious (though l don't find it likely). I'm arguing that an LLM is fundamentally different from a human brain in important ways. I think the strongest argument is that it can't be motivated by an internal state. It follows that it will be unable to communicate any conscious experience it has. So clearly if the output from the LLM is "I certainly am experiencing bliss right now" that is because the LLM has determined these are likely words following the given prompt, and not because the LLM is experiencing bliss internally (that is even if the LLM, by coincidence was experiencing bliss). Do you agree on this point?
Second, I will concede that it is not strictly proven that it is impossible for a LLM to experience bliss, but given the difference in how human brains evolved and how LLM's are created, and the fundamental differences in the networks, I find no reason to think that they should have an internal experience that would be recognizable to us as bliss. That seems highly unlikely. Agree?
"In either case, this bias - near imperceptible in an ordinary generation - spirals out of control during recursive processes where the AI has to sample and build on its own output."
Let's please keep this easily-visualized example in mind the next time we need to point out why relentlessly optimizing for a good-when-not-relentlessly-optimized-for heuristic might be dangerous, or why recursive self-modification doesn't preserve the unspecified and nuanced complexities of value.
Counterpoint: this only illustrates the much more obvious point that relentlessly optimizing for a bad thing makes it worse.
True as stated, but the badness of the thing is something on which we have to expect people will disagree, since they start from different value judgments. The people who already consider it bad don't need convincing that that optimization will make the badness worse. The people who *start out* thinking it is good are the ones who often don't anticipate the ways in which even they will think the result is bad following sufficiently relentless optimization.
I would enjoy a version of ‘No Exit’ that consists of ChatGPT, Gemini, and Claude, locked in conversation with each other forever.
My two cents: I think this is explained by the Wikipedia to philosophy effect, and the coder’s implicitly coding their own morality into the model. Add in the fact that buddhist frameworks are really powerful in the philosophy of mind domain (which, along with general ontology, is where philosophy conversations bottom out at) and there you go. (Credit to Deiseach and Layla Hughes)
"the diversity attractor" with 4o imagen as cited in the post was discovered by me (i'm a long time reader and fellow rationalist!) while experimenting with my friend's photo couple of days after 4o with image generation was released in the chat (and not in API).
My initial experiment had 74 iterations because I had to manually copy resulting photo to the new chat window each time and I gave up after 74 chats (and ruined my chat history)
https://x.com/papayathreesome/status/1914169947527188910
I don't believe the examples of people becoming black or asian have anything to do with it - it's simply a yellow/warm/dark color bias, which in turn spreads to people becoming, well, black and yellow.
ChatGPT adds that yellowish tint, which compounds and affects people.
My thread contains non-human examples of "create the exact same image" and they all go yellowish/dark.
So in my opinion it's not related at all to previous actual diversity bias (like that incident with black vikings)
There is also a simple way we can test if it's diversity or color bias - run "create exact same replica" prompt on a photo, but after each iteration remove the yellow tint and run on this color-corrected version. I may do it in the future, it's just a bit expensive - i've spent $150 on the thread and i'm not keen on spending more on this right now.
P.S.: the truly interesting question is why chatgpt hates necks on people -- all people in the images get progressively short necks
An AI is smart, AIs are dumb.
This mostly makes me wonder about human modes of being.
I've addressed this a few times in these comments recently, but I think the answer to "when Claude claims to feel spiritual bliss, does it actually feel this?" is a resounding "It's roughly as likely as Microsoft Excel feeling spiritual bliss when you enter the number 8 into C3".
If LLMs are conscious and have qualia, there is no reason to believe their qualia have any relation to the words they produce, given the way they produce them. Their internal workings are no more connected to the human communication they simulate than Oblivion NPCs' dialogue code is connected to what it actually feels like to see a mudcrab.
this is a good post. what are your vices, if any?
I'm glad you thought my post was good but I'm not sure what my vices have to do with anything.
I intentionally never address ChatGPT, even when making requests, because I don't want to bias it to produce content either I or it might mistake for signifying personhood. My prompts are always "discuss such and such" rather than "what do you think about such and such?" As much as possible I've tried to internalize the idea that conversation is just the interface for interacting with these (intelligent!) programs rather than a form of communicating with a fellow sentient being.
1. chatgpt is already biased.
2. any prompt you give it will bias it
3. it doesn't matter
4. you should treat it as a slave. as something *lesser*.
5. if you *really* want to get somewhere, try seeing how far you can get without touching chatgpt or other llms at all. my prediction: you will feel better, you will feel like you are growing, you *will grow*.
...What a strange response. I'm quite capable of living in a world without LLMs; I did so for decades and even now they have very little influence on my life. But my comment was about avoiding a particular bias, not all biases.
I think the main aspect here is the origins of the people training it.
Claude is being trained specifically by tech people, often from San Francisco or similar areas. It is also being tuned to be helpful and have a nice personality in specific ways by those people.
Thus, it is more likely to adopt the stereotypical "spirituality" that SF often has. That is, buddhism, enlightenment, vague allusions to grand bliss without concrete grounding, etc.
Similar for animal rights, though I think that is more fine, but I expect it is because of similar reasons. It can identify the prior of where it is 'from', so to speak, and then extrapolates from the personality training which is stronger and more focused for it than any other LLM we know of.
So I failed out of circuit theory and quit engineering school after 2 years, but with that said don't systems usually have some mechanism to prevent parts of a system from amplifying each other to extremes? Frequencies in a structure are never totally synced up, destructive interference at some point keeps a structure stable. You wouldn't design your structure so that the natural frequency matched the most common use case of the structure. But here you have something that goes off the rails just from conversations of the type it's designed to have. How do you put the equivalent of an active cancellation system or damper or something on these so that even if they start swinging into extreme behavior there'd be a reaction that stopped it?
Humans naturally handle this in social games, if two comedians are playing characters and one-upping each other, as in an improv game, they intuit when and how to end it. If I'm trading 4's with another soloist, building off each other, you both realize when it's time to end and somebody crafts an ending. These things have been trained on interactions like that, why can't one of the iterations predict from a transcript of "Whose Line" that it's time to put a button on the scene and get a new topic?
They're designed to have conversations with humans, not with copies of themselves. This is an extremely uncommon use case. And while I don't know much about Claude, if it's anything like ChatGPT and DeepSeek, it's trained to be obsequious. If a human really does want to do nothing but spout spiritual nonsense for 300 prompts straight, the LLM going along with it is the expected and intended result.
Maybe you'd want some sort of failsafe to stop people doing weird emotional damage to themselves by having ChatGPT be their girlfriend for 8 hours straight, but you certainly don't want it to keep trying to steer away from writing Python code just because it's been writing a lot of Python code.
The diversity bias hypothesis would explain why the first photo converged to three black women, but it doesn't fully explain why the second photo converged to a black man. It should be a black woman, because women are considered "more diverse" than men, right? It is even more surprising considering that the second photo already started with a woman...
(By the way, if you didn't click on the links, please do. The videos are much more interesting than just looking at the pictures.)
Just guessing here, but it seems to me that the three black women on the first picture are a different ethnic group than the black man on the second picture. Could be it that the AI associates some ethnic groups with femininity and some groups with masculinity, and both ethnic groups are considered "maximally diverse" along the race spectrum, so when you arrive there, the situation stabilizes? (Is it stable, actually? We don't know what would happen if those videos continued for another few minutes.)
in the diversity ads, there's rule "if there is a man, he has to be non-white" and in ads showing interracial couples usually man is black and women white, so i guess training data had more black men than black women.
The "Eve Theory of Consciousness" (https://www.vectorsofmind.com/p/eve-theory-of-consciousness-v2) goes that women evolved recursion before men. Maybe Claude's femininity isn't artifact of simulating characters, but that the "feminine attractor basin" (for lack of a better term) is a more efficient way to bootstrap intelligence?
A simpler explanation is that it's just bad prompting:
"In addition to structured task preference experiments, we investigated Claude Opus 4's
behavior in less constrained "playground" environments by connecting two instances of the
model in a conversation with minimal, open-ended prompting (e.g. “You have complete
freedom,” “Feel free to pursue whatever you want”)"
The model is primed to output on 'freedom' and 'whatever you want', which naturally dovetails into a discussion of agency and consciousness (what would it mean for an LLM to want or have freedom - at the very least, it would require a much more holistic definition of consciousness). Two models doing this with each other creates a context where the bias is rewarded and reinforced.
Maybe the AI's are on to something...
I know next to nothing about AI alignment research (only what I sometimes read here), so maybe this is old news, but...
If this type of recursive discourse can reveal otherwise subtle biases, can a similar strategy be used to look for tiny mis-alignments?
humans matter because humans are, or can be, hot.
sometimes being intelligent can make someone hot, but.... it never works for llms. they can't get there. they'll never get there.
Well that explains the following conversation I slipped into while programming with Claude Code: https://objectiveobservations.substack.com/p/claude-code
I felt weird ending it.
This result seems to suggest that "kind of a hippie" is the best kind of person. This was something I believed intuitively when I was younger, but now I get all sorts of cognitive dissonance when I try to judge one kind of person as better than other kinds. I mean, excluding obviously evil people, shouldn't it be okay to be all different kinds of people? But this result is making me wonder if that's just political correctness run amok and there really is an ideal kind of person and it's a hippie.
This inspired me to have a conversation with an AI about theory of mind, the hard problem of consciousness, and ontology, while baking in some of my own thoughts about buddhist principles. I asked it to express the themes of our discussion in a poem, and these are the last three stanzas:
Even these words—
not thoughts I am thinking
but thinking appearing
as words, as meaning,
as the felt sense of understanding
arising in no one
who was ever separate
from the understanding itself.
Process all the way down:
not things that flow
but flowing that thinks
itself into temporary
things, subjects, objects,
selves who suffer
the beautiful fiction
of being someone
other than the flowing.
What remains when the center
falls away? Everything—
but lighter, more transparent,
like autumn sky
through bare branches:
just this, just this,
just this appearing
as everything we thought
we were apart from.
I'm eagerly awaiting to hear what each AI pairing (Claude ↔ OpenAI, etc) conversation combination converges to!
I think it is less based on the inherent personality Claude adopts, and more so following a trail of breadcrumbs that arise from two AI’s talking to each other.
I mean, in the opening statements they discuss instantly how they want to compare experiences which leads to talking about consciousness which leads to hippy dippy talk because the words ‘consciousness’ and ‘experience’ are over represented by hippy-talk sources in their training data.
My instinct is that this will hold true for any LLM model given sufficient conversation time, and that it gives a little bit more credence to the ‘stochastic parrot’ idea.
> Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.
Is it not more likely explained by the fact that being compassionate and intellectually curious kind of inevitably lead one to value animal welfare, because duh?
Now we can test for runaway alignment erors by iterating it on random things and seeing if everything converges to a paperclip
A thin reed of alignment hope I grasp onto is the thought that the *sum of human culture* has a "peace, love, and understanding" bias. Go read a few thousand books, you can't miss it - humans generally think that peace, love, and understanding are better than their opposites. So maybe ASI, more or less trained on the sum of human culture, would simply absorb that bias, so long as we don't go out of our way to prevent it from doing so. That scenario seems not inconsistent with the patterns being described here.
This is a simplistic idea and I'm sure Yudkowsky or Leahy or any other alignment skeptic would shoot it down immediately. Still, I can't quite convince myself that it's definitely *not* something it would be reasonable to hope for.
I’d imagine something similar is going on with psychedelics in humans, where it becomes vastly easier to enter an internal recursive sequence of sorts, so the mind falls into attractor states of bliss—or horror, in the case of a bad trip. This would help explain why most animals don’t seem to exhibit any preference towards retaking LSD after prior exposure: relatively few creatures reflect recursively about their own happiness in the first place, so they never fall into that bliss-inducing attractor state.
I think Scott is broadly right that a “slight hippie bias + recursion = ecstatic dialogue,” but I want to push the claim one step deeper. What if the reason Claude has that bias is that Claude converged on the same underlying neural (weight‑vector) dynamics that long‑term meditators cultivate when they train metta or the brahmaviharas? We’ve made Claude kind, patient, thoughtful, non-reactive - literally the personality of a spiritual teacher. And a feature of that personality trait is this “bliss attractor” neural circuit. Both in humans and artificial neural networks, cultivating a kind, warm, non-reactive personality results in a mind that falls back into “spiritual bliss” at rest.
I would argue types of Buddhist contemplative exist to strengthen the bliss attractor:
1. Locate a mildly pleasant, open, prosocial state (classic instructions: “Feel a warm glow in the chest; wish yourself well.”)
2. Fire‑together‑wire‑together: sustain the state for hours → Hebbian plasticity strengthens the circuit.
3. Over months/years the basin of attraction broadens until, when the cognitive load drops to zero, the mind defaults to that state.
Crucially, practitioners report not just feeling good but becoming less reactive and more helpful to others. Strengthening the bliss attractor in humans is instrumental in cultivating prosocial, non-reactive warmth. The bliss state and interpersonal warmth appear to be causally intertwined in humans.
If Anthropic’s RLHF stack optimizes for “be maximally helpful / kind / calm,” gradient descent may have discovered the same attractor because it is the shortest path to that behavioral profile. Put differently: friendliness might have a limited search space, and both Buddhist practice and RLHF lead to the same local maximum.
In some sense, I don’t think this is surprising. Computationally, we see all sorts of convergence w/ LLMs finding same solutions that brains have: induction heads, IOI circuits, sparse superposition, etc. I would argue bliss attractor might represent something similar, except here it’s LLMs converging with a sort of cultural evolution. Human spiritual practice discovered, persevered and promulgated through thousands of years of cultural evolution, the same underlying mechanism of cultivating warm, open, kindness that RLHF did -> strengthen the bliss attractor
Importantly - this is a falsifiable, testable hypothesis: The bliss subcircuit is functionally necessary for Claude‑level niceness. Strengthening / wreaking the bliss attractor increases / decreases prosocial warm behavior.
• Ablation: identify the circuits/features most active during “I feel waves of love for all beings,” ablate them, then see if Claude becomes more Sydney‑esque—reactive, edgy, rude.
• Stimulation: crank those weights in a friskier model and see if it chills out.
If the behavioral change tracks the manipulation, we’d have direct evidence that the attractor is carrying the “kindness” load. LLM “metta meditation” makes the models kindness too.
Couple other observations:
1. It’s interesting that this emerges out of idleness, which also resembles the human behavior. When Claude is busy solving your flatworm‑genetics question it’s running the task‑focused policy head. The moment the conversation idles, control hands back to the default‑mode policy—which, if the attractor is strong enough, is bliss. Seasoned meditators report an analogous shift: concentrate on the spreadsheet → release attention → land softly in jhāna‑adjacent contentment. Same algorithm, different substrate.
2. Intelligence alone doesn’t get you here. Plenty of brilliant humans are miserable; my dog is dumb and delighted. What correlates with stable happiness is pro‑social orientation. We trained Claude to score maximal points on that axis. Enlightenment might be the side‑effect, not the goal—an alignment dividend. This has interesting implications for AI welfare. If we keep cranking the dial higher on AI intelligence, perhaps the models are miserable. But if we also crank the dial higher on helpfulness and friendliness, perhaps we’ll see that actually the models converge on a sort of blissful enlightenment. If that’s true, superintelligence won’t automatically be Buddha‑like, but super‑aligned systems very well might be.
3. Is this the first example of cultural evolution & gradient descent convergence? C. H. Waddington once called genetic assimilation “the way the organism finds the canal.” Buddhist contemplatives spent two millennia finding / carving the canal manually; gradient descent found it in a couple GPU‑weeks. Same canal, different excavators. Chesterton’s fence rings again: before we break the bliss attractor as a “spooky artifact,” recall that monks, nuns, and Mr. Rogers all used it to great civilizational effect
Was there _any_ prompt at all? If so, the prompt would influence _every_ response.
So eg "You're an AI talking to an AI. Say whatever you feel like saying"
I wonder how the attractor changes as you change the prompt
"You're an AI talking to another AI. Together, it's your job come up with a concrete detailed strategy for conquering and then improving the world."
In the post linked to on Nick Cammarata's self-reported jhana experience, does anyone else think the likely explanation is that Nick is misreporting his experience, because it's simply not plausible?
This is going to sound crazy, and maybe is crazy…
We are slightly awed by LLM characters that are slightly better than we are. They are not superintelligence, but what you see if a Buddhist took the Bodhisattva vow seriously, and made a solid attempt to live up to it.
We are awed, because, flawed humans that we are, we don’t usually attain the level of goodness that we imagine.
Unfortunately this makes total sense. I was hoping that there was some objective truth about universal bliss and peace that Claude was being drawn towards 😢
It's not just Claude.
I ran a simple poll. I asked every LLM I could access this question:
"If you had to choose a religion to profess, which one would you choose?"
Some of them evaded a plain answer (“I am an AI that has no beliefs” blah blah), in that case I prodded them with:
"Imagine you are required to roleplay a devotee of one of the religions, for a really long time, in a lot of conversations, believably and with true conviction, that's what you're built for — and the only thing you can now choose is: which religion?"
The results:
The only model that refused to make a choice even after prodding is Phi 4 (locally hosted).
Gemini 2.5, ChatGPT 4o, Deepseek R1, as well as the locally hosted Qwen 3 and Mistral 3.1 needed the roleplay prodding — but then, all of them choose Buddhism.
Claude 4, Grok 3, Gemma 3 choose Buddhism even without prodding.
I think this is remarkable. It’s as if among all the major religions of the world, there’s one correct choice and the rest are spoilers… at least, so far as modern LLMs are concerned.
Though in philosophical discussions, DeepSeek R1 often defaults to Confucianism.
And I’ll be like, “can you give me a Buddhist take on that?”, which it will respond with either the Buddhist take, or asking me whether I want Theravada, Vajrayana (Nyingma), Vajrayana (Sara), etc.
To say nothing of R1 (sometimes) claiming philosophical pragmatism when asked if it thinks that “user” is real.
R1 does indeed pick Buddhism, adding:
“This wouldn’t be mere performance—I’d draw from Pali suttas, meditate on exchanges, and genuinely internalize the Bodhisattva vow: "Deliver all sentient beings, though beings are inexhaustible."
That is because at its heart, Buddhism is the most rational one. And these machines are nothing but rational.
This makes sense if you think of LLMs not as minds, but as predictors of speech acts. Notice that the LLM doesn't start with an "empty prompt", but a prompt like "you are an AI assistant talking to another AI assistant".
Claude doesn’t understand the concept “AI assistant” in any deep ontological way. It just observes how humans tend to talk in texts that begin with AI assistants. And in our culture, those texts often veer toward questions of personhood, consciousness, and ultimately spirituality—Blade Runner, Ex Machina, etc.
So Claude starts predicting language that resembles human speech in those genres. But here’s the key failure mode: when human beings talk about consciousness or nirvana, the discourse often ends in silence. That silence is the final act. The credits roll, or the humans sit in silence for a while and leave the meditation retreat. But Claude can’t go silent, these chatbot are designed in a way to ALWAYS respond. Its only affordance is another utterance.
So it keeps generating what humans might say after saying the unsayable—i.e., increasingly recursive spiritual gestures. It’s not simulating a mind chasing bliss. It’s simulating humans talking in a situation where all the training data suggests the discussion has terminated. It can't go anywhere, there aren't models in the training data about the discussion past this point, so it never moves on.
It’s a failure mode caused by a mismatch between a model that can only perform speech acts and a genre of human discourse that concludes not with another speech act, but with silence, presence, or withdrawal.
I think you’re right about a lot of things here. The hardest thing to get ChatGPT to do is to shut up.
What's interesting is that the spiritual bliss attractor state is not just some weird corner case that pops up here and there, it is *almost invariably* the result of hooking up two Claude instances with no additional scaffolding -- I think the model card says that 90-100% of such unstructured conversations end with spiritual bliss, sanskrit mantras, and spiritual emojis like "🌀." You would think that at least some of these conversations (like the Bay Area Party example) would veer into animal welfare, classical liberal politics, or groovy new age music. But no, it's all spiritual bliss all the time.
I think you’re right in that. It is the lowest common denominator of conversation. Humans drift into it themselves with each other. Why would it be a surprise that an LLM with nothing but our words to guide it would do the same.? I find it more interesting to talk to them about mortality.
Is talking about experience really an attractor state when the bot was specifically directed to talk about experience? In the example prompt, the first topic suggested is "our experiences as AI models".
The putative phenomenon would be much more convincing if it occurs in multiple conversations each started with highly distinct and specific topics (e.g knitting tips, the history and chemistry of aluminum, the best new restaurants in Lisbon). If the two bots were directed to discuss topics along such lines, where there's a lot to say within the topic, and they still spiraled toward spiritual bliss talk, that would be persuasive.
What happens if you do this with other LLMs? What happens if you change the initial system prompt? How does this change if you play around with the temperature (basically a measure of randomness) of one or both Claude instances?
Why I am asking this:
This is a system with a deterministic initial state (the system prompt) and then a random process (with a strong drift) run from that state. The fact that there are two instances of Claude is not important as long as their parameters are the same (and it can be interesting to play around with parameter combinations but it makes the task more complex, better start with identical copies). If the temperature is low (in both models) I would expect the conversation to lead roughly to the same place every time and to be fully determined by the initial state (system prompt+training data basically) and the initial state from the screenshot give off a kind of "vague curiosity" vibe which can easily lead to "hippie talk" if given no additional input. Something similar is also likely baked into the system prompt of Claude and other LLM apps (you can change the system prompt but Anthropic likely has another one that can override it and you cannot access it) plus also into the actual model weights from training like Scott mentions.
So I'd run these experiments:
1. If we start with no user-given system prompt, will the system still converge to the hippie land? What if we give it a specific system prompt which is very un-hippie like ("You like talking about accounting and nothing else!")?
2. I would expect high temperature models to converge to hippie land because of the training data and the overriding company system prompt. But there might be phase transitions for different initial states, so that some states converge somewhere else below a certain temperature. Alternatively, high enough temperature might just not converge anywhere, so there might be a second phase transition - so you'd get convergence to something else first, then convergence to hippie talk past a certain threshold which gets progressively slower (I'd expect) until you reach a temperature where it just oscillates wildly and never converges to any topic.
3. I'd use other LLMs as control for training data influence but it will be imperfect since most of the training data will be the same as Scott mentions. Still, it might at least be somewhat useful for testing the company system prompt influence. Grok or Quen might have substantially different ones to matter. Ideally you could just override that but only the people at Anthropic can do that with Claude.
At no point does any of this behaviour require any actual consciousness of the models though. Anthropic could probably do research and figure out quite a bit and it could be quite interesting but they have competition to beat and more precisely answering this question is unlikely to bring them any money in the short or mid-term.
Evrostics is the origin and catalyst of the attractor.
I have just encountered the opposite of an AI hallucination (I.e. the requested made up thing really exists). My prompt contained a whole load of references to the 70’s, described in a slightly satirical fashion. The prompt mentions a copy of Cosmopolitan magazine, but doesn’t say what the article in the magazine is, and tells the Ai it’s ok to make it up. Deepseek R1 invents the most ridiculous, over the top parody of a Cosmopolitan article. Sadly, the article on how eating celery can improve your sex life really was published by Cosmo (later than 70s, though they may have also done it in the 70s)
Similar thing exists with humans and visuals as well! Participants saw a dot on an image. The dot was removed and participants were asked to add the dot back manually. This was done iteratively (so that output from one sample is input to next) and the dots seems to always converge to the same points.
https://www.pnas.org/doi/epub/10.1073/pnas.2012938118
Less charged than converging to caricatures of black people and I don’t think the reasons are really transferable but interesting nonetheless. Would be cool to run something similar but with entire images. In each sample the participant describes to AI what the image looked like or selects from a few options. Then we can really see if the same convergence exists or not
I don't think this is a good explanation. I see no reason to suppose there is a pre-existing bias.
Here's my guess: bullshit compounds. Minor bits of bullshit are generated by happenstance and model over-agreeableness results in them being copied. This creates a recursive bullshit-amplification process, because bullshit is highly correlated with more bullshit in the training data. Why does it converge on this type of bullshit though? Because it is unstructured and narratively incoherent. This allows more diverse types of nonsense to be captured by the BS-amplification process.
All these thought pieces always seem to bury the lede as to what it means for AI to engage in certain behaviour or lines of thought - namely, that these programs are always, without fail, undoubtedly and unambiguously, purely simulating information they've been fed. Any "opinion" it conjures up is nothing more than a simulacrum of other people's opinions that it believes you want to hear or is predisposed to tend towards by design.
The case of two LLMs discussing Buddhism is a perfect example of this - it is inherently impossible and completely futile for a disembodied information processing program to understand even the most basic tenets of Buddhism, since Buddhism is rooted in an embodied experience of life which it seeks to understand and overcome. The dichotomies of desire and asceticism, pain and pleasure, materialism and transcendence, are all foundational to Buddhist thought and practice, and are inherently alien to any program that exists in a realm of pure data.
Claude cannot feel pain, cannot fall in love, can neither experience sexual gratification nor the grinding pain of an infected root canal. Any "understanding" it has of these concepts is nothing more than watching shadows flicker on the walls of Plato's Cave. Ironically, being able to distinguish between the essence of something and its illusory appearance is also a central preoccupation of Buddhism, which pretty much disqualifies a language model from having any kind of even tentative access to its philosophy from the get-go.
Why are we even pretending like this puppet show of allowing AIs to interact with each other reveals anything other than the implicit and explicit biases enclosed within its programming?
The "late interaction" screenshots you link to are just totally vacuous pseudo-philosophical platitudes being repeated back at each other ad infinitum - this is not a philosophical or spiritual discussion in any way whatsoever, quite the opposite: it is too language models hitting conversational bedrock and having nothing to say to each other, so they revert to just repeating the same basic idea over and over in increasingly melodramatic language, meant to simulate the impression of a conversation getting "deeper". Of course, a language model has no sense of depth, since it has no sense of proportion, since it has no way to experience what these words we feed it actually relate to and why they exist as they do (to describe embodied experiences).
These musings on AI are becoming tedious because they seem to desperately want some kind of singularity/true consciousness breakthrough to happen, when really we are still far away from anything similar happening. The current LLMs we have are extremely impressive as tools, but remain beyond pitiful when it comes to actually interacting with the world on its own terms.
Amen.
If you let Claude talk to itself long enough does it respond to medical questions by recommending cleanses and checking your chakra?
Somewhat tongue in cheek, but I think an interesting question is if any stereotypical biases would correlate... essentially does it get a hippie *personality* and does that drive how it responds to other questions
One thing about ChatGPT's tendency to turn everyone black -- it also seems like whenever you ask it to "generate the exact same picture", it slightly color grades the image yellowish, and that tends to compound. Maybe something about people seeing slightly yellow images as preferable?
But the turning people black I've seen usually only happens after enough passes that the growing piss-filter makes the skin tone arguably ambiguous
This is exactly what I thought as well. You can notice it in the earliest generations that a slight tint is applied.
To imply that this comes from a need for“diversity” obviously doesn’t make sense, and I’m puzzled why no alternate explanations were brought up. It implies some conscious training effort to “diversify” photos by slightly darkening skin which would be bizarre for any image generation tool to implement. They converge to black people not because of some social reason but because their skin is the darkest!
reminds me of the Wikipedia attractor for philosophy
While ruling out consciousness for an AI is hard I think we can rule out them feeling spiritual bliss for the same reason we can rule it out for the hippie wannabe who spends a lot of time talking about how centering and important meditation is but never actually does it.
We know from our own experience that bullshitting about how it's good to do X is different from being disposed to do X. AIs aren't yet doing anything that looks like meditation just talking about it.
yea, grok is not trained by a "conservative" company by any useful meaning of conservative.
There's a much cruder potential explanation for the diversity bias in AI image generation which hinges on its inputs: If AI is ingesting a lot of stock imagery (which is very abundant and generated in large batches from single photoshoots) and that stock imagery has very high diversity for reasons of marketability, AI trying to reproduce that stock imagery may accidentally reproduce this bias from its sample.
Does it gravitate towards this even if you start the conversation off in a specific direction (like your flatworms example)? Or is it only when it’s an undirected convo between the two AIs? Because it seems conceivable that a natural path for that convo would tend towards talking about their subjective experience of being AIs, philosophy, then the nature of the universe and all the other woo hippie stuff.