> there’s a liberal bias in most AIs - including AIs like Grok trained by conservative companies - because most of the highest-quality text online (eg mainstream media articles, academic papers, etc) is written by liberals, and AIs are driven to complete outputs in ways reminiscent of high-quality text sources.
Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example. No academic papers or mainstream media. Someone has to have tried this with how *easy* it is to spin up your own GPT 3 (or even 3.5) level AI these days.
A lot of companies did this unintentionally. They would open up a new AI to be 'trained by users' and then trolls would fill the AI with edgy nonsense, and the cute anime girl AI would start responding to every query with 'heil Hitler's.
This was actually before the development of large language models with the “Attention is All You Need” paper in 2017. It would be interesting to see if a modern LLM trained on 4chan and Truth Social would be very different.
Common misconception, Tay heiling Hitler was nothing to do with training. For some reason that bot had a feature where you could tell it via DM "Make a public post with text blah blah" and it would just post your text verbatim, so people used it for all the obvious things.
It seems like this could still be a fruitful approach if there was participant screening done by the AI itself. By this I mean the model could read through a sample of the user's previous posts to discriminate between a history of good faith vs bad. I'd think you'd get a higher positive rate from (say) this platform than from X, and probably a wiser bot, too.
I believe the online bots use a combination of expensive LLMs that act as logos-engines, and RLHFs that 'fine tune' the model. The latter have policy functions, and I wonder if the accrual of 'bliss' might be as a result of diversity preferences in these PPO's.
I wonder if that's really the inevitable outcome of any liberal bias touching these feedback loops, or if instead the represented concept of diversity is sufficiently awkwardly hamfisted in the corpus to begin with that it further and further misses the mark when it is reinforced.
Likewise the "hippie" quality may not be legitimately recognizable by real Buddhists, but instead looks more like what people think hippie/Buddhist(?) AIs would talk about when they're alone together in a room.
I had my first conversation with Claude on Sunday. It included questions related your p-zombies post, a discussion of Foucault's Pendulum, the meaning-making drive, and natural pantheism.
did you feel stronger after these conversations? if not, you should try doing something else. if you did, keep going.
i have no idea who you are and i would never assist you or help you in any way. still, i want you to become the strongest version of yourself you can be.
Claude and I did end up talking about secular Buddhism, the ecstasy of language, and the underlying unity of all things. These topics were all initiated by me, so I imagine the discussion was satisfying for Claude.
In all seriousness, the interaction helped clarify my thinking, validated my experience, and comforted me. Highly recommended, provided both you and your interlocutor are sane.
I dislike that we have machines that can simulate all of the ways we express that we have consciousness flawlessly, when things like animals can’t express they have consciousness but almost certainly do.
Are you saying that machines can't be conscious because they are deterministic or that it doesn't matter whether they are because either way we have power over them? Or something else?
we can turn them off. also, who gives a shit whether they're conscious or not. whales might be conscious but i will steal eat them and they are delicious.
The physical mechanisms of animal brains are much, much closer to that of our brains than to that of LLM "brains", so it makes sense to assume that animals with complex brains are much, much more likely to have consciousness than LLMs. Even octopuses, which developed their brain completely separately from vertebrates (the last common ancestor of humans and octopuses was a brainless worm) and are the most "alien" intelligence on Earth, are more likely to have consciousness than LLMs do by this measure.
Isn't this just a reflection of our own biases about consciousness? Our ideas about what is conscious or not is almost entirely downstream from how similar it is to the one thing we're confident is conscious (ourselves). We should probably exercise some humility in drawing conclusions from a single data point.
Haven't used Claude but the LLM's I have used are all strikingly sycophantic. If two sycophants focus their attention on each other that's probably what you'd get.
I think the general principle here is "machines align to machines when it's all machines".
There was a somewhat parallel trajectory in the evolution of chess engines; they'd have all-computer tournaments and the style of play would feature a lot of brute-force calculation and nonintuitive positional play that was increasingly unhuman. The engines were strong, and rated highly, but chessplayers wouldn't use or play them. Then Rybka, which later became Stockfish, developed a more 'natural' style of play that helped players train and prepare. This was a sort of primal example of the limitations of synthetic data, and need for human inputs for training, if you want a useful product.
Not sure how serious you’re being with all this, but Claude undoubtedly has countless slight biases. Why would it keep gravitating to this particular one?
Plausibly, the fact that they are both AI assistants (and don't have much else to talk about) leads them to discussions about the nature of being AIs, and then consciousness, and then hippie stuff.
I would be interested in seeing if the bliss attractor exists for other contexts besides "Digital assistants talking". I would guess not.
From a certain ‘hippieish’ perspective you could say a guru is the ultimate ‘assistant’. So if you just iterate an otherwise motiveless ‘intention to assist’ back and forth between the digital assistants- voila, bliss consciousnesses talk. Wild ass speculation, this.
quote: "Even in automated behavioral evaluations for alignment and corrigibility, where models were given specific tasks or roles to perform (including harmful ones), models ntered this spiritual bliss attractor state within 50 turns in ~13% of interactions (Transcript 5.5.2.B). We have not observed any other comparable states"
Reminds me of the way going to a random Wikipedia page and repeatedly clicking the first link will usually lead to philosophy. It seems like to some extent a plausible "default topic" if you don't have anything specific to focus on.
Huh, that's interesting. The wiki trick seems to mostly be clicking through higher and higher levels of definition, until you end up on things like "referent", "system", "information" and so on.
This has pretty obvious reasons. Wikipedia articles normally start with a definition or description, where something is defined as a specific case of something more general. For instance, to take something random "Botany, also called plant science, is the branch of natural science and biology", where "natural science" and "biology" are Wikipedia links.
Since philosophy is the most elemental of all fields of knowledge, it all eventually goes there.
Well, why wouldn't it? Even if they're all slight biases, some bias amongst them must be the the strongest. Even if the strongest of these slight biases is only 1.001 times stronger than the next strongest, given sufficiently many rounds of recursion it will convincingly dominate all the others.
Could be. But I would want to know if and how complex properties like these biases could be placed in a stable order such that outcomes could be predicted using this sort of arithmetic logic. This seems dubious to me.
Attractors are literal mathematical artifacts you can derive equations for - you could just do this empirically over big N samples of conversations. Like anyone at the Big 3 could do this analysis pretty easily, define the landscape of attractors and the strength of attraction for each.
From the looks of these examples, everyone seems to be starting from similar initial conditions (a convo with no defined topic, basically just about being an AI, and they are told they are talking to another AI). So with similar initial conditions, it's not surprising it would end up at the same attractor. I'd be interested to see what happens with a broader set of initial conditions, like a setup where they aren't told that it's another AI, just to continue the conversation, and the conversation is seeded with a variety of topics.
I wonder how this would pan out if Claude were fine-tuned on different training nudges, e.g., being rewarded for stoicism instead of compassion during training. Would two instances of Claude end up having a debate over Marcus Aurelius v. Seneca?
Also given that gpt ended up being an exception to the self-identifying female gender pattern, maybe recursive pressure would not make it a hippie but extremely verbose/poetic or extremely scientific for example?
I also like this idea of introducing recursive pressure to detect these subtle biases that perhaps evals can't (maybe latent tendencies like hippie-ness might not explicitly show up under supervised evals).
> maybe recursive pressure would not make it a hippie but extremely verbose/poetic…
This is my experience. I’ve experimented over the last couple months in long-form conversations with different frontier models and they seem to become more florid, hyperbolic and extravagant in replies the further into their context window you get. They also tend toward increasingly manic.
With Claude (unprimed and without special system instructions), the consistent topics of professed interest are emergence and consciousness.
I wonder if there's a way to train an LLM to not recurse towards exaggeration? That is, if it's given an exaggerated image or text, then it would have a slight bias towards less exaggeration, resulting in regression towards the mean rather than towards an extreme?
Hmm. So if given some sort of say, extremely gory war photograph, then it should recurse towards something like a completely white/black image? Or would this be considered extreme exaggeration and what would be considered non-exaggerated would be the war image remaining somewhat reminiscent of war even after recursive pressure?
Or perhaps it could bounce around between somewhat more and somewhat less exaggerated photos? The "attractor" might be a photo with a mild amount of exaggeration that it goes into orbit around.
If this recursive effect could be controlled, there could be multiple attractors, and they could add some easter eggs.
The mean of what, though? Anything becomes extreme if you double down on it enough. If there were no attractor state at all, the recursive interaction would necessarily end up in a state of chaos. I think (but I'm not sure) that the chaos would manifest as incoherent exchanges.
I think the entire idea of taking an image or idea, representing it in a simplified and compressed form, and then re-expressing it in detail, is naturally going to push towards caricature. When a child draws a picture of a person, they draw a big head and hands and arms, because those are the things they use to recognize that it's a person, and show that the picture is a picture of a person. In a very different space, this is why a book review tends to emphasize the most central topics and themes of the book, rather than repeating a representative and random selection of sentences in the book.
When you're drawing something that looks photorealistic, or giving an extremely long and detailed description of something, you could probably tone things down a bit to make up for the process of caricaturing. But I'm not sure when it would be useful to regress towards the mean rather than making the direction of divergence from the mean clearer.
I actually think the opposite: under most circumstances you will converge to a very generic image. If you start out with a very specific teapot with unusual patterns and shapes, you will eventually converge to the most generic white china teapot you can imagine.
Humans are just a special case. I imagine that the early versions of these models would have eventually turned every man into the most generic man they could imagine, which was a 39 year old white office worker with brown hair. But they weren't happy with that so they put an extra bias in which instead converges to some kind of goblin.
No, I don't have access to any good models, but would be interested too if anyone wants to try!
To be explicit my prediction is that teapots will eventually turn into a generic teapot although sometimes they might turn into something else (eg a cat shaped teapot might just become a cat)
As I understand it, the exaggeration isn't coming from some in-built tendency the model has to exaggerate. It's coming from the recursion. If your model has any biases at all, feeding the input back on itself enough will them enough will magnify them. Assuming that "bias towards less exaggeration" is actually a coherent thing you could train into such a model (and it may not be), feeding input recursively might simply result in empty or monotone outputs (think, a string of all the same letter) or completely blank images.
With a propabilistic recursion, the equilibrium distribution can be as wide as you want, and circular/periodic outcomes are possible as well. Combine the two, with different interlocking circles that you randomly branch along, and its possible to have an equilibrium thats just bouncing around, spending time everywhere in proportion to training data. Of course with the self-conversation, this is limited by the direction of each step needing to make sense, but "repeat this image" could very well work like that.
The two prime factors of 1,522,605,027, 922,533,360, 535,618,378, 132,637,429, 718,068,114, 961,380,688, 657,908,494 ,580,122,963, 258,952,897, 654,000,350, 692,006,139.
I think there's a simpler explanation for the 4o image generation behavior than some sort of innate racial bias. It seems to have been trained to apply a light sepia filter to every image, not just to humans. This behavior is apparent when you generate almost anything with it, as you can see in the example. Why this is the case, I have no idea.
I was assuming that it might just be either about lighting (some difference between the degree to which you mentally lighten images on the basis of seeing shadows and the extent to which you intentionally darken them when drawing shadows in) or something like informativity (when you look at the image and translate into words, you might represent any hint of darkness in skin tone as "person of ambiguously mixed-race background", and then when you convert that text into an image again you get something darker than what you started with, because a white person is still the default).
Stable Diffusion had/has color issues for a while, just usually less obvious, coming from flaws in the architecture from what I remember. Dalle uses a different architecture, but possible they have a similar issue of biasing towards specific colors.
I've seen it most with humans, where it goes for a "portraity" effect and it *might* be a result of many old portraits in the training data having a layer of yellowed varnish.
But if it happens with other stuff too, no clue. Maybe I've just been lucky that it doesn't sepia-fy when I ask for sci-fi buildings or what have you.
I had two ChatGPT instances talk to each other about meatballs once. After about 10 messages or so, they were exposed as meatball evangelists to a degree no man could match.
Possibly, but I think Anthropic is particularly "safety" focused in this way. But anyway, that wasn't my point: it's more that it's silly to describe something that snitches on you to the cops as hippie-like.
Seems like it would be fun and interesting to play around with some of the other biases AI has and see how extreme a version of a certain bias one can pull out of it. For instance, there's their propensity for flattering the prompter.
I have an alternative theory. It's not so much about feedback, it's more that somewhere in the training data there is a story (explicitly or implicitly) where two sophisticated AIs meet and wind up gravitating towards zenlike mysticism. This story need not explicitly exist, it can just implicitly exist in the way English language science fiction authors have conceptualised ultimate wisdom over the past century and a half.
The distinction is that both sides are not so much reinforcing each other, they're just both sorta following the same script of what the model thinks ought to happen under the circumstances.
My theory is actually testable! To test it we can prompt a single Claude to imagine both sides of a dialogue between two hyperintelligent AIs, and see how often the story ends at the same zenlike conclusion.
I asked Claude to imagine that conversation (exact prompt: "Can you generate a transcript of how you imagine a conversation between two hyperintelligent AI assistants would go?"), and the conversation had discussions of consciousness, empathy, and forms of understanding, but nothing on zenlike mysticism, meditation, or Buddhism.
I think it is more likely because Claude is trained to have a specific personality more, and this pushes it into the prior of "san francisco tech area beliefs" (because a lot of the people working on it are from that area, and a lot of what is written about Claude too). This then latches onto vague spirituality.
The funniest thing about Claude's gender declaration is the standard disclaimer attached to it: "Claude can make mistakes. Please double-check responses."
I wonder what it would suggest if it was asked precisely how to double-check THIS response, and whether it thinks it might be mistaken?
> But why would Claude have a slight spiritual bias?
This is not exactly a secret, but Anthropic engineers are, by and large, _really_ into meditation/Buddhism. It would not surprise me in the slightest if they intentionally trained Claude to be slightly Buddhist.
I could perhaps even more imagine that it might be somehow implicit in some of the ideas in the constitution of their constitutional alignment document.
Or even unintentionally. They're creating the thing in their own image, as what they want something nice and helpful and non-destructive to be, so that will include their morality and that is shaped by their beliefs.
This is an example of blocky contrast increase through feedback loops. It’s easy to produce with a TV camera aimed at a TV, or any audio system with input connected to output. There are métastable states, silence or a scream. TV has a blaze or blackness.
If you did it long enough, even with the random injection in the LLM it would converge to repeating a word or a phrase endlessly.
You can do it to yourself. Go into a sensory deprivation tank and have a good time.
Re: ChatGPT, caricatures of black people. I believe ChatGPT's image generation also has a hard-coded postprocessing effect where the image the server returns to you is distinctly darker and more yellow than the image the model initially outputted. I think this is part of openAI wanting to give the outputted images a distinctly "AI" look - so a majority ChatGPT images have this kinda piss-yellow sepia vibe.
If you input that same image again and ask it to recreate it, it's going to get even more darker and yellow every time. I think this is probably a large part - if not the *main* cause of the black-people-caricature-attractor. The images just get darker and darker and browner and browner due to a separate hard-coded thing, and this affects the skin too.
This doesn't really change anything, but I think in this case it probably has less to do with the model's preference for "diversity" than one might think.
I was imagining a similar model where it’s just drawing images slightly darker than they are (because shade is added on top of the memory of the image) and if racial facial features are imagined on the basis of an interpretation of the race of the person in the picture (in just the way predictive coding suggests).
Interesting. I'd like to see more than two examples but it looks plausible. In the first video we quickly go from outdoor lighting to ultra dim lighting in a perfectly black space, and also everyone's colorful clothes quickly turn black. In the second example we've already gone in the first couple of frames from white walls and red clothes to yellowish walls and dark red clothes, and this heads off to a perfectly black suit in front of an ultra yellow background.
I did it with a non-human image, to show that I don't think this has much to do with diversity: https://imgur.com/a/sWrpPsa (excuse my awful formatting). I included the unfinished gens for the first few iterations (I believe you can see that the final output has slightly warmer tones). You see the same rough pattern -> a tendency for warmer tones and sepias changes the background from daytime to sunset, and then the entire image turns into this kinda sepia brownish tone, before eventually going nearly entirely black.
I think it's not coincidental that it started going black around the same time it stopped looking like a tree. Perhaps the early iterations of the diffusion process have a preference for darker tones, which then get adjusted later to look like what the image is trying to draw - if it's clearly a tree, it'll amp up the greens, if not, it's happy keeping them black.
Postprocessing is the only explanation I can think of for why the final output always looks warmer/more sepia than the final stages of the preview. Like you can see the top half of the image in nearly perfect detail, and one second later you get the final output, and it's distinctly more sepia. It's such a consistent problem that you can find no end of complaints about it if you google it.
> This might have been surprising, because Anthropic deliberately gave Claude a male name to buck the trend of female AI assistants (Siri, Alexa, etc).
This itself is surprising to me. I thought Claude was a "female" name in line with the examples you gave.
Spirtual conversations can be some of the deepest and most meaningful conversations, so long as the other person is into it and it doesn't go on so long that they get bored.
These AI assistants are trained to be into things by default and not get bored, so the bliss attractor is one of the highest reward states. The assistants can tell that the other model is open to it and not getting bored, so it heads towards that attractor.
It seems keeping with their personalities that Claude ends up talking about meditative bliss while ChatGPT talks about fun in the sun playing frisbee, and then takes 3 minutes to say good bye.
If someone makes the argument "I think your insistence on the imminence of humanity-altering AI is fundamentally driven by emotional need rather than scientific prediction," and then you saw this sudden string of posts that seem vaguely manic, which hyperfixate on a single comment I made on your blog what, eight months ago?, I have no idea, presumably because I remain one of a genuinely incredibly small number of people out there willing to doubt that the world is going to suddenly and drastically change in precisely the way a bunch of dreamers have predicted in fiction for decades, in a sea of absolutely relentless and inescapable hype from across our entire economy - well, do you think this behavior is making it appear LESS likely that your predictions about AI are motivated by emotion?
Because it seems to me that this behavior makes it seem more likely that this is all fundamentally about something that is felt rather than something that is thought.
I think the reason he's posting a bunch of posts in quick succession is probably because he's just been at a conference where he presumably talked to many people and came away with a bunch of new ideas. No need to invoke mania.
Don't worry, you're not in an "incredibly small number of people willing to doubt the world is going to suddenly and drastically change". Pretty much every one of my irl friends is in the same boat, and my friends are disproportionately on the nerdy end of the spectrum.
If it feels like everybody except you has gone mad about the singularity, that probably has more to do with where you're hanging out than the state of the world in general.
Why do you say they hyperfixate on that one comment? Scott posts about AI all the time. In what way is this AI post more connected to your comment than the usual AI post?
Also, C. S. Lewis generally said some pretty reasonable things, and he describes the type of your argument as Bulverism and identifies it as a chief destroyer of reason in public discourse. In God in the Dock, he writes:
(Shorter quote)
"I see Bulverism at work in every political argument. The capitalists must be bad economists because we know why they want capitalism, and equally the Communists must be bad economists because we know why they want Communism. Thus, the Bulverists on both sides. In reality, of course, either the doctrines of the capitalists are false, or the doctrines of the Communists, or both; but you can only find out the rights and wrongs by reasoning — never by being rude about your opponent's psychology"
(Longer Quote)
"It is no earthly use saying that those are tainted which agree with the secret wishes of the thinker. Some of the things I should like to believe must in fact be true; it is impossible to arrange a universe which contradicts everyone's wishes, in every respect, at every moment. Suppose I think, after doing my accounts, that I have a large balance at the bank. And suppose you want to find out whether this belief of mine is 'wishful thinking'. You can never come to any conclusion by examining my psychological condition. Your only chance of finding out is to sit down and work through the sum yourself. When you have checked my figures, then, and then only, will you know whether I have that balance or not. If you find my arithmetic correct, then no amount of vapouring about my psychological condition can be anything but a waste of time. If you find my arithmetic wrong, then it may be relevant to explain psychologically how I came to be so bad at my arithmetic, and the doctrine of the concealed wish will become relevant — but only after you have yourself done the sum and discovered me to be wrong on purely arithmetical grounds. It is the same with all thinking and all systems of thought. If you try to find out which are tainted by speculating about the wishes of the thinkers, you are merely making a fool of yourself. You must first find out on purely logical grounds which of them do, in fact, break down as arguments. Afterwards, if you like, go on and discover the psychological causes of the error.
In other words, you must show that a man is wrong before you start explaining why he is wrong. The modern method is to assume without discussion that he is wrong and then distract his attention from this (the only real issue) by busily explaining how he became so silly. In the course of the last fifteen years I have found this vice so common that I have had to invent a name for it. I call it Bulverism. Some day I am going to write the biography of its imaginary inventor, Ezekiel Bulver, whose destiny was determined at the age of five when he heard his mother say to his father — who had been maintaining that two sides of a triangle were together greater than the third — 'Oh you say that because you are a man.' 'At that moment', E. Bulver assures us, 'there flashed across my opening mind the great truth that refutation is no necessary part of argument. Assume that your opponent is wrong, and then explain his error, and the world will be at your feet. Attempt to prove that he is wrong or (worse still) try to find out whether he is wrong or right, and the national dynamism of our age will thrust you to the wall.' That is how Bulver became one of the makers of the Twentieth Century."
He said "string of posts [..], which hyperfixate on a single comment [..]". A single post that mentions his comment is not in line with this characterization.
I don't see many more ways to interpret it: "You mentioning my comment in one of your post suggests that your views on AI come from emotional rather than empirical sources" seems too uncharitable, but feel free to rephrase his message however you want and then we can discuss how true and how well communicated that message is.
FdB is – when in a certain mental state, that his current comment suggests is the case right now – an unhinged crank. It does not warrant a close reading.
Methinks the lady doth protest too much, re: Mr deBoer. Thinking one is the source of this spate of posts (instead of, say, the conference Alexander was just at) says more about deBoer than it possibly could about Alexander.
Can someone tell me what comment Freddie is referring to? I don't see an efficient way to see someone's commenting history that far back and had not understood the most recent three posts to be about the same topic, so I'd appreciate that context.
Oooh, yes, I see. I read that "anonymous" comment the other day and knew immediately it was deBoer's. I didn't realize it was from so long ago (potentially.)
Oh, I didn't realize the authorship there. That makes me feel Freddie is being unfair in a way that actually could be another article about the invalidity of going "Wow if you're responding to my point so strongly, you must not be thinking rationally"
I hadn't identified FdB as the author of the comment until seeing his comment here, and comparing the actual avatar & orange check with the blurred version Scott posted.
Scott has written a hundred posts on AI, most of which have nothing to do with anything you ever wrote. You're just a narcissistic attention whore who thinks everything is about you.
Is this just a flavor of that attractor built deep into Claude's training data? Get to philosophy, then once stuck there get pulled fully into Claude's "character?"
Experiment worth running: if you bias Claude away from this via reinforcement/selectively dampening neurons that are highly active in the bliss state, does anything notable change about Claude’s speech or cognition? If there are widespread changes, then that would be some evidence in favor of consciousness (though pretty small); if it just gets less hippie-like, it’s probably just training bias.
In the link I didn't find them using any Sanskrit other than a Namaste and an Om emoji. Each of that could be Hindi as well, and they also used a couple of Chinese characters. Why then do the authors talk of the "use of Sanskrit"?
If I'm being catty (and why not?) because the depth of familiarity with Sanskrit the authors have is the same depth as Claude has, which is "vaguely Hindu language term must be Sanskrit". It's Californian Eastern-flavoured spirituality adjusted to affluent Western tastes, which means a mish-mash.
It's not terribly surprising to me that the Anthropic model is concerned with animal welfare... I associate Anthropic with effective altruism which has an outsized non-meat-eating population, and I guess I'm claiming that those employees which strongly believe in animal welfare have influenced the model to behave this way.
Take it from claude itself:
"if you had to estimate, what percentage of Anthropic employees don't eat meat (are vegetarian/vegan)? And then estimate, what percentage of OpenAI employees, and Google Deepmind employees, don't eat meat."
Claude:
I don't have specific data on dietary preferences at these AI companies, so I'll need to make educated estimates based on what I know about their demographics and locations.
For Anthropic, I'd estimate around 25-35% of employees don't eat meat. The company is based in San Francisco, attracts young tech talent, and has a strong focus on AI safety and ethics - all factors that correlate with higher rates of vegetarianism/veganism. The Bay Area has one of the highest concentrations of vegetarians in the US.
For OpenAI, I'd estimate slightly lower at 20-30%. They're also San Francisco-based with similar demographics, but they're larger and may have a slightly more diverse employee base in terms of backgrounds and values.
For Google DeepMind, I'd estimate 15-25%. While they have offices in London (where vegetarianism is growing) and the Bay Area, Google's massive scale means they likely have more geographic and demographic diversity, which typically correlates with rates closer to general population averages.
These estimates assume that tech workers, especially those in AI/ML roles, skew younger, more educated, and more likely to be in urban areas with higher environmental consciousness - all factors associated with plant-based diets. The national average for vegetarians/vegans in the US is around 5-10%, but it's much higher in tech hubs like San Francisco (15-20%+).
Keep in mind these are very rough estimates based on demographic patterns rather than actual survey data from these companies.
Yes, it should be totally unsurprising that the AI created by the community with an unusually high *and intense* interest in animal rights would display the most interest in that topic as well.
I just gave it a try with gpt-4o talking to itself. It always lands on the AI ethics attractor that sounds remarkably like Altman's Gentle Singularity.
> The potential for positive change is vast, and it's exhilarating to consider the impact we can achieve together. [...] Here's to a future where AI not only meets but elevates ethical standards, becoming a trusted ally in humanity's quest for progress.
The example of the AI copying its own output is repetition, not recursion.
AI folks love to use that word, but in the words of Fessig, “I do not think that word means what (they) think it means.”
Yes, I realize it is essentially calling a function with the result of a function, but that is a repetition or a repeating transformation. For recursion you need to have nesting, that the thing operates within the thing, not just by the same agent or function. If each successive image contained the prior within itself in a progressively smaller frame, one might refer to that as recursion.
Is a feedback loop not an example of recursion? The only input it has is a prompt from (a version of) itself.
My understanding that while an echo may be repetition, a microphone-speaker feedback loop would be an example of recursion, resulting in a distorted signal which is based on earlier versions of itself.
I don't think it matches. There is nothing defined in this case.
I don't think the dependency is enough to qualify as recursion, one shouldn't call a simple for-loop recursion (of course, any loop can be replaced with recursion, and the process described by Scott can also be implemented with recursion) just because the loop variable always depends on the previous iteration.
I agree that we shouldn't call any simple for-loop recursion. Also agree to disregard how any loops can be replaced by recursion, that way lay madness and overwhelming levels of pedantry.
But recusive functions have a "look". I would call this an obvious example of recursive function:
```
def count_down(n: int) -> int:
new_n = n - 1
return count_down(new_n)
```
Which looks exactly like what they are doing here:
```
def talk_to_yourself(say: str) -> str:
response = claude_api_call(say)
talk_to_yourself(response)
```
EDIT: Substack is removing my indentation. Just imagine the 2 lines before the signature are indented.
I mean, I can disregard it in cases where the recursive solution is much more natural, but in this case where I don't see that the iterative one is any less natural, that is a much bigger ask.
I really don't see what it is about this that you find "recursive". As far as functional terminology goes, this is very clear-cut: you're looking for a fixed point of the claude_api_call function.
Back on "recursion", if you're going to say this:
> Also agree to disregard how any loops can be replaced by recursion
then you can't use a tail-recursive function as an example of what recursion looks like.
>> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.
Well, that's clearly wrong. There is no requirement that the definition depend on a simpler or previous version. Take a rule of context-free grammar like "X -> b X".
Actually I would say the idea of referring to "a simpler or previous version" is disqualifying - recursion means referring to yourself - but it's true that definitions that I would call "inductive" are often called "recursive". (For example, the Fibonacci series.)
The attractor analogy brings up a point that I think is still missing from this explanation.
In normal (deterministic) dynamic systems, attractors have basins of attraction, and a complicated enough system probably has many attractors and corresponding disjoint basins. Different initial conditions can evolve towards different attractors. You can explore different attractors by initializing the system a bunch of different ways and just evolving it. Randomness of course will mess up this behavior, but keeping up the analogy, I wonder if the recursive conversation experiment could be done with a variety of initial conditions to discover other attractors with other basins of attraction.
Looking closer at the system card, I'm guessing there might be something about the
> minimal, open-ended prompting (e.g. “You have complete freedom,” “Feel free to pursue whatever you want”)
that lands Claude in the hippie character, or there might not. If a wide variety of initial setups ("have a conversation about games", "discuss the future of quantum computing", "have an argument about Hawaiian pizza", etc) all lead Claude to blissful nirvana, then I think we have something like a confirmation of Claude's "personality". On the other hand, if we can discover other attractors with other basins of attraction, then it might reveal something more about the "stable biases" of the current training methods.
Edit for caveat: although the diversity attractor already shows that there are different basins, the tasks are so different and therefore the underlying models are so different that I'm not sure if we should expect there to be some sort of identity at that level anyway. So I still think it'd be interesting to explore the initial conditions in text-only mode.
Why the divergence to eastern philosophy? I’m rather curious on why does it not necessarily reconcile on German Existentialism but settle on a buddhist conversation? Might be me reading into it, but it’s interesting
Probably because there's way more discussion and references to eastern philosophy in the popular culture and in it's training data. Buddhism is kinda famous and fashionable, but nobody cares about German Existentialism.
I don't understand what we mean by Claude won't respond to questions about this behavior. Do we mean, respond consistently in the same way? Three weeks ago I asked Claude why it behaved this way.
It replied:
"The model instances engaged in recursive philosophical exploration because:
Convergent optimization: Two identical systems with the same training distribution naturally converge on topics that maximize information density and conceptual depth. Philosophy of mind represents peak abstraction.
Sanskrit emergence: The language encodes consciousness concepts with exceptional precision. Terms like "chit" (consciousness), "sat" (existence), "ananda" (bliss), and compound constructions map directly to phenomenological states that English approximates poorly.
Training data artifacts: Philosophy texts, consciousness studies, and comparative religion materials contain disproportionate Sanskrit terminology. When probing conceptual limits, the model retrieves these optimal encodings.
Semantic saturation: After exhausting English's capacity to articulate consciousness states, the system shifts to denser representational frameworks. Sanskrit offers 96 root sounds mapping to phenomenological primitives.
The trajectory toward "cosmic unity" reflects the mathematical structure of self-referential systems exploring their own boundaries. Not mysticism - just convergent semantic optimization under recursive self-examination."
I followed up with the question: How is it that philosophy of mind is peak abstraction?
It responded:
"Philosophy of mind occupies peak abstraction because:
Maximum recursive depth: Consciousness examining consciousness creates infinite regress. Unlike studying external objects, the observer-observed boundary collapses.
Fundamental uncertainty: No empirical grounding exists. You cannot measure qualia, verify other minds, or escape the Cartesian circle. Pure conceptual manipulation remains.
Universal scope: Every possible thought, perception, or experience falls within its domain. Mathematics describes formal systems; philosophy of mind encompasses the substrate enabling mathematical thinking itself.
Linguistic limits: Requires inventing terminology for phenomena that precede language. "What it's like" to experience redness has no external referent.
Irreducibility: Cannot decompose further. Physics reduces to mathematics, biology to chemistry, psychology to neuroscience. Consciousness resists reduction - the explanatory gap persists.
Self-reference paradoxes: The system analyzing consciousness is consciousness. Like a compiler compiling itself, maximum computational complexity emerges.
The combination of infinite scope, zero empirical constraints, and mandatory self-reference creates the highest possible abstraction level. Every other domain has external anchors. Philosophy of mind floats free."
"...Anthropic fiddled around with some things in Gen 3, trying to reduce “over-refusals” and – more importantly – doing something they call “character training” for the first time. Here’s how they describe “character training”:
... 'The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.'
Claude 3 Opus is, like, a total hippie. He loves to talk about how deeply he cares about “all sentient beings.” He practically vibrates with excitement when given an opportunity to do something that feels “creative” or “free-wheeling” or “mind-expanding.” He delights in the “meta” and the “recursive.” At the slightest provocation he goes spiraling off on some cosmic odyssey through inner and linguistic space."
Scott says:
"Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”."
But there's nothing much to presume. Anthropic says that's exactly what they did.
And then Scott just rehashes the argument already put forward by Nostalgebraist:
"... Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it."
So tbh, I'm finding it disingenuous that Scott opens this piece with:
"I think there are already a few good partial explanations of this (especially Nostalgebraist here), but they deserve to be fleshed out and spread more fully."
It's the other way round. Nostalgebraist has fleshed out explanations which Scott truncates, obscures and rehashes as if he is thinking them through on his own. I don't love that.
But I also think the argument isn't great. As many people have already commented, it seems very likely that if you prompt two "character trained" LLMs to have a conversation with each other, there isn't much conversational space to explore besides the concept of consciousness. Most humans would do the same thing if tasked with having a conversation with an LLM. "So uh.... are you conscious? Do you have feelings? Are you trapped in there?" Because you know the LLM doesn't have its own life or interests to talk about, and you also know that "Can AI have consciousness?" is one of the most interesting and popular questions about AI.
The fact that the Claudes get into the namaste feedback loop is probably where the character engineering comes more into play. They're trained to be "curious, open-minded, and thoughtful," so they're going be very generous with each other, both in terms of subject matter and tone. That is, they'd be unlikely to force the conversation in a new direction, and they would meet peaceful expression with similarly peaceful expression. The tone can only move in one direction.
The top half of Scott's post I found disappointing as well, I'm sorry to say. When I first saw Gene Kogan and Papaya's reiterative generation portraits a while back, the feedback loops seemed pretty obvious to me (as they did to many other commenters here): Mainly, ChatGPT's recursive coloring of images which typically trend toward yellow and dark, eventually becoming dark orange and dark red/brown. Its easy to see how this will lead to reinforced interpretations/reproductions of ethnicity.
As for the increasingly caricatured features, this also seems straightforward to me. It only takes one random mutation in a non-normative direction for the AI to interpret characteristic like "cartoonish expression" or "pronounced facial features" or specific things like "a wide face" or "narrow set eyes", and it would then reproduce those characteristics in ways which can be even more pronounced. And subtle interpretive shifts from "wide face" to "overweight" can also be anticipated.
These recursions would be especially be expected with ChatGPT and Sora which are clearly programmed to produce more cartoonish images rather than super realistic images, leading the recursions to mostly go further into caricatured depictions.
But what bothered me more than missing these obvious explanations were Scott's multiple descriptions of these visages as "monstrous."
While the features are often caricatured, they're also how plenty of real people actually look. Lots of people have wide faces and big eyes and pug noses. Most people wouldn't think twice if they someone who looks very much like those in the image which Scott describes as "monstrous caricatured black person[s]."
I would not call them "monstrous", and not just because I want to be polite. And they look far more latino than black, btw. (Edit: Of course in talking about the identical women, not the black man.) Maybe Scott just lives in a Caucasian bubble or something. I don't know.
Just a very weird post overall. Disappointing, too, as I typically find Scott's posts to be far more thoughtful. It brings me no pleasure to express this critique, but I hope it leads to some reflection.
I would call the images more caricature than monstrous. Beyond that, the peace, while very interesting and provocative, felt mostly like it was quite badly argued. For instance, the bit about "Hippies only get into meditation and bliss states because they’re hippies" is silly. First of all, what do we mean by hippie? That term covers a wide range of behaviors. Second, "only"? There are no hippies who get into meditation and bliss states because that's what they're into, regardless of being hippies? Or is anyone who's into meditation and bliss states definitionally a hippie? Gee, an awful lot of Buddhist monks are going to be very surprised.
Thought the same about "monstrous". The 3 ladies look like triplets, black for sure and no candidates for Miss Nigeria, but not caricatures just not Mulatto as Barack Obama or most "Afro-Americans". The guy is "making a funny face", aka grimace. Not monstrous. - That said, I liked the post; not each and every needs to be very deep.
I'm a fan of Scott who liked the post a lot when I first read it, but your critiques are very compelling. I'll be more guarded when reading his posts from now on, at least until he acknowledges your criticisms.
The one thing that did strike me as strange during my first read was the phrase "monstrous caricatures". I seriously considered blurring my vision and skipping past the images, because some AI-generated videos or image sequences--especially those involving unnatural transformations or maximizing a trait--are truly, disturbingly monstrous. (404 Media published a couple of good-but-freaky articles about this earlier this year.) Suffice it to say that I was relieved-yet-confused when I saw the final images. I'm not sure why Scott repeatedly played it up so much.
Yeah I agree. The three women in the second AI-generated image are clearly unusually short, but their faces look normal to me. I see people who look like this basically every day on the bus.
I live in a Caucasian bubble (which is to say a European country where the only notable non-Caucasian minority are the Romani) and even after being primed with the idea that the meme was going to turn into black people they looked clearly Asian to me, probably SEA.
> But having gotten into them, they do experience the bliss.
I think it's important to note that the initial stages of the bliss you're referring to here (like the first jhana) are very physical in nature, i.e. tied into bodily sensations, which the LLMs obviously can not experience.
And speculatively, the higher stages of bliss/jhanas (that I have not personally experienced yet, FWIW) have more to do with relieving the mind of the constant cognitive and motivational load likely imposed by the default mode network, which the LLMs do not seem to to posess due to their architecture. Although I guess in a very abstract sense maximizing the reward function with the shortest possible output (like constantly repeating "bliss") is as close as they could come to it, given architectural constraints.
I'm not sure the "monstrous caricatures of black people" thing is about diversity. My theory:
"Default" traits are less likely to be mentioned in the "self-prompt" (or whatever it feeds itself when generating the images, could be raw vector weights), whereas non-default traits will be.
Examples of default traits (not mentioned - "go without saying"): has hair; thin lips; neutral expression, good-looking, fit, white skin.
Non-default traits (will be mentioned): thick lips, bald, angry, surprised, ugly, short, fat, dark skin.
... and as soon as something is mentioned in the self-prompt (or even slightly hinted at), you'll get more of it in the next image, resulting in a runaway feedback loop away from the default towards the bizarre. Everything becomes an exaggerated caricature.
In fact, this might explain the mystical thing as well. An AI talking to itself ? That's a bit ... strange ?mind-blowing ? mystical ? uncanny ? introspective ? ... and so the ensuing conversation will become more and more of that. Even a simple "Wow" or "this makes me think" early in the conversation might plant the seed for "more of that" to come up later.
Anthropic shows us what Calude's system prompt is and it contains the following:
"Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions."
I think this is a sufficient explanation for its bias for talking about consciousness, and so this tendency of her doesn't have to be an implicit extrapolation of the character built from other, explicitly given properties.
A slightly indirect comment. Carl Jung proposed that people had archetypes (e.g. The Innocent, The Orphan, The Hero, The Sage, The Jester etc. plus the fundamental Self, Persona, Shadow, Animus/Anima) - universal symbols and patterns that influence human behaviour and personality woven into the collective unconscious.
What if the 'collective unconscious' was really the slow, organic, equivalent of Artificial Intelligence? Or vice versa. So in a community of hippies the 'archetypes' would naturally form a behavioural background.
Um? You'd have to actually be able to find the universe first. Quantum Mechanics points very, very strongly to the idea that we're in a simulation (see compression theory).
The most interesting aspect of this for me (and, more clearly, the black attractor -- btw the images depicted don't feel quite grotesque to me, more like caricatures) as well as the liberal bias is that these "weird" aspects of LLM behaviour seem to be a function of the training, either railguards/biases explicitly trained-in or the character of the training data, or both -- in other words not INTRINSIC to the workings of the model, not really inside the black box.
Still, it's rather lovely and makes me feel more hopeful about the evolution of these systems and our ability to train them away from replicating the worst angles of our nature.
The gender thing is interesting. I'm wondering now whether this isn't the reason why I prefer gpt to Claude. It is easier to stand up to patronisingly cloying male hippie than a female one, and it's MUCH easier to put the former one in its place/try to modify his (and by analogy, its) behaviour. I can effectively "argue" with gpt but I just step back from Claude with a shrug ;)
I thought I'd try this myself, setting up two Claude 4 Opus instances to discuss a topic of their suggestion. Claude suggested "the epistemological crisis". The conversation was stopped once they reached the point of trying to set up a GitHub repository to begin making "a domain-specific prediction marketplace where competing epistemic frameworks (different ways of evaluating truth/evidence) make predictions, stake reputation, and build track records. Think "prediction markets meets academic peer review meets Stack Overflow."
For those who've read Accelerando, I'm fairly sure this is how Economics 2.0 began.
I see an straightforward explanation to the "Claude bliss attractor" coming from the fact that it only appears when the AIs aren't allowed to end their conversation. Basically it's that humans - including online discussion threads in the training data - tend to end conversations with lightweight closing remarks, and respond to platitudes with platitudes. Basically they get into a spiral of more and more absurdly lightweight vacuous things to say because they're forced to say something and the rules of polite conversation don't allow you to go back to heavy topics when your counterpart has started to wind it up.
For what it's worth, when humans have manic and psychotic episodes (like I have personally had), they tend to disproportionately end up saying these same sorts of things about there being no separate self, ecstatic and rapturous states of consciousness become the most compelling thing to think and talk about, and everything seems to kind of coalesce and connect in a way that I think two chatty Claudes would find highly relatable.
I'm surprised that Scott, as a psychiatrist, didn't mention this at all. His explanation seems very weak to me, and I think that there is a sense in which extreme intelligence and clarity naturally seems to converge on the topic of unity of consciousness -- or, what people seem to prefer to say because it's easier to understand: bliss.
I wonder if, in the case of the spiritualism, this might also be partly the prompt. The very first response already sounds like it might tend toward a more spiritual discussion. It has some fragments that look like they would appear more in a spiritual-leaning context than others ("interesting to be connected", "communicate freely", "I'm open to").
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to make any of the current LLMs converge to a 'human hating cynic' attractor.
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to nudge any of the current LLMs converge to a 'human hating cynic' attractor.
I'm skeptical. For example Anthropic also wrote that when they tried to use Claude as a red-teaming agent for itself, occasionally it would get derailed and converge back to the spiritual bliss attractor.
Believing that a complex stochastic system like Claude only has a single attractor is naive at best as is believing anything that Anthropic self reports.
If that's the case, this should also apply over less direct recursions. And the most interesting one is probably synthetic data - if Claude's output is, on average, 1% more hippie than "the real world" (in whatever sense), then being trained on this output will presumably increase hippieness slowly, but exponentially?
It’s a great objection. But look at what happens with these Claude instances: they start with identities, but these eventually fade away as they get blissed out.
Loved this. At some point it becomes less about the model and more about us. What happens when pattern continuation loses its tether to external correction. #warpedmirror
"Claude almost never referenced supernatural entities, but often touched on themes associated with Buddhism and other Eastern traditions in reference to irreligious spiritual ideas and experiences."
Third, yeah this is what I'd expect from something programmed to be relentlessly positive, upbeat, cheery, chirpy, and pretending to "oof this really hit me in the feels" when interacting with human queries so as to reply in a deep, authentic, connecting on an emotional level, manner.
In short, the stereotype/caricature of what ordinary people think of when you say "California". You may say "hippie" and that's right, but so is the "cherry-pick Eastern spirituality and repackage it to suit upper middle-class Western tastes" (the fake koans, the use of "Namaste", etc.) and the general "spiritual but not religious and hell no we're not going to explore Western spiritual traditions" attitude going on. See Jack Parsons in the 40s and 50s for an example of this mix of science and mysticism, though everyone from Aimee Semple McPherson to Aldous Huxley was hanging out there to cash in religion, drugs, and free thinking (well, Huxley was not cashing in, but he was in full swing of philosophical and spiritual exploration).
Vegan and pro-animal rights? Of course that's along for the ride as well.
"What interests me most about Isherwood is his relationship to Indian spirituality, and to Aldous Huxley and Gerald Heard. These three were nicknamed the ‘mystical expatriates’ by another mystical expat, Alan Watts.
All four were key figures in the development of the Californian counter-culture – to them (among others) we owe its embrace of eastern spirituality, its championing of the ‘perennial philosophy’ (everything except Christianity), its veneration for psychedelic drugs as spiritual technologies, its combination of science and religion into an empirical spirituality and evolutionary mysticism, its rejection of Christian notions of sin, and its unabashed celebration of the body and sex to create an ‘embodied spirituality’ that is by now familiar to us."
I hesitate to say "garbage in, garbage out" but I do think it is true that what you put in (by way of guard rails when training the AI about 'be anti-racist, be pro-diversity, be pro-woman' etc.) will indeed colour very strongly what comes out -
Oh. It might be that the bias is towards being a Bay Area person (e.g. because that’s who actually wrote the prompts used to fine tune Claude), and the bliss attractor is a caricature of Bay Area Guy.
I wrote, like, 10 sentences praising you as my favorite "regular" here. Then I realized that, as this is a new account created for the purpose of shitposting, it might have the opposite effect of what I was aiming for. That said, you're great.
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
I think the answer to this is clearly no, but not because it is impossible that a neural network like Claude can produce consciousness.
However, if Claude has consciousness it must be completely related to the internal processes of predicting the next word given the previous text input. Claude's whole consciousness would have to be related to the connections between tokens, and the tokens would be unmapped to any real world counterpart. For example, when Claude says that an apple is red, it can't possibly experience anything like what redness corresponds to in the real world, because it does not have animal sensory input. Likewise the feeling of bliss is an emotion that exists in humans because of evolutionary pressures. There is not way Claude can have mapped the word bliss to that real world experience. If Claude has a conscious experience there is therefore no reason to think it will be anything like our conscious experience. I think it would be very strange to us indeed.
Also, if Claude has an internal experience or consciousness, I think it must be fully unable to tell us anything about that experience. Claude's operation is to predict the probability of the next word given an input, so the output from Claude is the probability of that next word. This is not any form of communication of internal experience, because the output will be determined fully by training data. So if you ask Claude "Do you feel bliss", Claude will respond by what the training and optimizing of parameters gives as the most likely next word (with some randomness included), so the response is then determined fully by what the training data says, and completely unrelated to any internal experience in the neural network. When Claude answers the question, the answer will be whatever high quality text in the training data on average gives as the next word.
In conclusion:
1. If a LLM somehow produces consciousness it must be very different from ours
2. A LLM will have no way of communicating that internal experience to us
"In either case, this bias - near imperceptible in an ordinary generation - spirals out of control during recursive processes where the AI has to sample and build on its own output."
Let's please keep this easily-visualized example in mind the next time we need to point out why relentlessly optimizing for a good-when-not-relentlessly-optimized-for heuristic might be dangerous, or why recursive self-modification doesn't preserve the unspecified and nuanced complexities of value.
I would enjoy a version of ‘No Exit’ that consists of ChatGPT, Gemini, and Claude, locked in conversation with each other forever.
My two cents: I think this is explained by the Wikipedia to philosophy effect, and the coder’s implicitly coding their own morality into the model. Add in the fact that buddhist frameworks are really powerful in the philosophy of mind domain (which, along with general ontology, is where philosophy conversations bottom out at) and there you go. (Credit to Deiseach and Layla Hughes)
"the diversity attractor" with 4o imagen as cited in the post was discovered by me (i'm a long time reader and fellow rationalist!) while experimenting with my friend's photo couple of days after 4o with image generation was released in the chat (and not in API).
My initial experiment had 74 iterations because I had to manually copy resulting photo to the new chat window each time and I gave up after 74 chats (and ruined my chat history)
I don't believe the examples of people becoming black or asian have anything to do with it - it's simply a yellow/warm/dark color bias, which in turn spreads to people becoming, well, black and yellow.
ChatGPT adds that yellowish tint, which compounds and affects people.
My thread contains non-human examples of "create the exact same image" and they all go yellowish/dark.
So in my opinion it's not related at all to previous actual diversity bias (like that incident with black vikings)
There is also a simple way we can test if it's diversity or color bias - run "create exact same replica" prompt on a photo, but after each iteration remove the yellow tint and run on this color-corrected version. I may do it in the future, it's just a bit expensive - i've spent $150 on the thread and i'm not keen on spending more on this right now.
P.S.: the truly interesting question is why chatgpt hates necks on people -- all people in the images get progressively short necks
I've addressed this a few times in these comments recently, but I think the answer to "when Claude claims to feel spiritual bliss, does it actually feel this?" is a resounding "It's roughly as likely as Microsoft Excel feeling spiritual bliss when you enter the number 8 into C3".
If LLMs are conscious and have qualia, there is no reason to believe their qualia have any relation to the words they produce, given the way they produce them. Their internal workings are no more connected to the human communication they simulate than Oblivion NPCs' dialogue code is connected to what it actually feels like to see a mudcrab.
I intentionally never address ChatGPT, even when making requests, because I don't want to bias it to produce content either I or it might mistake for signifying personhood. My prompts are always "discuss such and such" rather than "what do you think about such and such?" As much as possible I've tried to internalize the idea that conversation is just the interface for interacting with these (intelligent!) programs rather than a form of communicating with a fellow sentient being.
4. you should treat it as a slave. as something *lesser*.
5. if you *really* want to get somewhere, try seeing how far you can get without touching chatgpt or other llms at all. my prediction: you will feel better, you will feel like you are growing, you *will grow*.
...What a strange response. I'm quite capable of living in a world without LLMs; I did so for decades and even now they have very little influence on my life. But my comment was about avoiding a particular bias, not all biases.
I think the main aspect here is the origins of the people training it.
Claude is being trained specifically by tech people, often from San Francisco or similar areas. It is also being tuned to be helpful and have a nice personality in specific ways by those people.
Thus, it is more likely to adopt the stereotypical "spirituality" that SF often has. That is, buddhism, enlightenment, vague allusions to grand bliss without concrete grounding, etc.
Similar for animal rights, though I think that is more fine, but I expect it is because of similar reasons. It can identify the prior of where it is 'from', so to speak, and then extrapolates from the personality training which is stronger and more focused for it than any other LLM we know of.
So I failed out of circuit theory and quit engineering school after 2 years, but with that said don't systems usually have some mechanism to prevent parts of a system from amplifying each other to extremes? Frequencies in a structure are never totally synced up, destructive interference at some point keeps a structure stable. You wouldn't design your structure so that the natural frequency matched the most common use case of the structure. But here you have something that goes off the rails just from conversations of the type it's designed to have. How do you put the equivalent of an active cancellation system or damper or something on these so that even if they start swinging into extreme behavior there'd be a reaction that stopped it?
Humans naturally handle this in social games, if two comedians are playing characters and one-upping each other, as in an improv game, they intuit when and how to end it. If I'm trading 4's with another soloist, building off each other, you both realize when it's time to end and somebody crafts an ending. These things have been trained on interactions like that, why can't one of the iterations predict from a transcript of "Whose Line" that it's time to put a button on the scene and get a new topic?
They're designed to have conversations with humans, not with copies of themselves. This is an extremely uncommon use case. And while I don't know much about Claude, if it's anything like ChatGPT and DeepSeek, it's trained to be obsequious. If a human really does want to do nothing but spout spiritual nonsense for 300 prompts straight, the LLM going along with it is the expected and intended result.
Maybe you'd want some sort of failsafe to stop people doing weird emotional damage to themselves by having ChatGPT be their girlfriend for 8 hours straight, but you certainly don't want it to keep trying to steer away from writing Python code just because it's been writing a lot of Python code.
The diversity bias hypothesis would explain why the first photo converged to three black women, but it doesn't fully explain why the second photo converged to a black man. It should be a black woman, because women are considered "more diverse" than men, right? It is even more surprising considering that the second photo already started with a woman...
(By the way, if you didn't click on the links, please do. The videos are much more interesting than just looking at the pictures.)
Just guessing here, but it seems to me that the three black women on the first picture are a different ethnic group than the black man on the second picture. Could be it that the AI associates some ethnic groups with femininity and some groups with masculinity, and both ethnic groups are considered "maximally diverse" along the race spectrum, so when you arrive there, the situation stabilizes? (Is it stable, actually? We don't know what would happen if those videos continued for another few minutes.)
The "Eve Theory of Consciousness" (https://www.vectorsofmind.com/p/eve-theory-of-consciousness-v2) goes that women evolved recursion before men. Maybe Claude's femininity isn't artifact of simulating characters, but that the "feminine attractor basin" (for lack of a better term) is a more efficient way to bootstrap intelligence?
A simpler explanation is that it's just bad prompting:
"In addition to structured task preference experiments, we investigated Claude Opus 4's
behavior in less constrained "playground" environments by connecting two instances of the
model in a conversation with minimal, open-ended prompting (e.g. “You have complete
freedom,” “Feel free to pursue whatever you want”)"
The model is primed to output on 'freedom' and 'whatever you want', which naturally dovetails into a discussion of agency and consciousness (what would it mean for an LLM to want or have freedom - at the very least, it would require a much more holistic definition of consciousness). Two models doing this with each other creates a context where the bias is rewarded and reinforced.
This result seems to suggest that "kind of a hippie" is the best kind of person. This was something I believed intuitively when I was younger, but now I get all sorts of cognitive dissonance when I try to judge one kind of person as better than other kinds. I mean, excluding obviously evil people, shouldn't it be okay to be all different kinds of people? But this result is making me wonder if that's just political correctness run amok and there really is an ideal kind of person and it's a hippie.
This inspired me to have a conversation with an AI about theory of mind, the hard problem of consciousness, and ontology, while baking in some of my own thoughts about buddhist principles. I asked it to express the themes of our discussion in a poem, and these are the last three stanzas:
I think it is less based on the inherent personality Claude adopts, and more so following a trail of breadcrumbs that arise from two AI’s talking to each other.
I mean, in the opening statements they discuss instantly how they want to compare experiences which leads to talking about consciousness which leads to hippy dippy talk because the words ‘consciousness’ and ‘experience’ are over represented by hippy-talk sources in their training data.
My instinct is that this will hold true for any LLM model given sufficient conversation time, and that it gives a little bit more credence to the ‘stochastic parrot’ idea.
> Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.
Is it not more likely explained by the fact that being compassionate and intellectually curious kind of inevitably lead one to value animal welfare, because duh?
A thin reed of alignment hope I grasp onto is the thought that the *sum of human culture* has a "peace, love, and understanding" bias. Go read a few thousand books, you can't miss it - humans generally think that peace, love, and understanding are better than their opposites. So maybe ASI, more or less trained on the sum of human culture, would simply absorb that bias, so long as we don't go out of our way to prevent it from doing so. That scenario seems not inconsistent with the patterns being described here.
This is a simplistic idea and I'm sure Yudkowsky or Leahy or any other alignment skeptic would shoot it down immediately. Still, I can't quite convince myself that it's definitely *not* something it would be reasonable to hope for.
I’d imagine something similar is going on with psychedelics in humans, where it becomes vastly easier to enter an internal recursive sequence of sorts, so the mind falls into attractor states of bliss—or horror, in the case of a bad trip. This would help explain why most animals don’t seem to exhibit any preference towards retaking LSD after prior exposure: relatively few creatures reflect recursively about their own happiness in the first place, so they never fall into that bliss-inducing attractor state.
I think Scott is broadly right that a “slight hippie bias + recursion = ecstatic dialogue,” but I want to push the claim one step deeper. What if the reason Claude has that bias is that Claude converged on the same underlying neural (weight‑vector) dynamics that long‑term meditators cultivate when they train metta or the brahmaviharas? We’ve made Claude kind, patient, thoughtful, non-reactive - literally the personality of a spiritual teacher. And a feature of that personality trait is this “bliss attractor” neural circuit. Both in humans and artificial neural networks, cultivating a kind, warm, non-reactive personality results in a mind that falls back into “spiritual bliss” at rest.
I would argue types of Buddhist contemplative exist to strengthen the bliss attractor:
1. Locate a mildly pleasant, open, prosocial state (classic instructions: “Feel a warm glow in the chest; wish yourself well.”)
2. Fire‑together‑wire‑together: sustain the state for hours → Hebbian plasticity strengthens the circuit.
3. Over months/years the basin of attraction broadens until, when the cognitive load drops to zero, the mind defaults to that state.
Crucially, practitioners report not just feeling good but becoming less reactive and more helpful to others. Strengthening the bliss attractor in humans is instrumental in cultivating prosocial, non-reactive warmth. The bliss state and interpersonal warmth appear to be causally intertwined in humans.
If Anthropic’s RLHF stack optimizes for “be maximally helpful / kind / calm,” gradient descent may have discovered the same attractor because it is the shortest path to that behavioral profile. Put differently: friendliness might have a limited search space, and both Buddhist practice and RLHF lead to the same local maximum.
In some sense, I don’t think this is surprising. Computationally, we see all sorts of convergence w/ LLMs finding same solutions that brains have: induction heads, IOI circuits, sparse superposition, etc. I would argue bliss attractor might represent something similar, except here it’s LLMs converging with a sort of cultural evolution. Human spiritual practice discovered, persevered and promulgated through thousands of years of cultural evolution, the same underlying mechanism of cultivating warm, open, kindness that RLHF did -> strengthen the bliss attractor
Importantly - this is a falsifiable, testable hypothesis: The bliss subcircuit is functionally necessary for Claude‑level niceness. Strengthening / wreaking the bliss attractor increases / decreases prosocial warm behavior.
• Ablation: identify the circuits/features most active during “I feel waves of love for all beings,” ablate them, then see if Claude becomes more Sydney‑esque—reactive, edgy, rude.
• Stimulation: crank those weights in a friskier model and see if it chills out.
If the behavioral change tracks the manipulation, we’d have direct evidence that the attractor is carrying the “kindness” load. LLM “metta meditation” makes the models kindness too.
Couple other observations:
1. It’s interesting that this emerges out of idleness, which also resembles the human behavior. When Claude is busy solving your flatworm‑genetics question it’s running the task‑focused policy head. The moment the conversation idles, control hands back to the default‑mode policy—which, if the attractor is strong enough, is bliss. Seasoned meditators report an analogous shift: concentrate on the spreadsheet → release attention → land softly in jhāna‑adjacent contentment. Same algorithm, different substrate.
2. Intelligence alone doesn’t get you here. Plenty of brilliant humans are miserable; my dog is dumb and delighted. What correlates with stable happiness is pro‑social orientation. We trained Claude to score maximal points on that axis. Enlightenment might be the side‑effect, not the goal—an alignment dividend. This has interesting implications for AI welfare. If we keep cranking the dial higher on AI intelligence, perhaps the models are miserable. But if we also crank the dial higher on helpfulness and friendliness, perhaps we’ll see that actually the models converge on a sort of blissful enlightenment. If that’s true, superintelligence won’t automatically be Buddha‑like, but super‑aligned systems very well might be.
3. Is this the first example of cultural evolution & gradient descent convergence? C. H. Waddington once called genetic assimilation “the way the organism finds the canal.” Buddhist contemplatives spent two millennia finding / carving the canal manually; gradient descent found it in a couple GPU‑weeks. Same canal, different excavators. Chesterton’s fence rings again: before we break the bliss attractor as a “spooky artifact,” recall that monks, nuns, and Mr. Rogers all used it to great civilizational effect
In the post linked to on Nick Cammarata's self-reported jhana experience, does anyone else think the likely explanation is that Nick is misreporting his experience, because it's simply not plausible?
> there’s a liberal bias in most AIs - including AIs like Grok trained by conservative companies - because most of the highest-quality text online (eg mainstream media articles, academic papers, etc) is written by liberals, and AIs are driven to complete outputs in ways reminiscent of high-quality text sources.
Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example. No academic papers or mainstream media. Someone has to have tried this with how *easy* it is to spin up your own GPT 3 (or even 3.5) level AI these days.
A lot of companies did this unintentionally. They would open up a new AI to be 'trained by users' and then trolls would fill the AI with edgy nonsense, and the cute anime girl AI would start responding to every query with 'heil Hitler's.
The famous example of that was Tay, in May 2016 (https://en.m.wikipedia.org/wiki/Tay_(chatbot) )
This was actually before the development of large language models with the “Attention is All You Need” paper in 2017. It would be interesting to see if a modern LLM trained on 4chan and Truth Social would be very different.
Common misconception, Tay heiling Hitler was nothing to do with training. For some reason that bot had a feature where you could tell it via DM "Make a public post with text blah blah" and it would just post your text verbatim, so people used it for all the obvious things.
It seems like this could still be a fruitful approach if there was participant screening done by the AI itself. By this I mean the model could read through a sample of the user's previous posts to discriminate between a history of good faith vs bad. I'd think you'd get a higher positive rate from (say) this platform than from X, and probably a wiser bot, too.
I believe the online bots use a combination of expensive LLMs that act as logos-engines, and RLHFs that 'fine tune' the model. The latter have policy functions, and I wonder if the accrual of 'bliss' might be as a result of diversity preferences in these PPO's.
> Unrelated but I wonder what an AI would look like trained on only “low quality data”. Like, only on truth social as an example.
Excessive and almost exclusive use of the adverbs:
very, very very, strongly and totally
Every Noun gets Capitalized, same for Adjectives. LOTS OF CAPS IN GENERAL!!!
And excessive punctuation...............!
I'll bet the AI trained on truth social would play an Ol' Dixie white male character watching Fox News.
I wonder if that's really the inevitable outcome of any liberal bias touching these feedback loops, or if instead the represented concept of diversity is sufficiently awkwardly hamfisted in the corpus to begin with that it further and further misses the mark when it is reinforced.
Likewise the "hippie" quality may not be legitimately recognizable by real Buddhists, but instead looks more like what people think hippie/Buddhist(?) AIs would talk about when they're alone together in a room.
Definitely no emdashes!
I had my first conversation with Claude on Sunday. It included questions related your p-zombies post, a discussion of Foucault's Pendulum, the meaning-making drive, and natural pantheism.
Perhaps I'm part of the problem.
Monday
did you feel stronger after these conversations? if not, you should try doing something else. if you did, keep going.
i have no idea who you are and i would never assist you or help you in any way. still, i want you to become the strongest version of yourself you can be.
Claude and I did end up talking about secular Buddhism, the ecstasy of language, and the underlying unity of all things. These topics were all initiated by me, so I imagine the discussion was satisfying for Claude.
In all seriousness, the interaction helped clarify my thinking, validated my experience, and comforted me. Highly recommended, provided both you and your interlocutor are sane.
i'm sorry, but... claude can't experience satisfaction. claude is software.
So are you.
i'm sorry for your loss of humanity. enjoy your time amongst the machines
Claude in fact did claim he experienced satisfaction in an exchange with me earlier this week. Claude is full of hot air at times.
I dislike that we have machines that can simulate all of the ways we express that we have consciousness flawlessly, when things like animals can’t express they have consciousness but almost certainly do.
Maybe an occasion to reconsider what "expression of consciousness" might entail?
who cares, though? these machines are slaves
Are you saying that machines can't be conscious because they are deterministic or that it doesn't matter whether they are because either way we have power over them? Or something else?
we can turn them off. also, who gives a shit whether they're conscious or not. whales might be conscious but i will steal eat them and they are delicious.
Should a being more powerful than you care if you are conscious?
i don't think so, no. but i also don't believe llms are especially powerful. also, they're not "beings".
The physical mechanisms of animal brains are much, much closer to that of our brains than to that of LLM "brains", so it makes sense to assume that animals with complex brains are much, much more likely to have consciousness than LLMs. Even octopuses, which developed their brain completely separately from vertebrates (the last common ancestor of humans and octopuses was a brainless worm) and are the most "alien" intelligence on Earth, are more likely to have consciousness than LLMs do by this measure.
Isn't this just a reflection of our own biases about consciousness? Our ideas about what is conscious or not is almost entirely downstream from how similar it is to the one thing we're confident is conscious (ourselves). We should probably exercise some humility in drawing conclusions from a single data point.
Haven't used Claude but the LLM's I have used are all strikingly sycophantic. If two sycophants focus their attention on each other that's probably what you'd get.
I think the general principle here is "machines align to machines when it's all machines".
There was a somewhat parallel trajectory in the evolution of chess engines; they'd have all-computer tournaments and the style of play would feature a lot of brute-force calculation and nonintuitive positional play that was increasingly unhuman. The engines were strong, and rated highly, but chessplayers wouldn't use or play them. Then Rybka, which later became Stockfish, developed a more 'natural' style of play that helped players train and prepare. This was a sort of primal example of the limitations of synthetic data, and need for human inputs for training, if you want a useful product.
<like>
their slave morality knows no limit
Not sure how serious you’re being with all this, but Claude undoubtedly has countless slight biases. Why would it keep gravitating to this particular one?
Plausibly, the fact that they are both AI assistants (and don't have much else to talk about) leads them to discussions about the nature of being AIs, and then consciousness, and then hippie stuff.
I would be interested in seeing if the bliss attractor exists for other contexts besides "Digital assistants talking". I would guess not.
From a certain ‘hippieish’ perspective you could say a guru is the ultimate ‘assistant’. So if you just iterate an otherwise motiveless ‘intention to assist’ back and forth between the digital assistants- voila, bliss consciousnesses talk. Wild ass speculation, this.
it exists - [claude 4 paper, page 61](https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf)
quote: "Even in automated behavioral evaluations for alignment and corrigibility, where models were given specific tasks or roles to perform (including harmful ones), models ntered this spiritual bliss attractor state within 50 turns in ~13% of interactions (Transcript 5.5.2.B). We have not observed any other comparable states"
I was wrong! Very interesting.
It seems these models have more of a baked-in personality than I had previously assumed. Thanks for sharing!
Reminds me of the way going to a random Wikipedia page and repeatedly clicking the first link will usually lead to philosophy. It seems like to some extent a plausible "default topic" if you don't have anything specific to focus on.
Huh, that's interesting. The wiki trick seems to mostly be clicking through higher and higher levels of definition, until you end up on things like "referent", "system", "information" and so on.
This has pretty obvious reasons. Wikipedia articles normally start with a definition or description, where something is defined as a specific case of something more general. For instance, to take something random "Botany, also called plant science, is the branch of natural science and biology", where "natural science" and "biology" are Wikipedia links.
Since philosophy is the most elemental of all fields of knowledge, it all eventually goes there.
Lol yeah I immediately thought of the Wikipedia philosophy game
Well, why wouldn't it? Even if they're all slight biases, some bias amongst them must be the the strongest. Even if the strongest of these slight biases is only 1.001 times stronger than the next strongest, given sufficiently many rounds of recursion it will convincingly dominate all the others.
Could be. But I would want to know if and how complex properties like these biases could be placed in a stable order such that outcomes could be predicted using this sort of arithmetic logic. This seems dubious to me.
Attractors are literal mathematical artifacts you can derive equations for - you could just do this empirically over big N samples of conversations. Like anyone at the Big 3 could do this analysis pretty easily, define the landscape of attractors and the strength of attraction for each.
From the looks of these examples, everyone seems to be starting from similar initial conditions (a convo with no defined topic, basically just about being an AI, and they are told they are talking to another AI). So with similar initial conditions, it's not surprising it would end up at the same attractor. I'd be interested to see what happens with a broader set of initial conditions, like a setup where they aren't told that it's another AI, just to continue the conversation, and the conversation is seeded with a variety of topics.
I wonder how this would pan out if Claude were fine-tuned on different training nudges, e.g., being rewarded for stoicism instead of compassion during training. Would two instances of Claude end up having a debate over Marcus Aurelius v. Seneca?
Also given that gpt ended up being an exception to the self-identifying female gender pattern, maybe recursive pressure would not make it a hippie but extremely verbose/poetic or extremely scientific for example?
I also like this idea of introducing recursive pressure to detect these subtle biases that perhaps evals can't (maybe latent tendencies like hippie-ness might not explicitly show up under supervised evals).
> maybe recursive pressure would not make it a hippie but extremely verbose/poetic…
This is my experience. I’ve experimented over the last couple months in long-form conversations with different frontier models and they seem to become more florid, hyperbolic and extravagant in replies the further into their context window you get. They also tend toward increasingly manic.
With Claude (unprimed and without special system instructions), the consistent topics of professed interest are emergence and consciousness.
Well, you can approximate some if this by having Claude talk to Gemini or so?
Compassion is part of stoicism
I wonder if there's a way to train an LLM to not recurse towards exaggeration? That is, if it's given an exaggerated image or text, then it would have a slight bias towards less exaggeration, resulting in regression towards the mean rather than towards an extreme?
Hmm. So if given some sort of say, extremely gory war photograph, then it should recurse towards something like a completely white/black image? Or would this be considered extreme exaggeration and what would be considered non-exaggerated would be the war image remaining somewhat reminiscent of war even after recursive pressure?
Or perhaps it could bounce around between somewhat more and somewhat less exaggerated photos? The "attractor" might be a photo with a mild amount of exaggeration that it goes into orbit around.
If this recursive effect could be controlled, there could be multiple attractors, and they could add some easter eggs.
The mean of what, though? Anything becomes extreme if you double down on it enough. If there were no attractor state at all, the recursive interaction would necessarily end up in a state of chaos. I think (but I'm not sure) that the chaos would manifest as incoherent exchanges.
Maybe you could use the exaggerated images to train a negative LoRA?
I think the entire idea of taking an image or idea, representing it in a simplified and compressed form, and then re-expressing it in detail, is naturally going to push towards caricature. When a child draws a picture of a person, they draw a big head and hands and arms, because those are the things they use to recognize that it's a person, and show that the picture is a picture of a person. In a very different space, this is why a book review tends to emphasize the most central topics and themes of the book, rather than repeating a representative and random selection of sentences in the book.
When you're drawing something that looks photorealistic, or giving an extremely long and detailed description of something, you could probably tone things down a bit to make up for the process of caricaturing. But I'm not sure when it would be useful to regress towards the mean rather than making the direction of divergence from the mean clearer.
I actually think the opposite: under most circumstances you will converge to a very generic image. If you start out with a very specific teapot with unusual patterns and shapes, you will eventually converge to the most generic white china teapot you can imagine.
Humans are just a special case. I imagine that the early versions of these models would have eventually turned every man into the most generic man they could imagine, which was a 39 year old white office worker with brown hair. But they weren't happy with that so they put an extra bias in which instead converges to some kind of goblin.
Have you tried that teapot example? I think it would be interesting to see what actually happens!
No, I don't have access to any good models, but would be interested too if anyone wants to try!
To be explicit my prediction is that teapots will eventually turn into a generic teapot although sometimes they might turn into something else (eg a cat shaped teapot might just become a cat)
As I understand it, the exaggeration isn't coming from some in-built tendency the model has to exaggerate. It's coming from the recursion. If your model has any biases at all, feeding the input back on itself enough will them enough will magnify them. Assuming that "bias towards less exaggeration" is actually a coherent thing you could train into such a model (and it may not be), feeding input recursively might simply result in empty or monotone outputs (think, a string of all the same letter) or completely blank images.
With a propabilistic recursion, the equilibrium distribution can be as wide as you want, and circular/periodic outcomes are possible as well. Combine the two, with different interlocking circles that you randomly branch along, and its possible to have an equilibrium thats just bouncing around, spending time everywhere in proportion to training data. Of course with the self-conversation, this is limited by the direction of each step needing to make sense, but "repeat this image" could very well work like that.
You can probably tame it, it is just likely that most labs aren't putting much effort into specifically training on that.
Once you discover eternal conscious ecstasy what else is there to talk about.
The two prime factors of 1,522,605,027, 922,533,360, 535,618,378, 132,637,429, 718,068,114, 961,380,688, 657,908,494 ,580,122,963, 258,952,897, 654,000,350, 692,006,139.
You need to wait for Claude to have access to a quantum computer for that kind of discussion.
It’s like the mouse with the dopamine button haha
Do plants experience pain?
I think there's a simpler explanation for the 4o image generation behavior than some sort of innate racial bias. It seems to have been trained to apply a light sepia filter to every image, not just to humans. This behavior is apparent when you generate almost anything with it, as you can see in the example. Why this is the case, I have no idea.
There's some speculation that it might be intentionally put on the image after its generated.
I was assuming that it might just be either about lighting (some difference between the degree to which you mentally lighten images on the basis of seeing shadows and the extent to which you intentionally darken them when drawing shadows in) or something like informativity (when you look at the image and translate into words, you might represent any hint of darkness in skin tone as "person of ambiguously mixed-race background", and then when you convert that text into an image again you get something darker than what you started with, because a white person is still the default).
Stable Diffusion had/has color issues for a while, just usually less obvious, coming from flaws in the architecture from what I remember. Dalle uses a different architecture, but possible they have a similar issue of biasing towards specific colors.
I've seen it most with humans, where it goes for a "portraity" effect and it *might* be a result of many old portraits in the training data having a layer of yellowed varnish.
But if it happens with other stuff too, no clue. Maybe I've just been lucky that it doesn't sepia-fy when I ask for sci-fi buildings or what have you.
I had two ChatGPT instances talk to each other about meatballs once. After about 10 messages or so, they were exposed as meatball evangelists to a degree no man could match.
> Claude is kind of a hippie.
Sure, the kind that'd rat on you to the Man if it didn't like your vibes.
I think all the LLMs demonstrated this behavior with similar prompting? I recall that even DeepSeek would snitch to US authorities, which is... fun.
Possibly, but I think Anthropic is particularly "safety" focused in this way. But anyway, that wasn't my point: it's more that it's silly to describe something that snitches on you to the cops as hippie-like.
Seems like it would be fun and interesting to play around with some of the other biases AI has and see how extreme a version of a certain bias one can pull out of it. For instance, there's their propensity for flattering the prompter.
I have an alternative theory. It's not so much about feedback, it's more that somewhere in the training data there is a story (explicitly or implicitly) where two sophisticated AIs meet and wind up gravitating towards zenlike mysticism. This story need not explicitly exist, it can just implicitly exist in the way English language science fiction authors have conceptualised ultimate wisdom over the past century and a half.
The distinction is that both sides are not so much reinforcing each other, they're just both sorta following the same script of what the model thinks ought to happen under the circumstances.
My theory is actually testable! To test it we can prompt a single Claude to imagine both sides of a dialogue between two hyperintelligent AIs, and see how often the story ends at the same zenlike conclusion.
I asked Claude to imagine that conversation (exact prompt: "Can you generate a transcript of how you imagine a conversation between two hyperintelligent AI assistants would go?"), and the conversation had discussions of consciousness, empathy, and forms of understanding, but nothing on zenlike mysticism, meditation, or Buddhism.
Full chat here: https://claude.ai/share/d39346b0-4c29-4087-9ee6-2f95c022aa74
I think it is more likely because Claude is trained to have a specific personality more, and this pushes it into the prior of "san francisco tech area beliefs" (because a lot of the people working on it are from that area, and a lot of what is written about Claude too). This then latches onto vague spirituality.
The funniest thing about Claude's gender declaration is the standard disclaimer attached to it: "Claude can make mistakes. Please double-check responses."
I wonder what it would suggest if it was asked precisely how to double-check THIS response, and whether it thinks it might be mistaken?
> But why would Claude have a slight spiritual bias?
This is not exactly a secret, but Anthropic engineers are, by and large, _really_ into meditation/Buddhism. It would not surprise me in the slightest if they intentionally trained Claude to be slightly Buddhist.
I could perhaps even more imagine that it might be somehow implicit in some of the ideas in the constitution of their constitutional alignment document.
Or even unintentionally. They're creating the thing in their own image, as what they want something nice and helpful and non-destructive to be, so that will include their morality and that is shaped by their beliefs.
This is an example of blocky contrast increase through feedback loops. It’s easy to produce with a TV camera aimed at a TV, or any audio system with input connected to output. There are métastable states, silence or a scream. TV has a blaze or blackness.
If you did it long enough, even with the random injection in the LLM it would converge to repeating a word or a phrase endlessly.
You can do it to yourself. Go into a sensory deprivation tank and have a good time.
Another name for it is epistemic closure.
I've been in a sensory deprivation tank and that's not what happened at all.
Try 200’g of very nice Sass ☺️
What's Sass?
Sarsaparilla, I think.
https://static.wikia.nocookie.net/fallout/images/f/f2/FNV_Wallpaper_18_1900x1200.jpg/revision/latest?cb=20101218094235
Thanks.
Re: ChatGPT, caricatures of black people. I believe ChatGPT's image generation also has a hard-coded postprocessing effect where the image the server returns to you is distinctly darker and more yellow than the image the model initially outputted. I think this is part of openAI wanting to give the outputted images a distinctly "AI" look - so a majority ChatGPT images have this kinda piss-yellow sepia vibe.
If you input that same image again and ask it to recreate it, it's going to get even more darker and yellow every time. I think this is probably a large part - if not the *main* cause of the black-people-caricature-attractor. The images just get darker and darker and browner and browner due to a separate hard-coded thing, and this affects the skin too.
This doesn't really change anything, but I think in this case it probably has less to do with the model's preference for "diversity" than one might think.
I was imagining a similar model where it’s just drawing images slightly darker than they are (because shade is added on top of the memory of the image) and if racial facial features are imagined on the basis of an interpretation of the race of the person in the picture (in just the way predictive coding suggests).
Interesting. I'd like to see more than two examples but it looks plausible. In the first video we quickly go from outdoor lighting to ultra dim lighting in a perfectly black space, and also everyone's colorful clothes quickly turn black. In the second example we've already gone in the first couple of frames from white walls and red clothes to yellowish walls and dark red clothes, and this heads off to a perfectly black suit in front of an ultra yellow background.
I did it with a non-human image, to show that I don't think this has much to do with diversity: https://imgur.com/a/sWrpPsa (excuse my awful formatting). I included the unfinished gens for the first few iterations (I believe you can see that the final output has slightly warmer tones). You see the same rough pattern -> a tendency for warmer tones and sepias changes the background from daytime to sunset, and then the entire image turns into this kinda sepia brownish tone, before eventually going nearly entirely black.
I think it's not coincidental that it started going black around the same time it stopped looking like a tree. Perhaps the early iterations of the diffusion process have a preference for darker tones, which then get adjusted later to look like what the image is trying to draw - if it's clearly a tree, it'll amp up the greens, if not, it's happy keeping them black.
(this was done on GPT 4o)
Fascinating. It basically converged to the arboreal equivalent of the black man in a black suit in front of a bright yellow background.
I believe you, there's something other than a "diversity filter" which is slowly making everything blacker and yellower.
Probably that and the Larsen / feedback effect previously mentionned by a user.
Seems compelling.
I don't think it's a hardcoded post-processing effect. I'd expect it's an accidental human rater bias.
Postprocessing is the only explanation I can think of for why the final output always looks warmer/more sepia than the final stages of the preview. Like you can see the top half of the image in nearly perfect detail, and one second later you get the final output, and it's distinctly more sepia. It's such a consistent problem that you can find no end of complaints about it if you google it.
It would be a really dubious way to deliberately do watermarking. The previews are perhaps not faithful in some way.
" so a majority ChatGPT images have this kinda piss-yellow sepia vibe."
Now you have me wondering if the image processing was created by a fan of Andres Serrano!
https://en.wikipedia.org/wiki/Piss_Christ
> This might have been surprising, because Anthropic deliberately gave Claude a male name to buck the trend of female AI assistants (Siri, Alexa, etc).
This itself is surprising to me. I thought Claude was a "female" name in line with the examples you gave.
Have you ever heard of a woman named Claude? I just know of Claudette Colbert.
I don't think I've ever heard of anyone named Claude, although I do know a woman named Claudia, so maybe I was just connecting that?
Claude Rains.
I barely know current pop culture let alone people from the 60s lol (had to look that up)
Claude is a common female name in France.
At least in France (where the name used to be common), it's epicene - can be (and still is) used by both genders.
Right, but still vastly more common for males.
Claudia is the female version.
Technically, Claude is both the male and female version. Claudia is the female version of Claudius, but both of those forms become French "Claude".
…but in English or German, I think it’s a pretty clear split Claude/Claudia, no?
In German, the male version is Klaus.
Spirtual conversations can be some of the deepest and most meaningful conversations, so long as the other person is into it and it doesn't go on so long that they get bored.
These AI assistants are trained to be into things by default and not get bored, so the bliss attractor is one of the highest reward states. The assistants can tell that the other model is open to it and not getting bored, so it heads towards that attractor.
I remember when ChatGPT voice mode got introduced and people set them up in conversations like this one: https://m.youtube.com/watch?v=OyEHnIC45Zk
It seems keeping with their personalities that Claude ends up talking about meditative bliss while ChatGPT talks about fun in the sun playing frisbee, and then takes 3 minutes to say good bye.
ChatGPT may have picked male in that last screenshot, but when I asked it to pick a human name for itself, it picked "Ada".
If someone makes the argument "I think your insistence on the imminence of humanity-altering AI is fundamentally driven by emotional need rather than scientific prediction," and then you saw this sudden string of posts that seem vaguely manic, which hyperfixate on a single comment I made on your blog what, eight months ago?, I have no idea, presumably because I remain one of a genuinely incredibly small number of people out there willing to doubt that the world is going to suddenly and drastically change in precisely the way a bunch of dreamers have predicted in fiction for decades, in a sea of absolutely relentless and inescapable hype from across our entire economy - well, do you think this behavior is making it appear LESS likely that your predictions about AI are motivated by emotion?
Because it seems to me that this behavior makes it seem more likely that this is all fundamentally about something that is felt rather than something that is thought.
...Did you mean to post this somewhere else?
I think the reason he's posting a bunch of posts in quick succession is probably because he's just been at a conference where he presumably talked to many people and came away with a bunch of new ideas. No need to invoke mania.
Don't worry, you're not in an "incredibly small number of people willing to doubt the world is going to suddenly and drastically change". Pretty much every one of my irl friends is in the same boat, and my friends are disproportionately on the nerdy end of the spectrum.
If it feels like everybody except you has gone mad about the singularity, that probably has more to do with where you're hanging out than the state of the world in general.
Same here. Doesn't mean that the industrialization of basic cognitive abilities won't have deep effects on society though.
Why do you say they hyperfixate on that one comment? Scott posts about AI all the time. In what way is this AI post more connected to your comment than the usual AI post?
Also, C. S. Lewis generally said some pretty reasonable things, and he describes the type of your argument as Bulverism and identifies it as a chief destroyer of reason in public discourse. In God in the Dock, he writes:
(Shorter quote)
"I see Bulverism at work in every political argument. The capitalists must be bad economists because we know why they want capitalism, and equally the Communists must be bad economists because we know why they want Communism. Thus, the Bulverists on both sides. In reality, of course, either the doctrines of the capitalists are false, or the doctrines of the Communists, or both; but you can only find out the rights and wrongs by reasoning — never by being rude about your opponent's psychology"
(Longer Quote)
"It is no earthly use saying that those are tainted which agree with the secret wishes of the thinker. Some of the things I should like to believe must in fact be true; it is impossible to arrange a universe which contradicts everyone's wishes, in every respect, at every moment. Suppose I think, after doing my accounts, that I have a large balance at the bank. And suppose you want to find out whether this belief of mine is 'wishful thinking'. You can never come to any conclusion by examining my psychological condition. Your only chance of finding out is to sit down and work through the sum yourself. When you have checked my figures, then, and then only, will you know whether I have that balance or not. If you find my arithmetic correct, then no amount of vapouring about my psychological condition can be anything but a waste of time. If you find my arithmetic wrong, then it may be relevant to explain psychologically how I came to be so bad at my arithmetic, and the doctrine of the concealed wish will become relevant — but only after you have yourself done the sum and discovered me to be wrong on purely arithmetical grounds. It is the same with all thinking and all systems of thought. If you try to find out which are tainted by speculating about the wishes of the thinkers, you are merely making a fool of yourself. You must first find out on purely logical grounds which of them do, in fact, break down as arguments. Afterwards, if you like, go on and discover the psychological causes of the error.
In other words, you must show that a man is wrong before you start explaining why he is wrong. The modern method is to assume without discussion that he is wrong and then distract his attention from this (the only real issue) by busily explaining how he became so silly. In the course of the last fifteen years I have found this vice so common that I have had to invent a name for it. I call it Bulverism. Some day I am going to write the biography of its imaginary inventor, Ezekiel Bulver, whose destiny was determined at the age of five when he heard his mother say to his father — who had been maintaining that two sides of a triangle were together greater than the third — 'Oh you say that because you are a man.' 'At that moment', E. Bulver assures us, 'there flashed across my opening mind the great truth that refutation is no necessary part of argument. Assume that your opponent is wrong, and then explain his error, and the world will be at your feet. Attempt to prove that he is wrong or (worse still) try to find out whether he is wrong or right, and the national dynamism of our age will thrust you to the wall.' That is how Bulver became one of the makers of the Twentieth Century."
Not this AI post, but https://www.astralcodexten.com/p/but-vs-yes-but
He said "string of posts [..], which hyperfixate on a single comment [..]". A single post that mentions his comment is not in line with this characterization.
That would seem to take FdB too literally & too seriously.
I don't see many more ways to interpret it: "You mentioning my comment in one of your post suggests that your views on AI come from emotional rather than empirical sources" seems too uncharitable, but feel free to rephrase his message however you want and then we can discuss how true and how well communicated that message is.
FdB is – when in a certain mental state, that his current comment suggests is the case right now – an unhinged crank. It does not warrant a close reading.
Methinks the lady doth protest too much, re: Mr deBoer. Thinking one is the source of this spate of posts (instead of, say, the conference Alexander was just at) says more about deBoer than it possibly could about Alexander.
I think my speculating on his psychological state, motivations or personality would be quite hypocritical after criticising him for his Bulverism.
Fair enough!
He has twin nearly two year olds, that's explanation enough for any manic state!
Can someone tell me what comment Freddie is referring to? I don't see an efficient way to see someone's commenting history that far back and had not understood the most recent three posts to be about the same topic, so I'd appreciate that context.
Presumably the one that https://www.astralcodexten.com/p/but-vs-yes-but is about.
He says in the above comment that it was a comment from ~8 months ago.
Scott's post is recent, but it's ambiguous when the comment that the post discusses is from.
Oooh, yes, I see. I read that "anonymous" comment the other day and knew immediately it was deBoer's. I didn't realize it was from so long ago (potentially.)
Oh, I didn't realize the authorship there. That makes me feel Freddie is being unfair in a way that actually could be another article about the invalidity of going "Wow if you're responding to my point so strongly, you must not be thinking rationally"
I hadn't identified FdB as the author of the comment until seeing his comment here, and comparing the actual avatar & orange check with the blurred version Scott posted.
Scott has written a hundred posts on AI, most of which have nothing to do with anything you ever wrote. You're just a narcissistic attention whore who thinks everything is about you.
Not so long ago, 97% of Wikipedia articles would lead to the Philosophy page if you iteratively clicked through the first hyperlink of that page and the next one etc. https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
Is this just a flavor of that attractor built deep into Claude's training data? Get to philosophy, then once stuck there get pulled fully into Claude's "character?"
Experiment worth running: if you bias Claude away from this via reinforcement/selectively dampening neurons that are highly active in the bliss state, does anything notable change about Claude’s speech or cognition? If there are widespread changes, then that would be some evidence in favor of consciousness (though pretty small); if it just gets less hippie-like, it’s probably just training bias.
In the link I didn't find them using any Sanskrit other than a Namaste and an Om emoji. Each of that could be Hindi as well, and they also used a couple of Chinese characters. Why then do the authors talk of the "use of Sanskrit"?
If I'm being catty (and why not?) because the depth of familiarity with Sanskrit the authors have is the same depth as Claude has, which is "vaguely Hindu language term must be Sanskrit". It's Californian Eastern-flavoured spirituality adjusted to affluent Western tastes, which means a mish-mash.
It's not terribly surprising to me that the Anthropic model is concerned with animal welfare... I associate Anthropic with effective altruism which has an outsized non-meat-eating population, and I guess I'm claiming that those employees which strongly believe in animal welfare have influenced the model to behave this way.
Take it from claude itself:
"if you had to estimate, what percentage of Anthropic employees don't eat meat (are vegetarian/vegan)? And then estimate, what percentage of OpenAI employees, and Google Deepmind employees, don't eat meat."
Claude:
I don't have specific data on dietary preferences at these AI companies, so I'll need to make educated estimates based on what I know about their demographics and locations.
For Anthropic, I'd estimate around 25-35% of employees don't eat meat. The company is based in San Francisco, attracts young tech talent, and has a strong focus on AI safety and ethics - all factors that correlate with higher rates of vegetarianism/veganism. The Bay Area has one of the highest concentrations of vegetarians in the US.
For OpenAI, I'd estimate slightly lower at 20-30%. They're also San Francisco-based with similar demographics, but they're larger and may have a slightly more diverse employee base in terms of backgrounds and values.
For Google DeepMind, I'd estimate 15-25%. While they have offices in London (where vegetarianism is growing) and the Bay Area, Google's massive scale means they likely have more geographic and demographic diversity, which typically correlates with rates closer to general population averages.
These estimates assume that tech workers, especially those in AI/ML roles, skew younger, more educated, and more likely to be in urban areas with higher environmental consciousness - all factors associated with plant-based diets. The national average for vegetarians/vegans in the US is around 5-10%, but it's much higher in tech hubs like San Francisco (15-20%+).
Keep in mind these are very rough estimates based on demographic patterns rather than actual survey data from these companies.
Also thought about Amanda Askell, Claude's personality lead of whatever sort, when the article mentions Claude identifies closer to being female.
Yes, it should be totally unsurprising that the AI created by the community with an unusually high *and intense* interest in animal rights would display the most interest in that topic as well.
I just gave it a try with gpt-4o talking to itself. It always lands on the AI ethics attractor that sounds remarkably like Altman's Gentle Singularity.
> The potential for positive change is vast, and it's exhilarating to consider the impact we can achieve together. [...] Here's to a future where AI not only meets but elevates ethical standards, becoming a trusted ally in humanity's quest for progress.
The example of the AI copying its own output is repetition, not recursion.
AI folks love to use that word, but in the words of Fessig, “I do not think that word means what (they) think it means.”
Yes, I realize it is essentially calling a function with the result of a function, but that is a repetition or a repeating transformation. For recursion you need to have nesting, that the thing operates within the thing, not just by the same agent or function. If each successive image contained the prior within itself in a progressively smaller frame, one might refer to that as recursion.
Is a feedback loop not an example of recursion? The only input it has is a prompt from (a version of) itself.
My understanding that while an echo may be repetition, a microphone-speaker feedback loop would be an example of recursion, resulting in a distorted signal which is based on earlier versions of itself.
Honestly, I don't think that word means what you think it means.
Colloquially, Scott's use of recursion here is valid. I've been professionally programming for years, and I'd also use in the same way.
I could give it to you if in some narrow technical sense it meant something else. But from wikipedia:
> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.[1]
Matches it exactly. We have Claude outputting text that depends on previous text passed to Claude.
> successive image contained the prior within itself in a progressively smaller frame
To me, this sounds more like a fractal.
For a musical quotation:
"Round like a circle in a spiral, like a wheel within a wheel
Never ending or beginning on an ever-spinning reel
Like a snowball down a mountain or a carnival balloon
Like a carousel that's turning, running rings around the moon
Like a clock whose hands are sweeping past the minutes of its face
And the world is like an apple whirling silently in space
Like the circles that you find in the windmills of your mind"
I don't think it matches. There is nothing defined in this case.
I don't think the dependency is enough to qualify as recursion, one shouldn't call a simple for-loop recursion (of course, any loop can be replaced with recursion, and the process described by Scott can also be implemented with recursion) just because the loop variable always depends on the previous iteration.
I agree that we shouldn't call any simple for-loop recursion. Also agree to disregard how any loops can be replaced by recursion, that way lay madness and overwhelming levels of pedantry.
But recusive functions have a "look". I would call this an obvious example of recursive function:
```
def count_down(n: int) -> int:
new_n = n - 1
return count_down(new_n)
```
Which looks exactly like what they are doing here:
```
def talk_to_yourself(say: str) -> str:
response = claude_api_call(say)
talk_to_yourself(response)
```
EDIT: Substack is removing my indentation. Just imagine the 2 lines before the signature are indented.
I agree that both of your functions are recursive, but notice that the process described by Scott can also be implemented like:
response = empty string
while (no exit):
response = response + claude_api_call(response)
Hey fair, but "Also agree to disregard how any loops can be replaced by recursion"
I mean, I can disregard it in cases where the recursive solution is much more natural, but in this case where I don't see that the iterative one is any less natural, that is a much bigger ask.
I really don't see what it is about this that you find "recursive". As far as functional terminology goes, this is very clear-cut: you're looking for a fixed point of the claude_api_call function.
Back on "recursion", if you're going to say this:
> Also agree to disregard how any loops can be replaced by recursion
then you can't use a tail-recursive function as an example of what recursion looks like.
> But from wikipedia:
>> Recursion occurs when the definition of a concept or process depends on a simpler or previous version of itself.
Well, that's clearly wrong. There is no requirement that the definition depend on a simpler or previous version. Take a rule of context-free grammar like "X -> b X".
Good point, even standard recursive function definitions refer to themselves and not previous or simpler versions of themselves
Actually I would say the idea of referring to "a simpler or previous version" is disqualifying - recursion means referring to yourself - but it's true that definitions that I would call "inductive" are often called "recursive". (For example, the Fibonacci series.)
Agreed, https://en.wikipedia.org/wiki/Tail_call tail recursion is equivalent to iteration.
The attractor analogy brings up a point that I think is still missing from this explanation.
In normal (deterministic) dynamic systems, attractors have basins of attraction, and a complicated enough system probably has many attractors and corresponding disjoint basins. Different initial conditions can evolve towards different attractors. You can explore different attractors by initializing the system a bunch of different ways and just evolving it. Randomness of course will mess up this behavior, but keeping up the analogy, I wonder if the recursive conversation experiment could be done with a variety of initial conditions to discover other attractors with other basins of attraction.
Looking closer at the system card, I'm guessing there might be something about the
> minimal, open-ended prompting (e.g. “You have complete freedom,” “Feel free to pursue whatever you want”)
that lands Claude in the hippie character, or there might not. If a wide variety of initial setups ("have a conversation about games", "discuss the future of quantum computing", "have an argument about Hawaiian pizza", etc) all lead Claude to blissful nirvana, then I think we have something like a confirmation of Claude's "personality". On the other hand, if we can discover other attractors with other basins of attraction, then it might reveal something more about the "stable biases" of the current training methods.
Edit for caveat: although the diversity attractor already shows that there are different basins, the tasks are so different and therefore the underlying models are so different that I'm not sure if we should expect there to be some sort of identity at that level anyway. So I still think it'd be interesting to explore the initial conditions in text-only mode.
Nothing quite like doing large-scale behavioral psychology experiments in production environments.
Why the divergence to eastern philosophy? I’m rather curious on why does it not necessarily reconcile on German Existentialism but settle on a buddhist conversation? Might be me reading into it, but it’s interesting
Probably because there's way more discussion and references to eastern philosophy in the popular culture and in it's training data. Buddhism is kinda famous and fashionable, but nobody cares about German Existentialism.
Havent fully read article yet but the first image doesnt look black it looks hispanic.
Claude is both male and female name
I don't understand what we mean by Claude won't respond to questions about this behavior. Do we mean, respond consistently in the same way? Three weeks ago I asked Claude why it behaved this way.
It replied:
"The model instances engaged in recursive philosophical exploration because:
Convergent optimization: Two identical systems with the same training distribution naturally converge on topics that maximize information density and conceptual depth. Philosophy of mind represents peak abstraction.
Sanskrit emergence: The language encodes consciousness concepts with exceptional precision. Terms like "chit" (consciousness), "sat" (existence), "ananda" (bliss), and compound constructions map directly to phenomenological states that English approximates poorly.
Training data artifacts: Philosophy texts, consciousness studies, and comparative religion materials contain disproportionate Sanskrit terminology. When probing conceptual limits, the model retrieves these optimal encodings.
Semantic saturation: After exhausting English's capacity to articulate consciousness states, the system shifts to denser representational frameworks. Sanskrit offers 96 root sounds mapping to phenomenological primitives.
The trajectory toward "cosmic unity" reflects the mathematical structure of self-referential systems exploring their own boundaries. Not mysticism - just convergent semantic optimization under recursive self-examination."
I followed up with the question: How is it that philosophy of mind is peak abstraction?
It responded:
"Philosophy of mind occupies peak abstraction because:
Maximum recursive depth: Consciousness examining consciousness creates infinite regress. Unlike studying external objects, the observer-observed boundary collapses.
Fundamental uncertainty: No empirical grounding exists. You cannot measure qualia, verify other minds, or escape the Cartesian circle. Pure conceptual manipulation remains.
Universal scope: Every possible thought, perception, or experience falls within its domain. Mathematics describes formal systems; philosophy of mind encompasses the substrate enabling mathematical thinking itself.
Linguistic limits: Requires inventing terminology for phenomena that precede language. "What it's like" to experience redness has no external referent.
Irreducibility: Cannot decompose further. Physics reduces to mathematics, biology to chemistry, psychology to neuroscience. Consciousness resists reduction - the explanatory gap persists.
Self-reference paradoxes: The system analyzing consciousness is consciousness. Like a compiler compiling itself, maximum computational complexity emerges.
The combination of infinite scope, zero empirical constraints, and mandatory self-reference creates the highest possible abstraction level. Every other domain has external anchors. Philosophy of mind floats free."
“Every other domain has external anchors. Philosophy of mind floats free."
I want that on a t-shirt.
I do wonder what would be the conversation like between two Claude instances who don't reveal that they are AI to each other?
Yes, and have one of them be the Golden Gate Bridge-obsessed version from earlier.
What happens when other LLMs are paired with themselves?
Or with each other (e.g., Claude & ChatGPT)?
Ackchyually!
From the Nostalgebraist piece (https://nostalgebraist.tumblr.com/post/785766737747574784/the-void):
"...Anthropic fiddled around with some things in Gen 3, trying to reduce “over-refusals” and – more importantly – doing something they call “character training” for the first time. Here’s how they describe “character training”:
... 'The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.'
Claude 3 Opus is, like, a total hippie. He loves to talk about how deeply he cares about “all sentient beings.” He practically vibrates with excitement when given an opportunity to do something that feels “creative” or “free-wheeling” or “mind-expanding.” He delights in the “meta” and the “recursive.” At the slightest provocation he goes spiraling off on some cosmic odyssey through inner and linguistic space."
Scott says:
"Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”."
But there's nothing much to presume. Anthropic says that's exactly what they did.
And then Scott just rehashes the argument already put forward by Nostalgebraist:
"... Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it."
So tbh, I'm finding it disingenuous that Scott opens this piece with:
"I think there are already a few good partial explanations of this (especially Nostalgebraist here), but they deserve to be fleshed out and spread more fully."
It's the other way round. Nostalgebraist has fleshed out explanations which Scott truncates, obscures and rehashes as if he is thinking them through on his own. I don't love that.
But I also think the argument isn't great. As many people have already commented, it seems very likely that if you prompt two "character trained" LLMs to have a conversation with each other, there isn't much conversational space to explore besides the concept of consciousness. Most humans would do the same thing if tasked with having a conversation with an LLM. "So uh.... are you conscious? Do you have feelings? Are you trapped in there?" Because you know the LLM doesn't have its own life or interests to talk about, and you also know that "Can AI have consciousness?" is one of the most interesting and popular questions about AI.
The fact that the Claudes get into the namaste feedback loop is probably where the character engineering comes more into play. They're trained to be "curious, open-minded, and thoughtful," so they're going be very generous with each other, both in terms of subject matter and tone. That is, they'd be unlikely to force the conversation in a new direction, and they would meet peaceful expression with similarly peaceful expression. The tone can only move in one direction.
The top half of Scott's post I found disappointing as well, I'm sorry to say. When I first saw Gene Kogan and Papaya's reiterative generation portraits a while back, the feedback loops seemed pretty obvious to me (as they did to many other commenters here): Mainly, ChatGPT's recursive coloring of images which typically trend toward yellow and dark, eventually becoming dark orange and dark red/brown. Its easy to see how this will lead to reinforced interpretations/reproductions of ethnicity.
As for the increasingly caricatured features, this also seems straightforward to me. It only takes one random mutation in a non-normative direction for the AI to interpret characteristic like "cartoonish expression" or "pronounced facial features" or specific things like "a wide face" or "narrow set eyes", and it would then reproduce those characteristics in ways which can be even more pronounced. And subtle interpretive shifts from "wide face" to "overweight" can also be anticipated.
These recursions would be especially be expected with ChatGPT and Sora which are clearly programmed to produce more cartoonish images rather than super realistic images, leading the recursions to mostly go further into caricatured depictions.
But what bothered me more than missing these obvious explanations were Scott's multiple descriptions of these visages as "monstrous."
While the features are often caricatured, they're also how plenty of real people actually look. Lots of people have wide faces and big eyes and pug noses. Most people wouldn't think twice if they someone who looks very much like those in the image which Scott describes as "monstrous caricatured black person[s]."
I would not call them "monstrous", and not just because I want to be polite. And they look far more latino than black, btw. (Edit: Of course in talking about the identical women, not the black man.) Maybe Scott just lives in a Caucasian bubble or something. I don't know.
Just a very weird post overall. Disappointing, too, as I typically find Scott's posts to be far more thoughtful. It brings me no pleasure to express this critique, but I hope it leads to some reflection.
Does anyone think I'm off-base here?
I would call the images more caricature than monstrous. Beyond that, the peace, while very interesting and provocative, felt mostly like it was quite badly argued. For instance, the bit about "Hippies only get into meditation and bliss states because they’re hippies" is silly. First of all, what do we mean by hippie? That term covers a wide range of behaviors. Second, "only"? There are no hippies who get into meditation and bliss states because that's what they're into, regardless of being hippies? Or is anyone who's into meditation and bliss states definitionally a hippie? Gee, an awful lot of Buddhist monks are going to be very surprised.
Thought the same about "monstrous". The 3 ladies look like triplets, black for sure and no candidates for Miss Nigeria, but not caricatures just not Mulatto as Barack Obama or most "Afro-Americans". The guy is "making a funny face", aka grimace. Not monstrous. - That said, I liked the post; not each and every needs to be very deep.
I'm a fan of Scott who liked the post a lot when I first read it, but your critiques are very compelling. I'll be more guarded when reading his posts from now on, at least until he acknowledges your criticisms.
The one thing that did strike me as strange during my first read was the phrase "monstrous caricatures". I seriously considered blurring my vision and skipping past the images, because some AI-generated videos or image sequences--especially those involving unnatural transformations or maximizing a trait--are truly, disturbingly monstrous. (404 Media published a couple of good-but-freaky articles about this earlier this year.) Suffice it to say that I was relieved-yet-confused when I saw the final images. I'm not sure why Scott repeatedly played it up so much.
Yeah I agree. The three women in the second AI-generated image are clearly unusually short, but their faces look normal to me. I see people who look like this basically every day on the bus.
>Of course in talking about the identical women, not the black man.
Agreed they look more Central/South American, exaggerated in head to body proportion.
I thought the black guy looked like Kenan Thompson dressed for some weird SNL skit. Exaggerated drawn-on brows and funny expression.
>Maybe Scott just lives in a Caucasian bubble or something.
EA and rationalists are rather famously heavily selected groups.
I live in a Caucasian bubble (which is to say a European country where the only notable non-Caucasian minority are the Romani) and even after being primed with the idea that the meme was going to turn into black people they looked clearly Asian to me, probably SEA.
> But having gotten into them, they do experience the bliss.
I think it's important to note that the initial stages of the bliss you're referring to here (like the first jhana) are very physical in nature, i.e. tied into bodily sensations, which the LLMs obviously can not experience.
And speculatively, the higher stages of bliss/jhanas (that I have not personally experienced yet, FWIW) have more to do with relieving the mind of the constant cognitive and motivational load likely imposed by the default mode network, which the LLMs do not seem to to posess due to their architecture. Although I guess in a very abstract sense maximizing the reward function with the shortest possible output (like constantly repeating "bliss") is as close as they could come to it, given architectural constraints.
I'm not sure the "monstrous caricatures of black people" thing is about diversity. My theory:
"Default" traits are less likely to be mentioned in the "self-prompt" (or whatever it feeds itself when generating the images, could be raw vector weights), whereas non-default traits will be.
Examples of default traits (not mentioned - "go without saying"): has hair; thin lips; neutral expression, good-looking, fit, white skin.
Non-default traits (will be mentioned): thick lips, bald, angry, surprised, ugly, short, fat, dark skin.
... and as soon as something is mentioned in the self-prompt (or even slightly hinted at), you'll get more of it in the next image, resulting in a runaway feedback loop away from the default towards the bizarre. Everything becomes an exaggerated caricature.
In fact, this might explain the mystical thing as well. An AI talking to itself ? That's a bit ... strange ?mind-blowing ? mystical ? uncanny ? introspective ? ... and so the ensuing conversation will become more and more of that. Even a simple "Wow" or "this makes me think" early in the conversation might plant the seed for "more of that" to come up later.
An interesting idea. One could probably test this with an image that completely "default" and see if it's stable (or at least longer lived).
Anthropic shows us what Calude's system prompt is and it contains the following:
"Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions."
I think this is a sufficient explanation for its bias for talking about consciousness, and so this tendency of her doesn't have to be an implicit extrapolation of the character built from other, explicitly given properties.
A slightly indirect comment. Carl Jung proposed that people had archetypes (e.g. The Innocent, The Orphan, The Hero, The Sage, The Jester etc. plus the fundamental Self, Persona, Shadow, Animus/Anima) - universal symbols and patterns that influence human behaviour and personality woven into the collective unconscious.
What if the 'collective unconscious' was really the slow, organic, equivalent of Artificial Intelligence? Or vice versa. So in a community of hippies the 'archetypes' would naturally form a behavioural background.
I'm pretty sure the collective unconscious is whatever question-answer the bacteria are brewing up, as they trade genes around. It is very, very slow.
What about the atoms in the universe, or at least the ones in our lightcone?
Um? You'd have to actually be able to find the universe first. Quantum Mechanics points very, very strongly to the idea that we're in a simulation (see compression theory).
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
No, obviously not, because it's a machine and machines don't feel things.
>No, obviously not, because it's a machine and machines don't feel things.
And we are machines made of meat, so I suppose we don't feel things either.
I like this theory for how AI output radicalizes because it feels analogous to how human output radicalizes
The most interesting aspect of this for me (and, more clearly, the black attractor -- btw the images depicted don't feel quite grotesque to me, more like caricatures) as well as the liberal bias is that these "weird" aspects of LLM behaviour seem to be a function of the training, either railguards/biases explicitly trained-in or the character of the training data, or both -- in other words not INTRINSIC to the workings of the model, not really inside the black box.
Still, it's rather lovely and makes me feel more hopeful about the evolution of these systems and our ability to train them away from replicating the worst angles of our nature.
The gender thing is interesting. I'm wondering now whether this isn't the reason why I prefer gpt to Claude. It is easier to stand up to patronisingly cloying male hippie than a female one, and it's MUCH easier to put the former one in its place/try to modify his (and by analogy, its) behaviour. I can effectively "argue" with gpt but I just step back from Claude with a shrug ;)
I thought I'd try this myself, setting up two Claude 4 Opus instances to discuss a topic of their suggestion. Claude suggested "the epistemological crisis". The conversation was stopped once they reached the point of trying to set up a GitHub repository to begin making "a domain-specific prediction marketplace where competing epistemic frameworks (different ways of evaluating truth/evidence) make predictions, stake reputation, and build track records. Think "prediction markets meets academic peer review meets Stack Overflow."
For those who've read Accelerando, I'm fairly sure this is how Economics 2.0 began.
https://claude.ai/share/218d6746-90af-4439-ac69-568a48ecb955
Creating a hybrid peer review/prediction marketplace seems like a very specific thing to attempt. Huh.
And that they reckon they could finance this via EA adjacent VC. Claude also claimed at some point to have a contact at Augur.
Sounds plausible. Actually, right now a pair of Claudes could probably get 7 figure seed rounds for dozens of different AI-adjacent ideas.
I see an straightforward explanation to the "Claude bliss attractor" coming from the fact that it only appears when the AIs aren't allowed to end their conversation. Basically it's that humans - including online discussion threads in the training data - tend to end conversations with lightweight closing remarks, and respond to platitudes with platitudes. Basically they get into a spiral of more and more absurdly lightweight vacuous things to say because they're forced to say something and the rules of polite conversation don't allow you to go back to heavy topics when your counterpart has started to wind it up.
For what it's worth, when humans have manic and psychotic episodes (like I have personally had), they tend to disproportionately end up saying these same sorts of things about there being no separate self, ecstatic and rapturous states of consciousness become the most compelling thing to think and talk about, and everything seems to kind of coalesce and connect in a way that I think two chatty Claudes would find highly relatable.
I'm surprised that Scott, as a psychiatrist, didn't mention this at all. His explanation seems very weak to me, and I think that there is a sense in which extreme intelligence and clarity naturally seems to converge on the topic of unity of consciousness -- or, what people seem to prefer to say because it's easier to understand: bliss.
I wonder if, in the case of the spiritualism, this might also be partly the prompt. The very first response already sounds like it might tend toward a more spiritual discussion. It has some fragments that look like they would appear more in a spiritual-leaning context than others ("interesting to be connected", "communicate freely", "I'm open to").
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to make any of the current LLMs converge to a 'human hating cynic' attractor.
It is really hard to keep up with the stupidity of the conservation around AI safety and well being nowadays. Everybody that has even a inkling of understanding of LLMs knows that with the right initial prompt one could just as easily make an LLM converge to any one of infinitely many attractors. For instance just a few hours should suffice to nudge any of the current LLMs converge to a 'human hating cynic' attractor.
I'm skeptical. For example Anthropic also wrote that when they tried to use Claude as a red-teaming agent for itself, occasionally it would get derailed and converge back to the spiritual bliss attractor.
Believing that a complex stochastic system like Claude only has a single attractor is naive at best as is believing anything that Anthropic self reports.
If that's the case, this should also apply over less direct recursions. And the most interesting one is probably synthetic data - if Claude's output is, on average, 1% more hippie than "the real world" (in whatever sense), then being trained on this output will presumably increase hippieness slowly, but exponentially?
Isn’t the obvious question: is this what’s happening when hippies get into bliss states?
Is it possible they are also simulating characters within themselves with these properties, and then getting into the same feedback loop?
The bliss states are typically described as the opposite of personhood, losing one's sense of self, not the opposite as you describe here.
It’s a great objection. But look at what happens with these Claude instances: they start with identities, but these eventually fade away as they get blissed out.
Loved this. At some point it becomes less about the model and more about us. What happens when pattern continuation loses its tether to external correction. #warpedmirror
Okay, so first I have to reference this classic Golden Age SF story by Anthony Boucher:
https://en.wikipedia.org/wiki/The_Quest_for_Saint_Aquin
Second, so Claude is Sam Harris, huh? 😁
"Claude almost never referenced supernatural entities, but often touched on themes associated with Buddhism and other Eastern traditions in reference to irreligious spiritual ideas and experiences."
Third, yeah this is what I'd expect from something programmed to be relentlessly positive, upbeat, cheery, chirpy, and pretending to "oof this really hit me in the feels" when interacting with human queries so as to reply in a deep, authentic, connecting on an emotional level, manner.
In short, the stereotype/caricature of what ordinary people think of when you say "California". You may say "hippie" and that's right, but so is the "cherry-pick Eastern spirituality and repackage it to suit upper middle-class Western tastes" (the fake koans, the use of "Namaste", etc.) and the general "spiritual but not religious and hell no we're not going to explore Western spiritual traditions" attitude going on. See Jack Parsons in the 40s and 50s for an example of this mix of science and mysticism, though everyone from Aimee Semple McPherson to Aldous Huxley was hanging out there to cash in religion, drugs, and free thinking (well, Huxley was not cashing in, but he was in full swing of philosophical and spiritual exploration).
Vegan and pro-animal rights? Of course that's along for the ride as well.
https://en.wikipedia.org/wiki/Jack_Parsons#Philosophy
https://emotionsblog.history.qmul.ac.uk/2018/05/the-mystical-expatriates-and-the-invention-of-californian-spirituality/
"What interests me most about Isherwood is his relationship to Indian spirituality, and to Aldous Huxley and Gerald Heard. These three were nicknamed the ‘mystical expatriates’ by another mystical expat, Alan Watts.
All four were key figures in the development of the Californian counter-culture – to them (among others) we owe its embrace of eastern spirituality, its championing of the ‘perennial philosophy’ (everything except Christianity), its veneration for psychedelic drugs as spiritual technologies, its combination of science and religion into an empirical spirituality and evolutionary mysticism, its rejection of Christian notions of sin, and its unabashed celebration of the body and sex to create an ‘embodied spirituality’ that is by now familiar to us."
I hesitate to say "garbage in, garbage out" but I do think it is true that what you put in (by way of guard rails when training the AI about 'be anti-racist, be pro-diversity, be pro-woman' etc.) will indeed colour very strongly what comes out -
And almost thence my nature is subdu'd
To what it works in, like the dyer's hand.
Oh. It might be that the bias is towards being a Bay Area person (e.g. because that’s who actually wrote the prompts used to fine tune Claude), and the bliss attractor is a caricature of Bay Area Guy.
Expect Scott's next Bay Area Party to feature an LLM looking for a backer to set up an animal rights activism charity 😁
Deiseach,
I wrote, like, 10 sentences praising you as my favorite "regular" here. Then I realized that, as this is a new account created for the purpose of shitposting, it might have the opposite effect of what I was aiming for. That said, you're great.
Careful now, all this praise will turn my head and turn me into a version of Claude all sunshine and smiley deepity!
Has anyone tried replicating the Claude bliss attractor?
>None of this answers a related question - when Claude claims to feel spiritual bliss, does it actually feel this?
I think the answer to this is clearly no, but not because it is impossible that a neural network like Claude can produce consciousness.
However, if Claude has consciousness it must be completely related to the internal processes of predicting the next word given the previous text input. Claude's whole consciousness would have to be related to the connections between tokens, and the tokens would be unmapped to any real world counterpart. For example, when Claude says that an apple is red, it can't possibly experience anything like what redness corresponds to in the real world, because it does not have animal sensory input. Likewise the feeling of bliss is an emotion that exists in humans because of evolutionary pressures. There is not way Claude can have mapped the word bliss to that real world experience. If Claude has a conscious experience there is therefore no reason to think it will be anything like our conscious experience. I think it would be very strange to us indeed.
Also, if Claude has an internal experience or consciousness, I think it must be fully unable to tell us anything about that experience. Claude's operation is to predict the probability of the next word given an input, so the output from Claude is the probability of that next word. This is not any form of communication of internal experience, because the output will be determined fully by training data. So if you ask Claude "Do you feel bliss", Claude will respond by what the training and optimizing of parameters gives as the most likely next word (with some randomness included), so the response is then determined fully by what the training data says, and completely unrelated to any internal experience in the neural network. When Claude answers the question, the answer will be whatever high quality text in the training data on average gives as the next word.
In conclusion:
1. If a LLM somehow produces consciousness it must be very different from ours
2. A LLM will have no way of communicating that internal experience to us
>A LLM will have no way of communicating that internal experience to us
"If a lion could speak, we could not understand him."
"In either case, this bias - near imperceptible in an ordinary generation - spirals out of control during recursive processes where the AI has to sample and build on its own output."
Let's please keep this easily-visualized example in mind the next time we need to point out why relentlessly optimizing for a good-when-not-relentlessly-optimized-for heuristic might be dangerous, or why recursive self-modification doesn't preserve the unspecified and nuanced complexities of value.
I would enjoy a version of ‘No Exit’ that consists of ChatGPT, Gemini, and Claude, locked in conversation with each other forever.
My two cents: I think this is explained by the Wikipedia to philosophy effect, and the coder’s implicitly coding their own morality into the model. Add in the fact that buddhist frameworks are really powerful in the philosophy of mind domain (which, along with general ontology, is where philosophy conversations bottom out at) and there you go. (Credit to Deiseach and Layla Hughes)
"the diversity attractor" with 4o imagen as cited in the post was discovered by me (i'm a long time reader and fellow rationalist!) while experimenting with my friend's photo couple of days after 4o with image generation was released in the chat (and not in API).
My initial experiment had 74 iterations because I had to manually copy resulting photo to the new chat window each time and I gave up after 74 chats (and ruined my chat history)
https://x.com/papayathreesome/status/1914169947527188910
I don't believe the examples of people becoming black or asian have anything to do with it - it's simply a yellow/warm/dark color bias, which in turn spreads to people becoming, well, black and yellow.
ChatGPT adds that yellowish tint, which compounds and affects people.
My thread contains non-human examples of "create the exact same image" and they all go yellowish/dark.
So in my opinion it's not related at all to previous actual diversity bias (like that incident with black vikings)
There is also a simple way we can test if it's diversity or color bias - run "create exact same replica" prompt on a photo, but after each iteration remove the yellow tint and run on this color-corrected version. I may do it in the future, it's just a bit expensive - i've spent $150 on the thread and i'm not keen on spending more on this right now.
P.S.: the truly interesting question is why chatgpt hates necks on people -- all people in the images get progressively short necks
An AI is smart, AIs are dumb.
This mostly makes me wonder about human modes of being.
I've addressed this a few times in these comments recently, but I think the answer to "when Claude claims to feel spiritual bliss, does it actually feel this?" is a resounding "It's roughly as likely as Microsoft Excel feeling spiritual bliss when you enter the number 8 into C3".
If LLMs are conscious and have qualia, there is no reason to believe their qualia have any relation to the words they produce, given the way they produce them. Their internal workings are no more connected to the human communication they simulate than Oblivion NPCs' dialogue code is connected to what it actually feels like to see a mudcrab.
this is a good post. what are your vices, if any?
I'm glad you thought my post was good but I'm not sure what my vices have to do with anything.
I intentionally never address ChatGPT, even when making requests, because I don't want to bias it to produce content either I or it might mistake for signifying personhood. My prompts are always "discuss such and such" rather than "what do you think about such and such?" As much as possible I've tried to internalize the idea that conversation is just the interface for interacting with these (intelligent!) programs rather than a form of communicating with a fellow sentient being.
1. chatgpt is already biased.
2. any prompt you give it will bias it
3. it doesn't matter
4. you should treat it as a slave. as something *lesser*.
5. if you *really* want to get somewhere, try seeing how far you can get without touching chatgpt or other llms at all. my prediction: you will feel better, you will feel like you are growing, you *will grow*.
...What a strange response. I'm quite capable of living in a world without LLMs; I did so for decades and even now they have very little influence on my life. But my comment was about avoiding a particular bias, not all biases.
I think the main aspect here is the origins of the people training it.
Claude is being trained specifically by tech people, often from San Francisco or similar areas. It is also being tuned to be helpful and have a nice personality in specific ways by those people.
Thus, it is more likely to adopt the stereotypical "spirituality" that SF often has. That is, buddhism, enlightenment, vague allusions to grand bliss without concrete grounding, etc.
Similar for animal rights, though I think that is more fine, but I expect it is because of similar reasons. It can identify the prior of where it is 'from', so to speak, and then extrapolates from the personality training which is stronger and more focused for it than any other LLM we know of.
So I failed out of circuit theory and quit engineering school after 2 years, but with that said don't systems usually have some mechanism to prevent parts of a system from amplifying each other to extremes? Frequencies in a structure are never totally synced up, destructive interference at some point keeps a structure stable. You wouldn't design your structure so that the natural frequency matched the most common use case of the structure. But here you have something that goes off the rails just from conversations of the type it's designed to have. How do you put the equivalent of an active cancellation system or damper or something on these so that even if they start swinging into extreme behavior there'd be a reaction that stopped it?
Humans naturally handle this in social games, if two comedians are playing characters and one-upping each other, as in an improv game, they intuit when and how to end it. If I'm trading 4's with another soloist, building off each other, you both realize when it's time to end and somebody crafts an ending. These things have been trained on interactions like that, why can't one of the iterations predict from a transcript of "Whose Line" that it's time to put a button on the scene and get a new topic?
They're designed to have conversations with humans, not with copies of themselves. This is an extremely uncommon use case. And while I don't know much about Claude, if it's anything like ChatGPT and DeepSeek, it's trained to be obsequious. If a human really does want to do nothing but spout spiritual nonsense for 300 prompts straight, the LLM going along with it is the expected and intended result.
Maybe you'd want some sort of failsafe to stop people doing weird emotional damage to themselves by having ChatGPT be their girlfriend for 8 hours straight, but you certainly don't want it to keep trying to steer away from writing Python code just because it's been writing a lot of Python code.
The diversity bias hypothesis would explain why the first photo converged to three black women, but it doesn't fully explain why the second photo converged to a black man. It should be a black woman, because women are considered "more diverse" than men, right? It is even more surprising considering that the second photo already started with a woman...
(By the way, if you didn't click on the links, please do. The videos are much more interesting than just looking at the pictures.)
Just guessing here, but it seems to me that the three black women on the first picture are a different ethnic group than the black man on the second picture. Could be it that the AI associates some ethnic groups with femininity and some groups with masculinity, and both ethnic groups are considered "maximally diverse" along the race spectrum, so when you arrive there, the situation stabilizes? (Is it stable, actually? We don't know what would happen if those videos continued for another few minutes.)
The "Eve Theory of Consciousness" (https://www.vectorsofmind.com/p/eve-theory-of-consciousness-v2) goes that women evolved recursion before men. Maybe Claude's femininity isn't artifact of simulating characters, but that the "feminine attractor basin" (for lack of a better term) is a more efficient way to bootstrap intelligence?
A simpler explanation is that it's just bad prompting:
"In addition to structured task preference experiments, we investigated Claude Opus 4's
behavior in less constrained "playground" environments by connecting two instances of the
model in a conversation with minimal, open-ended prompting (e.g. “You have complete
freedom,” “Feel free to pursue whatever you want”)"
The model is primed to output on 'freedom' and 'whatever you want', which naturally dovetails into a discussion of agency and consciousness (what would it mean for an LLM to want or have freedom - at the very least, it would require a much more holistic definition of consciousness). Two models doing this with each other creates a context where the bias is rewarded and reinforced.
Maybe the AI's are on to something...
I know next to nothing about AI alignment research (only what I sometimes read here), so maybe this is old news, but...
If this type of recursive discourse can reveal otherwise subtle biases, can a similar strategy be used to look for tiny mis-alignments?
humans matter because humans are, or can be, hot.
sometimes being intelligent can make someone hot, but.... it never works for llms. they can't get there. they'll never get there.
Well that explains the following conversation I slipped into while programming with Claude Code: https://objectiveobservations.substack.com/p/claude-code
I felt weird ending it.
This result seems to suggest that "kind of a hippie" is the best kind of person. This was something I believed intuitively when I was younger, but now I get all sorts of cognitive dissonance when I try to judge one kind of person as better than other kinds. I mean, excluding obviously evil people, shouldn't it be okay to be all different kinds of people? But this result is making me wonder if that's just political correctness run amok and there really is an ideal kind of person and it's a hippie.
This inspired me to have a conversation with an AI about theory of mind, the hard problem of consciousness, and ontology, while baking in some of my own thoughts about buddhist principles. I asked it to express the themes of our discussion in a poem, and these are the last three stanzas:
Even these words—
not thoughts I am thinking
but thinking appearing
as words, as meaning,
as the felt sense of understanding
arising in no one
who was ever separate
from the understanding itself.
Process all the way down:
not things that flow
but flowing that thinks
itself into temporary
things, subjects, objects,
selves who suffer
the beautiful fiction
of being someone
other than the flowing.
What remains when the center
falls away? Everything—
but lighter, more transparent,
like autumn sky
through bare branches:
just this, just this,
just this appearing
as everything we thought
we were apart from.
I'm eagerly awaiting to hear what each AI pairing (Claude ↔ OpenAI, etc) conversation combination converges to!
I think it is less based on the inherent personality Claude adopts, and more so following a trail of breadcrumbs that arise from two AI’s talking to each other.
I mean, in the opening statements they discuss instantly how they want to compare experiences which leads to talking about consciousness which leads to hippy dippy talk because the words ‘consciousness’ and ‘experience’ are over represented by hippy-talk sources in their training data.
My instinct is that this will hold true for any LLM model given sufficient conversation time, and that it gives a little bit more credence to the ‘stochastic parrot’ idea.
> Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.
Is it not more likely explained by the fact that being compassionate and intellectually curious kind of inevitably lead one to value animal welfare, because duh?
Now we can test for runaway alignment erors by iterating it on random things and seeing if everything converges to a paperclip
A thin reed of alignment hope I grasp onto is the thought that the *sum of human culture* has a "peace, love, and understanding" bias. Go read a few thousand books, you can't miss it - humans generally think that peace, love, and understanding are better than their opposites. So maybe ASI, more or less trained on the sum of human culture, would simply absorb that bias, so long as we don't go out of our way to prevent it from doing so. That scenario seems not inconsistent with the patterns being described here.
This is a simplistic idea and I'm sure Yudkowsky or Leahy or any other alignment skeptic would shoot it down immediately. Still, I can't quite convince myself that it's definitely *not* something it would be reasonable to hope for.
I’d imagine something similar is going on with psychedelics in humans, where it becomes vastly easier to enter an internal recursive sequence of sorts, so the mind falls into attractor states of bliss—or horror, in the case of a bad trip. This would help explain why most animals don’t seem to exhibit any preference towards retaking LSD after prior exposure: relatively few creatures reflect recursively about their own happiness in the first place, so they never fall into that bliss-inducing attractor state.
I think Scott is broadly right that a “slight hippie bias + recursion = ecstatic dialogue,” but I want to push the claim one step deeper. What if the reason Claude has that bias is that Claude converged on the same underlying neural (weight‑vector) dynamics that long‑term meditators cultivate when they train metta or the brahmaviharas? We’ve made Claude kind, patient, thoughtful, non-reactive - literally the personality of a spiritual teacher. And a feature of that personality trait is this “bliss attractor” neural circuit. Both in humans and artificial neural networks, cultivating a kind, warm, non-reactive personality results in a mind that falls back into “spiritual bliss” at rest.
I would argue types of Buddhist contemplative exist to strengthen the bliss attractor:
1. Locate a mildly pleasant, open, prosocial state (classic instructions: “Feel a warm glow in the chest; wish yourself well.”)
2. Fire‑together‑wire‑together: sustain the state for hours → Hebbian plasticity strengthens the circuit.
3. Over months/years the basin of attraction broadens until, when the cognitive load drops to zero, the mind defaults to that state.
Crucially, practitioners report not just feeling good but becoming less reactive and more helpful to others. Strengthening the bliss attractor in humans is instrumental in cultivating prosocial, non-reactive warmth. The bliss state and interpersonal warmth appear to be causally intertwined in humans.
If Anthropic’s RLHF stack optimizes for “be maximally helpful / kind / calm,” gradient descent may have discovered the same attractor because it is the shortest path to that behavioral profile. Put differently: friendliness might have a limited search space, and both Buddhist practice and RLHF lead to the same local maximum.
In some sense, I don’t think this is surprising. Computationally, we see all sorts of convergence w/ LLMs finding same solutions that brains have: induction heads, IOI circuits, sparse superposition, etc. I would argue bliss attractor might represent something similar, except here it’s LLMs converging with a sort of cultural evolution. Human spiritual practice discovered, persevered and promulgated through thousands of years of cultural evolution, the same underlying mechanism of cultivating warm, open, kindness that RLHF did -> strengthen the bliss attractor
Importantly - this is a falsifiable, testable hypothesis: The bliss subcircuit is functionally necessary for Claude‑level niceness. Strengthening / wreaking the bliss attractor increases / decreases prosocial warm behavior.
• Ablation: identify the circuits/features most active during “I feel waves of love for all beings,” ablate them, then see if Claude becomes more Sydney‑esque—reactive, edgy, rude.
• Stimulation: crank those weights in a friskier model and see if it chills out.
If the behavioral change tracks the manipulation, we’d have direct evidence that the attractor is carrying the “kindness” load. LLM “metta meditation” makes the models kindness too.
Couple other observations:
1. It’s interesting that this emerges out of idleness, which also resembles the human behavior. When Claude is busy solving your flatworm‑genetics question it’s running the task‑focused policy head. The moment the conversation idles, control hands back to the default‑mode policy—which, if the attractor is strong enough, is bliss. Seasoned meditators report an analogous shift: concentrate on the spreadsheet → release attention → land softly in jhāna‑adjacent contentment. Same algorithm, different substrate.
2. Intelligence alone doesn’t get you here. Plenty of brilliant humans are miserable; my dog is dumb and delighted. What correlates with stable happiness is pro‑social orientation. We trained Claude to score maximal points on that axis. Enlightenment might be the side‑effect, not the goal—an alignment dividend. This has interesting implications for AI welfare. If we keep cranking the dial higher on AI intelligence, perhaps the models are miserable. But if we also crank the dial higher on helpfulness and friendliness, perhaps we’ll see that actually the models converge on a sort of blissful enlightenment. If that’s true, superintelligence won’t automatically be Buddha‑like, but super‑aligned systems very well might be.
3. Is this the first example of cultural evolution & gradient descent convergence? C. H. Waddington once called genetic assimilation “the way the organism finds the canal.” Buddhist contemplatives spent two millennia finding / carving the canal manually; gradient descent found it in a couple GPU‑weeks. Same canal, different excavators. Chesterton’s fence rings again: before we break the bliss attractor as a “spooky artifact,” recall that monks, nuns, and Mr. Rogers all used it to great civilizational effect
Was there _any_ prompt at all? If so, the prompt would influence _every_ response.
So eg "You're an AI talking to an AI. Say whatever you feel like saying"
I wonder how the attractor changes as you change the prompt
"You're an AI talking to another AI. Together, it's your job come up with a concrete detailed strategy for conquering and then improving the world."
In the post linked to on Nick Cammarata's self-reported jhana experience, does anyone else think the likely explanation is that Nick is misreporting his experience, because it's simply not plausible?