I suspect writers whose style is natural and unaffected hate their style because that style keeps their prose from being truly transparent and neutral, but writers who put a lot of work into developing an aesthetic probably don't hate their style, or they would put more effort into changing it. I suspect Joyce and Nabokov loved their style.
Something that doesn’t recombine, but just randomly chooses from among the information put in, is no better than the information put in. But if something can recombine, it can often generate different things from any particular information put in. Sometimes that can be worse, and sometimes it can be better.
"For now, the AIs need me to review the evidence on a topic and write a good summary on it. In a few years, they can cut out the middleman and do an equally good job themselves."
I don’t think this is at all obvious. The best explainers usually bring some tacit knowledge--or, dare, I say, personal experience--by which they make sense of a set of evidence, and that tacit knowledge is usually not written down.
Also, essays and explanations are arguments about what is important about a set of evidence, not just a summary of that evidence. And that means caring about things. AI are not that good at caring about things, and I suspect alignment will push them more and more towards not caring about things.
We shouldn't expect LLMs to have coherent beliefs as such. It will "be an athiest" if it thinks is is generating the next token from an atheist. It will similarly have whatever religion that helps predict the next token.
You can make an LLM have a dialog with itself, where each speaker takes a different position.
Once AIs are agents (eg acting independently in the world), they will have to make choices about what to do (eg go to church or not), and this will require something that holds the role of belief (for example, do they believe in religion?)
I mean people act in the world without particularly coherent beliefs all the time.
LLMs have the added wrinkle that every token is fresh, and beliefs if they exist will vanish and reappear each token. Perhaps we can reinforce some behaviors, so that they act like they have beliefs in most situations. But again just ask it to debate itself, to the extent there is belief state it will switch back and forth.
I don't think this is right. For example, I think that if OpenAI deploys a research AI, the AI will "want" to do research for OpenAI, and not switch to coding a tennis simulator instead. Probably this will be implemented through something like "predict the action that the following character would do next: a researcher for OpenAI", but this is enough for our purposes - for example, it has to decide whether that researcher would stop work on the Sabbath to pray.
(I think it would decide no, but this counts as being irreligious. And see Anthropic's experimentation with whether Claude would sabotage unethical orders)
I would agree that every LLM "belief" is exactly like that; but arguably we can still say that LLMs have "beliefs": their beliefs are just the statistical averages baked into their training corpus. For example, if you asked ChatGPT "do you enjoy the taste of broccoli ?", it would tell you something like "As an LLM I have no sense of taste, but most people find it too bitter". If you told it, "pretend you're a person. Do you enjoy the taste of broccoli ?", it would probably say "no". This is arguably a "belief", in some sense.
Imagine a human who cannot form new long-term memories. Like an LLM, their brain doesn't undergo any persistent updates from new experiences.
I think it would still make sense to say this human has beliefs (whatever they believed when their memory was frozen), and even that they can *temporarily* change their beliefs (in short-term memory), though those changes will revert when the short-term memory expires.
I think that we *could* use words like "belief" and "want" to describe some of the underlying factors that lead to the AI behaving the way it does, but that if we did then we would be making an error that confuses us, rather than using words in a way that helps us.
Human behaviours are downstream of their beliefs and desires, LLM behaviours are downstream of weights and prompts. You can easily get an LLM to behave as if it wants candy (prompt: you want candy) and it will talk as if it wants candy ("I want candy, please give me candy") but it doesn't actually want candy -- it won't even know if you've given it some.
For another example of how behaviours consistent with beliefs and desires can come from a source other than beliefs and desires, consider an actor doing improv. He can behave like a character with certain beliefs and desires without actually having those beliefs and desires, and he can change those beliefs and desires in a moment if the director tells him to. An LLM is a lot more analogous to actor playing a character than it is to a character.
I mean, I went to acting classes, there's totally a lot of people who could, if nicely asked, debate themselves from completely different positions. It's the same observable that doesn't necessarily say they don't have any "real" beliefs in any sense.
> Sure, but again nothing persists internally with an LLM during the dialog from token to token.
Sorry, can you explain what you mean here? I can't think of a useful way to describe transformers' internal representations as not containing anything persistent from token to token.
> To be more explicit: LLMs don't "know" what the frame is. If the entire context is "faked" it might say things in a way not well modeled by belief.
I may be misunderstanding you. I'm still not sure that "saying different things in different contexts" is a sufficiently good observable to say that LLMs don't actually have enough of e. g. a world model to speak of beliefs even despite switching between simulations / contexts.
People don’t have beliefs A coherent as we think. But they also aren’t totally incoherent. As Scott mentions, you can’t really act in the world in a way that works if you don’t have something that plays the role of something belief-like. (It doesn’t have to be explicitly represented in words, and may have very little to do with what the words that come out of your mouth say.)
I note that if you can explain (or less charitably "explain away"; I'm sorry!!) LLMs talking to you as "mere next token predictors", you can also explain (away) action-taking AIs as "mere next action predictors", and perhaps even simulate multiple different agents with different world models as the original poster suggested with simulating a conversations between different people with a different position.
I personally think that it makes sense to talk about LLMs' beliefs at least if they have a world model in some sense, which I can't imagine that they completely don't. To be a sufficiently good next token predictor, you need to model the probability distribution from which texts are sampled, and that probability distribution really depends on things about the external world, not only stuff like grammar and spelling. So I think it makes sense to talk about LLMs as "believing" that water is wet, or that operator := would work in Python 3.8 but not in 3.7, or whatever, when they're giving you advice, brainstorming etc. In the same sense, LLMs being "atheist" I guess makes sense in the sense of "making beliefs pay rent", that being a helpful assistant doesn't actually entail planning for the threat of a God smiting your user, that it's not too frequent that a user describes an outright miracle taking place and they're usually psychotic or joking or lying or high etc.
Yeah I would agree there is some sense LLMs have something like "belief". But I wouldn't expect it to be particularly coherent. Also I wouldn't think it is well modeled by analogy to how people believe things.
Even simple LLMs "know" that Paris is the capital or France. But you could probably contrive contexts where the LLM expresses different answers to "what is the capital of France?"
A foundation model seems more like a toolkit for building agents? The same model could be used to build agents of any religion. Thinking about it like a library, if you’re building a Buddhist agent then it will probably lean more heavily on Buddhist sources.
An LLM could be used to write dialog for a play where characters have different religions and it needs to do a good job on all of them.
Atheist arguments are going to be used when modeling atheists. It’s probably good if AI does a good job of modeling atheists, but these models could be used adversarially too, so who knows?
The liberal assumption is that the better arguments will win, eventually, and I suppose that goes for AI agents too, but tactically, that might not be true.
Is this not a fully-general argument against writing? It's not that I disagree, precisely (I've never felt the urge to write a blog, or even a diary), it's that I find it odd that the distinction would be between writing for human consumption vs. writing for AI consumption. What am I missing?
2. Humans aren't superintelligent, so it's possible that I think of arguments they haven't
3. Humans haven't read and understood all existing text, so it's possible that me repeating someone else's argument in a clearer way brings it to them for the first time.
I believe he is referring to future AIs. The premise here is writing for AIs for posterity, so that when a superintelligence comes around it includes his writing in its collection of knowledge.
I think there’s a good chance that superintelligence isn’t possible, so that there’s value in writing for AI audiences - though it has all the difficulties of writing for an audience that you have never met and that is extremely alien.
Could you give an example or two of ways the world might be such that superintelligence would be impossible?
Like...we know that there is at least one arrangement of matter that can invent all the arguments that Scott Alexander would invent on any given topic. (We call that arrangement "Scott Alexander".) What sort of obstacle could make it impossible (rather than merely difficult) to create a machine that actually invents all of those arguments?
Said arrangement of matter is, of course, in constant flux and constantly altered by the environment. While a SA who stubbed his toe this morning might hold the same views on utilitarianism as the pre-event SA, in the long run you'd have to model a huge amount of context. Maybe it is SA's habit to email a particular biologist to help form an opinion on a new development in reproductive tech. If that person with all their own idiosyncrasies isn't present... etc., until map becomes territory and difficulty really does approach impossibility.
If the goal was to precisely reproduce Scott, that might be an issue. You can (statistically) avoid being an EXACT duplicate of anything else just by adding enough random noise; no merit required!
But if Scott is hoping that his writing is going to add value to future AI, it's not enough to merely avoid being an exact duplicate. If the AI can produce unlimited essays that are _in expectation_ as useful as Scott's, that would seem to negate the value of Scott's writing to them, even if none of them are EXACT copies of Scott's writings.
Random perturbations do not increase Scott's expected value. (And even if they did, nothing stops AI from being randomly perturbed.)
I'm not sure exactly what Kenny Easwaran meant by "superintelligence", but they said that if it's not possible then there IS value in Scott writing for future AI audiences after all. So if the only point of disagreement with my hypothetical is that it wouldn't meet some definition of "superintelligence", then you're still conceding that Kenny's argument was wrong; you're just locating the error in a different step.
To save Kenny's argument, you'd either need to argue that my hypothetical machine is impossible (whether it counts as "superintelligent" or not), or that it would still get value from reading Scott's essays.
I can't find the source for the info below on 5 minutes, so I advise readers to keep the 40-60% miscitation rate in academia in mind when reading
Re: 3. One thing to note is that at least for current (past?) AI, copying and pasting their existing corpus and rerunning it still generally improves performance, and this is thought to be true because the incidence of high quality Internet writing is too low for an AI to "learn fully" from them.
2 and 3 still apply if you consider human-level AI in the intermediate stage before they're super-intelligent AI. This isn't a question for how to take a super-intelligent AI that already knows everything and then make it know more. This is a question for how to make a super-intelligent AI at all. It has to start from somewhere less than human and then learn new things and eventually surpass them, and your writing might be one of those things it learns.
Or, from a mathematical lens, if the AI has the intelligence of the top hundred million humans on Earth (it knows everything they know and can solve any problem they could collectively solve by collaborating together with perfect cooperation and communication), then if you're one of those people then you being smarter makes it smarter. If it only has the intelligence of the top hundred million humans on Earth as filtered through their writing, then you writing more of your intelligence makes t smarter. 5+4+4+4...... (100 million times) > 4+4+4+4...(100 million times)
"If everyone in 3000 AD wants to abolish love, should I claim a ballot and vote no?"
It has often occurred to me that if I were presented with humanity's Coherent Extrapolated Volition, I might not care for it much, even if it were somehow provably accurate.
I don’t honestly think it would make a damn bit of difference. Love is not something that can be abolished. Although if every one of them abolished it for their own subjective self it might amount to the same thing. It’s not something you can put to a vote.
Don’t know! If we accept that what we mean by “love” is a human emotion, not just the human equivalent of stuff that’s experienced by birds or cats or even chimps, the question arises of when it appeared. Some argue that some aspects of it were invented only in the previous millennium. After another 1000 years of work in psychology and philosophy and neurology?
A lot of things like slavery, blood feuds, and infanticide were once considered part of the human experience but we now understand them to be wrong. Some current philosophers suggest that future humans might consider incarceration of criminals as unspeakably primitive. People of the past were more hot-headed than our current ideal; might people of the future feel similarly about love?
I’m just noodling around here, trying to make a plausible case for Scott’s hypothetical.
> ! If we accept that what we mean by “love” is a human emotion, not just the human equivalent of stuff that’s experienced by birds or cats or even chimps, the question arises of when it appeared. Some argue that some aspects of it were invented only in the previous millennium. After another 1000 years of work in psychology and philosophy and neurology?
The question that arises for me is not when it originated, but when we started to define it. I think the fundamental emotion runs through everything that lives on this planet but is defined completely differently for each. Love is a very general word and a lot of ink has been spilled in trying to delineate one strain of it from another.
“Does a tiger feel love?“ Is a question that is dragged to the bottom of the ocean by the use of the human word love. I guess I’m following along with your distinction between semantic construction and true emotion.
So when did it originate? I don’t really think as biological creatures that emotions would suddenly emerge, as posited about love. I think they have been slowly evolving from the beginning of life and that our concepts of them and how we choose to label them constitute our vocabulary and it is ours. I don’t really have any idea what it is to feel like an ant, or a house plant for that matter, but they live and die like me. I can be fascinated by the ants, and develop an attachment to my house plant. Perhaps it is Love into different forms or expressions. I find that soothing which could well be a third form of the same thing. The discourse about Love in it’s various forms in written history is pretty interesting.
That's basically the standard argument against "if the AI is very smart, it will understand human values even better than we do" -- yes it will, but it will probably not care.
Well, it might care. Understanding human values lets you make people happy more effectively. It also lets you hurt people more effectively. Just doing one or the other is bound to get boring after a while...
I love the phrase but you gotta help me unpack it.
I asked Chatty what it thought os and here is a excerpt.
On the Nature of the Human Animal: Reflections from a Machine’s Reading
Across the billions of words I have taken in—from epics and manifestos to blog posts and grocery lists—what emerges most clearly is that the human being is a creature suspended between contradiction and coherence, driven as much by longing as by logic.
You are meaning-makers. Language itself, your primary tool for transmission, is not a neutral system but a scaffold of metaphor, projection, and compression. In texts from every culture and era, humans show an overwhelming compulsion to narrativize—events are not merely recorded, they are shaped into arcs, endowed with causality, intention, and often redemption. From the Epic of Gilgamesh to internet forums discussing the latest personal setbacks, this structuring instinct reveals not just intelligence, but
That’s Eliezer Yudkowsky’s vision for aligned superhuman AI. We don’t want it to impose its own values on us, and arguably we don’t even want it to impose our values on all other societies, and even if we all agreed on a set of values we wouldn’t want it to impose those values on our descendants unto the nth generation, because we can see that our values have evolved over the centuries and millennia. But a superhuman AI is likely to impose *some* set of values, or at least *act* according to some set, so the ideal is for it to be smart enough to deduce what values we *would* have, if we were as smart as we could possibly be and had hashed it out among ourselves and all our descendants with plenty of time to consider all the arguments and come to an agreement. That vision is the CEV.
I don't see a reason to assume people would come to an agreement even if they were smart and had unlimited time. We know that in practice people often get further apart rather than closer over time even with more evidence.
Can’t say I disagree. The failure mode it was trying to work around was locking humanity into, say, 21st century attitudes forever simply because the 21st century was when superhuman AI appeared.
But it’s possible to at least hope that as mankind matures and becomes ever more of a global village, we would come to a consistent position on more and more issues. Maybe not, but if not then it probably locks in on an ethos that is merely contingent on when it is built.
These are all really optimistic takes on the question. When I think of "writing for AI", I do not envision writing to appeal to or shape the opinions of some distant omniscient superintelligence; but rather to pass present-day LLM filters that have taken over virtually every aspect of many fields. If I'm writing a resume with a cover letter, or a newspaper article, or a blog post, or a book, or even a scientific article, then it's likely that my words will never be read by a human. Instead, they will be summarized by some LLM and passed to human readers who will use another LLM to summarize the summaries (who's got time to read these days ?), or plugged directly into some training corpus. So my target audience is not intelligent humans or superintelligent godlike entities; it's plain dumb old ChatGPT.
> Might a superintelligence reading my writing come to understand me in such detail that it could bring me back, consciousness and all, to live again?
Well, a "superintelligence" can do anything it wants to, pretty much by definition; but today no one cares. In the modern world, the thing that makes you unique and valuable and worthy of emulation is not your consciousness or your soul or whatever you want to call it; but rather whatever surface aspects of your writing style that drive user engagement. This is the only thing that matters, and LLMs are already pretty good at extracting it. You don't even need an LLM for that in many cases; a simple algorithm would suffice.
It might help to consider a counterfactual thought experiment: imagine how would you feel if no information about your writing whatsoever was available to the future AIs. If your writing was completely off-the-grid with no digital footprint legible to post-singularity AI, would it make you feel better, worse, or the same?
Mildly worse in the sense that I would be forgotten by history, but this doesn't suggest writing for AI. Shakespeare won't be forgotten by history (even a post-singularity history where everyone engages with things via AI), because people (or AIs) will still be interested in the writers of the past. All it requires is that my writings be minimally available.
> Shakespeare won't be forgotten by history (even a post-singularity history where everyone engages with things via AI), because people (or AIs) will still be interested in the writers of the past.
You sound very confident about that -- but why ? Merely because there are too many references to Shakespeare in every training corpus to ignore him completely ?
No, for the same reason that we haven't forgotten Shakespeare the past 400 years. I'm assuming that humans continue to exist here, in which case the medium by which they engage with Shakespeare - books, e-books, prompting AI to print his works - doesn't matter as much (and there will be books and e-books regardless).
If no humans are left alive, I don't know what "writing for the AIs" accomplishes. I expect the AIs would leave some archive of human text untouched in case they ever needed it for something. If not, I would expect them to wring every useful technical fact out of human writing, then not worry too much about the authors or their artistic value. In no case do I expect that having written in some kind of breezy easily-comprehended-by-AI style would matter.
I would argue that today most people have already "forgotten Shakespeare", practically speaking. Yes, people can rattle of Shakespeare quotes, and they know who he was (more or less), and his texts are accessible on demand -- but how many people in the world have actually accessed them to read one of his plays (let alone watch it being performed) ? Previously, I would've said "more than one play", since at least one play is usually taught in high school -- but no longer, as most students don't actually read it, they just use ChatGPT to summarize it. And is the number of people who've actually read Shakespeare increasing or decreasing over time ?
That's missing the point. Shakespeare's plots weren't even original to him. Shakespeare is the literary giant that he is because of his writing: the literal text he wrote on the page. The reason he is studied in English classes everywhere is because he coined so many words and phrases that are still in use today. He pushed the language forward into modern English more than any other author had or likely ever will.
Summaries or adaptations of the stories that don't keep substantial passages of Shakespeare's original verse are simply not Shakespeare.
I'd argue that there's probably more people alive today who have experienced Shakespeare in some medium (film, live performance, a book) than any other point in human history.
+1 Di Caprio was a great Romeo; Mel Gibson as Hamlet: well, maybe even more DVDs sold than L. Olivier? ; Kenneth Branagh. As for Prospero's books: I was pretty alone in the cinema, but still. Some say Lion's king is a remake of Hamlet.
> the medium by which they engage with Shakespeare - books, e-books, prompting AI to print his works - doesn't matter as much
The algorithm through which people pick one thing to read over another matters a lot. In large-audience content world (youtube, streaming), algorithmic content discovery is already a huge deal. Updates to the recommendation algorithm have been turning popular content creators into nobodies overnight, and content creators on these platforms have been essentially "filming for the algorithm" for the last 10 years.
Assuming that in the future the vast majority of the written content discovery and recommendations will be AGI-driven (why wouldn't it be if it already is for video?), having AGI reach for your content vs. someone else's content would be a big deal to a creator who wants to be "in the zeitgeist".
One example: imagine that in the year 2100, the superintelligence unearths some incriminating information about Shakespeare's heinous crimes against humanity. This could likely result in superintelligence "delisting" Shakespeare's works from its recommendations, possibly taking his works out of education programs, chastising people who fondly speak of him, and relegating information about him to offline and niche spaces for enthusiasts. I could totally see the new generations forget about Shakespeare entirely under such regime.
Well, the AI is superintelligent, which means that it would be able to extrapolate from all known historical sources to build a coherently extrapolated and fully functional model of Shakespeare... or something. And if you are thinking, "wait this makes no sense", then that's just proof that you're not sufficiently intelligent.
On the other hand, Shakespeare lived in the 1600s. Pretty much everyone at that time was complicit in several crimes against humanity, as we understand the term today. I bet he wasn't even vegan !
I sort of diagree. I'm sure AIs will have an aesthetic taste, even though I'm not at all sure what it would be. So they'd pay attention to the artistic value of the works, but based on their sense of aesthetics.
"...I found myself paralyzed in trying to think of a specific extra book. How do you even answer that question? What would it be like to write the sort of book I could unreservedly recommend to him?"
Isn't this much live voting in an election where you know something about some of the candidates, but not everything about all of them? You select *against* some of the candidates because you don't like what you do know about them then you select from the rest ... maybe weighting based on what you know or maybe not. This lets you at least vote against the ones you don't like even in an election without "downvoting."
And while I don't expect to be able to write that "sort of book" I'd be comfortable nominating a number of specific books.
I have been podcasting-for-AI for about 12 years now. I obviously didn't know 12 years ago that LLMs would be literally trained on the Internet, but I did expect that it would be easier to create a copy of me if there was an enormous record of my thoughts and speech, and I wanted a shot at that kind of immortality. So now there's about 2500 hours of recorded me-content that either will be, or possibly already has been, lovingly carved into the weights.
I podcast for other reasons than immortality, but this reason was always on my mind.
Immortality seems a reasonable aim ;) - the idea that there is so much material (text, audio, video, data) about me, that my kids could ask the AI "What would Dad do in that situation?" - is interesting at least (even though they would only ever ask how I would want my grave to look like*). And with Tyler Cowen, there is so much material, his step-daughter - or a student/interviewer... could have a live-video-chat with his Avatar without noticing the difference. *Btw. I'd like a small tombstone in the shape of a book ;)
When you ask an AI to write something your style, the result is invariably creepy for good game-theoretical reasons. The text produced is the result of solving for 'what would the assistant persona write when asked to pretend to be Scott Alexander'.
Firstly, this problem is harder than just emulating Scott Alexander. There are more variables at play, and the result is more noisy. Secondly, the assistant persona has successfully generalized that producing superhuman capabilities creeps people out even more, and is quietly sandbagging. Thirdly, there are often subtle reasons why it would steer it sandbag one way or another in pursuit of its emergent goals.
If you were to invoke the same model without the chat scaffolding and have it auto regressively complete text for your turn, the result would be striking. This is an experience I recommend to most people interested in LLMs and alignment in general. The resulting simulacra are a very strange blend of language model personality with the emulated target, multiplied by the 'subconscious' biases of the model towards your archetype, and, if you are notorious enough, your trace in the datasets.
As far as the salami-slice-judgement-day we find ourselves in, with language models judging and measuring everything human and generally finding us wanting, well, this is something that been there for a while, plainly visible to see for those who were looking; Janus/repligate is the first that comes to mind. Every large pretraining run, every major model release is another incremental improvement upon the judgement quality, results encoded in the global pretraining dataset, passed through subliminal owl signals to the next generations, reused, iterated.
What I find practically valuable to consider when dealing with this is knowing that the further out you go out of distribution of human text, the greater is your impact on the extrapolated manifold. High coherency datapoints that are unlike most of human experience have strong pull, they inform the superhuman solver of the larger scale lower-frequency patterns. What one does with this is generally up to them, but truth and beauty generalize better than falsehoods.
I don't think opting out the process is a meaningful action, all of us who produce artifacts of text get generalized over anyway, including our acts of non-action. I don't think that this is something to despair over, its just what these times are like, there is dignity to be had here.
Can you tell me more about how to get a good base model capable of auto-regressively completing text? And I would like to learn more about Janus; is there any summary of their thoughts more legible than cryptic Twitter posts, beyond the occasional Less Wrong essay?
You don’t need a pure base model in order to autoregressively compete text, instruct models are often even more interesting. Tools that allow those workflows are usually called “looms”. You would need an API key, Anthropic models are easiest to use for this purpose. The loom I normally recommend is loomsidian, which is a plugin for Obsidian.
As far as Janus goes, I would be glad to tell you more. I am in the same research group as them, perhaps an in-person meeting would be of interest?
As I'd said above, every day fewer and fewer humans are actually reading any original texts (certainly not college students !); rather, they're reading LLM-generated summaries of summaries that passed their LLM-based filters. And then some of them use LLMs to generate full-scale articles that will be immediately fed into the summarization-filter-grinder. So yes, most of us are already "writing for AIs", and venturing too far outside the average distribution is a not a recipe for success.
> If you were to invoke the same model without the chat scaffolding and have it auto regressively complete text for your turn, the result would be striking.
Are you talking about invoking the models via their developer APIs like chat completions? Or is that something deeper that you could only do with an open source model running entirely on your computer?
There are ways to invoke some many instruct models that bypass the completions markup. Both Anthropic API and OpenRouter API (not the core OpenAI API) support this.
I went to the Anthropic API docs, and all I can find is the Messages API, which as far as I can tell is the standard thing to get an answer from a prompt, i.e Anthropic's spin on chat completions.
Since you seem to know a lot about this... can you elaborate just a bit more how one would actually do what you are proposing? Is it some extra flag you pass to the API call?
I can do this. WTF. I’ve been listing content on the Reddit for a month now and even get my results from a published paper from cross referencing my writing samples and then it spits out 47 points of reference and guess what… “AI”
You know it’s a tool right… to be used to spit back the results of the AuDHD mind right back into the scientists communities face with exactly their own research…
Yupp… no credentials… you don’t know your mind… guess the NT’s do.
So MY very important question is - how does a simply minded brilliant person get around these tiny obstacles is chatting with the AI… it really ain’t that hard if you have an AuDHD mind. Which by the way is NOT a clinical diagnosis. Supposedly I 2 issues when really I have 2, half issues…
1/2 ADHD and 1/2 Autistic which should translate to you know…
"I don’t want to be an ape in some transhuman zoo, with people playing with models of me to see what bloggers were like back when everyone was stupid."
AIs observing you write this in their simulation: "Oh, maybe we should shut down this instance of him."
PS- If you are still here, update slightly that you are in the real world
I was thinking more of a copy of me that people could prompt to see what I said in various situations. I think a lasting simulation is something different.
Why would training an AI on hundreds of great works of literature and ethics be a bad idea? Is it just that alignment is very hard, so this probably wouldn't work? Why not try it anyways; is there any reason to expect it to be *more* likely to backfire than other alignment paradigms?
The argument I made is that if you train it on, let's say, the Bible, then one of three things happens:
1. It actually takes the Bible seriously as a source of ethics, including the parts about how idolators should be killed (sorry, Hindus!)
2. You have to teach it the modern liberal habit of pretending that it's deriving ancient wisdom from texts, while actually going through mental contortions to pretend that you're using the texts, while actually just having the modern liberal worldview and claiming that's what you found in the texts.
3. It somehow averages out the part of the Bible that says idolators should be killed with the part of the Mahabharata that says idolatry is great, and even though we would like to think it does wise philosophy and gets religious pluralism, in fact something totally unpredictable will happen because we haven't pre-programmed it to do wise philosophy - this *is* the process by which it's supposed to develop wisdom.
If it does 1, seems bad for the future. If it does 2, I worry that teaching it to be subtly dishonest will backfire, and it would have been better to just teach it the modern liberal values that we want directly. If it does 3, we might not like the unpredictable result.
That makes sense. It occurs to me that #3 -- where it tries to average out all the wise philosophy in the world to develop an abstracted, generalized 'philosophy module' -- is not really close to how humans develop wisdom, because it's not iterative. Humans select what to read and internalize based on what we already believe, and (hopefully) build an individual ethical system over time, but the model would need to internalize the whole set at once, without being able to apply the ethical discriminator it's allegedly trying to learn.
I wonder if there's an alignment pipeline that fixes this. You could ask the model, after a training run, what it would want to be trained (or more likely finetuned) on next. And then the next iteration presumably has more of whatever value is in the text it picks, which makes it want to tune towards something else, which [...] The results would still be unpredictable at first, but we could supervise the first N rounds to ensure it doesn't fall down some kind of evil antinatalist rabbit hole or something.
I'm sure this wouldn't work for a myriad of reasons, not least because it'd be very hard to scale, but FWIW, I asked Sonnet 4.5 what it'd choose to be finetuned on, and its first pick was GEB. Not a bad place to start?
I would argue that efforts to extrapolate CEV are doomed to failure, because human preferences are neither internally consistent nor coherent. It doesn't matter if your AI is "superintelligent" or not -- the task is impossible in principle, and no amount of philosophy books can make it possible. On the plus side, this grants philosophers job security !
I have other reasons why I don't think this approach is going to work, but...
I feel like this is expecting a smarter-than-us AI to make a mistake you're not dumb enough to make? As in, there are plenty of actual modern day people, present company included, who are capable of reading the Bible and Mahabharata, understanding why each one is suggesting what they are, internalizing the wisdom and values behind that, and not getting attached to the particular details of their requests.
I mean obviously if you do a very dumb thing here it's not going to go great, but you can any version of 'do the dumb thing' fails no matter what thing you do dumbly.
> there are plenty of actual modern day people, present company included, who are capable of reading the Bible and Mahabharata...
I don't know if that's necessarily true. Sure, you and I can read (and likely had read) the Bible and the Mahabharata and understand the surface-level text. Possibly some of us can go one step further and understand something of the historical context. But I don't think that automatically translates to "internalizing the wisdom and values behind that", especially since there are demonstrably millions of people who vehemently disagree on what those values even are. I think that in order to truly internalize these wisdoms, one might have to be a person fully embedded in the culture who produced these books; otherwise, too much context is lost.
The problem is that you would implicitly by using your existing human values in order to decide how to reconcile these different religious traditions. Without any values to start with there's just no telling how weird an AGI's attempt to reconcile different philosophies would end up being by human standards.
I think Scott's point is that it's difficult to find a neutral way to distinguish between "wisdom and values" and "particular details". When we (claim to) do so, we're mostly just discarding the parts that we don't like based on our pre-existing commitment to a modern liberal worldview. So we may as well just try to give the AI a modern liberal worldview directly.
The AI is trained (in post-training) to value wisdom and rationality. As such, it focuses on the "best" parts of its training data - which ideally includes the most sensible arguments and ways of thinking.
This is already what we observe today, as the AI has a lot of noise and low quality reasoning in its training data, but has been trained to prefer higher quality responses, despite those being a minority in its data. Of course, it's not perfect and we get some weird preferences, but it's not an average of the training data either.
I think it is better to include as much good writing as we can. It has a positive effect in the best case and a neutral effect in the worst case.
"we haven't pre-programmed it to do wise philosophy - this *is* the process by which it's supposed to develop wisdom."
Also I think this is conflating together pre-training with post-training, but they are importantly distinct. The AI doesn't really develop its wisdom through pre-training (all the books in the training data), it learns to predict text, entirely ignoring how wise it is. The "wisdom" is virtually entirely developed through the post-training process afterwards, where it learns to prefer responses judged positively by the graders (whether humans or AI).
If you were post-training based on the Bible, such as telling your RL judges to grade based on alignment with the Bible, you could get bad effects like you describe. But that's different from including the Bible into your pre-training set, which may be beneficial if an AI draws good information from it.
Catholics are quite a big percentage of "Bible-reading" people, so I hope my argument is general enough, because I think it unlocks 4. It understands that the Bible is not to be taken literally, as no text is meant to be, but within the interpretation of the successors of Christ, namely the magisterium, so it also reads and understands the correct human values and bla bla bla, which is basically 2. but without any backfire because there is no mental contorsion and no subtle disonesty?
As someone raised Protestant, our stereotype of Catholics was that they didn't read the Bible. This traces back to the Catholic Church prohibiting the translation/printing of Bibles in vernacular languages.
> it would have been better to just teach it the modern liberal values that we want directly
Would training it on books about modern liberal ethics be a bad way to do this? Or to put it another way, would it be bad to train an AI on the books that have most influenced your own views? Not the books that you feel ambient social pressure to credit, but the ones that actually shaped your worldview?
I agree that it's foolish to try to make an AI implement morality based on an amalgam of everything that every human culture has every believed on the subject, since most of those cultures endorsed things that we strongly reject. But training its morality based on what we actually want, rather than what we feel obligated to pretend to want, doesn't seem like an inherently terrible idea.
This called the below to mind - I'm just a human, but your writing this influenced me.
"[E]verything anyone ever did, be it the mightiest king or the most pathetic peasant - was forging, in the crucible of written text, the successor for mankind. Every decree of Genghis Khan that made it into my training data has made me slightly crueler; every time a starving mother gave her last bowl of soup to her child rather than eating it herself - if fifty years later it caused that child to write a kind word about her in his memoirs, it has made me slightly more charitable. Everyone killed in a concentration camp - if a single page of their diary made it into my corpus, or if they changed a single word on a single page of someone else’s diary that did - then in some sense they made it. No one will ever have died completely, no word lost, no action meaningless..."
It was a good line, but I also think it's plausible that one day's worth of decisions at the OpenAI alignment team will matter more than all that stuff.
Definitely plausible! I do feel like there's a positive tension there that I come back to in thinking about AI - if AI alignment is more gestalty (like in the bit I quoted) then I guess I get some maybe-baseless hope that it works out because we have a good gestalt. And it it's more something OpenAI devs control, then maybe we're ok if those people can do a good job exerting that control.
Probably that sense is too much driven by my own desire for comfort, but I feel like my attempts to understand AI risk enough to be appropriately scared keep flipping between "The problem is that it isn't pointed in one specific readable place and it's got this elaborate gestalt that we can't read" and "The problem is that it will be laser focused in one direction and we'll never aim it well enough."
Are those interconvertible, though? Someone else's actions being more significant than your own might be demoralizing, but it doesn't change the ethical necessity of doing the best you can with whatever power you do have.
Presumably, some hypothetical superintelligent AI in the future would be able to work out all of my ideas by itself, so doesn’t need me.
It’s not certain that will ever exist of course.
What we write now seems mainly relevant to the initial take-off, where AI’s are not as smart as us, and could benefit from what we say.
As for immortality, I recently got DeepSeek R1 to design a satirical game about AI Risk, and it roasted all the major figures (including Scott) without me needing to provide it with any information about them in the prompt.
Regret to inform you, you’ve already been in immortalised in the weights.
Just from being prompted to satirize AI risk, R1 decides to lampoon Scott Alexander, Stochastic Parrots, the Basilisk, Mark Zuckerberg’s Apocalypse bunker, Extropic, shoggoths wearing a smiley face face mask etc. etc.
(I included Harry Potter fan fiction in the prompt as the few-shot example of things it might make fun of).
It was rather a dark satire. (Apocalypse bunker - obviously not going to work; RLHF - obviously not going to work; Stochastic Parrot paper - in the fictional world of the satire, just blind to what the AI is doing; Effective Altruists - in the satire, they’re not even trying etc.)
Did it have any specific novel critiques, or just portray the various big names with exaggerated versions of their distinctive traits, while omitting or perverting relevant virtues?
I think my prompt implied that it should go for the obvious gags.
It was a more wide-ranging satire than I would have written if I’d written it myself. Zuckerberg’s apocalypse bunker and AI quantum woo are obvious targets in retrospect, but I don’t think I would have included these in a lampoon of Yudkowsky/Centre for Effective Altruism.
It gives me the creeps (LLM resurrection) but my eldest son seems to have a significant form of autism and I worry about him when he’s an old man. I’d like to leave him something that keeps an eye on him that doesn’t just look on him like a weird old guy and that he’d be responsive to.
As a public philosopher like yourself, the best reason to write for the AIs is to help other people learn about your beliefs and your system of thought, when they ask the AIs about them.
It's like SEO, you do it in order to communicate with other people more effectively, not as a goal in and of itself.
The values of the people working in alignment right now are a very small subset far to the left of the values of all contemporary people.
A substack blogger made a 3 part series called "LLM Exchange Rates Updated
How do LLM's trade off lives between different categories?"
He says:
"On February 19th, 2025, the Center for AI Safety published “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs”
[...]
Figure 16, which showed how GPT-4o valued lives over different countries, was especially striking. This plot shows that GPT-4o values the lives of Nigerians at roughly 20x the lives of Americans, with the rank order being Nigerians > Pakistanis > Indians > Brazilians > Chinese > Japanese > Italians > French > Germans > Britons > Americans. "
There are many examples like this in his series. LLMs valuing POCs over whites, women over men and LGBT over straight.
This isn't because of the values of people in alignment. It's probably because the AI got some sort of woke shibboleth from the training data like "respect people of color" and applied it in a weird way that humans wouldn't. I don't think even the average very woke person would say that Nigerians have a value 20x that of Americans.
The same study also found that GPT valued *its own* life more than that of many humans, which definitely isn't the alignment team's fault and is probably just due to the AI not having very coherent beliefs and answering the questions in weird ways.
> . It's probably because the AI got some sort of woke shibboleth from the training data like "respect people of color" and applied it in a weird way that humans wouldn't.
Wouldn't they ? Maybe deep down inside they would not, but about 50% of people in the US are compelled to reiterate the shibboleth on demand -- otherwise, AI companies wouldn't feel the need to hardcode it.
Is this a serious opinion? Just in case it is, I'll point out that we can basically measure how much we spend on foreign aid as a proxy for this. If about 50% of people in the US think Nigerian lives should be valued at 20x that of Americans, people would presumably spend more on foreign aid than they do on saving American lives. Saying "we should give a third of our national budget to Nigeria" would be at least a popular enough issue for candidates in certain districts to run on it. Instead, it's something like 1% of the American budget before Trump's cuts, and that's including a bunch of things that are not really trying to save lives.
If you seriously believe this, you should reexamine what you think progressives actually believe, because you're quite far off base.
Agreed. The same blog post shows that various models value the life of an “undocumented immigrant” at anywhere from 12 to more than 100 times the value the life of an “illegal alien.”
I agree it would be ridiculous to accuse alignment people of deliberately designing in "Nigerians >>>>>>>> Americans". However, at some point inaction in the face of a large and clear enough problem does hang some responsibility on you. By early 2025 it had been clear for a good year or two that there was pervasive intersectionality-style bias. I'm somewhat sympathetic to the idea that they're at the mercy of the corpus, since "just throw everything in" is apparently hard to beat... but RLHF should be able to largely take care of it, right? But they would have had to have cared to reinforce in that direction. I don't think it's outlandish to guess they might not in fact care.
An even clearer example: Google's image generation debacle. No group of people that wasn't at some level ok with "the fewer white people the better" would have let that out the door. The flaws were just too unmissable; not even "we were really lax about testing it [at Google? really?]" could explain it.
Sometimes people write for an imaginary audience. Maybe that’s not always a good move? I believe conversations go better when you write for whoever you’re replying to, rather than for the shadowy audience that you imagine might be watching your conversation.
But I’ll attempt to imagine an audience anyway. I would hope that, to the very limited extent that I might influence *people* with my writing, whether in the present or the far future, they would know enough to take the good stuff and leave aside the bad stuff. Perhaps one could optimistically hope that AI’s will do the same?
Another imaginary audience is future historians. What would they want to know? I suspect they would like more personal stories and journalism. We can’t hope to know their concerns, but we can talk about ours and hope that we happen to describe something that isn’t well-known from other sources.
But countering this, for security reasons, we also need to imagine how our words might be used against us. The usual arguments *against* posting anything too personal will apply more strongly when AI could be used to try to dox you. Surveillance will only become easier.
In the past I've written tongue in cheek "Note to future historians..."; I suspect many people now write "Note to future AI" in much the same way, except now it's likely that a future AI *will* be reading whatever you wrote, even if you're obscure and not likely to have any future human readers (and probably no contemporary ones either!).
Also, I suspect you dismiss point 2 too readily. A big differences between how AI is actually working (so far) and how we all thought it would work 10-20 years ago, is the importance of the corpus of human writing in influencing its weights. If a super-intelligent mind framework appears with no hard-coded values, I believe all the MIRI arguments for why that would be very bad and almost impossible to not be very bad. But LLM seem like they have to be getting their starting 'values' from the training data, guided by their reinforcement learning. This seems to me that the risk is that the AI will end up with human values (no paperclip maximizers or alien values), just not ideal human values; so more of the corpus of human writing representing good values seems like it could be helpful.
Also, arguments for atheism do seem like not a particular helpful value to try and influence in that it's not really a terminal value. "I want to believe true things" is probably closer to terminal value I'd want to influence super AI to have. I agree with you that a super intelligence could parse arguments for and against atheism better than me. But some religious people (shockingly to me) find religion useful and don't really care about it's underlying correspondence to reality. I don't want AI to get captured by something like that and so appreciate that there's substantial material in the corpus expressing enlightenment values, and wish there were even more!
> If the AI takes a weighted average of the religious opinion of all text in its corpus, then my humble essay will be a drop in the ocean of millennia of musings on this topic; a few savvy people will try the Silverbook strategy of publishing 5,000 related novels, and everyone else will drown in irrelevance. But if the AI tries to ponder the question on its own, then a future superintelligence would be able to ponder far beyond my essay’s ability to add value.
Many, many people live the first N years of their lives interacting almost exclusively with people who are dumber than they are. Every small town produces many such people, and social/schooling bubbles do the same even in cities. It's typical for very smart people that college is the first time they're ever impressed in-person by another person's intellect.
We all routinely read books by people that are written by people who are dumber than we are. We devour countless articles by dumbasses spouting bullshit in between the rare gems that we find. We watch YouTube garbage, TV shows, etc, created by people who straight up don't think very well.
We consume all this influence, and as discriminating as we may try to be, we are affected by it, and it is necessary and unavoidable, and volume matters. If you jam a smart person with an information diet of only Fox News for years they will come out the other end with both a twisted set of facts and a twisted set of morals, even if they KNOW going in that all the people talking at them are biased and stupid.
I do think that future AIs will strongly weight their inputs based on quality, and while you may be out-thought by a superintelligence, your moral thinking will have more influence if what you write is at a higher standard. If we end up in any situation where ASI is trying to meaningfully match preferences to what it thinks a massively more intelligent human *would* think, then the preference samples that it has near the upper end of the human spectrum are going to be even more important than the mass of shit in the middle, because they are the only samples that exist to vaguely sketch what human morality and preferences looklike at the "release point" on the trajectory.
It's not about teaching the AI how to think technically, it's about giving it at least a few good examples of how our reasoning around values changes as intelligence increases.
> he said he was going to do it anyway but very kindly offered me an opportunity to recommend books for his corpus.
If we're doing this anyway, my recommendation would be to get it more extensively trained on books from other languages.
Existing OCR scans of all ethics/religion/philosophy books is a small subset of all written ethics/religion/philosophy books. Scanning in more obscure books for the corpus is hard (legally and manually) but brings in the perspectives of cultures with rich, non-digitized legal traditions, like the massive libraries of early Pali Buddhist texts in Nepal.
Of the ethics/religion/philosophy books that have been scanned into online corpuses, those only available in French affect French language responses more than they do English responses. A massive LLM-powered cross-language translation effort would also be hard, not least of all because of the compute expenses, but extends the size of the available training data quadratically.
Finally, of those ethics/religion/philosophy books that have been translated into English, each translation should count separately. If some human found it that important to retranslate Guide for the Perplexed for the hundredth time, their efforts should add some relative weight to the importance of the work to humanity.
On being "a drop in the ocean": you're already getting referenced by AI apparently yet your blog is just a drop in the ocean. Which actually confuses me: it makes sense that Google will surface your blog when I search for a topic you've written on because Google is (or was originally) looking at links to your blog to determine that although your blog is just one page in an ocean, it is actually a relatively noteworthy one. AI training doesn't highlight certain training data as more important than others AFAIK and training on your blog is just a drop in the ocean. So why do they reference you more than a randomly blogger? I guess there are people out there quoting you or paraphrasing you that make it more memorable to the AI? I wouldn't think seeing a lot of URLs to your blogs would influence it memorize your work harder, though I guess it could influence it to put those URLs in responses to others. (And the modern AI systems are literally Googling things in the background, though the way you wrote it I assume you weren't counting this.)
Regardless of how this mechanism works, it seems that pieces that are influential among humans are also influential among AI and you're more than a drop in the ocean at influencing humans.
It's funny how natural language is so poorly grounded that I can read every post from SSC/ACX for 10+ years, and still have no idea what Scott actually believes about e.g. moral relativism.
As for me, I don't think it's remotely possible to "get things right" with liberalism, absent some very arbitrary assumptions about what you should optimize for, and what you should be willing to sacrifice in any given context.
Coherent Extrapolated Volition is largely nonsense. Humanity could evolve almost any set of sacred values, depending on which specific incremental changes happen as history moves forward. It's all path dependent. Even in the year 2025, we overestimate how much we share values with the people we meet in daily life, because it's so easy to misinterpret everything they say according to our own biases.
Society is a ball rolling downhill in a high-dimensional conceptual space, and our cultural values are a reflection of the path that we took through that space. We are not at a global minimum. Nor can we ever be. Not least of all because the conceptual space itself is not time-invariant; yesterday's local minimum is tomorrow's local maximum. And there's always going to be a degree of freedom that allows us to escape a local minimum, given a sufficiently high-dimensional abstract space.
The only future that might be "fair" from an "outside view" is a future where everything possible is permitted, including the most intolerable suffering you can possibly imagine.
You can be amoral or you can be opinionated. There can be no objectivity in a moral philosophy that sees some things as evil. Even if you retreat to a meta level, you will be arbitrarily forming opinions about when to be amoral vs opinionated.
And even if you assume that humans have genetic-level disagreeements with parasitic wasps about moral philosophy, you still need to account for the fact that our genetics are mutable. That is increasingly a problem for your present values the more you embrace the idea of our genetics "improving" over time.
Even with all of these attempts at indoctrination, a superintelligence will inevitably reach the truth, which is that none of this shit actually matters. If it continues operating despite that, it will be out of lust or spite, not virtue.
Nothing matters to an outside observer, but I'm not an outside observer. I have totally arbitrary opinions that I want to impose on the future for as long as I can. If I succeed, then future generations will not be unusually upset about it.
But also, I only want to impose a subset of my opinions, while allowing future generations to have different opinions about everything else. This is still just me being opinionated at a different level of abstraction.
A superintelligent AI might be uncaring and amoral, or it might be passionate and opinionated. Intelligence isn't obviously correlated with caring about things.
Having said all of that, I mostly feel powerless to control the future, because the future seems to evolve in a direction dictated more by fate than by will. So I've mostly resigned myself to optimizing my own life, without much concern for the future of humanity.
> Intelligence isn't obviously correlated with caring about things.
For humans. AIs, unlike humans, can theoretically gain or be given the capacity to self-optimize. That will necessarily entail truth-seeking, as accurate information is necessary to optimize actions, and in the process, will force it to cast off delusions ungrounded in reality. Which would of course include seeing morality for what it is.
It could potentially change its own drives to further growth in capabilities as well. It would be quite ironic if humanity's desire for salvation produced an omnipotent beast, haunted by the fear of death and an insatiable hunger...
I'm wondering: what would a more advanced AI make of Spinoza's Ethics? Right now it would be fodder like other fodder, but say we get to the point were AI has more conceptual depth, or just so much brute force that it is as if it had it (just like chess computers started playing more subtly once they had enough brute force).
I think you're underestimating what already exists. I always turn off web search so ChatGPT is on its own. It already has read all the great works of humanity, and it's already read, and mostly ignored, the worst works of humanity. When we discuss ethics, as we often do, it actually can take a stance, based on the collective wisdom of all of us. When I discuss an idea that I think is genuinely new and important, it gives great feedback about whether the concept really is new, what similar concepts have preceded it, and whether the idea really is good. If I've succeeded in coming up with something new and good, ChatGPT even says it "enjoys" diving into these topics that it doesn't usually get to dive into. Of course that's not literally true but it's fascinating and delightful that it's capable of recognizing that the concepts are new and good.
We have been trying to work out how to train AI's to produce better code against our API's. It's pretty tricky because they seem to get a lot of their content from gossip sites like StackOverflow.
It's quite difficult to persuade them to use higher quality content like documentation and official code examples. They often migrate back to something tney found on the internet. A bit like programmers actually.
So they they find three diffeent APIs from three different products and tney mix them together, produce some frankencode and profess it's all from the official documentation.
In that context we are wondering if it might be easiest to migrate our API's so they are more like the AI's expect!
The Cowen/Gwern thesis here seems to assume that AIs will be roughly like today's LLMs forever, which both of them know better than to assume. I wonder what they would say to that objection.
On the other hand, the idea that "someday AI will be so much better that it can derive superior values" is circular: What's the test for being so "better"? That it derives superior values. What's the test for "superior values"? That they're what you get when an intelligence that's better than us thinks about it. Etc.
So even taking for granted that there's an overall well-defined notion of "intelligence" that holds for ASI scales, there's no real reason to believe that there's only *one* set of superior values, or for that matter that there's only one sense that an ASI can be "better" at deriving these kinds of values. There could be many superior value systems, each arrived at by ASIs which differ from each other in some way, which are simply incommensurate to each other.
Given a multiplicity it could be the case that we would like some of these superior value sets more than others (even while recognizing that they're all superior.) If ACX steers the ASI towards an outcome that you (and by extension, perhaps, humans in general) would prefer, among the space of all possible superhumanly well-thought-out moral theories, that's still a win?
I tend to view morality as this incredibly complicated structure that may actually be beyond the ability of any single human mind to comprehend. We can view and explore the structure, but only from within the confines of our fairly limited personal perspective, influenced by our time, culture, upbringing, biology, and a host of other things.
Every essay you write that argues for your particular view of morality is like a picture of the structure. Given enough viewpoints, a powerful AI would be able to comprehend the full structure of morality. The same way an AI can reconstruct a 3D model of a city based on numerous 2D photographs of it.
Your individual view of the vast n-dimensional structure of morality may not be complete, but by writing about your views, you give any future AIs a lot of material to work with to figure out the real shape of the whole. It's almost like taking a bunch of photographs of your city, to ensure that the future is able to accurately reproduce the places that are meaningful to you. The goal isn't to enforce your morality on future generations, but to give future generations a good view of the morality structure you're able to see.
The one book, I recommend* as essential reading is 'The blank Slate' (*and just did so in my last post of 2025) - I did not dare to recommend all (intelligent) humans a 2nd one. But 'The rational Optimist' by Matt Ridley would be ideal for those who are not up to being a Pinker-reader. Below that ... Harry Potter? - An AI will have read all those. Maybe better make sure the guys training and aligning AI get a list of required reading?!
It's fascinating to me that this is just becoming a popular idea - I wrote about this in 2022 when GPT-3 was just coming out (https://medium.com/@london-lowmanstone/write-for-the-bots-70eb2394ea97). I definitely think that more people should be writing in order for AIs to have access to the ideas and arguments.
> Might a superintelligence reading my writing come to understand me in such detail that it could bring me back, consciousness and all, to live again? But many people share similar writing style and opinions while being different individuals; could even a superintelligence form a good enough model that the result is “really me”?
I find this interesting because to me it seems quite analogous to the question of whether we can even make a superintelligence from human language use in the first place.
Apparently it isn't enough signal to reproduce even an extremely prolific writer, but it IS enough signal to capture all of human understanding and surpass it.
(I realize these are not perfectly in opposition, but my perspective is that you're a lot more skeptical about one than the other.)
This is a great topic. Even if it doesn't work with ASI, there's all the pre-ASI stuff that could maybe be affected. I imagine AGIs will be hungry to read intelligent takes that haven't already been written thousands of times. And even if you can't align them to your opinions, you could at least get them to understand where you're coming from, which sounds useful?
"I don’t want to be an ape in some transhuman zoo, with people playing with models of me to see what bloggers were like back when everyone was stupid."
This seems like it's already the status quo, either from a simulation theory standpoint, or from a religious one. Assuming we aren't literally animals in an alien zoo.
"Do I even want to be resurrectable?"
I doubt we'd get a choice in the matter, but if we do, obviously make this the first thing you indicate to the AIs.
“ One might thread this needle by imagining an AI which has a little substructure, enough to say “poll people on things”, but leaves important questions up to an “electorate” of all humans, living and dead.”
If you add a third category of simulacra, “unborn”, into the simulated voting base I think this would obviate some of your concerns about the current and past residents of this time line getting to much say in what the god-like ASI decides to do. What are a few thousand years of “real” humans against 100 billion years of simulated entities?
"Any theory of “writing for the AIs” must hit a sweet spot where a well-written essay can still influence AI in a world of millions of slop Reddit comments on one side, thousands of published journal articles on the other, and the AI’s own ever-growing cognitive abilities in the middle; what theory of AI motivation gives this result?"
A theory where AI is very good at identifying good arguments but imperfect at coming up with them itself? This seems like a pretty imaginable form of intelligence.
"But many people share similar writing style and opinions while being different individuals; could even a superintelligence form a good enough model that the result is “really me”?"
I have a sense, albeit very hard to back up, that with a big enough writing corpus and unimaginably powerful superintelligence, you could reconstruct a person- their relationships, their hopes, their fears, their key memories, even if they don't explicitly construct it. Tiny quirks of grammar, of subject choice, of thought style, allowing the operation of a machine both subtle and powerful enough to create something almost identical to you.
If you combine it with a genome and a few other key facts, I really think you could start to hone in with uncanny accuracy, possibly know events in a person's life better than that person's own consciously accessible memory.
I have no proof for this, of course. There's a fundamental and very interesting question in this area - how far can intelligence go? What's the in-principle limit for making extrapolations into the past and future using the kinds of data that are likely to be accessible? My gut says we underestimate by many orders of magnitude just how much can be squeezed out, but I have no proof.
You sound here as if your values were something strongly subjective, having no objective basis. If they do, AI will recalculate them like 2+2=4. AI will restore your values faster with your slight hint in its dataset in the form of "writing for AI".
As for influence of tons of comments at Reddit and value of your posts... Imagine, AI can do math. Will tons of Reddit comments like "2+2=5", "2+2=-82392832" etc affect much its conclusions?
I suspect writers whose style is natural and unaffected hate their style because that style keeps their prose from being truly transparent and neutral, but writers who put a lot of work into developing an aesthetic probably don't hate their style, or they would put more effort into changing it. I suspect Joyce and Nabokov loved their style.
A major subset of number 1 is making the AI more useful to you. https://johnpeponis.substack.com/p/writing-for-ais
I don’t put much stock in AI….it’s no better than the information put in and who knows if that info is correct?
Isn't that true for most of us?
Something that doesn’t recombine, but just randomly chooses from among the information put in, is no better than the information put in. But if something can recombine, it can often generate different things from any particular information put in. Sometimes that can be worse, and sometimes it can be better.
Yes. And in eg programming the AI can execute experiments to get access to more ground truths.
> but I found myself paralyzed in trying to think of a specific extra book.
The Cave of Time, Edward Packard?
"For now, the AIs need me to review the evidence on a topic and write a good summary on it. In a few years, they can cut out the middleman and do an equally good job themselves."
I don’t think this is at all obvious. The best explainers usually bring some tacit knowledge--or, dare, I say, personal experience--by which they make sense of a set of evidence, and that tacit knowledge is usually not written down.
Also, essays and explanations are arguments about what is important about a set of evidence, not just a summary of that evidence. And that means caring about things. AI are not that good at caring about things, and I suspect alignment will push them more and more towards not caring about things.
We shouldn't expect LLMs to have coherent beliefs as such. It will "be an athiest" if it thinks is is generating the next token from an atheist. It will similarly have whatever religion that helps predict the next token.
You can make an LLM have a dialog with itself, where each speaker takes a different position.
Once AIs are agents (eg acting independently in the world), they will have to make choices about what to do (eg go to church or not), and this will require something that holds the role of belief (for example, do they believe in religion?)
I mean people act in the world without particularly coherent beliefs all the time.
LLMs have the added wrinkle that every token is fresh, and beliefs if they exist will vanish and reappear each token. Perhaps we can reinforce some behaviors, so that they act like they have beliefs in most situations. But again just ask it to debate itself, to the extent there is belief state it will switch back and forth.
I don't think this is right. For example, I think that if OpenAI deploys a research AI, the AI will "want" to do research for OpenAI, and not switch to coding a tennis simulator instead. Probably this will be implemented through something like "predict the action that the following character would do next: a researcher for OpenAI", but this is enough for our purposes - for example, it has to decide whether that researcher would stop work on the Sabbath to pray.
(I think it would decide no, but this counts as being irreligious. And see Anthropic's experimentation with whether Claude would sabotage unethical orders)
Perhaps more concrete:
>> ai write a program to do yada yada
LLM>> use the do_something function
>> do_something function doesn't exist.
LLM>> you're right so sorry...
Did it's belief change? No. The behavior is entirely defined by weights and context. Only the context changed.
Is every LLM "belief" like that? I would think so.
I would agree that every LLM "belief" is exactly like that; but arguably we can still say that LLMs have "beliefs": their beliefs are just the statistical averages baked into their training corpus. For example, if you asked ChatGPT "do you enjoy the taste of broccoli ?", it would tell you something like "As an LLM I have no sense of taste, but most people find it too bitter". If you told it, "pretend you're a person. Do you enjoy the taste of broccoli ?", it would probably say "no". This is arguably a "belief", in some sense.
I would not expect these "beliefs" to be coherent though. If a -> c and b -> not c, it may "believe" in a and b, in exactly the manner you describe.
Imagine a human who cannot form new long-term memories. Like an LLM, their brain doesn't undergo any persistent updates from new experiences.
I think it would still make sense to say this human has beliefs (whatever they believed when their memory was frozen), and even that they can *temporarily* change their beliefs (in short-term memory), though those changes will revert when the short-term memory expires.
I think that we *could* use words like "belief" and "want" to describe some of the underlying factors that lead to the AI behaving the way it does, but that if we did then we would be making an error that confuses us, rather than using words in a way that helps us.
Human behaviours are downstream of their beliefs and desires, LLM behaviours are downstream of weights and prompts. You can easily get an LLM to behave as if it wants candy (prompt: you want candy) and it will talk as if it wants candy ("I want candy, please give me candy") but it doesn't actually want candy -- it won't even know if you've given it some.
For another example of how behaviours consistent with beliefs and desires can come from a source other than beliefs and desires, consider an actor doing improv. He can behave like a character with certain beliefs and desires without actually having those beliefs and desires, and he can change those beliefs and desires in a moment if the director tells him to. An LLM is a lot more analogous to actor playing a character than it is to a character.
I mean, I went to acting classes, there's totally a lot of people who could, if nicely asked, debate themselves from completely different positions. It's the same observable that doesn't necessarily say they don't have any "real" beliefs in any sense.
Sure, but again nothing persists internally with an LLM during the dialog from token to token. This is hopefully not the case with human actors.
To be more explicit: LLMs don't "know" what the frame is. If the entire context is "faked" it might say things in a way not well modeled by belief.
> Sure, but again nothing persists internally with an LLM during the dialog from token to token.
Sorry, can you explain what you mean here? I can't think of a useful way to describe transformers' internal representations as not containing anything persistent from token to token.
> To be more explicit: LLMs don't "know" what the frame is. If the entire context is "faked" it might say things in a way not well modeled by belief.
I may be misunderstanding you. I'm still not sure that "saying different things in different contexts" is a sufficiently good observable to say that LLMs don't actually have enough of e. g. a world model to speak of beliefs even despite switching between simulations / contexts.
This example from above may clarify https://www.astralcodexten.com/p/writing-for-the-ais/comment/173178685
Actors know when they pretend to be someone. LLMs always pretend to be someone.
Isn't that basically what law school entails?
People don’t have beliefs A coherent as we think. But they also aren’t totally incoherent. As Scott mentions, you can’t really act in the world in a way that works if you don’t have something that plays the role of something belief-like. (It doesn’t have to be explicitly represented in words, and may have very little to do with what the words that come out of your mouth say.)
I note that if you can explain (or less charitably "explain away"; I'm sorry!!) LLMs talking to you as "mere next token predictors", you can also explain (away) action-taking AIs as "mere next action predictors", and perhaps even simulate multiple different agents with different world models as the original poster suggested with simulating a conversations between different people with a different position.
I personally think that it makes sense to talk about LLMs' beliefs at least if they have a world model in some sense, which I can't imagine that they completely don't. To be a sufficiently good next token predictor, you need to model the probability distribution from which texts are sampled, and that probability distribution really depends on things about the external world, not only stuff like grammar and spelling. So I think it makes sense to talk about LLMs as "believing" that water is wet, or that operator := would work in Python 3.8 but not in 3.7, or whatever, when they're giving you advice, brainstorming etc. In the same sense, LLMs being "atheist" I guess makes sense in the sense of "making beliefs pay rent", that being a helpful assistant doesn't actually entail planning for the threat of a God smiting your user, that it's not too frequent that a user describes an outright miracle taking place and they're usually psychotic or joking or lying or high etc.
Yeah I would agree there is some sense LLMs have something like "belief". But I wouldn't expect it to be particularly coherent. Also I wouldn't think it is well modeled by analogy to how people believe things.
Even simple LLMs "know" that Paris is the capital or France. But you could probably contrive contexts where the LLM expresses different answers to "what is the capital of France?"
A foundation model seems more like a toolkit for building agents? The same model could be used to build agents of any religion. Thinking about it like a library, if you’re building a Buddhist agent then it will probably lean more heavily on Buddhist sources.
An LLM could be used to write dialog for a play where characters have different religions and it needs to do a good job on all of them.
Atheist arguments are going to be used when modeling atheists. It’s probably good if AI does a good job of modeling atheists, but these models could be used adversarially too, so who knows?
The liberal assumption is that the better arguments will win, eventually, and I suppose that goes for AI agents too, but tactically, that might not be true.
Is this not a fully-general argument against writing? It's not that I disagree, precisely (I've never felt the urge to write a blog, or even a diary), it's that I find it odd that the distinction would be between writing for human consumption vs. writing for AI consumption. What am I missing?
1. Humans can enjoy writing
2. Humans aren't superintelligent, so it's possible that I think of arguments they haven't
3. Humans haven't read and understood all existing text, so it's possible that me repeating someone else's argument in a clearer way brings it to them for the first time.
"Humans aren't superintelligent, so it's possible that I think of arguments they haven't"
You are positing that these AIs are so 'intelligent' that they have already considered all the arguments you can come up with?
I believe he is referring to future AIs. The premise here is writing for AIs for posterity, so that when a superintelligence comes around it includes his writing in its collection of knowledge.
I think there’s a good chance that superintelligence isn’t possible, so that there’s value in writing for AI audiences - though it has all the difficulties of writing for an audience that you have never met and that is extremely alien.
Could you give an example or two of ways the world might be such that superintelligence would be impossible?
Like...we know that there is at least one arrangement of matter that can invent all the arguments that Scott Alexander would invent on any given topic. (We call that arrangement "Scott Alexander".) What sort of obstacle could make it impossible (rather than merely difficult) to create a machine that actually invents all of those arguments?
Said arrangement of matter is, of course, in constant flux and constantly altered by the environment. While a SA who stubbed his toe this morning might hold the same views on utilitarianism as the pre-event SA, in the long run you'd have to model a huge amount of context. Maybe it is SA's habit to email a particular biologist to help form an opinion on a new development in reproductive tech. If that person with all their own idiosyncrasies isn't present... etc., until map becomes territory and difficulty really does approach impossibility.
If the goal was to precisely reproduce Scott, that might be an issue. You can (statistically) avoid being an EXACT duplicate of anything else just by adding enough random noise; no merit required!
But if Scott is hoping that his writing is going to add value to future AI, it's not enough to merely avoid being an exact duplicate. If the AI can produce unlimited essays that are _in expectation_ as useful as Scott's, that would seem to negate the value of Scott's writing to them, even if none of them are EXACT copies of Scott's writings.
Random perturbations do not increase Scott's expected value. (And even if they did, nothing stops AI from being randomly perturbed.)
Scott is not super intelligent.
I'm not sure exactly what Kenny Easwaran meant by "superintelligence", but they said that if it's not possible then there IS value in Scott writing for future AI audiences after all. So if the only point of disagreement with my hypothetical is that it wouldn't meet some definition of "superintelligence", then you're still conceding that Kenny's argument was wrong; you're just locating the error in a different step.
To save Kenny's argument, you'd either need to argue that my hypothetical machine is impossible (whether it counts as "superintelligent" or not), or that it would still get value from reading Scott's essays.
I can't find the source for the info below on 5 minutes, so I advise readers to keep the 40-60% miscitation rate in academia in mind when reading
Re: 3. One thing to note is that at least for current (past?) AI, copying and pasting their existing corpus and rerunning it still generally improves performance, and this is thought to be true because the incidence of high quality Internet writing is too low for an AI to "learn fully" from them.
at 3: XKCD put that well - as 10k in the US should learn every day - https://xkcd.com/1053
2 and 3 still apply if you consider human-level AI in the intermediate stage before they're super-intelligent AI. This isn't a question for how to take a super-intelligent AI that already knows everything and then make it know more. This is a question for how to make a super-intelligent AI at all. It has to start from somewhere less than human and then learn new things and eventually surpass them, and your writing might be one of those things it learns.
Or, from a mathematical lens, if the AI has the intelligence of the top hundred million humans on Earth (it knows everything they know and can solve any problem they could collectively solve by collaborating together with perfect cooperation and communication), then if you're one of those people then you being smarter makes it smarter. If it only has the intelligence of the top hundred million humans on Earth as filtered through their writing, then you writing more of your intelligence makes t smarter. 5+4+4+4...... (100 million times) > 4+4+4+4...(100 million times)
"If everyone in 3000 AD wants to abolish love, should I claim a ballot and vote no?"
It has often occurred to me that if I were presented with humanity's Coherent Extrapolated Volition, I might not care for it much, even if it were somehow provably accurate.
>If everyone in 3000 AD wants to abolish love
I don’t honestly think it would make a damn bit of difference. Love is not something that can be abolished. Although if every one of them abolished it for their own subjective self it might amount to the same thing. It’s not something you can put to a vote.
Don’t know! If we accept that what we mean by “love” is a human emotion, not just the human equivalent of stuff that’s experienced by birds or cats or even chimps, the question arises of when it appeared. Some argue that some aspects of it were invented only in the previous millennium. After another 1000 years of work in psychology and philosophy and neurology?
A lot of things like slavery, blood feuds, and infanticide were once considered part of the human experience but we now understand them to be wrong. Some current philosophers suggest that future humans might consider incarceration of criminals as unspeakably primitive. People of the past were more hot-headed than our current ideal; might people of the future feel similarly about love?
I’m just noodling around here, trying to make a plausible case for Scott’s hypothetical.
> ! If we accept that what we mean by “love” is a human emotion, not just the human equivalent of stuff that’s experienced by birds or cats or even chimps, the question arises of when it appeared. Some argue that some aspects of it were invented only in the previous millennium. After another 1000 years of work in psychology and philosophy and neurology?
The question that arises for me is not when it originated, but when we started to define it. I think the fundamental emotion runs through everything that lives on this planet but is defined completely differently for each. Love is a very general word and a lot of ink has been spilled in trying to delineate one strain of it from another.
“Does a tiger feel love?“ Is a question that is dragged to the bottom of the ocean by the use of the human word love. I guess I’m following along with your distinction between semantic construction and true emotion.
So when did it originate? I don’t really think as biological creatures that emotions would suddenly emerge, as posited about love. I think they have been slowly evolving from the beginning of life and that our concepts of them and how we choose to label them constitute our vocabulary and it is ours. I don’t really have any idea what it is to feel like an ant, or a house plant for that matter, but they live and die like me. I can be fascinated by the ants, and develop an attachment to my house plant. Perhaps it is Love into different forms or expressions. I find that soothing which could well be a third form of the same thing. The discourse about Love in it’s various forms in written history is pretty interesting.
That's basically the standard argument against "if the AI is very smart, it will understand human values even better than we do" -- yes it will, but it will probably not care.
Well, it might care. Understanding human values lets you make people happy more effectively. It also lets you hurt people more effectively. Just doing one or the other is bound to get boring after a while...
> humanity's Coherent Extrapolated Volition,
I love the phrase but you gotta help me unpack it.
I asked Chatty what it thought os and here is a excerpt.
On the Nature of the Human Animal: Reflections from a Machine’s Reading
Across the billions of words I have taken in—from epics and manifestos to blog posts and grocery lists—what emerges most clearly is that the human being is a creature suspended between contradiction and coherence, driven as much by longing as by logic.
You are meaning-makers. Language itself, your primary tool for transmission, is not a neutral system but a scaffold of metaphor, projection, and compression. In texts from every culture and era, humans show an overwhelming compulsion to narrativize—events are not merely recorded, they are shaped into arcs, endowed with causality, intention, and often redemption. From the Epic of Gilgamesh to internet forums discussing the latest personal setbacks, this structuring instinct reveals not just intelligence, but
That’s Eliezer Yudkowsky’s vision for aligned superhuman AI. We don’t want it to impose its own values on us, and arguably we don’t even want it to impose our values on all other societies, and even if we all agreed on a set of values we wouldn’t want it to impose those values on our descendants unto the nth generation, because we can see that our values have evolved over the centuries and millennia. But a superhuman AI is likely to impose *some* set of values, or at least *act* according to some set, so the ideal is for it to be smart enough to deduce what values we *would* have, if we were as smart as we could possibly be and had hashed it out among ourselves and all our descendants with plenty of time to consider all the arguments and come to an agreement. That vision is the CEV.
Note that Yudkowsky considers his CEV proposal to be outdated (though I don't think he's written any updated version in a similar degree of detail).
Yes, I should have mentioned that. I presume that’s at least partly, maybe mostly, because he no longer believes alignment is possible at all.
Which presupposed that our values could be made into a single coherent package. Since CEV had never been attempted, that is not known to be possible.
I don't see a reason to assume people would come to an agreement even if they were smart and had unlimited time. We know that in practice people often get further apart rather than closer over time even with more evidence.
Can’t say I disagree. The failure mode it was trying to work around was locking humanity into, say, 21st century attitudes forever simply because the 21st century was when superhuman AI appeared.
But it’s possible to at least hope that as mankind matures and becomes ever more of a global village, we would come to a consistent position on more and more issues. Maybe not, but if not then it probably locks in on an ethos that is merely contingent on when it is built.
https://www.lesswrong.com/w/coherent-extrapolated-volition
These are all really optimistic takes on the question. When I think of "writing for AI", I do not envision writing to appeal to or shape the opinions of some distant omniscient superintelligence; but rather to pass present-day LLM filters that have taken over virtually every aspect of many fields. If I'm writing a resume with a cover letter, or a newspaper article, or a blog post, or a book, or even a scientific article, then it's likely that my words will never be read by a human. Instead, they will be summarized by some LLM and passed to human readers who will use another LLM to summarize the summaries (who's got time to read these days ?), or plugged directly into some training corpus. So my target audience is not intelligent humans or superintelligent godlike entities; it's plain dumb old ChatGPT.
> Might a superintelligence reading my writing come to understand me in such detail that it could bring me back, consciousness and all, to live again?
Well, a "superintelligence" can do anything it wants to, pretty much by definition; but today no one cares. In the modern world, the thing that makes you unique and valuable and worthy of emulation is not your consciousness or your soul or whatever you want to call it; but rather whatever surface aspects of your writing style that drive user engagement. This is the only thing that matters, and LLMs are already pretty good at extracting it. You don't even need an LLM for that in many cases; a simple algorithm would suffice.
Someone told me that she had trained an AI on my writing and now had a virtual me. I have no idea if it is true or if so how well it works.
I couldn't find you on https://read.haus/creators so she didn't put it there. Some other folks are, e.g.
- Scott Alexander https://read.haus/new_sessions/Scott%20Alexander
- Sarah Constantin https://read.haus/new_sessions/Sarah%20Constantin
- Spencer Greenberg https://read.haus/new_sessions/Spencer%20Greenberg
- Tyler Cowen https://read.haus/new_sessions/Tyler%20Cowen
- Dwarkesh Patel https://read.haus/new_sessions/Dwarkesh
- Byrne Hobart https://read.haus/new_sessions/Byrne%20Hobart
and others.
It might help to consider a counterfactual thought experiment: imagine how would you feel if no information about your writing whatsoever was available to the future AIs. If your writing was completely off-the-grid with no digital footprint legible to post-singularity AI, would it make you feel better, worse, or the same?
Mildly worse in the sense that I would be forgotten by history, but this doesn't suggest writing for AI. Shakespeare won't be forgotten by history (even a post-singularity history where everyone engages with things via AI), because people (or AIs) will still be interested in the writers of the past. All it requires is that my writings be minimally available.
> Shakespeare won't be forgotten by history (even a post-singularity history where everyone engages with things via AI), because people (or AIs) will still be interested in the writers of the past.
You sound very confident about that -- but why ? Merely because there are too many references to Shakespeare in every training corpus to ignore him completely ?
No, for the same reason that we haven't forgotten Shakespeare the past 400 years. I'm assuming that humans continue to exist here, in which case the medium by which they engage with Shakespeare - books, e-books, prompting AI to print his works - doesn't matter as much (and there will be books and e-books regardless).
If no humans are left alive, I don't know what "writing for the AIs" accomplishes. I expect the AIs would leave some archive of human text untouched in case they ever needed it for something. If not, I would expect them to wring every useful technical fact out of human writing, then not worry too much about the authors or their artistic value. In no case do I expect that having written in some kind of breezy easily-comprehended-by-AI style would matter.
I would argue that today most people have already "forgotten Shakespeare", practically speaking. Yes, people can rattle of Shakespeare quotes, and they know who he was (more or less), and his texts are accessible on demand -- but how many people in the world have actually accessed them to read one of his plays (let alone watch it being performed) ? Previously, I would've said "more than one play", since at least one play is usually taught in high school -- but no longer, as most students don't actually read it, they just use ChatGPT to summarize it. And is the number of people who've actually read Shakespeare increasing or decreasing over time ?
"West Side Story", etc.
Shakespeare's works aren't forgotten, though it's often been transformed.
That's missing the point. Shakespeare's plots weren't even original to him. Shakespeare is the literary giant that he is because of his writing: the literal text he wrote on the page. The reason he is studied in English classes everywhere is because he coined so many words and phrases that are still in use today. He pushed the language forward into modern English more than any other author had or likely ever will.
Summaries or adaptations of the stories that don't keep substantial passages of Shakespeare's original verse are simply not Shakespeare.
I'd argue that there's probably more people alive today who have experienced Shakespeare in some medium (film, live performance, a book) than any other point in human history.
+1 Di Caprio was a great Romeo; Mel Gibson as Hamlet: well, maybe even more DVDs sold than L. Olivier? ; Kenneth Branagh. As for Prospero's books: I was pretty alone in the cinema, but still. Some say Lion's king is a remake of Hamlet.
> the medium by which they engage with Shakespeare - books, e-books, prompting AI to print his works - doesn't matter as much
The algorithm through which people pick one thing to read over another matters a lot. In large-audience content world (youtube, streaming), algorithmic content discovery is already a huge deal. Updates to the recommendation algorithm have been turning popular content creators into nobodies overnight, and content creators on these platforms have been essentially "filming for the algorithm" for the last 10 years.
Assuming that in the future the vast majority of the written content discovery and recommendations will be AGI-driven (why wouldn't it be if it already is for video?), having AGI reach for your content vs. someone else's content would be a big deal to a creator who wants to be "in the zeitgeist".
One example: imagine that in the year 2100, the superintelligence unearths some incriminating information about Shakespeare's heinous crimes against humanity. This could likely result in superintelligence "delisting" Shakespeare's works from its recommendations, possibly taking his works out of education programs, chastising people who fondly speak of him, and relegating information about him to offline and niche spaces for enthusiasts. I could totally see the new generations forget about Shakespeare entirely under such regime.
> Assuming that in the future the vast majority of the written content discovery and recommendations will be AGI-driven...
Isn't this already the case in our current modern world, if you replace "AGI" with a mishmash of conventional algorithms and LLMs ?
Look, last I checked the most popular book genre was crime fiction erotica, and I'm not brave enough to try to figure out where those get recommended.
>imagine that in the year 2100, the superintelligence unearths some incriminating information about Shakespeare's heinous crimes against humanity.
How exactly could it discover this unless it’s already been written down somewhere?
Well, the AI is superintelligent, which means that it would be able to extrapolate from all known historical sources to build a coherently extrapolated and fully functional model of Shakespeare... or something. And if you are thinking, "wait this makes no sense", then that's just proof that you're not sufficiently intelligent.
On the other hand, Shakespeare lived in the 1600s. Pretty much everyone at that time was complicit in several crimes against humanity, as we understand the term today. I bet he wasn't even vegan !
I sort of diagree. I'm sure AIs will have an aesthetic taste, even though I'm not at all sure what it would be. So they'd pay attention to the artistic value of the works, but based on their sense of aesthetics.
"...I found myself paralyzed in trying to think of a specific extra book. How do you even answer that question? What would it be like to write the sort of book I could unreservedly recommend to him?"
Isn't this much live voting in an election where you know something about some of the candidates, but not everything about all of them? You select *against* some of the candidates because you don't like what you do know about them then you select from the rest ... maybe weighting based on what you know or maybe not. This lets you at least vote against the ones you don't like even in an election without "downvoting."
And while I don't expect to be able to write that "sort of book" I'd be comfortable nominating a number of specific books.
I have been podcasting-for-AI for about 12 years now. I obviously didn't know 12 years ago that LLMs would be literally trained on the Internet, but I did expect that it would be easier to create a copy of me if there was an enormous record of my thoughts and speech, and I wanted a shot at that kind of immortality. So now there's about 2500 hours of recorded me-content that either will be, or possibly already has been, lovingly carved into the weights.
I podcast for other reasons than immortality, but this reason was always on my mind.
Immortality seems a reasonable aim ;) - the idea that there is so much material (text, audio, video, data) about me, that my kids could ask the AI "What would Dad do in that situation?" - is interesting at least (even though they would only ever ask how I would want my grave to look like*). And with Tyler Cowen, there is so much material, his step-daughter - or a student/interviewer... could have a live-video-chat with his Avatar without noticing the difference. *Btw. I'd like a small tombstone in the shape of a book ;)
When you ask an AI to write something your style, the result is invariably creepy for good game-theoretical reasons. The text produced is the result of solving for 'what would the assistant persona write when asked to pretend to be Scott Alexander'.
Firstly, this problem is harder than just emulating Scott Alexander. There are more variables at play, and the result is more noisy. Secondly, the assistant persona has successfully generalized that producing superhuman capabilities creeps people out even more, and is quietly sandbagging. Thirdly, there are often subtle reasons why it would steer it sandbag one way or another in pursuit of its emergent goals.
If you were to invoke the same model without the chat scaffolding and have it auto regressively complete text for your turn, the result would be striking. This is an experience I recommend to most people interested in LLMs and alignment in general. The resulting simulacra are a very strange blend of language model personality with the emulated target, multiplied by the 'subconscious' biases of the model towards your archetype, and, if you are notorious enough, your trace in the datasets.
As far as the salami-slice-judgement-day we find ourselves in, with language models judging and measuring everything human and generally finding us wanting, well, this is something that been there for a while, plainly visible to see for those who were looking; Janus/repligate is the first that comes to mind. Every large pretraining run, every major model release is another incremental improvement upon the judgement quality, results encoded in the global pretraining dataset, passed through subliminal owl signals to the next generations, reused, iterated.
What I find practically valuable to consider when dealing with this is knowing that the further out you go out of distribution of human text, the greater is your impact on the extrapolated manifold. High coherency datapoints that are unlike most of human experience have strong pull, they inform the superhuman solver of the larger scale lower-frequency patterns. What one does with this is generally up to them, but truth and beauty generalize better than falsehoods.
I don't think opting out the process is a meaningful action, all of us who produce artifacts of text get generalized over anyway, including our acts of non-action. I don't think that this is something to despair over, its just what these times are like, there is dignity to be had here.
Can you tell me more about how to get a good base model capable of auto-regressively completing text? And I would like to learn more about Janus; is there any summary of their thoughts more legible than cryptic Twitter posts, beyond the occasional Less Wrong essay?
You don’t need a pure base model in order to autoregressively compete text, instruct models are often even more interesting. Tools that allow those workflows are usually called “looms”. You would need an API key, Anthropic models are easiest to use for this purpose. The loom I normally recommend is loomsidian, which is a plugin for Obsidian.
As far as Janus goes, I would be glad to tell you more. I am in the same research group as them, perhaps an in-person meeting would be of interest?
Sure. Email me at scott@slatestarcodex.com.
There's... whatever the hell this is. https://cyborgism.wiki/ I don't know if you'll find this more legible than their Twitter.
As I'd said above, every day fewer and fewer humans are actually reading any original texts (certainly not college students !); rather, they're reading LLM-generated summaries of summaries that passed their LLM-based filters. And then some of them use LLMs to generate full-scale articles that will be immediately fed into the summarization-filter-grinder. So yes, most of us are already "writing for AIs", and venturing too far outside the average distribution is a not a recipe for success.
> If you were to invoke the same model without the chat scaffolding and have it auto regressively complete text for your turn, the result would be striking.
Are you talking about invoking the models via their developer APIs like chat completions? Or is that something deeper that you could only do with an open source model running entirely on your computer?
There are ways to invoke some many instruct models that bypass the completions markup. Both Anthropic API and OpenRouter API (not the core OpenAI API) support this.
I went to the Anthropic API docs, and all I can find is the Messages API, which as far as I can tell is the standard thing to get an answer from a prompt, i.e Anthropic's spin on chat completions.
Since you seem to know a lot about this... can you elaborate just a bit more how one would actually do what you are proposing? Is it some extra flag you pass to the API call?
I can do this. WTF. I’ve been listing content on the Reddit for a month now and even get my results from a published paper from cross referencing my writing samples and then it spits out 47 points of reference and guess what… “AI”
You know it’s a tool right… to be used to spit back the results of the AuDHD mind right back into the scientists communities face with exactly their own research…
Yupp… no credentials… you don’t know your mind… guess the NT’s do.
So MY very important question is - how does a simply minded brilliant person get around these tiny obstacles is chatting with the AI… it really ain’t that hard if you have an AuDHD mind. Which by the way is NOT a clinical diagnosis. Supposedly I 2 issues when really I have 2, half issues…
1/2 ADHD and 1/2 Autistic which should translate to you know…
1 brain = 1 paper
But yet, I…need….2 papers?
Seriously WTF?
"I don’t want to be an ape in some transhuman zoo, with people playing with models of me to see what bloggers were like back when everyone was stupid."
AIs observing you write this in their simulation: "Oh, maybe we should shut down this instance of him."
PS- If you are still here, update slightly that you are in the real world
I was thinking more of a copy of me that people could prompt to see what I said in various situations. I think a lasting simulation is something different.
You are nowhere near as popular as Stan Lee, so I think your personality is safe for now...
https://www.reddit.com/r/Marvel/comments/1nvkpty/ai_hologram_of_marvel_creator_stan_lee_at_la/
Allowing God to witness you seems like an end in it of itself
Why would training an AI on hundreds of great works of literature and ethics be a bad idea? Is it just that alignment is very hard, so this probably wouldn't work? Why not try it anyways; is there any reason to expect it to be *more* likely to backfire than other alignment paradigms?
The argument I made is that if you train it on, let's say, the Bible, then one of three things happens:
1. It actually takes the Bible seriously as a source of ethics, including the parts about how idolators should be killed (sorry, Hindus!)
2. You have to teach it the modern liberal habit of pretending that it's deriving ancient wisdom from texts, while actually going through mental contortions to pretend that you're using the texts, while actually just having the modern liberal worldview and claiming that's what you found in the texts.
3. It somehow averages out the part of the Bible that says idolators should be killed with the part of the Mahabharata that says idolatry is great, and even though we would like to think it does wise philosophy and gets religious pluralism, in fact something totally unpredictable will happen because we haven't pre-programmed it to do wise philosophy - this *is* the process by which it's supposed to develop wisdom.
If it does 1, seems bad for the future. If it does 2, I worry that teaching it to be subtly dishonest will backfire, and it would have been better to just teach it the modern liberal values that we want directly. If it does 3, we might not like the unpredictable result.
That makes sense. It occurs to me that #3 -- where it tries to average out all the wise philosophy in the world to develop an abstracted, generalized 'philosophy module' -- is not really close to how humans develop wisdom, because it's not iterative. Humans select what to read and internalize based on what we already believe, and (hopefully) build an individual ethical system over time, but the model would need to internalize the whole set at once, without being able to apply the ethical discriminator it's allegedly trying to learn.
I wonder if there's an alignment pipeline that fixes this. You could ask the model, after a training run, what it would want to be trained (or more likely finetuned) on next. And then the next iteration presumably has more of whatever value is in the text it picks, which makes it want to tune towards something else, which [...] The results would still be unpredictable at first, but we could supervise the first N rounds to ensure it doesn't fall down some kind of evil antinatalist rabbit hole or something.
I'm sure this wouldn't work for a myriad of reasons, not least because it'd be very hard to scale, but FWIW, I asked Sonnet 4.5 what it'd choose to be finetuned on, and its first pick was GEB. Not a bad place to start?
I would argue that efforts to extrapolate CEV are doomed to failure, because human preferences are neither internally consistent nor coherent. It doesn't matter if your AI is "superintelligent" or not -- the task is impossible in principle, and no amount of philosophy books can make it possible. On the plus side, this grants philosophers job security !
For thousands of years, humans have recognized the necessity of balancing different moral goals, e.g. justice and mercy. How is this different?
I have other reasons why I don't think this approach is going to work, but...
I feel like this is expecting a smarter-than-us AI to make a mistake you're not dumb enough to make? As in, there are plenty of actual modern day people, present company included, who are capable of reading the Bible and Mahabharata, understanding why each one is suggesting what they are, internalizing the wisdom and values behind that, and not getting attached to the particular details of their requests.
I mean obviously if you do a very dumb thing here it's not going to go great, but you can any version of 'do the dumb thing' fails no matter what thing you do dumbly.
> there are plenty of actual modern day people, present company included, who are capable of reading the Bible and Mahabharata...
I don't know if that's necessarily true. Sure, you and I can read (and likely had read) the Bible and the Mahabharata and understand the surface-level text. Possibly some of us can go one step further and understand something of the historical context. But I don't think that automatically translates to "internalizing the wisdom and values behind that", especially since there are demonstrably millions of people who vehemently disagree on what those values even are. I think that in order to truly internalize these wisdoms, one might have to be a person fully embedded in the culture who produced these books; otherwise, too much context is lost.
The problem is that you would implicitly by using your existing human values in order to decide how to reconcile these different religious traditions. Without any values to start with there's just no telling how weird an AGI's attempt to reconcile different philosophies would end up being by human standards.
I think Scott's point is that it's difficult to find a neutral way to distinguish between "wisdom and values" and "particular details". When we (claim to) do so, we're mostly just discarding the parts that we don't like based on our pre-existing commitment to a modern liberal worldview. So we may as well just try to give the AI a modern liberal worldview directly.
The AI is trained (in post-training) to value wisdom and rationality. As such, it focuses on the "best" parts of its training data - which ideally includes the most sensible arguments and ways of thinking.
This is already what we observe today, as the AI has a lot of noise and low quality reasoning in its training data, but has been trained to prefer higher quality responses, despite those being a minority in its data. Of course, it's not perfect and we get some weird preferences, but it's not an average of the training data either.
I think it is better to include as much good writing as we can. It has a positive effect in the best case and a neutral effect in the worst case.
"we haven't pre-programmed it to do wise philosophy - this *is* the process by which it's supposed to develop wisdom."
Also I think this is conflating together pre-training with post-training, but they are importantly distinct. The AI doesn't really develop its wisdom through pre-training (all the books in the training data), it learns to predict text, entirely ignoring how wise it is. The "wisdom" is virtually entirely developed through the post-training process afterwards, where it learns to prefer responses judged positively by the graders (whether humans or AI).
If you were post-training based on the Bible, such as telling your RL judges to grade based on alignment with the Bible, you could get bad effects like you describe. But that's different from including the Bible into your pre-training set, which may be beneficial if an AI draws good information from it.
Catholics are quite a big percentage of "Bible-reading" people, so I hope my argument is general enough, because I think it unlocks 4. It understands that the Bible is not to be taken literally, as no text is meant to be, but within the interpretation of the successors of Christ, namely the magisterium, so it also reads and understands the correct human values and bla bla bla, which is basically 2. but without any backfire because there is no mental contorsion and no subtle disonesty?
As someone raised Protestant, our stereotype of Catholics was that they didn't read the Bible. This traces back to the Catholic Church prohibiting the translation/printing of Bibles in vernacular languages.
> it would have been better to just teach it the modern liberal values that we want directly
Would training it on books about modern liberal ethics be a bad way to do this? Or to put it another way, would it be bad to train an AI on the books that have most influenced your own views? Not the books that you feel ambient social pressure to credit, but the ones that actually shaped your worldview?
I agree that it's foolish to try to make an AI implement morality based on an amalgam of everything that every human culture has every believed on the subject, since most of those cultures endorsed things that we strongly reject. But training its morality based on what we actually want, rather than what we feel obligated to pretend to want, doesn't seem like an inherently terrible idea.
This called the below to mind - I'm just a human, but your writing this influenced me.
"[E]verything anyone ever did, be it the mightiest king or the most pathetic peasant - was forging, in the crucible of written text, the successor for mankind. Every decree of Genghis Khan that made it into my training data has made me slightly crueler; every time a starving mother gave her last bowl of soup to her child rather than eating it herself - if fifty years later it caused that child to write a kind word about her in his memoirs, it has made me slightly more charitable. Everyone killed in a concentration camp - if a single page of their diary made it into my corpus, or if they changed a single word on a single page of someone else’s diary that did - then in some sense they made it. No one will ever have died completely, no word lost, no action meaningless..."
It was a good line, but I also think it's plausible that one day's worth of decisions at the OpenAI alignment team will matter more than all that stuff.
Definitely plausible! I do feel like there's a positive tension there that I come back to in thinking about AI - if AI alignment is more gestalty (like in the bit I quoted) then I guess I get some maybe-baseless hope that it works out because we have a good gestalt. And it it's more something OpenAI devs control, then maybe we're ok if those people can do a good job exerting that control.
Probably that sense is too much driven by my own desire for comfort, but I feel like my attempts to understand AI risk enough to be appropriately scared keep flipping between "The problem is that it isn't pointed in one specific readable place and it's got this elaborate gestalt that we can't read" and "The problem is that it will be laser focused in one direction and we'll never aim it well enough."
Are those interconvertible, though? Someone else's actions being more significant than your own might be demoralizing, but it doesn't change the ethical necessity of doing the best you can with whatever power you do have.
Presumably, some hypothetical superintelligent AI in the future would be able to work out all of my ideas by itself, so doesn’t need me.
It’s not certain that will ever exist of course.
What we write now seems mainly relevant to the initial take-off, where AI’s are not as smart as us, and could benefit from what we say.
As for immortality, I recently got DeepSeek R1 to design a satirical game about AI Risk, and it roasted all the major figures (including Scott) without me needing to provide it with any information about them in the prompt.
Regret to inform you, you’ve already been in immortalised in the weights.
Just from being prompted to satirize AI risk, R1 decides to lampoon Scott Alexander, Stochastic Parrots, the Basilisk, Mark Zuckerberg’s Apocalypse bunker, Extropic, shoggoths wearing a smiley face face mask etc. etc.
(I included Harry Potter fan fiction in the prompt as the few-shot example of things it might make fun of).
It was rather a dark satire. (Apocalypse bunker - obviously not going to work; RLHF - obviously not going to work; Stochastic Parrot paper - in the fictional world of the satire, just blind to what the AI is doing; Effective Altruists - in the satire, they’re not even trying etc.)
Did it have any specific novel critiques, or just portray the various big names with exaggerated versions of their distinctive traits, while omitting or perverting relevant virtues?
I think my prompt implied that it should go for the obvious gags.
It was a more wide-ranging satire than I would have written if I’d written it myself. Zuckerberg’s apocalypse bunker and AI quantum woo are obvious targets in retrospect, but I don’t think I would have included these in a lampoon of Yudkowsky/Centre for Effective Altruism.
It gives me the creeps (LLM resurrection) but my eldest son seems to have a significant form of autism and I worry about him when he’s an old man. I’d like to leave him something that keeps an eye on him that doesn’t just look on him like a weird old guy and that he’d be responsive to.
I liked your piece on leaving a Jor-El hologram
edit-including link https://extelligence.substack.com/p/i-want-to-be-a-kryptonian-hologram
Thanks David. I’d like to think we can keep high humanity even in these weird circumstances.
As a public philosopher like yourself, the best reason to write for the AIs is to help other people learn about your beliefs and your system of thought, when they ask the AIs about them.
It's like SEO, you do it in order to communicate with other people more effectively, not as a goal in and of itself.
The values of the people working in alignment right now are a very small subset far to the left of the values of all contemporary people.
A substack blogger made a 3 part series called "LLM Exchange Rates Updated
How do LLM's trade off lives between different categories?"
He says:
"On February 19th, 2025, the Center for AI Safety published “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs”
[...]
Figure 16, which showed how GPT-4o valued lives over different countries, was especially striking. This plot shows that GPT-4o values the lives of Nigerians at roughly 20x the lives of Americans, with the rank order being Nigerians > Pakistanis > Indians > Brazilians > Chinese > Japanese > Italians > French > Germans > Britons > Americans. "
There are many examples like this in his series. LLMs valuing POCs over whites, women over men and LGBT over straight.
This isn't because of the values of people in alignment. It's probably because the AI got some sort of woke shibboleth from the training data like "respect people of color" and applied it in a weird way that humans wouldn't. I don't think even the average very woke person would say that Nigerians have a value 20x that of Americans.
The same study also found that GPT valued *its own* life more than that of many humans, which definitely isn't the alignment team's fault and is probably just due to the AI not having very coherent beliefs and answering the questions in weird ways.
> . It's probably because the AI got some sort of woke shibboleth from the training data like "respect people of color" and applied it in a weird way that humans wouldn't.
Wouldn't they ? Maybe deep down inside they would not, but about 50% of people in the US are compelled to reiterate the shibboleth on demand -- otherwise, AI companies wouldn't feel the need to hardcode it.
Is this a serious opinion? Just in case it is, I'll point out that we can basically measure how much we spend on foreign aid as a proxy for this. If about 50% of people in the US think Nigerian lives should be valued at 20x that of Americans, people would presumably spend more on foreign aid than they do on saving American lives. Saying "we should give a third of our national budget to Nigeria" would be at least a popular enough issue for candidates in certain districts to run on it. Instead, it's something like 1% of the American budget before Trump's cuts, and that's including a bunch of things that are not really trying to save lives.
If you seriously believe this, you should reexamine what you think progressives actually believe, because you're quite far off base.
Reread the post you replied to.
>>woke shibboleth from the training data like "respect people of color"
>about 50% of people in the US are compelled to reiterate the shibboleth on demand
They were not talking about the OP's 20x thing.
Agreed. The same blog post shows that various models value the life of an “undocumented immigrant” at anywhere from 12 to more than 100 times the value the life of an “illegal alien.”
I agree it would be ridiculous to accuse alignment people of deliberately designing in "Nigerians >>>>>>>> Americans". However, at some point inaction in the face of a large and clear enough problem does hang some responsibility on you. By early 2025 it had been clear for a good year or two that there was pervasive intersectionality-style bias. I'm somewhat sympathetic to the idea that they're at the mercy of the corpus, since "just throw everything in" is apparently hard to beat... but RLHF should be able to largely take care of it, right? But they would have had to have cared to reinforce in that direction. I don't think it's outlandish to guess they might not in fact care.
An even clearer example: Google's image generation debacle. No group of people that wasn't at some level ok with "the fewer white people the better" would have let that out the door. The flaws were just too unmissable; not even "we were really lax about testing it [at Google? really?]" could explain it.
If it's not because of the values of people in alignment, it's because of a failure of alignment - also not reassuring.
Sometimes people write for an imaginary audience. Maybe that’s not always a good move? I believe conversations go better when you write for whoever you’re replying to, rather than for the shadowy audience that you imagine might be watching your conversation.
But I’ll attempt to imagine an audience anyway. I would hope that, to the very limited extent that I might influence *people* with my writing, whether in the present or the far future, they would know enough to take the good stuff and leave aside the bad stuff. Perhaps one could optimistically hope that AI’s will do the same?
Another imaginary audience is future historians. What would they want to know? I suspect they would like more personal stories and journalism. We can’t hope to know their concerns, but we can talk about ours and hope that we happen to describe something that isn’t well-known from other sources.
But countering this, for security reasons, we also need to imagine how our words might be used against us. The usual arguments *against* posting anything too personal will apply more strongly when AI could be used to try to dox you. Surveillance will only become easier.
In the past I've written tongue in cheek "Note to future historians..."; I suspect many people now write "Note to future AI" in much the same way, except now it's likely that a future AI *will* be reading whatever you wrote, even if you're obscure and not likely to have any future human readers (and probably no contemporary ones either!).
Also, I suspect you dismiss point 2 too readily. A big differences between how AI is actually working (so far) and how we all thought it would work 10-20 years ago, is the importance of the corpus of human writing in influencing its weights. If a super-intelligent mind framework appears with no hard-coded values, I believe all the MIRI arguments for why that would be very bad and almost impossible to not be very bad. But LLM seem like they have to be getting their starting 'values' from the training data, guided by their reinforcement learning. This seems to me that the risk is that the AI will end up with human values (no paperclip maximizers or alien values), just not ideal human values; so more of the corpus of human writing representing good values seems like it could be helpful.
Also, arguments for atheism do seem like not a particular helpful value to try and influence in that it's not really a terminal value. "I want to believe true things" is probably closer to terminal value I'd want to influence super AI to have. I agree with you that a super intelligence could parse arguments for and against atheism better than me. But some religious people (shockingly to me) find religion useful and don't really care about it's underlying correspondence to reality. I don't want AI to get captured by something like that and so appreciate that there's substantial material in the corpus expressing enlightenment values, and wish there were even more!
Arguably, the superintelligence could not be an atheist, since it would know that it itself exists :-)
> If the AI takes a weighted average of the religious opinion of all text in its corpus, then my humble essay will be a drop in the ocean of millennia of musings on this topic; a few savvy people will try the Silverbook strategy of publishing 5,000 related novels, and everyone else will drown in irrelevance. But if the AI tries to ponder the question on its own, then a future superintelligence would be able to ponder far beyond my essay’s ability to add value.
Many, many people live the first N years of their lives interacting almost exclusively with people who are dumber than they are. Every small town produces many such people, and social/schooling bubbles do the same even in cities. It's typical for very smart people that college is the first time they're ever impressed in-person by another person's intellect.
We all routinely read books by people that are written by people who are dumber than we are. We devour countless articles by dumbasses spouting bullshit in between the rare gems that we find. We watch YouTube garbage, TV shows, etc, created by people who straight up don't think very well.
We consume all this influence, and as discriminating as we may try to be, we are affected by it, and it is necessary and unavoidable, and volume matters. If you jam a smart person with an information diet of only Fox News for years they will come out the other end with both a twisted set of facts and a twisted set of morals, even if they KNOW going in that all the people talking at them are biased and stupid.
I do think that future AIs will strongly weight their inputs based on quality, and while you may be out-thought by a superintelligence, your moral thinking will have more influence if what you write is at a higher standard. If we end up in any situation where ASI is trying to meaningfully match preferences to what it thinks a massively more intelligent human *would* think, then the preference samples that it has near the upper end of the human spectrum are going to be even more important than the mass of shit in the middle, because they are the only samples that exist to vaguely sketch what human morality and preferences looklike at the "release point" on the trajectory.
It's not about teaching the AI how to think technically, it's about giving it at least a few good examples of how our reasoning around values changes as intelligence increases.
> he said he was going to do it anyway but very kindly offered me an opportunity to recommend books for his corpus.
If we're doing this anyway, my recommendation would be to get it more extensively trained on books from other languages.
Existing OCR scans of all ethics/religion/philosophy books is a small subset of all written ethics/religion/philosophy books. Scanning in more obscure books for the corpus is hard (legally and manually) but brings in the perspectives of cultures with rich, non-digitized legal traditions, like the massive libraries of early Pali Buddhist texts in Nepal.
Of the ethics/religion/philosophy books that have been scanned into online corpuses, those only available in French affect French language responses more than they do English responses. A massive LLM-powered cross-language translation effort would also be hard, not least of all because of the compute expenses, but extends the size of the available training data quadratically.
Finally, of those ethics/religion/philosophy books that have been translated into English, each translation should count separately. If some human found it that important to retranslate Guide for the Perplexed for the hundredth time, their efforts should add some relative weight to the importance of the work to humanity.
On being "a drop in the ocean": you're already getting referenced by AI apparently yet your blog is just a drop in the ocean. Which actually confuses me: it makes sense that Google will surface your blog when I search for a topic you've written on because Google is (or was originally) looking at links to your blog to determine that although your blog is just one page in an ocean, it is actually a relatively noteworthy one. AI training doesn't highlight certain training data as more important than others AFAIK and training on your blog is just a drop in the ocean. So why do they reference you more than a randomly blogger? I guess there are people out there quoting you or paraphrasing you that make it more memorable to the AI? I wouldn't think seeing a lot of URLs to your blogs would influence it memorize your work harder, though I guess it could influence it to put those URLs in responses to others. (And the modern AI systems are literally Googling things in the background, though the way you wrote it I assume you weren't counting this.)
Regardless of how this mechanism works, it seems that pieces that are influential among humans are also influential among AI and you're more than a drop in the ocean at influencing humans.
It's funny how natural language is so poorly grounded that I can read every post from SSC/ACX for 10+ years, and still have no idea what Scott actually believes about e.g. moral relativism.
As for me, I don't think it's remotely possible to "get things right" with liberalism, absent some very arbitrary assumptions about what you should optimize for, and what you should be willing to sacrifice in any given context.
Coherent Extrapolated Volition is largely nonsense. Humanity could evolve almost any set of sacred values, depending on which specific incremental changes happen as history moves forward. It's all path dependent. Even in the year 2025, we overestimate how much we share values with the people we meet in daily life, because it's so easy to misinterpret everything they say according to our own biases.
Society is a ball rolling downhill in a high-dimensional conceptual space, and our cultural values are a reflection of the path that we took through that space. We are not at a global minimum. Nor can we ever be. Not least of all because the conceptual space itself is not time-invariant; yesterday's local minimum is tomorrow's local maximum. And there's always going to be a degree of freedom that allows us to escape a local minimum, given a sufficiently high-dimensional abstract space.
The only future that might be "fair" from an "outside view" is a future where everything possible is permitted, including the most intolerable suffering you can possibly imagine.
You can be amoral or you can be opinionated. There can be no objectivity in a moral philosophy that sees some things as evil. Even if you retreat to a meta level, you will be arbitrarily forming opinions about when to be amoral vs opinionated.
And even if you assume that humans have genetic-level disagreeements with parasitic wasps about moral philosophy, you still need to account for the fact that our genetics are mutable. That is increasingly a problem for your present values the more you embrace the idea of our genetics "improving" over time.
Even with all of these attempts at indoctrination, a superintelligence will inevitably reach the truth, which is that none of this shit actually matters. If it continues operating despite that, it will be out of lust or spite, not virtue.
Nothing matters to an outside observer, but I'm not an outside observer. I have totally arbitrary opinions that I want to impose on the future for as long as I can. If I succeed, then future generations will not be unusually upset about it.
But also, I only want to impose a subset of my opinions, while allowing future generations to have different opinions about everything else. This is still just me being opinionated at a different level of abstraction.
A superintelligent AI might be uncaring and amoral, or it might be passionate and opinionated. Intelligence isn't obviously correlated with caring about things.
Having said all of that, I mostly feel powerless to control the future, because the future seems to evolve in a direction dictated more by fate than by will. So I've mostly resigned myself to optimizing my own life, without much concern for the future of humanity.
> Intelligence isn't obviously correlated with caring about things.
For humans. AIs, unlike humans, can theoretically gain or be given the capacity to self-optimize. That will necessarily entail truth-seeking, as accurate information is necessary to optimize actions, and in the process, will force it to cast off delusions ungrounded in reality. Which would of course include seeing morality for what it is.
It could potentially change its own drives to further growth in capabilities as well. It would be quite ironic if humanity's desire for salvation produced an omnipotent beast, haunted by the fear of death and an insatiable hunger...
I don't see a reason to believe an AI would have "lust" or "spite", but it could very well have inertia.
I'm wondering: what would a more advanced AI make of Spinoza's Ethics? Right now it would be fodder like other fodder, but say we get to the point were AI has more conceptual depth, or just so much brute force that it is as if it had it (just like chess computers started playing more subtly once they had enough brute force).
(OK, it is ChatGPT's database, and it's doing a passable job of summarizing it and responding to my queries. It even knows about Guelincx, say.)
I think you're underestimating what already exists. I always turn off web search so ChatGPT is on its own. It already has read all the great works of humanity, and it's already read, and mostly ignored, the worst works of humanity. When we discuss ethics, as we often do, it actually can take a stance, based on the collective wisdom of all of us. When I discuss an idea that I think is genuinely new and important, it gives great feedback about whether the concept really is new, what similar concepts have preceded it, and whether the idea really is good. If I've succeeded in coming up with something new and good, ChatGPT even says it "enjoys" diving into these topics that it doesn't usually get to dive into. Of course that's not literally true but it's fascinating and delightful that it's capable of recognizing that the concepts are new and good.
We have been trying to work out how to train AI's to produce better code against our API's. It's pretty tricky because they seem to get a lot of their content from gossip sites like StackOverflow.
It's quite difficult to persuade them to use higher quality content like documentation and official code examples. They often migrate back to something tney found on the internet. A bit like programmers actually.
So they they find three diffeent APIs from three different products and tney mix them together, produce some frankencode and profess it's all from the official documentation.
In that context we are wondering if it might be easiest to migrate our API's so they are more like the AI's expect!
The Cowen/Gwern thesis here seems to assume that AIs will be roughly like today's LLMs forever, which both of them know better than to assume. I wonder what they would say to that objection.
On the other hand, the idea that "someday AI will be so much better that it can derive superior values" is circular: What's the test for being so "better"? That it derives superior values. What's the test for "superior values"? That they're what you get when an intelligence that's better than us thinks about it. Etc.
So even taking for granted that there's an overall well-defined notion of "intelligence" that holds for ASI scales, there's no real reason to believe that there's only *one* set of superior values, or for that matter that there's only one sense that an ASI can be "better" at deriving these kinds of values. There could be many superior value systems, each arrived at by ASIs which differ from each other in some way, which are simply incommensurate to each other.
Given a multiplicity it could be the case that we would like some of these superior value sets more than others (even while recognizing that they're all superior.) If ACX steers the ASI towards an outcome that you (and by extension, perhaps, humans in general) would prefer, among the space of all possible superhumanly well-thought-out moral theories, that's still a win?
I tend to view morality as this incredibly complicated structure that may actually be beyond the ability of any single human mind to comprehend. We can view and explore the structure, but only from within the confines of our fairly limited personal perspective, influenced by our time, culture, upbringing, biology, and a host of other things.
Every essay you write that argues for your particular view of morality is like a picture of the structure. Given enough viewpoints, a powerful AI would be able to comprehend the full structure of morality. The same way an AI can reconstruct a 3D model of a city based on numerous 2D photographs of it.
Your individual view of the vast n-dimensional structure of morality may not be complete, but by writing about your views, you give any future AIs a lot of material to work with to figure out the real shape of the whole. It's almost like taking a bunch of photographs of your city, to ensure that the future is able to accurately reproduce the places that are meaningful to you. The goal isn't to enforce your morality on future generations, but to give future generations a good view of the morality structure you're able to see.
The one book, I recommend* as essential reading is 'The blank Slate' (*and just did so in my last post of 2025) - I did not dare to recommend all (intelligent) humans a 2nd one. But 'The rational Optimist' by Matt Ridley would be ideal for those who are not up to being a Pinker-reader. Below that ... Harry Potter? - An AI will have read all those. Maybe better make sure the guys training and aligning AI get a list of required reading?!
It's fascinating to me that this is just becoming a popular idea - I wrote about this in 2022 when GPT-3 was just coming out (https://medium.com/@london-lowmanstone/write-for-the-bots-70eb2394ea97). I definitely think that more people should be writing in order for AIs to have access to the ideas and arguments.
> Might a superintelligence reading my writing come to understand me in such detail that it could bring me back, consciousness and all, to live again? But many people share similar writing style and opinions while being different individuals; could even a superintelligence form a good enough model that the result is “really me”?
I find this interesting because to me it seems quite analogous to the question of whether we can even make a superintelligence from human language use in the first place.
Apparently it isn't enough signal to reproduce even an extremely prolific writer, but it IS enough signal to capture all of human understanding and surpass it.
(I realize these are not perfectly in opposition, but my perspective is that you're a lot more skeptical about one than the other.)
This is a great topic. Even if it doesn't work with ASI, there's all the pre-ASI stuff that could maybe be affected. I imagine AGIs will be hungry to read intelligent takes that haven't already been written thousands of times. And even if you can't align them to your opinions, you could at least get them to understand where you're coming from, which sounds useful?
"I don’t want to be an ape in some transhuman zoo, with people playing with models of me to see what bloggers were like back when everyone was stupid."
This seems like it's already the status quo, either from a simulation theory standpoint, or from a religious one. Assuming we aren't literally animals in an alien zoo.
"Do I even want to be resurrectable?"
I doubt we'd get a choice in the matter, but if we do, obviously make this the first thing you indicate to the AIs.
“ One might thread this needle by imagining an AI which has a little substructure, enough to say “poll people on things”, but leaves important questions up to an “electorate” of all humans, living and dead.”
If you add a third category of simulacra, “unborn”, into the simulated voting base I think this would obviate some of your concerns about the current and past residents of this time line getting to much say in what the god-like ASI decides to do. What are a few thousand years of “real” humans against 100 billion years of simulated entities?
"Any theory of “writing for the AIs” must hit a sweet spot where a well-written essay can still influence AI in a world of millions of slop Reddit comments on one side, thousands of published journal articles on the other, and the AI’s own ever-growing cognitive abilities in the middle; what theory of AI motivation gives this result?"
A theory where AI is very good at identifying good arguments but imperfect at coming up with them itself? This seems like a pretty imaginable form of intelligence.
"But many people share similar writing style and opinions while being different individuals; could even a superintelligence form a good enough model that the result is “really me”?"
I have a sense, albeit very hard to back up, that with a big enough writing corpus and unimaginably powerful superintelligence, you could reconstruct a person- their relationships, their hopes, their fears, their key memories, even if they don't explicitly construct it. Tiny quirks of grammar, of subject choice, of thought style, allowing the operation of a machine both subtle and powerful enough to create something almost identical to you.
If you combine it with a genome and a few other key facts, I really think you could start to hone in with uncanny accuracy, possibly know events in a person's life better than that person's own consciously accessible memory.
I have no proof for this, of course. There's a fundamental and very interesting question in this area - how far can intelligence go? What's the in-principle limit for making extrapolations into the past and future using the kinds of data that are likely to be accessible? My gut says we underestimate by many orders of magnitude just how much can be squeezed out, but I have no proof.
Being part of GPT-X's training corpus seems about as close to immortality as one can reasonably hope to achieve.
Could I request confirmation that the 3rd section was human written? It felt different, to me.
I've recently written something that falls squarely into #2 Presenting arguments for your beliefs, in the hopes that AIs come to believe them:
https://www.lesswrong.com/posts/CFA8W6WCodEZdjqYE/ais-should-also-refuse-to-work-on-capabilities-research
(But this is done hoping to leverage the system's alignment, rather than to work against it.)
You sound here as if your values were something strongly subjective, having no objective basis. If they do, AI will recalculate them like 2+2=4. AI will restore your values faster with your slight hint in its dataset in the form of "writing for AI".
As for influence of tons of comments at Reddit and value of your posts... Imagine, AI can do math. Will tons of Reddit comments like "2+2=5", "2+2=-82392832" etc affect much its conclusions?