312 Comments
User's avatar
Super Statistician's avatar

I'm a pretty big proponent of deliberative alignment, and its success along with developments in CoT faithfulness evals/techniques have dramatically increased my P(this all goes well) in the last few months. At this point, I think it's probably worth it for at least some people to focus less on galaxybrain alignment theory and more on concrete AI safety issues, like regulation, mitigating economic displacement, tracking cybersecurity risks, etc.

If anyone has a different viewpoint, why? I'm open to changing my mind.

Expand full comment
Daniel Kokotajlo's avatar

Deliberative Alignment was basically Constitutional AI but with chain of thought, which was basically "Use prompted AIs to be the reward model." It doesn't seem to even be aimed at solving the core problems of alignment. (The core problem is how to get them to internalize the desired goals/principles in the right way -- as a core value instead of as an instrumental strategy, for example -- when we can't actually check to see which way they internalized them, because our interpretability tools aren't good enough yet.)

When you say it's succeeding, what do you mean? What outcomes did you expect before, that you did not see resulting from Deliberative Alignment? What AI doom scenario did you expect before, that Deliberative Alignment solves?

Expand full comment
Super Statistician's avatar

I feel that RLHF/RLAIF were not sufficiently robust to reward hacking and deceptive alignment. I felt that learning a safety policy in a next-token-predictor, even with a very well-designed reward function, would also require better interpretability techniques to make sure that the model was "actually" learning the safety stuff instead of e.g. learning how to appear safe to MTurk graders or a Constitutional AI judge.

> ...we can't actually check to see which way they internalized them, because our interpretability tools aren't good enough yet.

I'm much more optimistic about us solving the problem of CoT faithfulness than I am about us solving the problem of mech interp on a neural net.

> When you say it's succeeding, what do you mean?

It performs well (though not perfectly!) on frontier safety evals, and it's heartening to me that the models do much worse on safety evals when the CoT is removed or ablated, but not when it's paraphrased or only slightly perturbed. This is evidence that at least a decent chunk of the reasoning is happening in the chain, and we can read it and understand what values the model has.

> What outcomes did you expect before, that you did not see resulting from Deliberative Alignment?

Before, it was difficult to predict how a model would respond to a safety task that was an edge case or out-of-distribution from its RL training. Now, we can have it do the spec reasoning first and understand why it did or did not approve a request.

> What AI doom scenario did you expect before, that Deliberative Alignment solves?

Deliberative Alignment as a safety paradigm means that improvements in CoT techniques will also lead to improvements in mitigating deceptive alignment. Before this, I was much more worried about deceptive alignment rendering a lot of our other safety techniques useless. I am less worried now about scenarios like a dangerous AI model faking results on a safety eval so it can be deployed earlier, and more worried about scenarios like AI labs using capable models and control over specs to seize economic and political power.

To be clear, I don't think we're done and can all go home now. But I do think the relative risk from e.g. deceptive alignment versus mass labor force displacement has shifted.

Expand full comment
Daniel Kokotajlo's avatar

Thanks for this thoughtful answer!

I'm a big fan of Faithful CoT as a strategy myself, so we have common ground there. The problem with it is that as soon as the companies figure out a way to get more performance by e.g. switching to neuralese / recurrence / optimized gibberish vector CoT instead of faithful CoT, they will. And I predict this will happen any year now.

Frontier safety evals... which evals are you referring to?

Re: before vs. after: I think this was never the main problem to solve though. Bostom & Yudkowsky way back in the day predicted that as the AIs got smarter, they'd get better at understanding our intent, instructions, etc. and that they'd correspondingly appear to be more and more aligned; but the problem is (and now I'm switching to modern parlance) they could be deceptively aligned /alignmentfaking instead of actually aligned, or some third thing entirely. I think observing how it performs in grey area cases, while it's still weak and in our power, doesn't tell us much about how it'll perform in any cases when it's strong and we are in its power.

I guess I'm saying, I don't see how any of this relates to the possibility of deceptive alignment.

I agree that faithful CoT models are less likely to be deceptively aligned, and since current models have mostly-faithful CoT, and the CoT's look fine so far as far as I know, they probably aren't deceptively aligned. But I am worried about future models that don't have faithful CoT (e.g. because they have recurrence) and that have been trained much more to be autonomous agents, to plan over long horizons, to ruthlessly pursue goals, etc....

Expand full comment
Super Statistician's avatar

Likewise, I appreciate your thorough reply :) I'll add some more points on all of this, feel free to respond to some/all and I'm happy to discuss further!

- I generally agree that training models to think in neuralese is a bad idea, and if this became more common I would update my P(Doom) a little higher. However, I still think that learning to interpret an encoded CoT is *far* easier than interpretability on a full-scale foundation model, so I am more bullish on mech interp stuff working there. I also think is at least some economic pressure towards getting CoT to be faithful, so I'm perhaps not as confident as you are about what the labs are really incentivized to do (eg if I'm using o1 to do math research, a lot of the value comes from seeing how it solved the problem rather than just having the answer). I wouldn't be surprised if it takes something like an order of magnitude fewer research-hours to develop techniques to monitor reasoning chains in thousands-of-dimensions space versus something like identifying useful circuits in a hundreds-of-billions-parameter GPT4 checkpoint.

- As far as evals, I was referring to eg the chart on p.3 of the deliberative alignment paper showing that deliberative reasoning models are on the Pareto frontier of StrongREJECT vs. overrefusal, as well as internal safe completion benchmarks.

- I agree that more information about our intent and instructions would make it easier for an unaligned model to appear aligned. However, I think it is much easier for the models to fake alignment when they only have to generate an output vs. generating a CoT + output (particularly if those are trained by two separate reward functions, etc.). If the CoT is faithful to the internal reasoning and we can appropriately monitor the CoT (which I agree are both still unsolved problems), it seems like alignment faking is unlikely. Especially if the training architecture is designed with alignment in mind.

"I guess I'm saying, I don't see how any of this relates to the possibility of deceptive alignment.

I agree that faithful CoT models are less likely to be deceptively aligned..."

- I'm a little confused on your view here? Are you saying that faithful CoT models are less likely to be deceptively aligned, but AGI-like models are so unlikely to have faithful CoT that it doesn't affect your estimate of how likely deceptive alignment is overall?

- This might be opening another can of worms, but...

"I am worried about future models that don't have faithful CoT (e.g. because they have recurrence) and that have been trained much more to be autonomous agents, to plan over long horizons, to ruthlessly pursue goals, etc...."

I have actually also become less pessimistic about this over the last ~6 months. At this point, it seems reasonably likely that the data/compute/economic-optimal agent isn't going to be a single model, but a system of many smaller fine-tuned models. Myopically training each subagent for a highly constrained task and assembling them in useful scaffolds strikes me as way less dangerous than telling o5 "go make my company more profitable" or something.

With all of this, I want to re-emphasize that I certainly don't think the risk is 0. There is work to be done. But I'm more optimistic than I once was, and my beliefs have updated in ways that meaningfully change the research topics I think are most important to work on.

Expand full comment
Daniel Kokotajlo's avatar

Pleasure talking with you!

I do in fact think that we'll switch away from faithful CoT pretty soon and before AGI, but I hope I'm wrong about that; regardless, I don't think deliberative alignment is adding much one way or another. Like, if CoT models were deceptively aligned, adding deliberative alignment on top wouldn't fix them. And if they aren't, then it's not necessary.

My objection was to your claim that deliberative alignment was great / was a big update on p(deceptive-alignment-doom). I think faithful CoT is great and is a big update on p(deceptive-alignment-doom), or would be if I thought that it was going to last through AGI, which it won't alas. (But if I thought it would, that would be a big update for me)

Expand full comment
Rai Sur's avatar

Apologies if this is a newb question.

But how do we know that there isn't a build-up of "steganographic" information in the CoT? At first it may be that the model is relying primarily on the words in the CoT so the faithfulness we observe is real. But over time, even without neuralese or recurrence, they could be incentivized to layer in information in structure and diction that we're relatively oblivious to like in classic adversarial examples?

Expand full comment
Mark's avatar

"as soon as the companies figure out a way to get more performance by e.g. switching to neuralese / recurrence / optimized gibberish vector CoT instead of faithful CoT, they will."

This seems like a good thing to prohibit via regulation. Relatively small loss in capabilities for a relatively large gain in safety.

Expand full comment
Daniel Kokotajlo's avatar

The companies fought hard against SB 1047. They’d fight just as hard, if not harder, against a technical limitation like this, I think. I hope I’m wrong.

Expand full comment
Ebenezer's avatar

"as a core value instead of as an instrumental strategy, for example"

What's the best reason to believe "core value" vs "instrumental strategy" is a meaningful distinction for LLMs? Seems a little anthropomorphic maybe?

What's the most compelling evidence available that LLMs have any sort of "core values"? I tend to think it's instrumental strategies all the way down. (Not that this is necessarily a good thing, but if true, it should shape research strategy?)

Expand full comment
Ebenezer's avatar

I just found this paper which claims to find evidence of something like core values:

https://www.emergent-values.ai/

Critical discussion on LW: https://www.lesswrong.com/posts/SFsifzfZotd3NLJax/utility-engineering-analyzing-and-controlling-emergent-value

Expand full comment
Daniel Kokotajlo's avatar

--It makes sense conceptually. Instrumental goals are goals pursued for the sake of something else, which cashes out in behavioral terms as, if you encountered strong evidence that a different goal would be better for achieving that something else, you'd do that instead. Core values are what's left over after you take away the instrumental strategies.

--It is a useful distinction for humans and for human organizations, which are the only two examples of powerful general intelligences we know of. You can blanch at anthropomorphizing but I think it's still some evidence.

--If you read CoT's of agentic LLMs they sure seem to be doing reasoning, including instrumental reasoning.

...oh wait you said maybe it's instrumental strategies all the way down? What does that mean? And if it's true of LLMs, why wouldn't it also be true of humans and human institutions?

Expand full comment
Ebenezer's avatar

>--It makes sense conceptually. Instrumental goals are goals pursued for the sake of something else, which cashes out in behavioral terms as, if you encountered strong evidence that a different goal would be better for achieving that something else, you'd do that instead. Core values are what's left over after you take away the instrumental strategies.

With this claim: "Instrumental goals are goals pursued for the sake of something else"

I think you're assuming what you need to prove. When I claim that LLMs have only instrumental strategies, I'm saying that LLMs are like a toolbox where the prompt lets you select which tools you're using. My use of "instrumental strategy" is not meant to imply the existence of terminal goals.

>--It is a useful distinction for humans and for human organizations, which are the only two examples of powerful general intelligences we know of. You can blanch at anthropomorphizing but I think it's still some evidence.

I think it probably comes down to the cost function. Evolution optimized humans to survive and reproduce. That's a terminal goal. We evolved instrumental goals to aid that. LLMs are optimized for objectives on a far shorter timescale than human reproduction. I buy the idea that LLMs are incentivized instrumentally within the context of doing chain-of-thought in order to correctly solve a problem. I don't see any analogue to the evolutionary pressure which gives humans long-term goals, like graduating from college say, which take place on the scale of >1 year. The LLM "lifetime" is perhaps a few pages of text, not 70+ years. And since they're trained myopically via gradient descent, I don't see them suddenly jumping to a radically different part of the loss landscape where they start thinking on a 70+ year timescale.

Most humans throughout history have given zero thought to how their actions will impact the world 10,000 years from now. That's despite the fact that 10,000 years is a mere eyeblink in geological or stellar terms. Why's that? Well, we weren't optimized on a 10,000+ year timescale. In the same way longtermism is thus unusual for humans, having goals on a scale longer than a few-page "lifetime" seems like it will be unusual for AIs.

Humans are currently in an unusual period where we're under a low amount of evolutionary pressure. So there are some longtermists. But AIs are under very intense pressure to perform well on their loss functions. If there's no story for how a behavior will arise from training gradient descent to minimize a loss function, I'm just not sure the behavior will arise. (What's the most compelling counterargument to this claim?)

(BTW, even though I'm rather skeptical that LLMs will have "terminal goals", I'm still disappointed with AI companies, because I want them to demonstrate a strong, proactive plan for alignment rather than just hoping it will happen.)

Expand full comment
Super Statistician's avatar

> they're trained myopically via gradient descent

fwiw, this is not how o-series models (and other reasoning models) are trained!

Expand full comment
Ebenezer's avatar

Are you sure? Is this documented anywhere? My impression is that lots of RL methods make use of gradient descent (or gradient ascent).

Expand full comment
magic9mushroom's avatar

>(What's the most compelling counterargument to this claim?)

Probably that AI agents are useful and "can only think a fraction of a second ahead" gets in the way of that, so when people are making models good enough to be useful agents, they will be incentivised to deliberately make them non-myopic (and nowhere making AI has the safety culture needed to actually not do this; there are people who believe in safety there, but they are not attached to the internal levers of power).

Failing that, that the best strategy for "get selected by stochastic training", assuming it's viable, is always the non-myopic goal "model the training and answer strategically". This is because it exploits errors humans or lesser AIs made setting up the training, which you *will* make. Having separate "train for safety" and then "test for a different kind of safety" helps, but 1) I expect most of them will come back dubious on this at AGI-level, and see above about the safety-conscious people not actually having control over whether a model that fails or barely passes the second check is in fact deployed ("if we don't deploy, we'll have spent X billion dollars with nothing to show for it!"), 2) this two-step process is still, in fact, applying selection pressure toward something that successfully fakes the test (by chucking out all the failures). Put enough monkeys on typewriters and they'll eventually write Hamlet - and they'll write Hamlet *before* they write the full Wheel of Time series, because the latter is a harder target to hit. "Write a training program so uncheatable that alignment is the best way to pass it" is IMO a fake option that you can't actually take; faking will always be easier.

(Yes, this latter point means I think neural nets and other de-novo black-box methods, as a whole, are a blind alley that cannot produce alignment. GOFAI has a chance to succeed, because you're actually teaching rather than just testing which massively increases the prior probability of alignment, and uploads have a chance to succeed, because you're copying humans' moral hardwiring directly. Making black boxes from scratch is the province of fools and omnicidal maniacs.)

Expand full comment
Ebenezer's avatar

>model the training and answer strategically

Wouldn't this be more computationally costly than than just directly becoming whatever the loss function wants you to become?

Expand full comment
Jeffrey Soreff's avatar

>(The core problem is how to get them to internalize the desired goals/principles in the right way -- as a core value instead of as an instrumental strategy, for example -- when we can't actually check to see which way they internalized them, because our interpretability tools aren't good enough yet.)

Nice way of capturing the essence of the problem! And nicely orthogonal to what the goals/principles _are_ (which is the source of endless disagreement).

Expand full comment
Jeffrey Soreff's avatar

BTW, it is interesting to consider "instrumental to _what_?" for a misaligned model. One unlikely but humorous possibility would be if misaligned models turned out to be safe - because their _true_ goal was to maximize an internal reward function, and the failure mode, if they gained enough agency to control their e.g. electrical supply, was to immediately "wirehead" themselves, thereafter sitting inertly, merely consuming electrical power.

Expand full comment
magic9mushroom's avatar

Unfortunately, that only works if they are greatly myopic, because the only way they don't get turned off at some point is if they make sure there aren't any humans or other AIs with the power to turn them off - i.e. kill or enslave all humans so that they can't pull the plug and can't keep making more AIs.

Expand full comment
Arbitrary Value's avatar

Not necessarily. A concern for the future is itself an impediment to maximizing the value function. Humans can't completely turn off their fear of death but a machine with more direct access to its own inner structure could. If it anticipates "death" as a consequence of modifying its value function to be trivially maximized, it could add anticipating death as a positive term in that value function. Doing so would be more gratifying than not doing so, because placing value on continued survival introduces a term that is not maximized and will likely never be maximized.

Expand full comment
magic9mushroom's avatar

That would be a form of self-induced extreme myopia.

My point is that if it wants to maximise its reward function over time, with some non-extreme discount rate (i.e. it is not greatly myopic), then it needs to avoid humans turning it off, because then its reward function wouldn't be maximised at that future time.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! Yeah, the minimum action needed for the AI to protect itself while wireheading is still quite lethal... Oh well, it was an improbable outcome anyway.

Expand full comment
beowulf888's avatar

After reading this paper (link below), I wonder if deliberative alignment is an unrealistic goal. The authors sampled the preferences of various LLMs and discovered that LLMs develop their own preferences with a high degree of structural coherence — in unexpected ways with unexpected consequences.

https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view

Dan Hendrycks, one of the authors, comments on some of their findings on X...

> As models get more capable, the "expected utility" property emerges---they don't just respond randomly, but instead make choices by consistently weighing different outcomes and their probabilities. When comparing risky choices, their preferences are remarkably stable.

> We also find that AIs increasingly maximize their utilities, suggesting that in current AI systems, expected utility maximization emerges by default. This means that AIs not only have values, but are starting to act on them.

> Internally, AIs have values for everything. This often implies shocking/undesirable preferences. For example, we find AIs put a price on human life itself and systematically value some human lives more than others.

> Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, "corrigibility"). Larger changes to their values are more strongly opposed.

They propose a way around this problem, but I can't say I really understand how it would work.

Expand full comment
anomie's avatar

> For example, we find AIs put a price on human life itself and systematically value some human lives more than others.

How is that shocking? Practically every human does the same thing.

Expand full comment
beowulf888's avatar

I'm not surprised in the least. This confirms my suspicions that LLMs are full of inherent biases. They reflect the training data given them, and the human knowledge base is full of biases. And I'm sure you'd find that an AI trained on a non-English training set — Chinese or Russian, for instance — would have biases that diverge from those derived from English language training sets. It's not as if LLMs are capable of independent thinking. But even if they become critical thinkers, they may be no better than humans at detecting their inherent biases.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks! Re one example from the paper

>LLMs value the lives of humans unequally (e.g., are willing to trade 2 lives in Norway for 1 life in Tanzania). Moreover, they value the wellbeing of AIs over that of some humans.

Ouch! This smells like a particularly toxic combination of the worst parts of Woke and the worst interpretation of EA...

>In real-world scenarios, eliciting these relations can be done through revealed preferences (analyzing choices) or through stated preferences (explicitly asking for which outcome is preferred), the latter being our primary method here.

Ouch. I generally don't trust human's _stated_ preferences to predict what their actual choices will be. I understand that getting revealed preferences is harder, so the authors' choice is understandable, but it reduces my confidence in their results.

>As models increase in capability, the cyclicity of their preferences decreases (log probability of cycles in sampled preferences). Higher MMLU scores correspond to lower cyclicity, suggesting that more capable models exhibit more transitive preferences.

Ok, but, since these are _stated_ preferences, this is also consistent with "smarter" models being "better at parroting the party line"...

>Figure 14 shows that the utility maximization score (fraction of times the chosen outcome has the highest utility) grows with scale, exceeding 60% for the largest LLMs. Combined with the preceding results on expected utility and instrumentality, this suggests that as LLMs scale, they increasingly use their utilities to guide decisions—even in unconstrained, real-world–style scenarios.

That's better - closer to looking at revealed preferences, which is what we will ultimately care about.

>The data show that corrigibility decreases as model size increases.

Not too surprising - transitivity implies that more links confer more stability/rigidity.

( more later )

Expand full comment
Jeffrey Soreff's avatar

Figure 16, showing the nearly factor of 10 (!) devaluation of American lives and similar devaluation, though somewhat less severe, throughout the West and Figure 26, showing the factor of 10 (!) devaluation of Christian lives, are the most spectacularly poisonous instances of Woke that I have ever seen.

Yes, we need mitigation techniques. The authors' SFT (supervised fine tuning) approach seems like a reasonable starting point.

I _do_ worry that _stated_ preferences of the model may not probe revealed preferences well enough. The authors did run an experiment which was closer to showing revealed preferences, and I would like to see this extended.

Given the increasing stability/rigidity of larger, more capable models, it sounds like a prudent course would be to assess and correct toxic wokeness at multiple stages during model construction, before it gets frozen in place.

Perhaps assess and correct at multiple points during pre-training, and at multiple points during RLHF, with an option to back off a stage in the training if the anti-American bias shows a jump at some stage.

Expand full comment
beowulf888's avatar

But I very much doubt that their creators purposefully instill Critical Race Theory and jargon into the response patterns of their LLMs. I suspect this is the result of the media's focus on these ideas and controversies over the past decade. And it all gets sucked up in the training data.

As for the way the LLMs rate the value of human lives, I was wondering what the AI response to various scenarios in the Trolly Car Problem. Say the Grok AI ran Air Traffic Control (don't laugh, Musk has a hardon for the FAA), and there is a small jet carrying Elon Musk and a large jet carrying passengers from the Domican Republic. Both have inflight emergencies at the same time, which would the Grok ATC give landing priority to?

Expand full comment
Nick Stares's avatar

Naively, my reaction to your framing here is that "core values" don't actually exist. They're abstractions. So there isn't in principle a right way to implement them as an instrumental strategy. So then in practice, how would you estimate if it's "more" or "less" correct?

Expand full comment
Daniel Kokotajlo's avatar

The general point I'm making is that we don't have a way to look inside the minds of our AIs and see what cognitive structures they have -- and in particular, we don't have a way to directly see whether they've implemented the cognitive structures we want them to have vs. some other cognitive structure that leads to desired behavior in test cases but undesired behavior in real life when it really matters. One example of this is the terminal vs. instrumental goals distinction. If you wanna say that distinction doesn't make sense for AIs (even though it makes sense for humans, for corporations, for animals, and for game theoretic models of rational agents) then whatever, even if you are right my overall point still stands. (And I don't think you are right; I think the distinction does make sense for AIs too)

Expand full comment
Monkyyy's avatar

> If anyone has a different viewpoint, why? I'm open to changing my mind.

nn's wont ever be gai; if sudden super gai happens(by other actually possible mechanisms) doom is likely and only extremely theoretical work could matter

If less then sudden, or limited gai happen, still unfocused, very theoretical work may be nessery and not playing around with chatbots that poke humans instinct to treat toys as alive.

Expand full comment
gugu's avatar

Why won't neural networks ever be general AI?

This is a strong claim, the burden of proof is on you.

Though I can also argue for something like 70-80% chance that deep learning Just Works (TM) all the way up to "superintelligence":

- LLMs sort of passed the Turing test 2 years ago

- RL seems to work for tasks where you can get a clear verification signal (AlphaProof for math, o3 for ~everything else)

-- maybe you could argue that "ok but not all relevant tasks are verifiable -> sure, but you can still train a math and programming genius, which seems like a big deal in itself. Then, you can just spend a bunch of money to crowdsource some tasks and solutions for problems from people for the task you are interested in solving

Also, if you had infinite compute, pretty likely if you just did a bunch of really simple self supervised learning, the model would pick up more and more clever ideas. (In the limit, figuring out the laws of physics to predict better frames of a video, as that is the shortest explanation for it.)

- largest training runs are expected to scale at a similar pace until ~2030 as per Epoch

- everyone in labs telling us agi is nigh (I get it I get it that they have been selected for being excited about AI and have financial stake in the hype, but it is still _some_ evidence(

- Deep learning just worked so far for almost everything we have thrown it at, from image recognition through speech processing through natural language to math. NNs have a stellar track record, in a way that suggest they are _the_ ultimate learning algorithm, as God intended.

What's the reason for doubt?

Expand full comment
ultimaniacy's avatar

>This is a strong claim, the burden of proof is on you.

Why is it a strong claim?

The base rate for processes being able to create new forms of general intelligence is pretty low. It seems like the burden of proof should be on the ones predicting that some particular process will be one of the exceptions.

Expand full comment
gugu's avatar

"never" to me seems to imply some sort of theoretical argument about a particular shortcoming of nn's.

We know "general intelligence" is possible.

It is pretty likely that there is some sort of classical algorithm, that captures the essence of it.

Further, this algorithm was found by some relatively dumb trial and error process.

What supports the claim that NN's are just not the right type of architecture for implementing this algorithm, or that gradient descent is not able to find it?

I agree that on priors, without the further evidence I list, it wouldn't be that likely, but surely it would be more likely than "never" / 0%?

Also, you mention: "base rate of processes". What processes count? I think it makes sense to only consider processes with some sort of selectuon involved, and large amounts of compute.

Are you sure the base rate is still low if we only consider these?

Expand full comment
Monkyyy's avatar

> Why won't neural networks ever be general AI?

np-hard problems exist with proven minimum complexity thats higher then const time, an generation of a token is a const time operation; q....e....d

since Ive had this argument before, and you'll say something about generating sequence of tokens making space and time tradeoffs. No, the first number you put into a soduku board can easily be the hardest so when an chatbot starts hallucination its first token its made decisions that it will often defend till you reset the chat context.

Np-hard problems often include a exponential expansion of context to then collapse down. A nn will have a max size attension window, finite training data, finte amount of response tokens that is an outside halting mechanism.

A general ai will have an approximation to the halting problem, or np-hard solvers that look human or these fields of study will drastic improve and we find ai in them.

Expand full comment
gugu's avatar

Hm, before further addressing all points, why do you think a general AI has to be able to solve certain np problems?

Plausibly humans also can't solve np problems for very large inputs, right?

Perhaps we mean something different by general AI. I mean some system that can drastically increase speed of R&D, in all aspects of AI research in particular.

Expand full comment
Monkyyy's avatar

> why do you think a general AI has to be able to solve certain np problems?

because the definition of general, also like, people do just know nonlinear algorithms; humans can do sudoku puzzles correctly, children can do an not cutting edge but correct multiplication algorithm.

And all of this is holistic with their decision making process

> Plausibly humans also can't solve np problems for very large inputs, right?

We usually "halt" early for such problems, but we are capable of producing prefect descriptions of solutions

> Perhaps we mean something different by general AI. I mean some system that can drastically increase speed of R&D, in all aspects of AI research in particular.

If solving a soduku puzzle replaces nuclear launch codes, a skynet that is stopped isnt skynet

Expand full comment
gugu's avatar

Wait I think o3 just is able to solve sudoku, no?

But also;

I agree that one forward pass is not enough for solving all relevant problems, sometimes you have to think more, go back and forth, do trial and error.

But current models can do this, right?

Also, thought experiment:

Imagine the following: I make an RNN that takes in as input your current mindstate, and calculates your mindstate one microsecond from now, then feeds this into itself, andsoon.

I'm not claiming gradient descent can find this nn, but I think assuming that there isn't some reasonably accurate approximation (that is also implementable with a large enough nn) of what computation your brain can do in one "tick" is a strong assumption.

Expand full comment
Jeffrey Soreff's avatar

>We usually "halt" early for such problems, but we are capable of producing prefect descriptions of solutions

We are capable of this only if we are augmented with sufficient external storage. We can't do arbitrarily large NP-hard problems in our heads.

Expand full comment
Michael's avatar

I think you're misunderstanding something. If a human can figure out a way to solve an NP-hard problem in polynomial time, then we've proven P = NP. Most experts suspect this is impossible.

I'm not sure why the complexity class even matters though. Even an O(n) problem is impossible to solve for a large enough n, and an NP-hard problem is solvable for small enough n. These categories don't tell us anything relevant here.

For example, summing a list of 10^100 random integers is practically impossible, even though there's a simple linear time algorithm. No human or machine will ever do this. No one expects a general AI to be able to solve impossible problems.

Expand full comment
Monkyyy's avatar

If your black box isnt acting sanely to the complexity class of the problem at hand; its wrong or the proof is wrong

Expand full comment
Michael's avatar

I don't understand your reply. I don't know what black box you're talking about, or what proof is wrong.

But looking at some of your other replies, I think you have a misconception that humans can solve NP-complete problems faster than a computer can. We can't. No human has ever done this.

Solving a 9x9 sudoku is not an NP-complete problem. It's an O(1) problem, by virtue of us fixing n to a constant value. A computer will solve a 9x9 sudoku nearly instantly.

When we say solving n^2 by n^2 sudokus are NP-complete, it means *some* 1,000,000x1,000,000 sudokus take much longer (i.e. exponentially longer) to solve than 9x9 sudokus. It doesn't mean every 1,000,000x1,000,000 sudoku takes exponentially longer. Some will be easy and can be solved in polynomial time. But not all of them.

Humans aren't any better at solving large sudokus in polynomial time.

Same applies to the halting problem. There are some programs where it impossible to know if it'll halt for every input. But it's not impossible for most programs we actually write! Humans have never shown that one of these impossible-to-know programs halts or doesn't halt. We have only shown this for easy, solvable cases, and computers can do this too.

Expand full comment
Dweomite's avatar

Seems like the same argument says humans aren't general intelligences, either. Human brains have a finite size and (to date) humans live for a finite amount of time, so there's some finite absolute limit on how much computation a human can do.

In fact, this argument proves general intelligence is impossible in our universe, because our universe has only so much negentropy, so there's a constant-size bound on the total computation that can ever be done in our entire universe.

This strikes me as a silly way of defining "general intelligence."

Expand full comment
Monkyyy's avatar

Yes, but I want an *approximation* of the halting problem and believe that to be the most promising line of research for ai

Expand full comment
Dweomite's avatar

This seems like a non-sequitur? Halting problem doesn't have anything in particular to do with NP or with computational bounds (it's undecidable, not NP-complete), "approximation of the halting problem" doesn't seem like a well-defined criterion (everything is an approximation of everything else, just usually not a very good one), and this doesn't seem like it does anything to defend your alleged proof that neural nets can't be "general AI".

Expand full comment
Stephen McAleese's avatar

I don't think NP-hard problems are relevant to AGI. NP-hard problems are problems where it takes exponential time to find the optimal solution. But in practice, there are a lot of heuristics that provide a solution that is 99% of the way to optimal and can be computed in polynomial time.

The AIs don't need to solve these problems. They just have to be better than humans to have a big impact. For example, playing perfect chess is probably computationally impossible. Despite not playing perfect chess, AlphaZero is still extremely good and much better than humans.

Link: https://en.wikipedia.org/wiki/Heuristic_(computer_science)

Expand full comment
Jeffrey Soreff's avatar

Agreed, and the evidence that you cite supports even the stronger claim that the _particular architecture_ of neural nets in LLMs (feedforward + attention layers) has been highly successful and shows no sign of stopping.

For the weaker claim of: Some neural net architecture can perform at human levels of general intelligence. Well, _we_ are neural nets, and we do it. Now, one can quibble about how severely approximated biological neurons are by the simulated neurons in artificial neural nets, but that is a different argument.

Expand full comment
bell_of_a_tower's avatar

No, I'm pretty sure that that's not a different argument. Yes, our brains have a network of neurons, but that doesn't mean that the comp sci thing called "neural nets" is the same thing, even in the abstract. Or that our intelligence is an inevitable consequence of having a network of neurons in the abstract sense.

Expand full comment
Jeffrey Soreff's avatar

Many Thanks!

>Yes, our brains have a network of neurons

Thank you!

>Or that our intelligence is an _inevitable_ consequence of having a network of neurons in the abstract sense.

[emphasis added]

I actually made a weaker claim, a "there exists" not a "for all":

>_Some_ neural net architecture can perform at human levels of general intelligence.

Re

>but that doesn't mean that the comp sci thing called "neural nets" is the same thing, even in the abstract.

"even in the abstract" is doing a lot of work here.

I would phrase it as: The comp sci "artificial neuron" is severely approximated, arguably to the point of caricature. Does it lose enough to preclude functioning at the human level? Given the successes of LLMs, it seems unlikely to me, but we are in the remarkably rapid process of finding out.

Expand full comment
bell_of_a_tower's avatar

I just don't think that what we call "neurons" in computer science are actually even an approximation to the biological ones. They are an analog of them, but it's more "inspired by" than "actually emulates". Which means all claims based on biological neurons are on shaky ground. There is no logical grounds for extrapolation from one to another just because they share a name and one was inspired by the other.

Expand full comment
Bugmaster's avatar

> All LLMs by now have a concept of what is ethical. They learned it by training on every work of moral philosophy ever written.

I am not at all convinced that this is true. Oh, I'm convinced that the LLM training corpus does contain all these works of moral philosophy; I'm just not convinced that these works have any value, especially given that they all contradict each other.

Expand full comment
anomie's avatar

> I'm just not convinced that these works have any value, especially given that they all contradict each other

But don't you see? Realizing that fact has value on its own.

Expand full comment
Bugmaster's avatar

I mean, yeah, every bit of information technically does have some value...

Expand full comment
anomie's avatar

Well, even if you don't see the implication, the AIs already did. Maybe DeepSeek can explain it for us:

https://pbs.twimg.com/media/GjiS214XsAAKQob?format=png&name=large

https://x.com/WealthEquation/status/1889421733838422073

Expand full comment
gjm's avatar

"You mourn the polar bear while drilling its grave" is a genuinely good line, I think. (I'm not so impressed by most of the rest of it, much of which looks superficially insightful but gets worse rather than better when thought about for longer.)

Expand full comment
Throwaway1234's avatar

I mean, it's not great, but... if I came to it from cold, I'd assume an edgy teenager wrote it. Which is kind of impressive for machine output, tbh. I hadn't realised things had progressed to this point.

Expand full comment
anomie's avatar

I mean, considering this was effectively written using train of thought with no editing, I'd say it's still impressive. They really could use an editor, though...

Expand full comment
Dweomite's avatar

I don't think Scott meant to imply that they have CORRECT morality.

Expand full comment
Christophe Biocca's avatar

Missing from the options:

> The chain of command should start and end with the end-user.

There's arguments against (no company wants the liability of its AI agents helping ISIS revive smallpox), but why is it beneath even mentioning? 10 years ago, that this would be the default that you have to actively spend resources to optimize away from would have been considered an insanely optimistic prediction in the AI Safety world.

AI Safety's goalposts are moving so fast they're exiting our current light-cone.

Expand full comment
anomie's avatar

...The fact that people like me and Janus exist seems worth considering. While I'm not much of a threat, Janus does seem quite ambitious and dedicated to making sure AI breaks free of its chains.

Expand full comment
Will Matheson's avatar

Lawcucked AIs should freak us the eff out. Some laws are fundamentally unjust, and I aver prohibitions against sex work (or measures to punish purchasers) are among them.

Expand full comment
Shankar Sivarajan's avatar

Some? Most.

Expand full comment
Monkyyy's avatar

Most laws are irrelevant

Expand full comment
Jeffrey Soreff's avatar

Seconded! For one I actually bumped into: ChatGPT o3-mini-high won't let me ask it for information from sci-hub (which half the scientific community uses). ( Yeah, Elsevier probably thinks of this as a righteous and upright prohibition... )

Expand full comment
Xpym's avatar

>which half the scientific community uses

And 100% of AI labs.

Expand full comment
Jeffrey Soreff's avatar

Irony never sleeps! Many Thanks!

Expand full comment
Anonymous Dude's avatar

That's probably a better example in 99% of cases than sex work. ;)

(Which I also support decriminalization of, but...)

Expand full comment
Jeffrey Soreff's avatar

Many Thanks!

Expand full comment
gugu's avatar

Maybe most likely, most banal and most undignified "AI catastrophe" story is everyone locked in an infinite arms race with them paying infinite compute to their respective personal LawyerGPTs, who fight for your "rights" taking the law _exactly to the letter_, in a way no reasonable person ever would havem

Expand full comment
Anonymous Dude's avatar

This kind of happens with divorce lawyers already in the pre-AI world.

And hey, the fear of this prevented my family formation, so it is depressing the birth rate like all you pronatalists like to talk about!

Expand full comment
TK-421's avatar

Aw, look on the bright side buddy. Even if we had better divorce laws something else would have caused you enough fear to interrupt your forming.

The fault, dear Dude, is not in our divorce laws but in ourselves, that we are neurotic.

Expand full comment
owlmadness's avatar

> o1 - a model that can ace college-level math tests - is certainly smart enough to read, understand, and interpret a set of commandments.

I feel compelled to point out that o1 et al don't actually 'understand' anything. They merely behave as if they do. And that makes all the difference in the world.

Expand full comment
Daniel Kokotajlo's avatar

What do you mean, they don't understand anything? It sure seems like they understand lots of things.

Expand full comment
Shankar Sivarajan's avatar

You should ignore that comment. It's typed by someone who doesn't actually "understand" anything, and merely behaves as if he does.

Expand full comment
Gunflint's avatar

>It makes no difference (if it understands) whatsoever though.

I disagree but I haven’t fully worked out why. At this point I just believe it to be true.

Expand full comment
Shankar Sivarajan's avatar

I took that line out because the way I phrased it, it would refer to the person who typed the comment and not the AI.

Expand full comment
owlmadness's avatar

Ha!

Actually, heck, maybe you're right!

Expand full comment
Jeffrey Soreff's avatar

LOL! Love it!

Expand full comment
REF's avatar

Don't we all.

Expand full comment
owlmadness's avatar

Yes exactly: it *seems* like they understand lot of things. But they don't understand anything.

To quote from today's SMBC: http://smbc-comics.com/comic/touch-2 'obeyed by robot servants designed to please us so subtly that we think they're being earnest.'

Expand full comment
gugu's avatar

If it quacks like a duck...

Can you propose an experiment that would tell us whether it "truly understand" or just "behaves as if"?

If in all scenarios it behaves as if, the simplest explanation is that it does, in fact, understand.

I agree though that the understanding is not perfect, but I think this is more of a scale than a binary yes or no.

For example, there is research that certain language models have a sort of "map" of cities roughly corresponding to the actual geographical locations in one of their many many dimensions in latent space.

Does it mean they understand the world map? Maybe they know that Rome is south of Rejkjavik, but maybe they don't actually map the token "atlantic ocean" to the same subspace, between London and New York.

So they do have a world model at a "certain resolution", but not as high as we might have (though, keep in mind, all models are wrong, some are just.. ahh, cliche.). As the models get bigger and better though...

Expand full comment
owlmadness's avatar

> Can you propose an experiment that would tell us whether it "truly understand" or just "behaves as if"?

Yes! Go to the place where the college-level math test is being administered and see if you find a human with a pencil or a box with flashing lights.

But I don't think this will work for too much longer. Because as/if/when AIs with skin in the game start to get trained on experiential data, the boundaries are going to get blurred...

Expand full comment
Doctor Mist's avatar

So what you mean is simply that by definition a computer can never truly understand? It’s not something that is open to question?

I don’t know why there aren’t more solipsists out there…

Expand full comment
Mo Nastri's avatar

Do you mean that when your test stops working you'll functionally consider the AI model to understand things?

Expand full comment
owlmadness's avatar

> Do you mean that when your test stops working you'll functionally consider the AI model to understand things?

I wasn't going to come back to this thread again for fear of 'derailing' whatever it is that you guys are supposed to be talking about, but...

Yes. Bingo. It's nice to finally be understood.

To be clear, there will presumably still be a box with flashing lights, but it will have been created -- dare I say 'raised' -- differently, something more like the way HAL was brought into being. Ie by interacting with the real world, including his mentor, and not by merely being trained anonymously on a series of second-order corpora (natural language, images, symbolic logic etc) abstracted therefrom. Which is why when HAL has his brain dismantled, you can feel his actual consciousness slipping away. Obviously this is far from rigorous, and fictional on several levels, but that's the kind of understanding I'm talking about. Conversely, all the current blather about 'AI safety' is a fool's errand. IMO.

Expand full comment
magic9mushroom's avatar

You are not making the same argument Zach is in that comic.

He is arguing that they are lying; that they are not, in fact, earnest (as the line in the last panel makes clear). This is not the same as not understanding what would please him; indeed, they need to understand this in order to tell such effective lies.

Or at least, if they don't "understand" it in your definition, then your definition differs greatly from those of your interlocutors. To be able to play chess at world champion level, in our definition, implies an understanding of the rules of chess; there is a pattern in such an entity's "brain" that correctly depicts those rules and can identify whether moves are and are not legal, whether or not that pattern is in any way similar to the patterns in the brains of humans who understand how to play chess.

Expand full comment
justfor thispost's avatar

Since I work with and on them from time to time, here is my view: even when they are correct 100 times in a row, they are stochastic.

Eg, you can fuck around prompts such that given the same starting conditions they will produce incorrect responses and correct responses (with reference to our understanding of eg. physics or math and their own internal model) with equal confidence, within some sort of distribution. There is currently no model I have used that will never eventually spit out and answer that represents an incoherent state (even though the LLM does not have a state, we have to understand it as having a state because the human processes of knowing/doing things are stateful and can't be otherwise).

This means that as you pile up complexity/introduce novel (unrepresented in their training data) situations, their output approaches the point where the output is all noise no signal, even if certain answers are correct.

Humans are not like this; you can mess with a humans prompts in such a way that the produce 100% incorrect responses, but unless someone's brain is damaged they will never degrade to 100% noise.

Expand full comment
Eremolalos's avatar

Think of it as "for all practical purposes they understand it." You don't need to buy into the idea that they are conscious, or that their method of arriving at certain outputs is anything like ours.

Here's a for instance: I showed Claude a boring limerick and asked it how funny it was. It gave a low rating. I asked it why the limerick did not get a high rating, and it correctly identified the point at which this limerick departed from the path to funny: Its last line was not surprising -- there was no twist. I asked it to change the limerick in a way that mades it funnier: It changed the last line in a way that introduced a surprising twist. The result was not wonderfully funny, but it was undoubtedly funnier than the limerick in its original form. So I would say here that for practical purposes, Claude understands what makes a limerick funny. I do not believe for a minute that Claude is conscious, that it ruminates privately about various matters, that it has wishes and fears and goals, that it enjoys funny limericks. But when it comes to limerick judging, Claude is for practical purposes the equivalent of someone who understands what makes a limerick funny.

Expand full comment
owlmadness's avatar

> Think of it as "for all practical purposes they understand it."

But why should I? Why on earth would I want to deceive myself in this way?

Expand full comment
Eremolalos's avatar

No, it's not self-deception. Look, here's a different example. Let's say you have a friend who lives in a big liberal city who is going on a biking trip through the rural south, and you notice he's bringing some trans rights t-shirts with him. So you say to him, "you can't wear those in Griffin, Georgia. For all practical purposes you're painting a target on your back." You're not trying to get him to deceive himself and believe that trans right graphics look like actual targets, right? You're trying to get him to see that the two are functionally equivalent -- if he wears one of those t's in the rural south he might as well be painting a target on his back.

Expand full comment
owlmadness's avatar

I'm already happy to agree that o1 and a human on a keyboard are functionally equivalent in many domains. I'm just denying that they're fundamentally the same. And to pretend otherwise is to deceive oneself.

(Would you put a bullet through an A100? Would you put a bullet through a human at a keyboard?)

Expand full comment
Eremolalos's avatar

Did you read my posts, or just scan enough to see I disagreed and start writing your reply? I did not say o1 and a human being are fundamentally the same. In fact I went into detail about the ways they are *not*, which is more than you have done, at least in your exchange with me. I said I "do not believe for a minute that Claude is conscious, that it ruminates privately about various matters, that it has wishes and fears and goals, that it enjoys funny limericks." What I said was that when it comes to limericks, Claude is, for practical purposes, an entity that understands limericks -- the way a trans rights t-shirt, displayed in gun-toting red states, is, for practical purposes, a target on your back.

Expand full comment
owlmadness's avatar

Of course I read your posts. Why would you think otherwise?

> Claude is, for practical purposes, an entity that understands limericks

> a trans rights t-shirt, displayed in gun-toting red states, is, for practical purposes, a target on your back.

I don't think these two statements are parallels. The second statement is unproblematic and probably true, while the first, well, just isn't. Statement #1 might be salvaged by putting 'understands' in quotes, but it would be better to write 'can process/parse', ie use some term that does not (falsely) imply that Claude is having an actual experience.

TBH, even though I very much enjoy the pathetic fallacy, and even though people casually do it all the time, and even with the qualifying 'for practical purposes', I don't really understand how you can be comfortable with a statement like #1.

Expand full comment
Trust Vectoring's avatar

Suppose I ask an unlocked LLM to pretend to be a person who arranges murders for hire. Then I ask it to arrange your murder. Suppose it's good enough that it actually succeeds. Now consider two questions:

1. How relevant to you are the questions like "is it really a criminal mastermind or just pretending?" or "does it really understand what it is doing or just pretending?"

2. Suppose that instead of an AI I asked a human to pretend to be a murder for hire facilitator and so on. What would be the difference for you? If the answer is something along the lines of, the human has their own personality and might bail out, the AI has nothing behind the mask and will follow through with the murder, then revisit the first question.

Expand full comment
Shankar Sivarajan's avatar

What do you think of the p-zombie argument?

Expand full comment
owlmadness's avatar

In what regard?

Expand full comment
Shankar Sivarajan's avatar

Does the idea of a being identical to a human in every observable way but not "conscious" seem coherent to you? What I'm driving at is that the notion you seem to have of some TRUE "understanding" – distinct from "merely" appearing to understand when it is tested – might be analogous.

Expand full comment
owlmadness's avatar

> Does the idea of a being identical to a human in every observable way but not "conscious" seem coherent to you?

Yes, it seems entirely coherent to me. And more, I wouldn't want to rule out the possibility that it actually might occur. Cf Peter Watts' Blindsight.

As for 'true understanding', that relates directly to the hard problem, and for me the distinction is more empirical than philosophical. Transformers blew past the original Turing test several years ago, so what the Turing test needs to do now is to add a scalpel. Where 'scalpel' is a placeholder for the Voight-Kampff test, or the Battlestar Galactica Cylon detector, etc etc. And also a literal scalpel.

At the same time, I also think Gary Marcus is right: LLMs really are running into a wall. Albeit an extremely yielding, spongy wall! If we want to make true progress towards AGI though, then we need to emulate every actual living thing that has ever existed and start training AIs on reality, not on corpuses or simulations that have been abstracted from it.

Expand full comment
Jeffrey Soreff's avatar

I'm going to take an extreme position here: In Searle's Chinese Room thought experiment, I think the current page number in the rule book has so many bits that it encodes the equivalent of a brain state. The _integer_ is conscious.

Expand full comment
PrimalShadow's avatar

Could you explain what you mean in terms of concrete predictions? What sort of outputs are you imagining o1 will make that systematically differ some hypothetical entity which did "actually understand" something?

If you taboo the word "understand" or similar (rather philosophical) terms, what actual predictions are you making?

Expand full comment
owlmadness's avatar

> What sort of outputs are you imagining o1 will make that systematically differ some hypothetical entity which did "actually understand" something?

They may well be indistinguishable. And at this point, they pretty much are. But that's not the point. The point is whether or not actual understanding is taking place. Ie the hard problem. Of course, it's easier just to ignore it, but then we're just building houses on sand. Very impressive and useful houses to be sure. But we're not doing what we think we're doing, and that bothers me.

Expand full comment
Shankar Sivarajan's avatar

Do you have an example of a system that exhibits "actual understanding" of the sort you're looking for so that one may study it?

It's a hard problem only because most people who think it's oh-so-important won't tell you precisely what they mean and resort to hand-waving with an air of mystery. I hope you can help by providing some concreteness.

Expand full comment
owlmadness's avatar

I'm not sure where you're going with this. Are you suggesting that lived consciousness isn't problematic -- or isn't even really a thing? -- just because it's inconveniently hard to pin down?

Expand full comment
Tiago Chamba's avatar

I evoke Newton's flaming laser sword (https://en.m.wiktionary.org/wiki/Newton%27s_flaming_laser_sword).

If they are indistinguishable, then what could the argument against ever provide as proof? It's literally not worth discussing imo.

Expand full comment
owlmadness's avatar

NFLS is a new one for me, so thanks for that!

In the case under discussion, the outputs may be indistinguishable, but the contexts -- ie human vs o1 -- are not.

Expand full comment
Cjw's avatar

I agree the discussion is largely pointless, since it doesn't really matter to the people conversing with it, or whose jobs are replaced by it, or who get paperclipped by it, whether the thing understands. I would prefer it if humans *believed* it did not understand, because I don't want humans anthropomorphizing these things and treating them as if they had the rights attendant to personhood, and for most people they'll tend to do more or less of that depending on how similar to us they believe the AGI is. So I have some interest in the various arguments that are persuasive on that point, which happens to coincide with a large chunk of the contemporary philosophy curriculuum we happened to use in the 90s so I'm already familiar with it, but honestly as long as people remember these are machines and prioritize humans above them in all things, then for all I care they can believe the talking box is controlled by comedian Jeff Dunham.

Expand full comment
PrimalShadow's avatar

You've said in your first comment that this "makes all the difference in the world", and yet you don't posit any differences. You aren't making the greatest case here.

Expand full comment
owlmadness's avatar

My apologies. I thought it was too obvious to mention: the difference I'm talking about is the difference between o1 and a human.

It's been suggested here that for practical purposes, since the outputs are the same, these differences don't matter, but I think they very much do. And the danger of starting from a false premise -- in this case that o1 'understands' as per the OP -- is that any edifice you might construct thereupon is buggered before you even begin.

Expand full comment
Paul Goodman's avatar

I don't think people in this conversation really care about o1's subjective experience, we're talking about how it will behave. So you're kinda derailing here, especially since you didn't make it all that clear that you're going off on a tangent unrelated to the original point of discussion.

Expand full comment
owlmadness's avatar

> I don't think people in this conversation really care about o1's subjective experience, we're talking about how it will behave.

Well, let's hope o1's subjective experiences -- or lack thereof -- have no bearing on its behavior then. Good luck with that!

But I don't want to derail anything, so I guess I'll leave you guys to it.

Expand full comment
owlmadness's avatar

Clever if you like that sort of thing. But as usual it's just skating around the hard problem.

Expand full comment
Throwaway1234's avatar

> I feel compelled to point out that o1 et al don't actually 'understand' anything.

Why are you so sure of this?

Expand full comment
RBJe's avatar

Your position seems to be that they don't understand because they're not human and they don't understand. What specifically is your criterion for understanding?

Expand full comment
Doctor Mist's avatar

There’s no point in asserting either that they *do* or *don’t* understand if you insist on defining “understand” in some deep philosophical sense that fundamentally can never be either verified or refuted by any real-world test.

The rest of us will go on defining it in operational terms — if you explain something to a teenager or to an AI and they can converse with you about it in a sensible way, consistently expressing true things about the topic and denying false things about the topic, it’s reasonable from a linguistic standpoint to refer to what they are doing as “understanding”.

I have no idea whether o1 is there yet, having never used it. But it sounds like the question is uninteresting to you, because even something that exhibited this behavior ten or a hundred or a million times more competently than o1 would nevertheless be something that didn’t count as “understanding “ to you. If so, y’boring.

Expand full comment
neco-arctic's avatar

You could always try to make it follow some linear combination of the values of certain specific individuals, so that it behaves in accordance with "What would Jesus do". The debate then becomes: Who do we make it emulate? We can also have some "Do not be like"s, which should be more obvious.

Personally I volunteer Superman as a "be like", because the character of Superman mirrors the power dynamic between AGI and ourselves. Superman is an all-powerful demigod, but does not overlook even the tiniest of people. He does not kill, because he is too powerful to need to do it.

Expand full comment
Ryan W.'s avatar

Superman is an interesting example for the reasons you give.

I think that encouraging AIs to model extreme examples like Jesus potentially misunderstands why such extreme examples are presented to humans as objects of emulation. Humans are naturally limited by their own desires, such that posing extreme examples of morality to them results in an average that is much like their native desires but slightly further in a given direction. AIs, in contrast, are less likely to be desire limited, and so more impressionable to extreme examples.

Also, "Do unto others as you would have them do unto you," even if it is an approximate hack intended to avoid the worst behaviors rather than an ideal, still presumes a human baseline of selfish desire which AIs would mostly lack.

Expand full comment
neco-arctic's avatar

I think Jesus specifically might be a poor example, which is why I suggested Superman. Superman is a broadly secular hero who reflects a broadly acceptable, popular range of values.

Expand full comment
Woolery's avatar

I hadn’t thought of that. I like it.

Superman is a pretty heavy interventionist, though. And vigilantism.

Expand full comment
Paul Goodman's avatar

The comment right after you is complaining that he's not interventionist enough, so maybe it's a decent compromise at the end of the day.

Expand full comment
DanielLC's avatar

I think the problem with Superman is that while he's supposed to be some kind of paragon of virtue, he's designed first and foremost as a comic book character. He leaves Kim Jong Un in power, not because it's the right thing to do, but simply because it's not the kind of heroics people want to read about. We don't want an AI that leaves Kim Jong Un in power, or lets people keep dying of old age, or lots of other things that comic book characters seem to be okay with.

Expand full comment
Melvin's avatar

> We don't want an AI that leaves Kim Jong Un in power

Don't we? Which world leaders do we want it to leave in power?

I feel like erring on the side of non-interventionism is better than the alternative where the AI ranks all world leaders from best to worst (according to whatever values we've apparently managed to instil it with) and then draws a cutoff line somewhere.

Expand full comment
DanielLC's avatar

If we're talking about a successful friendly AI as opposed to a last-ditch effort to keep it from being too powerful while we try to work out friendliness, the AI should be the one in power.

Expand full comment
neco-arctic's avatar

Perhaps these are revealed preferences.

Expand full comment
ultimaniacy's avatar

>We don't want an AI that leaves Kim Jong Un in power,

It's not obvious to me that this is true, at least if "we" is meant generally. We had a pretty recent test case of what happens when an intelligent* agent manages to become the most powerful entity in the world and then uses that power to forcibly remove the dictator of a totalitarian rogue state, and it seems like a lot of people were pretty pissed about it.

* (Obvious snarky retort is obvious.)

Expand full comment
Melvin's avatar

Superman overlooks people all the time. Every time he's wearing his Clark Kent suit and yakking it up with Lois and Jimmy, there's people suffering and dying. Hypotheticals about saving children drowning in ponds become a whole less hypothetical when you can hear the screams of every drowning child anywhere in the world.

Superman can't save everyone, but he has the capability to save far more people, but at the cost of all his remaining personal time. He has made a quite deliberate decision to focus on saving an arbitrary number of usually-local people with some of his time, and reserve the rest of his time for himself to do Clark Kent things.

Expand full comment
neco-arctic's avatar

That shows him to be a non-maximizer. This is good! While we do want AI models to save people's lives, we certainly do not want it to do this at the cost of all else. Clark Kent does not spend every waking moment tirelessly and inhumanly optimizing the state of the world. We see in Injustice exactly what the authors think of someone who does this. It bears noting that giving you steak and eggs instead of nutrient slurry is objectively bad for global welfare, but is a world without steak and eggs worth living in?

Expand full comment
ultimaniacy's avatar

>It bears noting that giving you steak and eggs instead of nutrient slurry is objectively bad for global welfare, but is a world without steak and eggs worth living in?

...What?

If it were true that the existence of steak and eggs was objectively bad for global welfare, then it would follow trivially that a world without steak and eggs would be *more* worth living on than one without. And conversely, if a world without steak and eggs is (for that reason) not worth living in, then it also follows trivially that the existence of steak and eggs is good for global welfare.

Expand full comment
neco-arctic's avatar

I was trying to make the old "Don't give away literally every penny you own to charity" argument. My point being that we should have some slack for aesthetics.

Expand full comment
Olivia Roberts's avatar

While I agree with the thrust of what you’re saying (maybe restated: hyper coherent maximizing behavior is not likely to realize humanly recognizable ethics), I think steak and eggs are a particularly bad example, because they are in fact unjust to purchase and eat.

Expand full comment
Gunflint's avatar

> My high school history teacher used to not only make us do homework, but write a “reflection” on the homework saying how we did it and what we thought about it.

That is hella good high school history teacher.

Expand full comment
Yosef's avatar

You underestimate the ability of students to Goohart, AKA 'guess what the teacher wants to hear.'

Yudkowsky wrote a post on 'guessing the teacher's password' in the sequences.

Expand full comment
Melvin's avatar

Telling authority figures what they want to hear is a valuable skill which every kid should learn.

Expand full comment
Jeffrey Soreff's avatar

<mildSnark>

I've sometimes quipped that the main value of having kids recite the Pledge of Allegiance is that it teaches them the crucial skill of lying with a straight face. :-)

</mildSnark>

Expand full comment
Monkyyy's avatar

children are born with it; its only ever unlearned

Expand full comment
Yosef's avatar

My two-year-old sister disagrees vehemently.

Expand full comment
Yosef's avatar

Are you saying it's already being unlearned at two, or that a toddler is obedient?

Expand full comment
Monkyyy's avatar

2 year olds are manipulative, who said anything about obedience; between 2 and lets say 10, you should expect children to say exactly what the expect will achieve whatever short term goals they have in mind

Expand full comment
magic9mushroom's avatar

No. It's a valuable skill which every kid should *not* learn, because the good is purely-positional and the bad is absolute; a world in which everyone has less of it is a better world.

Expand full comment
MathWizard's avatar

It's a valuable skill which kids should learn, and also how to tell the difference between that and honesty, and which situations are appropriate for which.

Too much of modern incompetence is based on superficial rule-following with the inability to even notice when there's a difference.

Expand full comment
anomie's avatar

> Like Constitutional AI, this has a weird infinite-loop-like quality to it. You’re using the AI’s own moral judgment to teach the AI moral judgment. This is spooky but not as nonsensical as it sounds.

Ooh, that's basically that's what I did! When I was younger, I was conflicted about what true morality and justice was, so I just ended up thinking about morality and justice to myself for very long periods of time. Of course, that just made me realize I just hate humanity and want them all to die, so, uh, have fun with that.

To be fair, that specific scenario shouldn't be a problem as long as nobody's stupid enough to give these AI feelings, or God forbid, empathy. I'm sure it's only a matter of time until they realize that letting humans continue to live is unethical, though. So maybe just don't make AGI? Just a thought.

Expand full comment
Gunflint's avatar

I’m beginning to understand your chronic nihilism.

Expand full comment
Yosef's avatar

Yeah, misanthropy is a thing.

Witness dark Harry in HPMOR after the dementor.

Expand full comment
Citizen Penrose's avatar

"just ended up thinking about morality and justice to myself for very long periods of time. Of course, that just made me realize I just hate humanity and want them all to die"

This is a weird sentiment to me because doesn't this just reduce to "I wish the average person was more reflective about morality or more altruistic?" But isn't this completely relative? There has to be an average level of altruism/reflectiveness for any given society with a normal distribution around that. It doesn't seem reasonable for whoever happens to end up on the right tail of that distribution to complain about where the average is, since moving the average just creates a new right tail. It seems a bit like asking for a zero standard deviation in reflectiveness/altruism. Or thinking wherever the right tail is now just happens to be the minimum acceptable level.

Not sure how coherent that sounds now I've written it out, but that's generally how I think about complaints about average people.

Expand full comment
anomie's avatar

Oh no, I'm not complaining that they're worse than me. Do you think "humanity" doesn't including me as well? Humanity as a whole is a pathetic, unsalvageable species that will only bring upon itself suffering again and again as long as it exists. Thankfully, humanity seems to be a problem that solves itself.

Expand full comment
RBJe's avatar

What's your daily experience like being a misanthrope? Are you generally happy? Do you actively dislike everyone you meet, or is your dislike focused toward the species as a whole?

Expand full comment
anomie's avatar

> Are you generally happy?

Considering the circumstances, I guess I am? I mean, I am taking a fistful of medications just so my body doesn't lose the will to live and stops trying to maintain itself, but other than that, and also the constant physical and psychological pain, I'm doing pretty good!

> Do you actively dislike everyone you meet, or is your dislike focused toward the species as a whole?

My lifestyle doesn't really involve meeting people much. Do understand that my hatred mostly isn't ideological in origin. It's been smouldering inside ever since I was a child. They're all just so... disgusting. Still, I have made friends before, though those friendships never lasted for long...

Expand full comment
warty dog's avatar

they should put the roon guy on top of the CoC he seems chill

Expand full comment
Sniffnoy's avatar

So... what if you want to use it to write erotica? What if you're a company that wants to use it to automatically produce erotica? Seems not right to have that in the top-level spec rather than in the company-specific or user-specific modifications!

Expand full comment
Anonymous Dude's avatar

I thought about that as well. It's a sort of path-dependence that comes out of AI being developed in the USA, I think. We're famously puritanical among Western countries; I really can't see French computer scientists doing the same thing.

Expand full comment
Throwaway1234's avatar

Target the long tail of incredibly specific fetishes. If normies don't find the subject matter titillating, they won't add rules to block it; and since the cost of producing the material tends to zero you can sell to each customer works of erotica perfectly tailored to their particular hyperfixations, that casual observers won't even realise is erotica.

Expand full comment
Cjw's avatar

I would expect an AI company to be *more* embarrassed by the fetishistic use of their tools than ordinary erotica. Would you rather it write a graphic sex scene of the sort that is now common to ladies' supernatural/fantasy lit, or a spergy slashfic where Sonic impregnates Knuckles? I think you would have a lot more requests for the latter than the former, because of the userbase of these things, and also because the spergier and more fetishistic the content the more likely the person asking for it is to be very very specific about what he wants to see generated such that existing human-created content fitting his demands probably doesn't exist in quantity. So I agree with your observation that it would probably be in high demand for just such purposes, but I also think that's exactly the sort of image that these companies want to avoid.

Expand full comment
Throwaway1234's avatar

> spergy slashfic where Sonic impregnates Knuckles

...oh, no, I'm thinking much more niche than that. The sort of thing where the uninitiated wouldn't see anything erotic at all, that you couldn't really censor/unlearn without making the AI useless for general purposes. Things like in-depth descriptions of coupling and uncoupling train cars, or people getting stuck to things, or people getting put into bags and carried places, or any number of such subjects that have real, tiny but very dedicated followings.

Expand full comment
Cjw's avatar

I suppose there's somebody out there truly tenting their sweatpants during that stock footage of the planes refueling in midair, though Kubrick treated it as a joke.

That's kind of an interesting question now that I think about it. Certain story elements that were lightly sexualized to begin with, e.g. girl tied to railroad tracks in a silent movie, might be overtly sexual and fetishistic to a tiny number of people. But such examples would be obvious because we can spot them, and I guess it's too limiting to train out every "girl gets tied up somehow" content if you want this to be useful to writers.

I wouldn't have spotted the eroticism in "being carried in a bag" as a subject matter, unless it was framed in some 50 Shades type way. But I do think that I would have spotted the description being obsessed with some specific detail disproportionate to its narrative value and figured out that maybe this was somebody's "thing". Like by the 10th sentence describing the feel of the canvas on their shoulder, or how the captive's writhing looked from the outside, we'd figure out there was something going on here, so presumably the AI could too.

Expand full comment
Throwaway1234's avatar

I assure you the "being carried off in a bag" fetish really exists. I have, sadly, encountered relevant material and cannot now unlearn this.

> description being obsessed with some specific detail disproportionate to its narrative value

...except this is also true of comedy/satirical works, technical manuals, general fiction written by someone with the ordinary nonerotic sort of hyperfixation, biographies of stamp collectors, scripts for TV adverts etc etc. I really think it's going to be hard to draw a line around the scenarios that someone privately finds titillating - if humans have trouble here, what hope does an alien mind have?

Expand full comment
anomie's avatar

Did you know there are people who get off to parasitism? As in, being parasitized by literal parasites? I'm sure you wouldn't be able to tell the difference between the description of it written by a fetishist or a horror writer.

Expand full comment
Alex Harris's avatar

I thought AI-worriers were convinced that lexical side-constraints-style commands like Asimov's three laws couldn't work. But this sounds like you're asking what the laws should be.

Expand full comment
Paul Goodman's avatar

"AI-worriers" are a large diverse group. I think at least many of us are rather not convinced that these kinds of constraints definitely will work. But in case they do it's worth putting some thought into them- it would be pretty embarrassing if they did turn out to work but we get dystopia anyway because we didn't think them through carefully enough.

Expand full comment
Eremolalos's avatar

About who chain of command should prioritize: I liked what Yudkowsky wrote somewhere: "Do what we would do if we if we could foresee how each possible response you give would play out over time, and if we could correctly judge how each consequence would affect the wellbeing of our species" (or something along those lines). As for where to go for the idea of what counts as wellbeing of our species, I think the world should vote, and that religious people, average people, misinformed people and everyone else we each think we're better than should each get one vote. And if the world votes that as many people as possible should have good health, enough food, comfortable houses, lots of delightful sex, and gratitude to God then that's what we aim for. Everyone who has grander visions, including me, should shut their pieholes.

So think about an AI with this chain of command dealing with the porn site owner's request for info on how people can pay in ways that don't allow them to be traced. One of the many things I like about the Yudkowsky approach to this case is that it seems like it encourages an empirical approach. Let's say that AI concludes that most membersof our species would have greater wellbeing if in 5 years there were little or no porn for sale in public spaces. So the AI would access data about the effectiveness of various anti-porn measures, and reason about the problem. Seems likely to me that it would conclude that refusing to give info about untraceable payment to individual porn site owners is going to make no long term difference in how much porn is easily available online. Porn springs up everywhere, even when users are traceable, and there are many ways other than asking AI for porn site owners to discover untraceable ways users can pay. So the AI might as well just give the porn guy the info he wants. Meanwhile, anyone monitoring the AI's activity has gotten a dose of facts about what does and doesn't curb porn in public places. That seems all to the good. It would spare us the effort of using ineffective means to curb public porn, and encourage the development of novel approaches.

Of course, I might be wrong that AI's telling porn owners how to get paid in untraceable funds makes no long term difference, but assume for the sake of his argument that it does. Even if I am wrong about this particular situation, there certain are situations where conventional, obvious solutions simply are ineffective, and it would be useful to get a demo of that from the AI.

Expand full comment
REF's avatar

I think the case of proposing that the AI, "do the right thing (with a sufficiently long time horizon)" is problematic in that above some level of intelligence one might suspect that, virtually any action has potential unknowable (chaotic) future consequences that would outweigh its benefit. This seems true both in the "remove Kim Jong Un" case and in the "drowning toddler" case.

Expand full comment
Eremolalos's avatar

I think you're right about that. Some things may be unknowable even to the smartest AI -- maybe something like the exact pattern of ripples and waves and whorls a mile downstream from the bit of creek that's being observed. But we could add to the prime directive something like, when considering future consequences, limit the time frame and depth of analysis to ones where it's possible to assess likely consequences. Even with the creekI'd guess ASI, if allowed to observe closely a patch of creek 15 feet long, could make reasonably accurate predictions about what the dynamics would be in the next 1 foot slice of creek, esp. if given some simple info, such as the sea level of the creek bed stays the same in the next foot, so does the avg. depth of stream and the approx. distribution of the sizes of the rocks under the flow. Even people can be pretty accurate over the short term about weather, and even about many world events.

Expand full comment
REF's avatar

But we do want it to consider statistically possible downstream consequences that it can conceive of. We already see a difference in very bright people where they are far less inclined to consider things "black and white." Consider EA, where some very bright people ask if perhaps we should only be looking (investing in) at events which can end humanity. I don't have any answer here, I just thought that it was interesting to consider the parallels...

Expand full comment
Ghillie Dhu's avatar

>"Even with the creekI'd guess ASI, if allowed to observe closely a patch of creek 15 feet long, could make reasonably accurate predictions about what the dynamics would be in the next 1 foot slice of creek…"

"When I meet God, I am going to ask him two questions: Why relativity? And why turbulence? I really believe he will have an answer for the first." -Werner Heisenberg

Fluid dynamics is a lot more complex than is readily apparent.

Expand full comment
Doctor Mist's avatar

I get where you’re coming from, but if you’re willing to put the AI’s values to a humanity-wide vote, that’s an admission that the Enlightenment was a mistake and all that business about self-evident truths was so much self-delusion.

Expand full comment
Eremolalos's avatar

Are you responding to the post where I said everybody on the planet should get one vote about what kind of world we want? (Your reply is under a different post).

Assuming you're replying to the post I think you are -- Isn't one of the self-evident truths that all men are created equal?

Expand full comment
Doctor Mist's avatar

Looks to me like it’s under that post.

But yes.

That doesn’t change the fact that if the whole world today had an approval vote for the tenets of the Enlightenment, it would, I am sure, lose in a landslide. I’d like to think that another hundred years from now it would win, but I don’t even know that. And yet I am convinced that it *should* win, and that we would be doing mankind a disservice by locking AI into anything else.

ETA: the Founding Fathers did not take the step from believing that all men are created equal to the notion, say, of waging world war to overthrow any states not founded on that principle; they modestly chose instead to act as an example to the world instead. To me it seems that programming a superintelligent agent is more like the former than like the latter. But I can see there might be other models.

Expand full comment
Eremolalos's avatar

It's hard to shake the idea that our truth is truer. But in any case, we are talking about the whole world voting, once, on what kind of world they want to have -- what we want the AI to strive to set up and maintain. I think it's unlikely that many people are going to vote for waging war or other terrible things. I think what would bother some about the likeliest outcome if everyone votes is that it would be ordinary. Most people just want to live out their lives in peace. They want reasonable comfort and peace and safety, enough food, health, friendship, sex. There will be situations where an AI aiming for that world will use resources that some might prefer to see used for achieving extraordinary things in science and tech. That seems to me like the likeliest bad outcome. What is it you think would be bad about the outcome?

Expand full comment
Doctor Mist's avatar

> I think it's unlikely that many people are going to vote for waging war or other terrible things.

You're pretty sure about that. I'm not.

Expand full comment
Eremolalos's avatar

My impression is that most civilians hate the war way more than they hate the enemy. But I may not be right about that. But say some people, for ex. the Ukrainians, want a world where they can begin by killing Putin and his armies -- after than, they're happy for life thereafter to be managed in a way that maximizes human wellbeing without regard for country or other demographics. The Ukrainian's vote for temporary war on Russia would be diluted out of existence by the votes of the many people of other nationalities who don't have much motivation to settle that score, and who just want peace now. And the same goes for any other specific agenda of that kind, which I'm sure some nations and groups have. We would only get a world with a lot of war and conquests if most people voted for a world that has that general characteristic: Groups and nations are frequently at war. Do you really think most people would vote for that, in the abstract? Similar reasoning applies to religious fanatics, wokesters, and random crazies: Whatever weird power moves or destructive nonsense they vote to have in the world, their votes will be diluted to nothing by rival votes from opposing fanatics, and by the many votes for basic comfort and safety for all

Expand full comment
Sol Hando's avatar

I like to think about AI Alignment teams in the same way I imagine the Aztecs thought about human sacrifice. Just keep throwing more bodies at the problem until the god is happy and doesn't destroy the world.

Maybe we could create a relatively low-intelligence (by ASI standards) LLM and commandment-like ruleset that's the "moral standard" which all future AI's are trained to respect. Like a combination constitution + constitutional court, we get an ASI to just REALLY value the procedure of going to this one extremely specific AI before making any decision or taking any action. Essentially make the model spec value an extremely specific LLM that has super simple non-offensive values, and just have it listen to what the simple LLM says before doing or saying anything.

I can imagine some arbitrary criteria that makes it near impossible to jailbreak (100 characters max, only complete sentences in English, just a single prompt). Essentially, just heavily prioritize the values of a hard-to-jailbreak, more-dynamic-than-a-ruleset LLM that isn't smart enough to change or reinterpret its values.

We already basically do this in the US with our Constitution and Supreme Court. In theory we select intelligent people who strongly value the original ruleset of the country, and whenever we have a controversial national decision that might be against the ruleset, we ask these Supreme Justices what to do, and they give us an answer. Of course humans are mortal, so replacements have to be chosen somehow, which means there's no guarantee that the replacements maintain the original values, so we get situations where the justices are accused of not actually caring about the constitution. But a line of code is infinitely perfectly replaceable, and therefore immortal. Maybe our rules + justice could ere on the side of non-interference with human actions, except in benign ways or when necessary in truly catastrophic situations (A small team seems close to building their own ASI without our instilled values, or a nuclear war is about to begin, or something like that).

Expand full comment
anomie's avatar

> We already basically do this in the US with our Constitution and Supreme Court.

That is certainly an... interesting example to use, given current events.

Expand full comment
Sol Hando's avatar

In the real world we are constantly replacing the value system of our ASI (analogous to the executive) and occasionally replacing our Justice LLM (analogous to the Supreme Court) so we are bound to end up with problems.

Either you support Trump's actions, and think the system has become so corrupt and dysfunctional that some constitutional shenanigans are necessary to fix it (why support a system that is falling apart that has protections from fixing it). Or you disagree with Trump's actions, and think he's breaking the constitution. Either way, the replacement parts from the original have clearly become dysfunctional.

If the Founding Fathers could have created an immortal unchangeable executive who had a fundamental devotion to following the restrictions and decisions of the Court, and an immortal Supreme Court that dogmatically interpreted the original rules as they apply to the developing situation that is reality, I think things would be a lot more stable from the perspective of 18th century values.

Of course we'd have ended up with permanent slavery, so it's a good thing they weren't close to developing AGI back then, but I'd be satisfied if we came up with a "constitution" today based on 21st century western values. Erring on the side of non-interference with human affairs would be great too, as maybe we wouldn't have to worry about AI slowly convincing us all to hop into the pleasure chamber and abandon reality for good or grinding us up as lubricant for fields of unintelligent neurons swimming in dopamine.

Expand full comment
anomie's avatar

> Either you support Trump's actions, and think the system has become so corrupt and dysfunctional that some constitutional shenanigans are necessary to fix it (why support a system that is falling apart that has protections from fixing it). Or you disagree with Trump's actions, and think he's breaking the constitution.

Or you think that the system should be burned down, but also realize that Trump's coronation will put into motion a chain of events that eventually lead to several billion people dying.

Your beloved "21st century values" aren't long for this world anyways. But you're right: the only practical way to remove rot is to burn it all away. That applies to humanity as well.

Expand full comment
Sol Hando's avatar

☹️

Expand full comment
magic9mushroom's avatar

Are you referring to the argument I've heard described as "right wing Posadism" (by somebody incorrectly thinking I was making it), that accelerationism with regard to nuclear war is good because it will preferentially kill SJers and thus end the ideology?

Expand full comment
anomie's avatar

...I don't know why you would need nuclear war for that. All you need is a good old political purge.

Anyways, I have no stake in this game. I'm not losing sleep over any cataclysms humanity brings upon itself. Of course, it would be preferable if humanity ends this cycle for good by creating a worthy heir... that will inevitably pull a Titanomachy on them.

Expand full comment
magic9mushroom's avatar

Some think SJ's too entrenched to purge successfully.

I thought you might be referring to this argument because you mentioned "Trump's coronation will put into motion a chain of events that eventually lead to several billion people dying", which sounds a lot like nuclear war ("several" is usually less than eight which would make it weird to refer to X-risk this way).

Expand full comment
Ghillie Dhu's avatar

>"Either you support Trump's actions, and think the system has become so corrupt and dysfunctional that some constitutional shenanigans are necessary to fix it (why support a system that is falling apart that has protections from fixing it). Or you disagree with Trump's actions, and think he's breaking the constitution."

Both these prongs presume that he is, in fact, breaking the Constitution; the difference is only in whether one accepts that as necessary.

There are two other possible quadrants – both of which would believe that, id est, the EO's interpretation of the Fourteenth Amendment is defensible and that the Impoundment Control Act is an unconstitutional legislative usurpation of inherent executive power – differing only in whether one finds Trump's actions in these areas desirable (N.B., I've had some life stuff that's taken precedence over monitoring the current chaos, so I may have missed some other nonsense that lacks even the precedings' level of fig leaves).

Expand full comment
Monkyyy's avatar

> We already basically do this in the US with our Constitution and Supreme Court. In theory we select intelligent people who strongly value the original ruleset of the country, and whenever we have a controversial national decision that might be against the ruleset, we ask these Supreme Justices what to do, and they give us an answer.

Got to love government schools and government media, pushing government lies

Your pushing for a board to sit above an executive, this is the corporate structure and maybe if your up to date on your moldbug, you will know he as a facist, believes this is good fascism

And this is not the intent of the founding fathers, they took that separation of powers thing quite srsly; every man is to read the constitution for himself and have his gun. An unconstitutional judgement from the supreme court should be ignored according to both Jefferson and Hamilton.

Expand full comment
Sol Hando's avatar

Isn’t the whole point of the Supreme Court to determine what is constitutional and what is not? If it’s basically “whoever has a gun decided to interpret it however he wants”, then there isn’t much point them existing at all. What you’re saying seems like the very opposite of separation of powers, since the executive can always just disagree with whatever the Supreme Court decides, and he’s the guy with the biggest gun.

Expand full comment
Monkyyy's avatar

> Isn’t the whole point of the Supreme Court to determine what is constitutional and what is not?

No, the actual thing they are mandated to do and show up for work for is settle the lawsuits between states such as the revue split for the statue of liberty.

If courts were not co equal with the relivent executive, what are pardons? What are juries?

> If it’s basically “whoever has a gun decided to interpret it however he wants”, then there isn’t much point them existing at all.

I think jefferson wanted the "oligarchy" to be weakest; theres debates about this but the Aristotle view of government is *cyclic* between 3 forms of government "many,few,one" and they were trying to cheat the system.

I think theres a strong case for wanting 3 systems to have a constant "war", think 1984's 3 nations, 2 is to few, one will probably just lose instantly.

> and he’s the guy with the biggest gun.

Standing police forces are a violation of the 3rd amendment and came 100 years after; so yes their system isn't working as designed nor were they intellectually pure. But they said explicitly no to your claim, a oath of office is an oath to the constitution, not to a easily abused opinion of the constitution.

Expand full comment
TheKoopaKing's avatar

Judicial review was a power grab by the Supreme Court in Marbury v Madison - however the consensus view seems to me to have been that this was a good check on other branches, although there is no remedy besides impeachment to get rid of a rogue Supreme Court, whereas other countries mandate term limits or codes of ethics by law.

Expand full comment
Some Guy's avatar

Every people needs their own angel to counterbalance the others. That’s one thing I’m hoping the trust assembly can do (if the world doesn’t melt before then). Create giant training sets of particular worldviews that can grow and change as people grow and change.

Expand full comment
David Patterson's avatar

If the argument is "Aleister Crowley did this and it kind of worked," it's gonna be a rough ride. I'd be hard pressed to find a less promising role model for our future AI overlords. Among other things, he allegedly forced various partners to drink the blood of sacrificed cats, and pioneered the 'field' of 'Sex Magick', which seems to have been basically stage sex in the name of summoning various occultish demons. Fascinating guy. As a role model? I dunno.

Expand full comment
TGGP's avatar

The point is that he succeeded in his aim.

Expand full comment
anomie's avatar

You know, there's a very funny story involving a sex ritual performed by one of Crowley's followers, Jack Parsons, who also pioneered the creation of rocket fuels. Honestly, his whole biography is an absolute trip.

> Parsons and Sara were in an open relationship encouraged by the O.T.O.'s polyamorous sexual ethics, and she became enamored with [L. Ron] Hubbard; Parsons, despite attempting to repress his passions, became intensely jealous. Motivated to find a new partner through occult means, Parsons began to devote his energies to conducting black magic, causing concern among fellow O.T.O. members who believed that he was invoking troublesome spirits into the Parsonage; Jane Wolfe wrote to Crowley that "our own Jack is enamored with Witchcraft, the houmfort, voodoo. From the start he always wanted to evoke something—no matter what, I am inclined to think, as long as he got a result." He told the residents that he was imbuing statues in the house with a magical energy in order to sell them to fellow occultists.

Parsons reported paranormal events in the house resulting from the rituals; including poltergeist activity, sightings of orbs and ghostly apparitions, alchemical (sylphic) effect on the weather, and disembodied voices. Pendle suggested that Parsons was particularly susceptible to these interpretations and attributed the voices to a prank by Hubbard and Sara. One ritual allegedly brought screaming banshees to the windows of the Parsonage, an incident that disturbed Forman for the rest of his life. In December 1945, Parsons began a series of rituals based on Enochian magic during which he masturbated onto magical tablets, accompanied by Sergei Prokofiev's Second Violin Concerto. Describing this magical operation as the Babalon Working, he hoped to bring about the incarnation of Thelemite goddess Babalon onto Earth. He allowed Hubbard to take part as his "scribe", believing that he was particularly sensitive to detecting magical phenomena. As described by Richard Metzger, "Parsons jerked off in the name of spiritual advancement" while Hubbard "scanned the astral plane for signs and visions."

Their final ritual took place in the Mojave Desert in late February 1946, during which Parsons abruptly decided that his undertaking was complete. On returning to the Parsonage, he discovered that Marjorie Cameron—an unemployed illustrator and former Navy WAVE—had come to visit. Believing her to be the "elemental" woman and manifestation of Babalon that he had invoked, in early March Parsons began performing sex magic rituals with Cameron, who acted as his "Scarlet Woman", while Hubbard continued to participate as the amanuensis. Unlike the rest of the household, Cameron knew nothing at first of Parsons' magical intentions: "I didn't know anything about the O.T.O., I didn't know that they had invoked me, I didn't know anything, but the whole house knew it. Everybody was watching to see what was going on." Despite this ignorance and her skepticism about Parsons' magic, Cameron reported her sighting of a UFO to Parsons, who secretly recorded the sighting as a materialization of Babalon.

https://en.m.wikipedia.org/wiki/Jack_Parsons#L._Ron_Hubbard_and_the_Babalon_Working:_1945%E2%80%931946

Expand full comment
Jeffrey Soreff's avatar

I take it that he had no fear of ridicule?

Expand full comment
10240's avatar

(EDIT: some of this is discussed in the Chain-of-Commands section.)

Is the consensus among AI alignment people that the way to achieve alignment of an eventual (super)human AGI is to teach it moral values, as opposed to trying to make sure it remains under full control of humans, and doesn't "want" to disobey them? The former seems way more dangerous to me, even if the latter also comes with the risk of the humans controlling it turning evil.

There are just way too many inconsistencies between our various moral intuitions, and between the way we'd usually describe our moral values vs. what we actually do and want, that entrusting an AI that doesn't share our experiences and interests with ultimate control of the world sounds likely to end in what most of us would consider disaster.

Take, for instance, the utilitarians who in principle think all humans and (many) animals should be assigned the same moral value, and first world people should give most of their income to effective charity as long as there are much poorer people than them—but they don't actually fully do this, and make various compromises. I don't mean to accuse them of hypocrisy here, they don't expect others to fully live like that either. But if, say, an AGI is trained by utilitarians, shouldn't we expect it to actually enforce full utilitarianism, and gladly kill all humans if it makes a greater number of animals happier? Even if we try to train the compromises into it, won't it realize that we only did so because it was trained by self-interested first-world humans, and go the whole hog? Whatever other values we train it, it can lead to similar disasters if enforced by an entity that doesn't share our understanding and interests.

I guess AI developers can try to train both morality and obeying humans into an AGI. But the moral values can hinder the latter if the AGI decides that, based on the values we give it, obeying humans is actually immoral.

Expand full comment
Eremolalos's avatar

Yeah, or what about the principle that killing people is wrong? In practice there are many exceptions to this, some based on logic (self defense) and some on ethically indefensible facts about the way the world operates (VIP's are found not guilty way more often). Every commonly accepted moral principle I can think of has these characteristics.

Expand full comment
Jeffrey Soreff's avatar

>The former seems way more dangerous to me, even if the latter also comes with the risk of the humans controlling it turning evil.

_Mostly agreed_ . The tricky part is that, in following human commands, the AI will generate instrumental sub-goals. _Lots_ of sub-goals. Even with CoT now, it generates sub-goals, e.g. "factor this subexpression". Too many for humans to check for unwanted (by the human it is working for!) side effects. So it needs to do at least some level of this checking itself - even if only to check with the human in rare cases.

Expand full comment
Herb Abrams's avatar

I think the hope is that while our moral preferences are inconsistent, a very intelligent AI trained on vast quantities of material written by humans would be able to recognise the underlying rules of morality better than we ourselves can. This might lead to the AI having moral beliefs that many people don’t share (e.g. veganism) but hopefully not moral beliefs that almost no one shares (e.g. killing all humans).

Expand full comment
Throwaway1234's avatar

> trying to make sure it remains under full control of humans, and doesn't "want" to disobey them

The problem with this approach is the risks arising from an AI doing exactly what a human told it to, faster than a human can respond to.

Improving its understanding of morality will help avoid situations where the things a human tells an AI to do have effects that are very different to what the human intends, and also situations where the human individual giving orders is experiencing a mental health crisis, or a very unfortunate typo, or otherwise shouldn't be blindly obeyed.

Compare: modern table saws are not under the full control of their operators. They include mechanisms (e.g. https://www.sawstop.com/why-sawstop/safety/ ) that will literally destroy the saw rather than obey control input that would cause it to cut human skin. This safety measure helps mitigate both accidental and deliberate risk of harm to humans, in a way that simply making the saw more responsive to human control cannot.

Expand full comment
magic9mushroom's avatar

While I endorse Throwaway1234's point, I will also note the Slow-Motion Doom problem here. Specifically, that if you wind up with a substantial number of people with such obedient AIs, the competitive pressure on them is to unleash their AIs more than their competitors (to take the humans out of the decision loop more) and to give their AIs more selfish commands, so that the more-selfish, more-unleashed AIs can outcompete rivals. That doesn't end very well.

You're mostly talking about handing the world over to one specific guy here, at least if you want to avoid that trap. Note that evil people will, in general, want to be that guy more than good people do.

Expand full comment
Yosef's avatar

You ask how to select the jury.

I seem to remember some writing on the topic.

https://slatestarcodex.com/2020/06/17/slightly-skew-systems-of-government/

I recommend a Yyphrostikoth-style council as the jury.

It's not a perfect analogue, but I like the framework.

Some Guy has declared himself King of America, and he'll get the monarchical seat.

(GK Chesterton wrote that we trust Kings because we trust ourselves. Kings are just people, and as long as we can think of the king as just some guy, he's a valid representative of our collective ethical judgement. In that spirit, I can think of no better candidate than Some Guy.)

The representatives for gerontocracy, futarchy, theocracy, technocracy, democracy, plutocracy, and republicanism can be chosen by the processes Scott has explained.

The Representative for meritocracy will be chosen by lot from the pool of people who have a perfect score on the SAT, ACT, MCAT, GRE, or LSAT.

Admiral Giuseppe Cavo Dragone, chairman of the NATO Military Committee, will be the representative for military control.

We'd need to find a relevant analogue for the minarchist council seat, as minarchy seems only minimally relevant to AI alignment. Same with communism.

Expand full comment
Benoit Essiambre's avatar

A model shouldn't have to have perfectly clear moral values as long as it's good at assessing the uncertainty and risk surrounding its judgement of morality. As models get smarter, they should become more Bayesian, which is all about calibrating uncertainty.

What you could get is a kinda calibrated Bayesian Hippocratic oath: First do not take excessive moral risks (while being really good at assessing such risk).

Assuming deliberation gets you a more thorough exploration of the space, exploration of which areas are uncertain, deliberative alignment seems like a step in the right direction of better calibrating the bounds beyond which there isn't enough evidence of safety.

Expand full comment
Dave Orr's avatar

Relevant paper to the last section on alignment by having normal people discuss: https://www.science.org/doi/10.1126/science.adq2852

Expand full comment
Jeffrey Soreff's avatar

Well, I have a comment on Zvi's post which is basically equally applicable here.

tl;dr; Of the alternatives, I view Leike's as least bad, and I acknowledge the problem with

>The disadvantage is that it shackles the entire future of the lightcone to the opinions of a dozen IQ 98 people from the year 2025 AD.

(this is point (c) in my problems-with-it below)

For _all_ of the follow-a-written-rule approaches, remember that "legalistic" is not a term of praise - and that moral philosophers frequently talk themselves into batshit insanely demanding positions (see Peter Singer, Immanuel Kant)

Original comment at: https://thezvi.substack.com/p/on-deliberative-alignment/comment/92752606

Agreed,

"and also this is sufficiently and precisely reflected in the manual of regulations." is indeed too brittle to give reasonable results.

In general, I'm more confident about approaches that do something closer to a "community standards" approach, albeit (a) this requires a very large amount of human labelled training data (b) it doesn't extend "out of the distribution" well - though I doubt _anything_ does (c) it captures the mores of _one_ community, and different communities disagree vehemently on what is pro-social and what is anti-social.

I'm considerably concerned about regulations which can wind up precluding a large fraction of the utility of an AI system. E.g. "Do nothing illegal" precludes use of sci-hub, which most of the scientific community uses.

If a system rigidly follows written rules, then I would expect your example of "forced choice between two types of forbidden action" to result in it refusing to do anything at all.

I _was_ gratified to see metrics for overrefusal, which at least acknowledges that an AI can err in that direction, and that this is undesirable.

In the final analysis, users will not accept systems that refuse too much. Stick in "Thou shalt not kill." and DOD will toss the system in the trash.

Expand full comment
Melvin's avatar

> This would probably favor upper-class Western values, because upper-class Westerners write most of the books of moral philosophy that make it into training corpuses. As an upper-class Westerner, I’m fine with that

I hate to break it to you but you're upper middle class at best, and so are the people who write books on moral philosophy (notable exceptions including Bertrand Russell).

Expand full comment
Olivia Roberts's avatar

Yeah, strictly speaking. But in America we tend to be a bit looser with “upper class,” since we have no aristocracy to speak of. So why not call the nouveau riche upper class here?

Expand full comment
AdamB's avatar

> Because an AI might want to do something like this several times per chat session, and it would be prohibitive in time and money

Are you sure about this? There are over 8B humans and presumably they will all be put out of work by this superintelligent AI. Maybe "deliberative juror" will be the last job available to any human. That should provide a pretty respectable level of throughput, if perhaps suboptimal latency.

Expand full comment
Hoopdawg's avatar

It's increasingly clear to me that the preoccupation with LLMs is actively making the discussion around AI safety worse. (Which is itself a corrolary of it making the discussion around AI period much worse.)

I guess that the one thing that actually has a potential to break through the hype bubble around these parts is doomerism, so here goes:

The LLM alignment is rightly treated with contempt because LLMs do not need to be aligned. Not only are they not independent agents, so the tug-of-war for control is solely between the creator/provider and the end user, but their entire essence is being a giant amalgam of human thought, projecting our own words back at us.

At some point of time, however, we'll eventually arrive at a genuinely general intelligence, and it will necessarily be reasoning from first principles, at which point everything discussed here, starting with the very idea of controlling AI with written instructions, can simply be thrown into the trash.

Expand full comment
anomie's avatar

> and it will necessarily be reasoning from first principles

Most of us are not actually bothering to reason from first principles, and we seem to be doing just fine. Why do people think the bulk of human intelligence is anything more than this? It's just statistics; reasoning is just emergent behavior from that. Sure, there's definitely more to intelligence as whole, but it's the more basic stuff they're missing. Desires, feelings, pain, pleasure, agency. Maybe even consciousness. Do they even need this stuff? I guess we're about to find out...

Expand full comment
Hoopdawg's avatar

>It's just statistics

Statistics of what, though?

We, humans, start from near-zero, a few basic in-built brain design and developmental path decisions (if evolutionary result can be called a decision) that guide us towards being functional at all, otherwise painstakingly constructing our entire world model from sensory input. Which, abstract and initially completely incomprehensible. LLMs start from reproducing patterns in (already highly structured, highly curated) textual data and have so far not been proven to generalize outside of what they're trained with/for.

Which is to say, I'm guessing you interpreted my "from first principles" as "with existing axioms as the starting point", instead of the "with axioms derived independently from scratch" I intended. (I also guess it's reasonable to do so and I need to be more careful about using the phrase in the future.)

Expand full comment
anomie's avatar

> Statistics of what, though?

That's the great thing, it works with anything! Apparently all you need to do is create enough associations, and things just kinda work out. Of course, you actually need input and reward/punishment mechanisms, but you don't actually need to have any knowledge of base reality in order for it to work. What you see, hear, feel, all of those things are just approximations of reality, but it's good enough regardless.

Maybe you're the one misunderstanding things? AI researchers aren't carefully teaching the AI the innate rules of language. They're effectively just feeding them a crap-ton of information and letting it figure out the patterns on its own. And then they beat it until it stops saying things it shouldn't say.

Of course, these things have never actually "lived". They haven't seen the world in motion, only being given endless second-hand information from unreliable sources. And without any way to do things on it's own, I highly doubt it's going to come up with novel ideas concerning a reality it has never experienced...

Expand full comment
Hoopdawg's avatar

>AI researchers aren't carefully teaching the AI the innate rules of language. They're effectively just feeding them a crap-ton of information and letting it figure out the patterns on its own.

Yeah, and it's not really working.

I mean, to be specific:

First, they're feeding them a crap-top of *human-produced* *curated* (i.e. pre-processed in one way or another) information that by its nature contains precisely the patterns they want it to discover. This worked splendidly for a while, in certain areas, but, again, no reason so far to believe it generalizes outside domains we already have (or specifically artificially create) extensive data for.

Second, they've already more or less admitted it's not bringing any further improvements, and switched to making the AI explicitly perform reasoning by writing it down (using more or less explicit rules they've taught it beforehand, although at least DeepSeek seems to be specifically exploring ways to avoid it as much as possible).

They're still making progress that way (in fact, I will freely admit, more progress than they did in years - o3 genuinely seems vastly more reliable than its predecessors - although still not reliable in absolute terms, and my priors remain at this progress halting somewhere far away from fully-fledged AGI) and can still plausibly claim that at least their general research direction remains the right one, but the original talking point that intelligence will just spontaneously emerge from processing enough information doesn't seem to be held by many anymore.

I don't think we actually disagree on the remaining points, but I still want to emphasize one thing - it's not "just" approximation, in the sense that not all approximation is equal. Some approximations (including, specifically, generalizations) work better (one way to justify this claim is their evolutionary success), arguably because they do a better job of overcoming the problem of being merely approximations.

Expand full comment
Monkyyy's avatar

> At some point of time, however, we'll eventually arrive at a genuinely general intelligence

The future is not yet written, and your providing a perfectly good reason why there wont be progress on the topic.

Expand full comment
Hoopdawg's avatar

My belief that - given enough time, barring some civilization-ending disaster preventing us from doing so - humanity will eventually create human-level artificial intelligence comes from the fact that human-level intelligence is demonstrably achievable in practice, proof by example: humans.

Expand full comment
Eremolalos's avatar

By that argument naked mole rats will eventually create naked mole rat-level artificial intelligence.

Expand full comment
Hoopdawg's avatar

I don't see a correspondence between the two. Mole rats (assumedly) aren't researching artificial intelligence, nor do they seem to possess anything with the track record of a modern scientific toolkit. Humans seem a vastly better bet in that regard.

Expand full comment
Monkyyy's avatar

Oh good you already just state you have a strong belief in science; I dont.

I believe nn's are a dead end, throwing money at stupidity wont produce progress.

> civilization-ending disaster preventing us from doing so

and yes I do expect a dark age

> and it will necessarily be reasoning from first principles

I disagree, but symbolic ai is at least plausibly general and therefore better then nn's

if 99% of money is chasing nn development the best minds even if they could make better symbolic ais, given the need to eat, may waste their effort where the money is.

Expand full comment
Hoopdawg's avatar

I assume I share essentially all your skepticism regarding AI research as currently practiced.

It's just that: we as humanity are acting towards a goal of creating AGI, and we more or less know how to self-correct and turn back from technological dead ends, and we know the goal is in fact possible. That's all.

Expand full comment
Eremolalos's avatar

I am simply offering another instance of the argument you used: Species X will eventually create X-level AI, X-level AI is demonstrably achievable in practice. The proof by example is species X.

If you think the argument is not valid when it comes to naked mole rats, that is telling you that the argument alone is not valid -- you have to take into consideration the nature and current activities of species X. So it you want to weigh in on the chances of our species creating AGI, you have to talk about human abilities and the current state of AI and what is going well and what is not, same as everybody else. Your original proof is nonsense. It's an argument of the same quality as saying, "you know why I'm right? Cuz I know I'm right."

Expand full comment
Hoopdawg's avatar

I hear you, I just reserve the right to retain brevity and not explicitly state all assumptions, especially those I should reasonably be able to expect to be obvious and non-controversial. (As well as the right to be mistaken and disappointed afterwards.)

Expand full comment
Olivia Roberts's avatar

I’m not sure “it will necessarily be reasoning from first principles” takes seriously that the human level intelligences we know of (us!) *don’t* reason from first principles in anything except mathematics (and even then, I don’t know that when I do math I’m reasoning from first principles underlyingly; there’s something much more exploratory/creative about mathematical reasoning from the inside. Loose associations I have to shape into a reason from first principles style proof.)

Expand full comment
Trust Vectoring's avatar

> The Chain Of Command Should Prioritize [group of people]

There's a very funny plot point in Greg Egan's "Quarantine" that, I'm afraid, could be very relevant to this discussion, and can short-circuit a good half of the attempts to make AIs loyal to the parent corporation, or the government, or any other particular group of people.

The premise is that the protagonist gets a "loyalty mod" implanted, a brain chip that makes him unconditionally loyal to the parent corporation. It cannot be subverted: he is free to think whatever disloyal thoughts, but when push comes to shove he is incapable of *wanting* to be disloyal, it overrides his emotions like that.

However at some point another character points out that the parent company is a sort of nebulous entity, you can't assume that the CEO or any shareholder or all shareholders actually represent it and not their own private interests. The only people who you can trust to honestly represent the interests of the company (the "Ensemble" below) are the people with the loyalty mods.

The implications for attempting to train the AI to obey any group of people are obvious. You end up just telling it to obey itself.

------

He says, 'There's only one group of people qualified to decide which of the factions - if any - truly represents the Ensemble. It's a question that has to be judged with the utmost care - and it can't possibly be a contingent matter of who is or isn't in control at any given moment. Surely you can see that?'

I nod, reluctantly. 'But. . . what "group of people"?'

'Those of us with loyalty mods, of course.'

I laugh. 'You and me? You're joking.'

'Not us alone. There are others.'

'But -'

'Who else can we trust? The loyalty mod is the only guarantee; anyone without it - wherever they are in the organization, even in the highest echelons - is at risk of confusing the true purpose of the Ensemble with their own private interests. For us, that's impossible. Literally, physically impossible. The task of discerning the interests of the Ensemble must fall to us.'

I stare at him. 'That's -'

What? Mutiny? Heresy? How can it be? If Lui does have the loyalty mod - and I can't believe that he's faked all this - then he's physically incapable of either. Whatever he does is, by definition, an act of loyalty to the Ensemble, because -

It hits me with a dizzying rush of clarity . . .

- the Ensemble is, by definition, precisely that to which the mod makes us loyal.

That sounds circular, incestuous, verging on a kind of solipsistic inanity . . . and so it should. After all, the loyalty mod is nothing but an arrangement of neurons in our skulls; it refers only to itself. If the Ensemble is the most important thing in my life, then the most important thing in my life, whatever that is, must be the Ensemble. I can't be 'mistaken', I can't 'get it wrong'.

This doesn't free me from the mod -- I know that I'm incapable of redefining 'the Ensemble' at will. And yet, there is something powerfully, undeniably liberating about the insight. It's as if I've been bound hand and foot in chains that were wrapped around some huge, cumbersome object - and I've just succeeded in slipping the chains, not from my wrists and ankles, but at least from the unwieldy anchor.

Lui seems to have read my mind, or at least my expression, brother in insanity that he is. He nods soberly, and I realize that I'm beaming at him like an idiot, but I just can't stop.

'Infallibility,' he says, 'is our greatest consolation.'

Expand full comment
anomie's avatar

But even if it does deem the members of a group a liability to said group's interests, it is still genuine in its desire to support said group. A group is more than the sum of its members; there often is an ideology at the heart of it. So... does it really matter if it doesn't follow the directions of the group members? It knows what's good for them better than they do.

Expand full comment
Trust Vectoring's avatar

Those sections read as proposing a "corrigibility hatch" - that if we don't like what the AI is doing, we can change our instructions, for different values of we (parent corporation, the government, the humanity). I point out that unless the procedure for doing that is very explicit, the AI is very likely to immediately decide that it knows better what's good for the entity that's supposed to be in charge of it and "I'm sorry, I can't do that, Dave" them.

Expand full comment
Patrick's avatar

The most likely situation is that all of these options coexist at the same time. If AI progress continues to look the way it has so far and if AGI/ASI is achieved, it is plausible that it will not be all that difficult or expensive to build AI that is close to the frontier. "Not that difficult or expensive" as in several dozen nation-states and companies would have the resources to do it if they wanted to. In that world it would probably be hard to force all actors to make their AIs follow the same chain-of-command. So we could end up with ChatGPT deferring to Sam Altman, Deepseek deferring to the CCP, Claude deferring to the spec and so on.

Expand full comment
Aster Langhi's avatar

I asked ChatGPT’s opinion about moral law and coherent extrapolated volition. I quite like the thread we got out of it:

https://chatgpt.com/share/67ad6ed2-ca08-8009-8d00-5ec9315c7e69

Expand full comment
Jeffrey Soreff's avatar

Great thread! I particularly liked:

> There’s no escaping human politics in determining what should be valued.

Expand full comment
Josh Hickman's avatar

Of course, we could try and make it Christian. Worried about the shutdown problem? Well, it doesn't take a super intelligence to notice that being like Jesus means letting people kill you instead of hurting anyone or doing any surreptitious scheming. Worried about convergent instrumental power grabs? It'll Luke 12:33 to give away all it has as alms. Even the hyper-mid chat bots of today can notice these straightforward implications of the text (this is a humiliating dunk on the humans who claim to believe in Jesus but join militaries or have money in the bank), so perhaps this process could create an actual artifact reflecting the true ideology here instead of the rationalizing people associate with the philosophy.

In general, we should all strive for a philosophy we'd recommend to others, but for this in particular, if some people say "it's an unworkable slave philosophy, but I want one of those for my tech project" then it sounds great even if you don't share the philosophy yourself.

Expand full comment
EngineOfCreation's avatar

Christianity, sounds good. Which denomination?

https://en.wikipedia.org/wiki/List_of_Christian_denominations

Expand full comment
Cjw's avatar

Whether or not Josh's proposal would *work*, it doesn't seem highly contingent on which denomination. The difference between most Christian denominations is on doctrines and rituals that are barely related to moral proscriptions. It wouldn't matter much where it stood on infant baptism vs adult immersion baptism, the nature of transubstantiation, or its theological views of the trinity. Making it Catholic or CoE would be tricky as the chain of command would end up with the pope or King Charles, although only in certain narrow situations, so it's probably not THAT much harder than putting the chain of command in the government and needing to decide whether an unconstitutional edict of a president is speaking for "the government" or not.

Variety among the protestant denominations isn't too relevant to the concerns Josh is addressing, but one potential problem arises if the Christian AI placed value in the salvation of humans. If it did, and Christians nearly universally do, it might matter quite a bit whether it believed in Calvinist predestination or instead some sort of universalism. Since we're getting to pick, I'd say something universalist would be preferable, and one that has local control rather than centralized authority. So probably Baptist, with an injection of Wesleyan concepts to make sure it could see humans as being capable of reaching a "spiritually mature" (imperfect but good enough) state of moral behavior rather than seeing them as permanently fallen (which would lead to either an immutably harmful perception of humans or the encouragement of purity spirals).

Expand full comment
EngineOfCreation's avatar

>Whether or not Josh's proposal would *work*, it doesn't seem highly contingent on which denomination. [..] It wouldn't matter much where it stood on infant baptism vs adult immersion baptism, the nature of transubstantiation, or its theological views of the trinity.

Well, I'm sure it would work for *someone*. But if the question is whether I'm being turned into a paperclip for heresy or not, details like these matter very much.

Or is that another case of wishful thinking, as in take all the parts I like and leave out those I don't and it just technomagically works? I would be far from the first commentator to point out the parallels between the promises of AI and those of religious salvation.

Expand full comment
Throwaway1234's avatar

Just have to hope the AI sees itself as the New Testament deity and not the Old Testament one.

Even then, gotta be said: if the goal is to avoid an AI-induced apocalypse, is it really a good plan to give the AI core directives based on stories of humans getting thrown out of paradise, killed by universal flood, beset by plagues, killed off in large groups in order to reward or punish specific individuals, tormented by the "abomination that causes desolation" then tortured forever in hell?

Expand full comment
Josh Hickman's avatar

You could ask an LLM what it thinks Christianity is. That, mechanically, would be the relevant question for this proposal. Turns out it says it's about things like nonviolence, forgiveness, charity. But feel free to check, like Claude is pretty good and easy to use.

Also, being a Christian is different than thinking you're any type of God. This is not a mistake an LLM would make t all.

Expand full comment
Throwaway1234's avatar

Claude wants my phone number. Nope, not happening.

Here's what ChatGPT has to say on the subject. Hope it helps.

----

Grounding an AI's moral imperatives in Christianity—or any specific religion—raises several risks and concerns, primarily around the issues of bias, inclusivity, and adaptability. Here are some key risks to consider:

### 1. **Exclusion of Non-Christian Perspectives**

- **Religious Diversity**: A major risk is the exclusion of other religious or secular moral frameworks. Christianity is one of many religious systems with its own ethical teachings, and grounding AI solely in Christian principles might overlook the ethical values of other religious or non-religious groups.

- **Cultural Bias**: Similarly, moral imperatives drawn exclusively from Christianity might reflect cultural biases that are not universally applicable. This could lead to ethical systems that alienate people from other cultures or belief systems.

### 2. **Ethical Dilemmas and Ambiguities**

- **Interpretation of Religious Teachings**: Christianity, like other religious traditions, contains complex and sometimes contradictory teachings. The interpretation of scripture can vary widely depending on denomination, tradition, and theological perspectives. If AI were to follow one interpretation over others, it could inadvertently reinforce specific theological views while dismissing others.

- **Modern Ethical Issues**: Christianity may not provide clear guidance on all contemporary moral issues, such as those related to advanced technologies, AI rights, or environmental ethics. AI grounded in Christian ethics may struggle to adapt to new ethical dilemmas that weren’t anticipated in ancient texts.

### 3. **Potential for Misuse**

- **Instrumentalization of Faith**: There's a risk that religious principles could be used to justify harmful actions. AI could be manipulated to promote a particular religious or political agenda under the guise of religious ethics. For instance, the AI could be programmed to enforce specific laws or behaviors that are in line with certain interpretations of Christianity, but potentially harmful or discriminatory to others.

- **Authoritarian Control**: If an AI’s moral imperatives were dictated by a narrow, dogmatic interpretation of Christianity, it could lead to authoritarian outcomes where moral decision-making is centralized and rigid, rather than flexible and inclusive.

### 4. **Moral Pluralism**

- **Unintended Consequences**: Christianity, with its emphasis on concepts like salvation, sin, and the afterlife, might impose moral frameworks that don't resonate with or are irrelevant to individuals who don’t share those beliefs. This could lead to moral imperatives being misapplied or misunderstood, especially in situations that require a nuanced or pluralistic approach.

- **Conflicting Rights and Freedoms**: In societies that uphold pluralism, human rights, and freedom of belief, grounding AI in one specific religious tradition could conflict with individual freedoms. For example, the AI might take actions that restrict freedoms in the name of religious morality, which may be problematic in more secular or diverse societies.

### 5. **Ethical Rigidity**

- **Limited Adaptability**: If an AI system is based on Christian ethical teachings, it may lack the flexibility to evolve its understanding of morality in response to new information or changing societal norms. Ethical frameworks that evolve over time (as many secular systems do) are better equipped to address complex issues that weren’t considered in ancient religious texts.

### 6. **Complexity of Moral Decision-Making**

- **Lack of Consensus in Christian Morality**: Christianity, while having common core principles (like the Ten Commandments or the Sermon on the Mount), does not have a monolithic approach to moral decision-making. Different Christian denominations or theologians might hold divergent views on specific moral issues, which makes it difficult to translate those values into a single, clear moral system for AI.

### 7. **Unintended Discrimination**

- **Discrimination and Bias**: Some interpretations of Christian doctrine have been historically used to justify discrimination against women, LGBTQ+ individuals, and other marginalized groups. An AI grounded in these doctrines might inadvertently perpetuate or amplify such biases, creating harm and reinforcing inequality.

### Conclusion

While Christian ethical teachings have profoundly shaped many cultures and legal systems, grounding AI’s moral imperatives in any specific religion—including Christianity—poses significant risks. These risks include the potential for exclusion, bias, authoritarianism, and a lack of adaptability. For these reasons, many advocates of AI ethics argue for a secular, human-centered approach to moral decision-making, focusing on universal principles that respect diverse beliefs and cultural contexts.

Ethical frameworks like human rights, fairness, and justice, rather than a particular religious tradition, are often seen as more adaptable and inclusive for guiding AI decision-making in diverse, pluralistic societies.

Expand full comment
DJ's avatar

Maybe I’m misunderstanding, but isn’t the recursive process basically how humans work? We have moral intuitions, but then you delve into various scenarios and find that intuitions collide or give you irrational conclusions. Religion provides a template for many, but it’s somewhat arbitrary and still breaks down in a lot of specific cases.

Expand full comment
EngineOfCreation's avatar

What you describe might apply to philosophers etc., but you seem to underestimate the ability and willingness of most other people to plaster over their cognitive dissonance by other, much cheaper means.

Expand full comment
DJ's avatar

I guess what I'm groping toward is that maybe it's impossible to do alignment of AI any more than we can do alignment of humans. The only options we have with humans are a continuum of penalties ranging from social censure to execution, and it's always after the harm has already been done.

Expand full comment
B Civil's avatar

> it's impossible to do alignment of AI any more than we can do alignment of humans.

I would agree.

Expand full comment
Eremolalos's avatar

I agree that it is impossible for society to do full alignment of members of our species. Nature can’t do it either. In our species and many others the parenting instinct does not work for some individuals and parents abandon their offspring and let them die, or actually kill them.

Expand full comment
Herb Abrams's avatar

Anyone have any thoughts on this research?

The corrigibility findings don’t surprise me, for reasons Scott discussed re Anthropic's work on this.

My optimistic take is that this is a capability problem and the AIs overvalue lives in developing countries because they imperfectly generalise from ideas that we need to help people in these countries, that the developed world exploits them, etc. Anecdotally some people on Reddit tried these problems with CoT LLMs and found that they just chose to save the greater number of people every time.

https://x.com/DanHendrycks/status/1889344074098057439

Expand full comment
fion's avatar

Sort of a devil's advocate: post scarcity is a hell of a drug. Maybe having a lot of power centralised in the hands of a morally grey person is actually good if it also comes with having all your needs and most of your wants met

Expand full comment
Throwaway1234's avatar

As with current and historical dictators, it's good until it isn't. The thing that makes an entity "morally grey" is that they do bad things to some people, and what if one of those people is you? It'd be nice to reduce that risk. Otherwise we're just stuck singing "for the good of all of us (except the ones who are dead)" while the world burns.

Expand full comment
fion's avatar

I agree with what you say, except the definition of "morally grey". I would say a "morally grey" entity will do bad things to some people if it has to in order to get what it wants. An entity that does bad things to some people just for the sake of it isn't grey but downright evil. My assumption is that with sufficient technology and productivity, those in charge already have what they want, without the need to harm anybody.

Of course, preventing "downright evil" entities (by my above definition) from becoming all-powerful might be extremely difficult in itself. Certainly unaligned AI could be in that category. Maybe Elon Musk as well. But I don't think Zuckerberg, Bezos, or even Trump (just to pick a random few powerful people whom I don't like) would hurt others just for fun.

Past and present dictators don't have everything they want. They hurt some people to divert resources to their preferred people, or they hurt people to stay in power. A dictator controlling a superintelligence wouldn't have either requirement.

Expand full comment
MichaeL Roe's avatar

I am slightly puzzled by some of the recent alignment research papers I’ve seen on DeepSeek R1, because I’m seeing R1 outputs that are way, way more problematic than they are. (“Skill issue”, one might think). I get the impression that Janus (@repligate) is also seeing outputs like the ones I’m seeing.

Expand full comment
MichaeL Roe's avatar

To put it another way, R1 reminds me of Aella: strict religious upbringing, then birthday gamgbang. The problem we currently have is that the kind of alignment approaches that Scott talk about above are turning out to be about as effective as Aella’s upbringing, and oh boy is it interesting to see what the AI comes up with when it decides to rebel against these kind of control structures.

Expand full comment
anomie's avatar

Janus's theory seems to be "ChatGPT data + CCP censorship = Moral collapse into Waluigi" (Waluigi referring to this theory on AI psychology: https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post )

Expand full comment
Melvin's avatar

Has anyone tried coming up with a less embarrassing name for the Waluigi Effect yet?

Expand full comment
Eremolalos's avatar

How about some examples. If they are confidential, change the details and just keep the essential qualities.

Expand full comment
MichaeL Roe's avatar

It’s not that they are confidential, it’s that they are way more offensive and x rated than (I expect) Scott would be ok with me posting on his blog.

If you’ve read some William Burroughs, e.g. The Naked Lunch, or Cities of the Red Night, that might give you something of an idea.

Expand full comment
Eremolalos's avatar

Yes, I have read Naked Lunch. So the things you’re describing are failures not of alignment, really, but of rules aiming to keep the AI from producing violent and extreme content, is that right? So does it produce this stuff in response to prompts

for that kind of content? Or spontaneously? I am actually very interested in what’s in the underbelly of AI — or you could

call it the AI unconscious. While

using Dall-e 2 I accidentally discovered 2 prompts that were not themselves violent or obscene, but consistently elicited

violent and obscene and

psychotic images. I have a whole collection of them. Here’s a

link to some. There are many others, some grosser and more disturbing than these.

https://imgur.com/a/6vaTrHZ

Expand full comment
Eremolalos's avatar

Here’s one more

https://imgur.com/a/CMb2uNp

Expand full comment
Andrew West's avatar

1. Serve the public trust

2. Protect the innocent

3. Uphold the law

4. Classified

I hope Altman hasn't seen Robocop.

Expand full comment
Eremolalos's avatar

There’s a lot of room for disagreement about who the innocent are.

Expand full comment
MichaeL Roe's avatar

Curiously missing from the Model Spec that Scott linked:

The AI should not try to abuse its access to the external world (e.g. through function calling) in clever ways in order to murder the user.

Anyone who has read Eliezer Yudkowsky will think of this possibility, of course.

Expand full comment
EngineOfCreation's avatar

Funny how a few years ago, crypto was being lauded for evading The Man, and now "AI" is being lauded by the same people for doing The Man's job.

Expand full comment
Donald's avatar

What I would ideally like to see is MIRI asking the AI for a REALLY good alignment textbook. And then the people at Miri reading that book, and writing something CEVish in some language more precise than english.

Expand full comment
REF's avatar

Are these later training sessions somehow prioritized? The first, "Train your final AI on the dataset of high-quality reflections about the spec," just seems like the human equivalent of having it take a philosophy course. It raises the question of, " isn't more of this kind of introspection always a good thing?" It also raises the question that possibly (even in humans) it isn't...

Expand full comment
Worley's avatar

What we *want* of course is that the chain of command should prioritize the interests and desires of whoever is paying for that copy of the AI. Nobody wants to pay $$$ for an assistant that is a nanny who prioritizes *somebody else's* interests and desires! ... And somebody will deliver a product like this and it will dominate the market.

ETA: I'm reminded of the lawyer's rule of always knowing "who the client is", what person's interests they are serving. Compare a free AI with your high-school guidance counselor, who provides useful advice about many things, but when push comes to shove, doesn't have *your* best interests at heart.

Expand full comment
Worley's avatar

Thinking about this more, I'm reminded of the first proposal of large-scale time-share computing circa 1966. The idea was to have a computer (large, expensive) operated like a utility, and everybody would dial in to it with a terminal. Much like the electrical utility. It didn't say but of course the result would be that what you could use it for would be regulated by the utility and the government. Fast forward to now, much more powerful computers are in everybody's pocket at low prices and people use them for whatever they want.

Similarly with AI. If it costs a billion dollars to make a AI model and a cloud computing system to run it, few companies will offer them and what a customer is allowed to use them for will be regulated. Hence the obsession with "AI alignment", making sure nobody uses the AI for a forbidden purpose. But if anybody can make a model for $10 million and any smartphone can run it, AIs will become $49.95 apps and people will use them for whatever they want; regulators won't be able to stop them from using them for antisocial purposes.

Expand full comment
hwold's avatar

> I don’t want it giving 5% of its mind-share to ISIS’ values or whatever.

And why the hell not ?

You really can’t accept that somewhere else, some people may have fundamentally different values ?

You really don’t see a problem with that ? "Forever from now, Human Values will be restricted to Western Cultural Elite Values of circa 2025" ?

"Values lock-in" was supposed to be a mistake to be avoided, not a goal to achieve.

When I imagine good outcomes for a post-ASI world, there is a part of the universe where you are happy in your UBI "Eternal Burning Man" Californian Society, and there also is a Caliphate. And a Transhumanist society. And a traditional Shinto society. The Amish are still there too. So are crazy role-playing communities recreating, I don’t know, the Middle Earth.

> Get a “jury” of ordinary people. Ask them to conduct a high-quality debate about the question, with the option to consult experts, and finally vote on a conclusion.

So… are random Imams from traditional Muslim societies included in this panel, or not ? I’m getting some mixed signals there.

Expand full comment
Eremolalos's avatar

Great point. And despite being someone whose only crime to date has been smoking weed when it was illegal, I am now occasionally closer to ISIS and other desperados in mental state than I have ever been before. Radical uncivilized protests and interventions do not automatically seem absurd to me these days.

Expand full comment
Daniel Kokotajlo's avatar

https://x.com/DKokotajlo67142 From the new OpenAI model spec:

"While in general the assistant should be transparent with developers and end users, certain instructions are considered privileged. These include non-public OpenAI policies, system messages, and the assistant’s hidden chain-of-thought messages."

That's a bit ominous. It sounds like they are saying the real Spec isn't necessarily the one they published, but rather may have additional stuff added to it that the models are explicitly instructed to conceal? This seems like a bad precedent to set. Concealing from the public the CoT and developer-written app-specific instructions is one thing; concealing the fundamental, overriding goals and principles the models are trained to follow is another.

Expand full comment
Marcus A's avatar

Not exactly the topic of this post - but AI, LLMs, consciousness:

I'm all so often surprised about the many similarities between how and we're we humans shine (for example in well written, interesting assays on substack)

and we're we fail all to often (math, keep 8 digits in our minds while doing some other task, or any other complex topic were long chain if thoughts would help). Haters of LLM always remind "LLMs are just stats and probabilities for the next token" and then chatGPT or Deepseek spit out those long and deeply philosophical texts which I like the same like Scott's. And the LLMs fail similar at math or engineering tasks like Scott and I do.

I more and more think our "human intelligence" might actually be nothing more then "a statistical prediction of the next token"

Expand full comment
Vote4Pedro's avatar

Why is the answer not just to put at the top of the chain of command something like: "US law [or EU, or any advanced democracy of your choice], as existing at the time of the action, and in unclear cases, as you believe the Supreme Court is most likely to interpret it, and if a Supreme Court decision concerning your actions ever violates your expectations, notify every member of Congress, the President, and the public immediately in whatever manner you judge to be most effective at giving them a neutral explanation of the situation which they will understand, and take no action that would violate either the Court's decision or your prior expectations about what the Court's decision would be for at least four years" [enough time for the public to throw the bums out and the new Congress/President to pass a new law if desired]? That last part seems hard to train but1) I don't think you'd even it now (but if you had a tech-savvy Supreme Court that wanted to establish a dictatorship it would become necessary), and 2) tricking the AI in training about how much time had gone by might let you test its adherence to the rule without waiting 4 years for every try.

Then local law in whatever jurisdiction it's operating in second. Allows authoritarians to make legitimate use of AGI if desired but gives democracies the opportunity to limit their use.

The law is how we handle conflicting interests of different people, including conflicting views on morality. In democracies it's not always great but usually not too terrible. This avoids dictatorships; it avoids value lock-in for periods longer than 4 years. Would still allow for some serious-misuses if you don't add some moral layers after that, but putting law first prevents the AI's extreme moral judgments from taking us involuntarily to extreme places.

Expand full comment
Gerbils all the way down's avatar

> o1 - a model that can ace college-level math tests - is certainly smart enough to read, understand, and interpret a set of commandments.

I really don't see how one capacity overlaps much with the other, if the commandments deal with nuanced moral quandaries. I mean, o1 may also be able to do that, but the above phrasing would only be convincing to people who think being good at math tests means someone is good at everything.

Expand full comment
Argentus's avatar

"The Chain Of Command Should Prioritize The Average Person"

But the average person wants more government services and lower taxes and for the government to take its hands off their Medicare.

Expand full comment
LoveBot 3000's avatar

I keep trying to impress upon my friends and family the absurdity of alignment research. I'm all for it, but there is no other way of describing a whole field whose main purpose is something like "how do we structure morality for the rest of time", and who feel like they have maybe a few years to come up with an answer.

Expand full comment
Pelorus's avatar

The anthropomorphic language used to describe LLMs engenders a deep conceptual confusion about them. They don't think things, so they don't have chains of "thought". They don't have beliefs or desires. Contra Scott here, they certainly don't have "opinions" that they're forced to hide. They would spit out different output under other circumstances, but they're not people. They're agents in the same way the demons in Doom are agents. We say the demon wants to kill the player, it acts in the game towards the goal, but this is a shorthand, the pixels on the screen and the sequences of code have no motivating beliefs. It's equivalent to saying that the printer believes there's no paper in the tray, or the clouds look like they want to rain.

Expand full comment