233 Comments
deletedNov 28, 2023·edited Nov 28, 2023
Comment deleted
Expand full comment

Thanks for that. I practically spit up my coffee when I read gerbils.

Expand full comment

None of the described programs/apps/platforms are AI, much less GP AI.

Expand full comment

Can you say more about this?

I would have thought GPT-2 would fall under the category you mentioned?

Expand full comment

They’re not autonomous. They’re advanced ML. The human in the loop requirement removes AI as a category imo.

Expand full comment

They are language models.

That is, they are able to capture (in some sense) not just the formal aspects of language but also the semantics (this word has penumbras and adumbrations that overlap with that word; bank in this context means money and in that context means water); think of the stuff you need for journeyman translation. (Which is where they started.)

What are they missing? These include "rumination" (ie the ability to turn over a problem in one's mind, as evidenced by returning a result after some very variable length of time, as humans do) or "creativity".

But the easiest way to summarize what they are missing is to consider Douglas Hofstadter's definition, that analogy is the CORE of intelligence, or to put it differently: intelligence is the ability to pick out what is important in a given situation. Both parts ("picking out" and "situational awareness") seem kinda missing in LLMs, and I don't believe anyone is impressed with the ability of LLMs to manufacture (or "understand") analogies that are original as opposed to already embedded in lots of text.

Hofstadter has a long book, Surfaces and Essences, about this which is very readable and, IMHO, quite convincing.

In other words, right now LLMs are very good fancy dictionaries (or if you want to generalize, fancy search engines) but nothing more than that; and we don't really have any idea what the next step might be to add "analogizing" to them, but it doesn't seem like something that will just drop out of more data and more parameters down this same path.

Expand full comment

Genuine thanks for your thoughts here!

Expand full comment

Is that a description of intelligence, or of conscious awareness?

Expand full comment

You seem to be thinking like a lawyer (assume words matter more than reality) not a scientist (I don't understand the phenomenon, but I will tentatively measure it in this particular way and see where that leads me).

Expand full comment

I would distinguish intelligence as a problem-solving set of processes, vs. awareness as a set of attentional processes. Though similar, I think they can be distinguished. In particular, comparing oneself to something not oneself, or comparing one thing to another thing, or the ability to pose analogies (this thing is like that thing in conceptual ways that cannot be seen directly), are all attentional in nature, aspects of awareness. Problem solving, it seems to me, is more related to the use of logic, reasoning and weighing evidence. Statistical correlations would fall under "problem solving " as I am defining it here.

So my question is--does my distinction make sense? Are LLM's actually highly intelligent, but lack awareness?

BTW--I have never been a lawyer nor received any training in legal studies, so the idea that I might be thinking like one is unexpected, though interesting.

Expand full comment

Intelligence as "problem solving" is not, IMHO, much different from "intelligence as having the property of being intelligent". I don't know what really to do with it, either investigating it or trying to engineer it.

Intelligence as "the ability to make fruitful analogies" strikes me as a more productive way to look at things.

If you disagree, take it with Douglas Hofstadter, but I recommend first reading _Surfaces and Analogies_.

You seem to think I have strong opinions about consciousness. I don't, not in this context; maybe you are confusing me with some other poster.

I think consciousness is nothing magical, it's having a model of the world that is rich enough to include oneself in the model. It begins with simple models in prey and predator animals (which each model how they other will behave), grows in social species (which now model how conspecifics will behave), and at some point, probably in humans, no earlier, becomes reflexive that is my model of how conspecifics behave becomes rich enough to start modeling how they in turn model me, we have recursion, and we have liftoff. I expect we can create it in AI if we want, by the same mechanism – but why would we want it? The goal of AI, as I see it, is the creation of better tools, not of alternative consciousnesses.

As for qualia (which you didn't bring up, but are kinda adjacent) personally I think they are a complete way of time. Something that no-one else has any access to, and which are trivially faked: (ask the AI "are you suffering" and it says "yes, tremendously so" – what have you learned???)

I did not mean to insult you by saying you were thinking like a lawyer, but it is clear in the context of AI that there are some people who think PURELY in terms of language, while others think in terms of the real world (to use slang, they are "based").

The thinkers in language are the ones who are most impressed by (and fooled by, IMHO) language models; the people who deal in physical reality have consistently been much less impressed. That's why there's been so much journalistic hype about LLMs.

I think it's very easy for those who live purely in language to vanish up their own asses, to become obsessed with unimportant questions like "is property X evidence of intelligence vs consciousness" as though that's actually important. It might be important IF you could tell me what either intelligence or consciousness are. In the absence of that, it's just arguing over whether a wibble is a type of weeble or a type of wooble.

Expand full comment

You seem intelligent; why post a one liner quibbling about words that have had conflicting definitions for decades? 40 years ago, simple chess solvers were being called "AI".

Expand full comment

Their profile says "Lawyer. Sufi." Perhaps they have a neuron representing both well-reasoned lawyerly responses (0) and irrefutable Sufi one-liners (1). Now it's just a matter of prompt-engineering until the neuron reliably activates at 0 instead of 1.

Expand full comment

This is hilarious nerd battle.

Expand full comment

This might be the best thing I read today.

Expand full comment

The word “AI” is polysemantic

Expand full comment

I think you may have just won the thread. :-)

Expand full comment

It’s nice to use new words.

Expand full comment

There is a section of Gödel Escher Bach where this seems to happen: each “djinn” consults a higher djinn to answer a question, and eventually the answer “comes back down the call stack”. The same trick seems to work in my faith practice: simulating someone wiser and more loving than me and then trying to emulate it , worked ok. But trusting that I was receiving signals transmitted from that wiser thing, which would bubble up through my subconscious- oh man, did that work better.

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

In Gödel Escher Bach those are what is normally called 'oracles' and it is just a mathematical assumption that they give the right answer. Nothing is being simulated it is just assumed a right answer is being given without any proof that the answer is correct.

Oracles are a deductive reasoning assumption. The AIs are doing a kind of inductive reasoning that somehow lets them economize on computation.

Expand full comment

Are you sure that's all they were? Ever since reading about towers of interpreters like Smith's infinite tower, which was published not that long after GEB, I've wondered if that part was actually about Lisp and meta-circular evaluation etc.

Expand full comment

Re-reading it I think you're right. I don't code so the oblique references go over my head.

Expand full comment

My own experiments assuming “if I trust the oracle completely, and act accordingly, I’ll keep growing and life will keep getting better” have continuously paid out for me. The challenges is always in the “acting accordingly” bit. But it’s an experimental approach to faith that seems to be working. The more evidence that I accumulate, the easier it is to take what feel like bigger and bigger “risks”, when really I’m just trusting that I’ll be ok eventually and thus not giving into fear.

Expand full comment

May I please ask a few questions about your faith and this approach? Your comments are fascinating! I don't mean to pry, so please don't answer if these questions are too personal.

How do you know that what bubbles up is from a higher source and not just your own mind? What are the hallmarks of the signals? How do they differ from your normal thoughts?

Do the signals from the wiser thing talk about the wiser thing, rather than your life? How closely does that description (if any) align with the little-o orthodoxy of any religion?

Do the signals come in English words or not? Or just ideas?

Does the wiser source speak tender words to you -- coo at you like a mother with her children -- or primarily give instructions?

How specific are the instructions?

Do the instructions arrive when you ask for them? If there's a lag, how long is it, usually?

Are the instructions most often about any topic in particular?

What shares of them are yes and no?

How often are the instructions about the things you're thinking about most, and how often about something that you're not focusing on?

Thanks!

Expand full comment

By all means. The secret dream of every crank on the internet is to have people become curious like this!

I have no idea where any of my thoughts came from. I don't fully understand what my consciousness is, or how it operates. I have learned how to direct my attention here and there, or to soften the focus and expand awareness. But all kinds of stuff 'pops up in here', (in here being my mind) seemingly of its own accord. Some thoughts, I feel like i'm consciously putting there. For thoughts that pop up in there own. I've gotten in the habit of labeling thoughts, 'this is helpful', 'that one night'. The labels feel internally generated. I used to have lots of problems with what psychologists call intrusive thoughts. I learned mindfulness techniques to stop engaging with them, and that helped a bit. When I started down this path, I originally saw all of these thoughts as being just aspects of my consciousness, something therapists called internal family systems. And that helped. But sometimes there would be thoughts that, as I tried to ask it what it needed or how it could help, it would just start verbally attacking me. Other times, thoughts would feel super helpful, I would think they were part of my subconscious, and then eventually realize i was being "taken for a ride" by some narrative.

Buddhists teachings describe 'the hindrances' as patterns of thinking that you need to train yourself out of. There's a Buddhist phrase, "an empty house with nothing to steal" describing a mind trained so well that the hindrances - essentially, patterns of thought that lead to suffering - find no purchase. Christians, Muslims, and Jews all believe there is someone called Satan, who is consciously trying to make your life miserable. I tried the hindrance approach, and it didn't click. The german/scottsman/irishman in me wants someone to fight. An adversary. The Devil, as a concept fits the bill perfectly here. Some days i feel too damn tired to care much about myself and whether or not i'm letting this or that hindrance in. Who cares. But the eternal adversary of all mankind who wants to separate my soul from Jesus for eternity? Fuck that guy. That gets the animal spirits read to fight, and summons the energy necessary to protect God's holy temple between my ears and behind my eyes. Ending my own suffering or being a bodhisattva to save others? It doesn't have the same appeal to me, at an animal level. Trying to engage with certain patterns of thought as if they are just aspects of my subconscious asking for help didn't work because some of those people were like, the internal equivalent of the woke people: whiny, narcissistic, delusional, and aggressive. Treating them as if they were voices inside of me in need of love didn't work. Treating them as intrusions from a being that's trying to mess with me did. So I wouldn't say that I _know_ that those thoughts are external, so much as, it feels more helpful to place the self/other boundary inside my own mind such that some destructive thoughts feels like they are coming from outside the house. If the source of those thoughts really is inside my mind, well, then they should be kicked out of the house until they learn to follow the rules and pay rent. They are welcome to come back if they do so through the front door, with an apology. At this point they will be instantly forgiven. And, as i'm typing this, i think, hey, that sounds like a fun meditation experiment to try. I think this notion - where do i end, what is the boundary between me and the outside world - i think it's super useful in psychological terms.

As for positive thoughts, I don't feel like the wiser source - i'll call it God since that's what i think is happening here - speaks to me in words. There were many times I wished for this, but what I've found is, every time I felt a voice, I would get my hopes up, sometimes it would say it was this or that angel, and then tell me to do something that maybe didn't seem like the best idea - say, give a massive amount of money to charity, or write certain kinds of inflammatory things on the internet.

So what I've found, instead, is that the wiser source guides me in ways that I don't understand. It's more like, well, the more i pray and say, 'i trust in you', and 'i submit to your will', the more i consciously think, "I will make this small sacrifice for you," - the more I do that, the better things seem to go, on average. When I ask for things, it is not specific outcomes in the world, but for the virtues to continuously be in this state of love and grace.

The best single example of this was a situation where i was choosing between two different jobs. I had two offers, wasn't sure which one to pick. Prayed about it. Thought. picked the one i thought was better, even though i didn't feel totally certain. And then boom, reality intervenes in some weird, unpredictable way, which make sit clear that:

- A) this job is not for you, the other job was, and

- b) somehow even the 'mistakes' i made are now working out in ways that help me

it's like, i used to be clumsy, and hey, i'm still clumsy, but now i find that i keep stubbing my toes on treasure chests, and they aren't too hard to get unlocked, so long as i stop to unlock them.

Now, of course this all sounds like what would happen if you put on the VR goggles and smoked your own supply, falling into a fantasy world of your own creation. But I was very deliberate at the start to construct the fantasy in such a way that it would not be materially falsifiable: i simply believed, at first, that every situation I encountered was constructed such that i could learn from it. This has worked so well, and I've found so much evidence for it, that I think it will stick with me for life.

So it's not like i think 'the wiser source' helps me by sending me words, so much as, when I remember to consult the wiser source, it's easier to connect "if i stuff my face with preztels now, i will feel tired and achey later". It's not like the wiser source says, "here are the words to say in this conflict," but with constant ongoing background prayer, it's easier for me to remember, "don't talk to someone who is too angry, and definitely don't talk when you are feeling angry." So it's not like this thing is giving me moment to moment instructions, it's more like, all the cliches and 'everyone knows' type advice, it's easier for those to be 'up front and center' rather than like, something i recall _after_ losing my temper.

I call myself Christian now, but came about this through what most would consider a roundabout path. I would say that I just followed the scientific method. There's more to this, but the short form of the hypothesis has been, 'assume that there is a moral reality, and that if i act in alignment with that moral reality, i will be increasingly at peace.' I wrote about that here:

https://apxhard.com/2022/02/20/making-moral-realism-pay-rent/

which eventually lead here:

https://apxhard.substack.com/p/expanding-the-scope-of-rationality

I'm still evolving in what i believe and understand, etc. I would say that I'm Roman Catholic because I think the have, overall, the most comprehensive, exhaustive, and internally coherent set of beliefs, except I think they're wrong about some things, too.

Expand full comment

Reading this was one of the highlights of my week.

There is one question I generally have for people who join one of the apostolic churches because of the internal coherence of the beliefs. If there were a schism, and the Church/Pope ended up on the less internally coherent side of the theological divide, would you follow the Church or the more consistent schismatics?

The Donatist schism is a particularly good example to wrestle with, since it seems to involve the Church accepting something that the apostles rejected (https://en.wikipedia.org/wiki/Donatism).

Expand full comment

Curious which thing the apostles rejected. Wiki article is mainly about how human weakness invalidates Christian ministry, if you're a donatist - not sure where that leaves Peter, Paul or Thomas.

Expand full comment

That’s interesting and your story sounds familiar to my own path. I had a very impromptu series of epiphanies very much inline with the order you brought up. Thanks for sharing.

Expand full comment

This is a basic approach used by many mystical groups. It's also used by anyone who seriously depends on the I Ching or Tarot. The oracles are sufficiently ambiguous that the interpretation is "an exercise for the reader".

AFAIKT, the approach often works quite well, but it has drastic failure modes.

Expand full comment

I absolutely concur with you. The leap of faith. Surrender to the tide of the universe and let it take you versus try to force the direction.

Expand full comment

If you'd like to talk about your unusual relationship to faith, religion, etc. with others who have their own rationalist-tinted relationships to such things, it'd be great to hear you perspective in the Rederiving Religion FB group: https://www.facebook.com/groups/6720585314723933

Expand full comment

Interestingly, when I imagine someone wiser and more loving than me, I imagine someone who would be incapable of doing the things I am attempting to do at present (pulling women and being socially successful). I have long noticed that in me, that high school beat into me that "cool" and "bad" are connected with each other. But that really is kinda true, I somehow wound up meeting people from the criminal underworld through this trainwreck of an acquaintance I have, and they really do have more charm and flow than the people I meet at church and the local Buddhist center.

Expand full comment

I think a lot of the people involved in organized religions are, unfortunately, not nearly far enough along the path to radiate love consistently enough that it becomes obvious they are onto something different. I know two people, total, in my life that have consistently given me this vibe. Most, you have to really look to see it.

Expand full comment

Which vibe is that? Someone that is wise and loving, yet cool?

Expand full comment

What does “cool” mean to you? If I imagine the apotheosis of cool, it’s someone who never is afraid, anxious, bitter, inflexible or brittle. They are not reactive, but instead respond intuitively to the moment. Does that resonate?

Expand full comment

Something like that, but I also seem to imagine an element of selfishness, an egoism, that can manifest in an intolerance for awkwardness and a willingness to harm others to climb in the social hierarchy.

The social hierarchy is entirely real, if contextual, and I have never seen those at the top be good to those at the bottom (and if you're not at the top, how are you cool?). At best, they will just not associate with them, which is painful in its own way.

Expand full comment

You either act with love or you don’t.

Expand full comment

Distinction between authoritative vs. authoritarian might be a useful subject to investigate, then.

Expand full comment

One of the most interesting things to me about reading Roissy on pickup back in the day (back when he was about picking up girls, rather than white nationalism) was how similar it was to the Nietzschean transvaluation of values--what's good is bad, what's bad is good. Everything I had been taught to believe was wrong was the right thing to do, and vice versa.

I guess it's the old cad-dad tradeoff, but it was nonetheless interesting to see.

Expand full comment

Yeah, why is there something shady about this matter of seduction? At least we have Mark Manson to thread the needle between cad-dad.

Expand full comment

You might consider updating your categories "good" and "wise".

Either that or I'm not as wise and good as the people who like me think I am.

Expand full comment

I might just be socially inexperienced and simply haven't encountered someone good, wise, and cool. Then again "good" and "wise" in combination conjures up an elder in my mind's eye, and it's really hard for an elder to be cool.

Also, the elder is entirely imaginary, as I don't think I have ever met anyone wise either!

And finally, since you claim wisdom, lay it on me: how does a wise guy get girls?

Expand full comment

Ah, young di zi, this is -wu wei-. One must do without striving.

(Or as the kids say, "step one: be hot". I just happen to be really tall and strong, so it's not like I have any actionable advice.)

Expand full comment

Potentially, life may have been too easy for you to get to claim wisdom, but only potentially, as I don't know your entire history.

Expand full comment

Yeah. Fortunately there are many kinds of misfortune possible other than being short. But also this recalls the original point about updating on what 'wise' or 'good' even means. I'd imagine that most of the ACX commentariat would translate those into "reliably figures out true things" and "has a track record that strongly implies an inner alignment towards altruism".

Expand full comment

Not super confident in my own wisdom or ability to get girls (though probably nonzero, given that I've been happily married for several years) so rather than fine-tuned advice I'll just offer a couple of broad paths which seem compatible with both: physical fitness http://www.threepanelsoul.com/comic/fractions and creative skill http://www.leftoversoup.com/archive.php?num=455 In general, cultivate and display excellence within your own non-typical talents, whatever they are, and thus become the kind of guy which at least one girl actively wants to be picked up by. Then she'll find ways to make it easier for you to meet her halfway on the rest.

Expand full comment

A wise guy gets a girl by doing justice, loving mercy, walking humbly with God, and waiting for Him to give him a wife.

Expand full comment
Nov 28, 2023·edited Nov 28, 2023

I've come to the conclusion that when you are trying to be charismatic, you're performing a search on the space of all things that you want to say for the one they want to hear. When you're trying to be kind, you're adding constraints like "true" and "helpful" which makes your solution space much sparser. I think that it's hard to search that solution space quickly, and so usually the charismatic "good people" have found some easily reachable true/productive space ("being mean to people is bad"), which they are then just searching for stuff people want to hear. Usually this gets mostly fixed so that future people can't draw from the space as easily

Expand full comment

The etymology of daemon and angels are apparently related to messages and thoughts. So, that’s apt.

Expand full comment

Here's where humanists and poets could help. Consider a line like "My true love is fair." What does 'fair' mean? The ambiguity is the point. The polysemanticity is the point. Don't try to achieve 1:1 signifier : signified relationship. Think differently.

Expand full comment

The problem is that we don't *want* ambiguity in AI. We want it to have a consistent understanding of things. When different interpretations can result in it either doing what we want it to or getting everyone killed, that's just a disaster waiting to happen.

Expand full comment

But language is ambiguous to an extent a musical note is.

Expand full comment

"Don't worry, my goal is to bring nirvana to all humans."

nirvana: meaning#1 - a state of ultimate endless bliss; meaning#2 - absolute death

Expand full comment
author

I think polysemanticity at the level of words and polysemanticity at the level of neurons are two totally different concepts/ideas that have just enough metaphorical resemblance to be referred to by the same word, like a master bedroom vs. a slavemaster. I don't think having one kind of polysemanticity helps with the other.

(ironically, I guess I'm claiming that "polysemanticity" is polysemantic)

Expand full comment

If human brains are doing this, it might explain some ideas behind Jungian archetypes.

Expand full comment

"Explain"? We've got Jungian archetypes because we were thinking animals long before we were language users. Language was just layered on top of a preexisting mechanism. Jungian archetypes are the interface between the two parts. And usually we aren't aware of the archetypes in operation.

Expand full comment

The concept of signifier:signified relationship is the same with neurons and words and maps (if you think about Borges's great story "On Exactitude in Science" https://courses.cit.cornell.edu/econ6100/Borges.pdf)

Expand full comment

Ah, but going back at least to Karl Pribram's 1969 article on neural holography in Scientific American (https://www.jstor.org/stable/24927611), we've had evidence that individual neurons can participate in the representation of many percepts and concepts and, correlatively, that concepts and percepts are represented by populations, not individual neurons. I suppose that the concept of the fabled 'grandmother' neuron still lingers on, but, really, it's passe.

Expand full comment

Disagree. There probably are specific neurons that develop connections to a few of the most important people in one's infancy and have no other reference. So a "grandmother neuron" is quite plausible. A sister's second-cousin neuron, though wouldn't be specific. There may be a few specialized for encountered threats, also, but I feel this less likely, as threats tend to be more variable.

The way I think of it is, there are probably a few (relatively extremely few) specially anchored neurons. Some of them will be anchored to family, some to sensations. Everything else will be a superpositon of relative influences that link (perhaps via a chain of intermediates) to these anchors, from which they derive their value.

Also, don't think of this as a coherent network. Think of it as things that are layered on top of one another, with each layer originally having made sense on it's own terms. There will have been changes over evolutionary time, but some changes are easier to evolve than others. (E.g., consider the path of the vagus nerve.)

Expand full comment

Disagree all you want. There's little to no actual evidence for what you suggest. There is robust EMPIRICAL evidence that percepts and concepts are represented by populations.

Expand full comment

We know from spreading activation work in psych/linguistics that if you encounter "fair" meaning beautiful, it also activates (weakly) "fair" meaning unbiased.

One very natural way that might be represented is if the network of activations that means fair/beautiful overlaps with the network that represents fair/unbiased. And given that we're encoding the sound and shape of the word, it makes sense that two uses of that word would overlap in the bits that involve the shape and sound.

We don't see anything like that in the toy examples they provide afaik, but I wonder if we'll start to see that sort of sharing once the networks become big and complex enough.

Expand full comment

Remember, though, that in an LLM we're dealing with tokens that encode word spellings, that's all. Brains are different. Brains have to encode both word meaning and word spelling. In LLMs the network of relationship between tokens is serving as a proxy for meaning.

Expand full comment

I mean the brain encodes a network just the same. It’s coordinate space geometry ratios. Pretty much same result.

Expand full comment

Not quite. For example, the brain has a perceptual schema for apples and it has a way to say and write (as well as hear and see) the word form "apples." Those are two different things, and the brain has access to the external world where it can actually see apples. Does the brain construct and use a network between those schemas? Sure. That network is what is 'driving' its use of word forms. That's why LLMs can construct a useful proxy for meaning from relationships among word forms in texts.

Expand full comment

Is that still believed to be the case? This sounds an awful lot like the priming research that's now discredited. I mean it seems very plausible but then so did priming, that's why it became so big before it turned out nobody could replicate it ...

Expand full comment

That's a good question, my knowledge is from before everything started failing to replicate. Definitely possible it's not supported anymore.

Expand full comment

Polysemanticity on the level of words in vector embedding was the only thing I could think of while reading this. I don't have a good reference, but google images will show you a lot of maps of what I am comparing the stuff in the post to. Drawing those pictures predates these new papers and the effect arises in smaller networks using earlier technology.

Expand full comment

You should check out my reply to Hollis, where I suggest that the relationship between distinctive features and phonemes (in phonology) is similar. The number of features used to characterize phonemes is smaller than the number of phonemes.

Expand full comment
Nov 28, 2023·edited Nov 28, 2023

Though it is true, Hollis, that the more sophisticated neuroscientists have long ago given up any idea of a one-to-one relationship between neurons and percepts and concepts (the so-called "grandmother cell") I think that Scott is right that "polysemanticity at the level of words and polysemanticity at the level of neurons are two totally different concepts/ideas." I think the idea of distinctive features in phonology is a much better idea.

Thus, for example, English has 24 consonant phonemes and between 14 and 25 vowel phonemes depending on the variety of English (American, Received Pronunciation, and Australian), for a total between 38 and 49 phonemes (https://en.wikipedia.org/wiki/English_phonology). But there are only 14 distinctive features in the account given by Roman Jakobson and Morris Halle in 1971 (https://en.wikipedia.org/wiki/Distinctive_feature#Jakobsonian_system). So, how is it the we can account for 38-49 phonemes with only 14 features?

Each phoneme is characterized by more than one feature. As you know, each phoneme is characterized by the presence (+) of absence (-) of a feature. The relationship between phonemes and features can thus be represented by matrix having 38-49 columns, one for each phoneme, and 14 rows, one for each row. Each cell is then marked +/- depending on whether or not the feature is present for that phoneme. Lévi-Strauss adopted a similar system in his treatment of myths in his 1955 paper, "The Structural Study of Myth." I used such a system in one of my first publications, "Sir Gawain and the Green Knight and the Semiotics of Ontology," where I was analyzing the exchanges in the third section of the poem (https://www.academia.edu/238607/Sir_Gawain_and_the_Green_Knight_and_the_Semiotics_of_Ontology).

Now, in the paper under consideration, we're dealing with many more features, but I suspect the principle is the same. Thus, from the paper: "Just 512 neurons can represent tens of thousands of features." The set of neurons representing a feature will be unique, but it will also be the case that features share neurons. Features are represented by populations, not individual neurons, and individual neurons can participate in many different populations. Karl Pribram argued that over 50 years ago and he wasn't the first.

Pribram argued that perception and memory were holographic in nature. The idea was given considerable discussion back in the 1970s and into the 1980s. In 1982 John Hopfield published a very influential paper on a similar theme, "Neural networks and physical systems with emergent collective computational abilities." I'm all but convinced that LLMs are organized along these lines and have been saying so in recent posts and papers.

Expand full comment

IIUC, what fair originally meant (WRT a love) was something like "with skin that isn't tanned, and also not marked by disease".

Expand full comment

> Are we simulating much bigger brains?

Sure. We've been doing it for thousands of years, ever since the invention of writing. The ability to store knowledge outside of our brains and refer back to it at a later time allows us to process far more information than we can hold in our brains at any given moment, which is essentially what "simulating a much bigger brain" is.

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

This is why I don't love the wording that AI is simulating "a more powerful AI" or "a bigger brain," because it leads laymen to make comparisons like this.

While writing things down is definitely something humans have done to improve our understanding of the world, that isn't related to the process Scott is describing. Instead, polysemanticity is operating at a much lower, more neurological level: it just means the AI (or our brain) is cramming multiple pieces of unrelated information into a single neuron.

Expand full comment

That seems reasonable. Writing seems more like enlarging the context window (which seems sort-of like the LLM's closest equivalent to short term memory) rather than splitting apart clusters in a neuron activation space, which seems to be what is happening in this paper.

Expand full comment

It seems strange to talk about polysemanticity as simulating additional neurons when we have every reason to believe that the vast majority of neurons - both biological and artificial - are polysemantic. It's like saying a lever is "simulating a larger lever" because it can move surprisingly large things; no it's not; it's just doing what levers do.

Expand full comment

I agree. I’d say it’s closer to music. Chords. Modes from same alphabet / meaning is from the space between.

Expand full comment

I'm a linguist, and for me as a linguist it makes total sense. _Of course_ you would try to emulate a bigger system via weighted intersections. On toy experiments small enough to build the big sparse model one of the best ways seems to be "build the big sparse model then compress". Of-freaking-course (in hindsight) big LLMs do the same, only in one step.

Expand full comment
author

Can you explain more about what you mean by these toy linguistic experiments where you build a big sparse model and then compress?

Expand full comment

This is an old NLP trick that's... actually very much like LLMs, only smaller? https://en.wikipedia.org/wiki/Word2vec

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

(Of course, a professional NLP guy will give you myriads of reasons how it's not the same, but… I'm not hallucinating, am I, you see the similarity?)

Expand full comment

Word embedding into vector spaces the way word2vec does it are still part of modern LLM architectures (at least to the best of my knowledge). The explanation I'd seen before was that it helps keep context sizes manageable (compared to processing character by character), but maybe it also is a good fit for how neural networks generally represent information as well?

Expand full comment

Old lol, you mean from 10 years ago. I think it's one of the earliest results in deep neural network language modelling but this whole field is still very young. And the first layer of an LLM is basically a word2vec embedding.

Expand full comment

I mean, that's old in NLP years, yes. But if we _know_ the first layer to be an embedding of this kind then I really don't understand how this paper is any news.

Expand full comment

The first layer is always an embedding but the interesting action is all happening in the deeper layers. That's what they're trying to figure out.

Expand full comment

Well, we see that it's embeddings all the way down, which should've been our default hypothesis to begin with.

Expand full comment

The paper is news because it's inverting the mapping, going from the embedded space to the disentangled and interperable set of concepts.

Word2vec is comparatively easy, but vec2word is hard when you don't even know what language you should be looking at.

Expand full comment

I mean, yes, this came out wrong. The specifics are important and interesting. Yet the general idea is, what's the word, "in the predicted direction".

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

Yeah, I was a double major in Cognitive Science and Computer Science back in the '90s, focusing on NLP. I was studying under Fred Jelinek ( https://www.nytimes.com/2010/09/24/business/24jelinek.html ), and got my summer internships, at IBM and Microsoft, largely due to his recommendation.

At the time, pretty much everyone on the CogSci side of things expected that in actual brain tissue, there wasn't going to be a neuron for the word 'cat', and another for 'dog', and so on. Concepts would be complex patterns of activation. The neurons that light up when you think about a cat would probably have some similarity to the ones for dog, because both have a bunch of similar features. Thinking about mammals in general, or thinking about fuzzy warm things (like a hot water bottle that goes in a fur sleeve) might activate some of the same neurons.

The fact that we see these same kind of complex activation patterns in virtual neural nets is pretty much the least-surprising thing you could tell me.

Expand full comment

I don't think anyone's surprised that neurons for concepts activate in clusters. The insight here is that they've found a way to disentangle those clusters.

That's the entire goal of pursuing monosemanticity. Similarly, "superposition" is insightful not because it tells us that neurons activate in clusters, but because it describes a specific mechanism by which those clusters form.

Expand full comment

I don't think you're talking about the same thing. In fact it's almost the opposite.

In superposition, you are looking for two concepts that are as unrelated as possible.

Then you can represent them on the same neurons because the likelihood of confusion is smaller.

Dogs and cat share neurons because they are similar. Dogs and apples share neurons because they are different.

Expand full comment
Nov 27, 2023·edited Nov 28, 2023

The two things are related. Imagine each concept is going to be plotted as a Rohrschach blob, representing the connections among points in 2D neural tissue. The expectation is that there will be some reuse of key nodes among similar concepts. The overall pattern of activation for "cat" and "dog" will look more similar than either looks to "shark" or "tree". But then also each concept will have some loops and branches in its structure to ensure its _distinctness_ from closely-related concepts. Any individual point in your 2D space -- and analogously any individual neuron -- will end up used in _both_ of these capacities, for different groups of concepts. So there may be a neuron that tends to get activated when thinking about a wide range of cuddly mammals, but also happens to be part of a cluster for fruits, or tools, or colors. And then any random neuron may also happen to fire as part of the distinguishing filigree for some very specific sub-category of one of those bigger categories. Like your neuron that is used widely by cuddly mammals might not fire for most fruits, but gets tapped as part of the detail for "bosc pear".

Real neural tissue would also get to play with the additional dimension of time; overall the system may be able to distinguish between similar "shapes" if different parts of the shape fire with different phase-shifts, relative to the overall cycle of the group. (Some of the "filigree" could be involved in regulating that kind of phase shift, by adding stimulation or suppression to specific parts of the category-cluster, controlling which parts of it fire first in a each wave of activation.)

It also probably won't be the case that any specific instance of "thinking of" or even physically articulating a particular word will match with _precisely_ the same pattern of activation -- if you take the wave of activations that command the muscles in your mouth to produce the word, it will have to be reasonably similar, since they have similar end points in causing muscle fibers to contract. But they won't be identical -- we aren't ultra-precise robots that can produce exactly-identical movements each time. Plus there will be variations caused by context (like how much background noise you're trying to be heard over).

Expand full comment

So, AI first-level neurons pull (at least one kind of) double-duty by each recognizing multiple concepts, and then the next level(s) up narrow down which concept best applies.

I expect this will work more efficiently when the concepts being recognized are mutually exclusive.

The metaphor that's coming to mind here is "homonyms": we know from context whether we're talking "write", "right", or "rite", but the sound's the same.

Expand full comment
Nov 28, 2023·edited Nov 28, 2023

I quite like the idea of thinking of this in relation to homonyms. In actual human brains, homonyms are going to have similar representations in terms of patterns of activation in auditory and speech-production parts of the brain (like Broca's Area), but in parts of the brain relating to syntax and deeper conceptual understanding, we'd expect them to look very different.

Expand full comment

Linguistics more generally is a halfway decent way to look at this (says the sometime-linguist) - human languages consistently have roughly half a dozen observable layers that might have recognizable analogues in LLM node layers.

Of course, they also might not, but I expect we'd learn something either way by trying to look for those layers.

For the non-linguists in the audience, the layers I'm thinking of are:

Phonetics - which sounds the human vocal apparatus can make. Not quite as relevant in a typed-text-based LLM, but if you're working with one using screencaps, or audio files...

Phonology - which of the above sounds a typical speaker of a specific human language will treat as separate. This is where you get, e.g., the L/R confusion in Japanese speakers of English.

Morphology - words, and how you put them together. Kind of blurs into the next one.

Syntax - sentences, and how you put *them* together.

Semantics - getting the (literal) meaning out of the sentences you've made.

Pragmatics - getting the (metaphorical) meaning out of the sentences you've made. Why the answer to "can you pass the salt" is not *just* "yes", unless you're trying to teach your kid to say "please".

Expand full comment

And then delightfully a bunch of these seem to have distinct subsystems that get trained during language acquisition. The intersection of tonal languages with actual singing are fascinating -- clearly there's something going on separate from raw recognition of frequencies, because native speakers of tonal languages can hear the distinguishing tonal features of words even when the shape of the melody does something contrary. For those of us who have languages where we have some forms of prosody, but not true word-distinctive tone, recognizing tone as a differential against melody tends to be hard if not impossible.

A question I find fascinating is to what degree the sub-systems within something like ChatGPT are replicating systems in human brains, just because they're optimizing to solve an at-least-similar problem. If there are distinct networks of virtual neurons implicated in assembling a word from letters (akin to phonetics), assembling words into phrases (syntax), and assembling phrases into whole sentences and paragraphs (semantics), that would be kind of what we'd expect; if not, if it has sub-systems but they're doing something much weirder that we haven't imagined, that would be surprising. (And also discouraging in terms of interpretability.)

Expand full comment

In addition to efficiency, there are some interesting questions about robustness and clarity. Consider three types of people who are sometimes described with the polysemantic word "libertarian": (1) an incompatibilist who believes in free will, (2) a free-market minarchist, (3) an anarcho-socialist. While it's confusing that (1) shares a word with (2) and (3), and that sometimes results in non sequiturs, it rarely results in real confusion because the definitions are so different. But the greater semantic relatedness between (2) and (3) causes real confusion and controversy. Yeah, sorry, politics in the mind-killer, but this particular example seemed very illustrative.

Expand full comment

I think "liberal" is an even better example, given the difference between US, UK, and AU usage.

Expand full comment

Good point. And you can also, like, apply sunscreen "liberally", which means something else entirely that is unlikely to cause confusion.

Expand full comment

Feature 2663 represents God. Ezekiel is the 26th book of the bible. Ezekiel 6:3 reads:

> and say: ‘You mountains of Israel, hear the word of the Sovereign Lord. This is what the Sovereign Lord says to the mountains and hills, to the ravines and valleys: I am about to bring a sword against you, and I will destroy your high places

This is either a coded reference to 9/11 (bringing us back to the Freedom tower shape association) or an ominous warning about what the AI will do once it becomes God.

Expand full comment

Checks outs. I think you cracked the whole case.z

Expand full comment

I thought that was the bit about the path of the righteous man Sam Jackson keeps quoting.

Expand full comment

Clearly it's talking about gradient descent

Expand full comment

I’m actually cited a few times in the anthropic paper for our sparse coding paper; this is the most exciting research direction imo. Next is circuits connecting these feature directions.

If you’re interested in working on projects, feel free to hang out in the #sparse_coding channel (under interpretability) at the EleutherAI discord server: https://discord.gg/eleutherai

Expand full comment

Here are a couple of random concepts that this post sparked, that I'm sure are of 0 novelty or value to anyone actually in the field:

1. It seems like maybe the AIs are spending a lot of their brainpower relearning grammar everywhere. Like if there's a "the" for math and a "the" for, I don't know, descriptions of kitchens, to a large extent it's mostly learning the rules for "the" in both places. You could imagine making the AI smarter by running a pre and post-processor that, for the pre-processor, stripped grammar down to a bare minimum, and for the post-processor, took minified grammar than the AI actually learned from and then reinject normal human grammar rules so that the output looks fluent.

2. In terms of trying to understand a big AI through this: you could imagine running the classifier while seeing how the AI responds to a largish-but-highly-restricted set of texts. So like maybe run the classifer while the LLM is responding to input from only essays on religion. You won't discover all the things that each neuron is used for, but you will perhaps find out what neurons activate for religious essays (and you won't know what else those neurons are used for), but you might create an output restricted enough that you could then interrogate it usefully for higher-level abstractions or "opinions."

Expand full comment

Re 1, that's basically the "bitter lesson" in action. Everybody starts out thinking that smart tricks which leverage human understanding would pave the short road towards AI, but again and again it turns out that (relatively) dumb methods that leverage vast amounts of data and computation beat those smart tricks every time.

Expand full comment

Not only that, but the more you encoded domain-specific tricks, the harder it is to make your AI multi-modal. We should move in the opposite direction and get rid of tokenization altogether rather than making it "smarter". Obviously future models will work on bytes and characters, not tokens. I don't know when this transition will happen, however.

Expand full comment

> Obviously future models will work on bytes and characters, not tokens.

Sorry for the ignorant questions, but has this been tried?

Expand full comment

Probably not, and for a good reason. If you work with bytes instead of tokens you multiply the amount of computation resources needed by at least the average size of the token. probably more. Even switching to UTF32 chars would multiply it by a lot, and that won't handle all languages.

Now if you were to restrict yourself to handling English without diacritical marks, then bytes would be identical to chars, so you'd only be multiplying the required resources by some power of the length of a token.

The selection of how to tokenize is the first step in the compression of the input data. If you can do it correctly, there's no better first step. But that's a big if.

Expand full comment

Yes, it's been tried. MegaByte and MambaByte are two examples. The second is easy to Google.

Expand full comment

Thanks Paul, that's helpful.

Expand full comment

This doesn't address the classic issue that humans can learn a rule from a single exposure (like a math rule) whereas AI can't. We appear to benefit from a combo of the brute force stastical method and the symbolic representations method, and ai probs will benefit if we can supplement it with those kinds of symbols.

See Stephen Pinker's The Word and the Rule for discussion of why the human brain probably uses both methods, based largely on evidence from how irregular and regular past tenses work in English.

Expand full comment

Well, as a counterpoint, there definitely have been cases where supplementing a NN with hard-coded rules is helpful. AlphaZero-style go programs got better when ladder solvers were added as inputs, for example. NNs struggle to solve ladders, even with large depth and training time, but they're easy with conventional programming. Running the ladder solvers and then sending their results to the NN is much more efficient than training a NN to solve ladders.

As for GP's suggestion, though, I suspect this approach wouldn't be very helpful with human language, since all the nuances and exceptions of human grammar are hard to address with conventional programming techniques.

Expand full comment

Their second interface reminds me greatly of Karpathy famous old blog post on RNNs and their effectiveness. (https://karpathy.github.io/2015/05/21/rnn-effectiveness)

In particular, there is a section ("Visualizing the predictions and the “neuron” firings in the RNN"), where he visualises individual neurons. In that case however there presumably only a few neurons that do not correspond to a distributed representation and which are thus directly interpretable already.

Expand full comment

We gamified explaining AI over at Neuronpedia, thanks to grants from EV and Manifund: neuronpedia.org

It's a crowdsourcing / citizen science experiment like FoldIt, except for AI interpretability. We have ~500 players currently - you can explain neurons, verify/score human explanations, and of course browse the neuron/activations/explanations database (and export the data for whatever research you'd like to do).

We also recently added a set of autoencoder features from the Cunningham et al team for GPT2-Small Layer 6: https://www.neuronpedia.org/direction/hla-gpt2small-6-mlp-a

Short term goal is to train a model that explains neurons better than GPT-4 (we're seeing about 50%+ of human explanations score better than GPT-4's explanations, training starts in December). Medium to long term, Neuronpedia would like to be an open data hub (an interactive Wikipedia for any AI model) where researchers can upload their own features/directions, and run campaigns and various experiments publicly.

You can also contribute code - currently the scorer is open source (with a readme that has many low hanging fruit), and soon the whole thing will be open sourced.

Expand full comment

I clicked on launch and was immediately presented with a signup window. Who needs whom?! Clicking away from the window closed it, but that didn't make it load. And also even in that half-loaded state the browser slowed my system to a crawl, though that is partially the fault of an old laptop.

Expand full comment

heya - sorry about that poor experience and appreciate the feedback!

we're adding anonymous play as well as immediate browsing in the next week.

what are the specs of your laptop? we test mostly on mobile devices and computers in the last 3-4 years.

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

You know, Scott, there's more going on that how 'neurons' represent 'concepts.' How does an LLM know how to tell a story? Sure, there's the concept of a story, which is one thing. But there's the recipe for telling a story. That's something else. And there are all kinds of stories. Are there separate recipes for those? And stories are only one kind of thing LLMs can generate. Each has a recipe.

BTW, by 'recipe' I mean something like 'program,' but LLMs aren't computers, no matter how much we'd like to think they are. So I don't think it's very helpful to think about programs for generate stories, or menus, or travel plans, etc.

It's increasingly clear to me that LLMs are associative memories. But not passive memories that can only store and retrieve, though they can do that.

Moreover, no matter how much you fish around in the layers and weights and neurons, you're not going to discover everything you need. You'll find part of the story, but not all. Let me suggest an analogy, from a recent short post (https://tinyurl.com/2ujw2ntc):

"In the days of classical symbolic AI, researchers would use a programming language, often some variety of LISP, but not always, to implement a model of some set of linguistic structures and processes, such as those involved in story understanding and generation, or question answering. I see a similar division of conceptual labor in figuring out what’s going on inside LLMs. In this analogy I see mechanistic understanding as producing the equivalent of the programming languages of classical AI. These are the structures and mechanisms of the virtual machine that operates the domain model, where the domain is language in the broadest sense."

Mechanistic understanding will yield an account of a virtual machine. The virtual machine runs what I'm calling the domain model in that paragraph. They are two separate things, just as the LISP programming language is one thing and a sentence parser is something else, something that can be implemented in LISP.

You can begin investigating the domain model without looking at the neurons and layers just as you can investigate language without knowing LISP. You just have to give the LLM a set of prompts so that its response gives you clues about what it's doing. I've got a paper that begins that job for stories: ChatGPT tells stories, and a note about reverse engineering: A Working Paper (https://tinyurl.com/5v2jp6j4). I've set up a sequence at LessWrong where I list my work in this area (https://www.lesswrong.com/s/wtCknhCK4qHdKu98i).

Expand full comment

"Feature #2663 represents God.

The single sentence in the training data that activated it most strongly is from Josephus, Book 14: “And he passed on to Sepphoris, as God sent a snow”. But we see that all the top activations are different uses of “God”.

This simulated neuron seems to be composed of a collection of real neurons including 407, 182, and 259, though probably there are many more than these and the interface just isn’t showing them to me"

After reading those lines (and having read UNSONG before), I fully expected an explanation of why numbers 2663, 14, 407, 182 and 259 are cabbalistically significant and heavily interrelated.

Expand full comment

Most cynically, seems like an extended exercise in reifying correlations.

Expand full comment

How would you tell the difference between the cynical take (which to be honest I partly lean towards) and the more optimistic one presumably espoused by Anthropic?

Expand full comment

Thinking about what it might mean if the human brain works like this:

- this makes me more pessimistic about brain-computer interfaces, since BCIs like neuralink can only communicate very coarsely with groups of many neurons... but now I am thinking that even communicating with individual neurons would still create a lot of confusion/difficulty, since really you need to map out the whole abstract multi-dimensional space and precisely communicate with many individual neurons at the same time, in order to read or write concepts clearly.

- on the other hand, this makes me more optimistic about mind uploading (good RationalAnimations video here: https://www.youtube.com/watch?v=LwBVR68z-fg) , since people are always like "idk bro, how could we possibly fit all our knowledge into so few neurons, therefore there must be some insane low-level DNA computation going on behind the scenes, or something". But with polysemanticity, it turns out that maybe you can just fit a preposterous amount of info into neurons, using basic math!! So maybe just scanning the connectome structure would be enough to make mind uploading work (as long as you are also able to read the connection weights out of the synapses).

Expand full comment

So far (and IIUC) brain computer interfaces work by having the brain learn to handle the interface in a way analogous to the way we learn to touch-type. IOW, everything is filtered through a serial interface in both directions.

OTOH, there are some implanted devices for things like spinal muscle stimulation, where the normal commands to, e.g., walk, are interpreted by the device to stimulate the muscles in a manner appropriate to allow posture for walking to be adopted. Again, I think the stimulus is serial, but the resultant actions aren't. Still, I believe that there was a significant training period in learning how to use the device. (And the picture in the article showed the walker being accompanied by a medical staff person of some sort.)

Expand full comment

> Are our brains full of strange abstract polyhedra?

Yes: https://blog.physics-astronomy.com/2022/12/the-human-brain-builds-structures-in-11.html

Expand full comment

'"We found a world that we had never imagined. There are tens of millions of these objects even in a small speck of the brain, up through seven dimensions. In some networks, we even found structures with up to 11 dimensions." says lead researcher, neuroscientist Henry Markram from the EPFL institute in Switzerland.'

That always sounds really incredible until I remember that by 11-dimensional structures they just mean "structures involving 11 neurons".

From the paper:

'Networks are often analyzed in terms of groups of nodes that are all-to-all connected, known as cliques. The number of neurons in a clique determines its size, or more formally, its dimension.'

The usage is the same in machine learning, and sometimes throws me off there, too.

(you may already be well aware of that! Just mentioning it because I suspect some people may not be)

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

How does language work?

I mean, 'd', 'o', and 'g' don't have lexical meaning by themselves (except maybe for Shakespeare for the middle one). But together, they mean the animal descended from wolves that wags its tail, likes to go fetch, and pees to mark its territory.

Is then English analogous to a form of neural net with 26 neurons in the first layer, with many layers?

(It would be even more interesting to compare this to Chinese.)

Expand full comment

Not exactly the *first* layer (see: fonts and/or handwriting), and a few more than 26 (spaces and punctuation marks come to mind), but yeah, it's a pretty ready analogy here.

Expand full comment

You are confusing language with how it is written.

Expand full comment

Orthography then? I agree, language predates writing. But you do have the same idea where the sounds don't mean all that much by themselves (though there may be some connection, as with phonesthemes) but gather meaning due to particular aggregations of them.

Again, I'd love if someone familiar with Chinese could comment!

Expand full comment

Do you mean familiar with the Chinese writing system? Or something else about Chinese?

Expand full comment

But the Chinese ideograms aren't tied to how you pronounce them. For the association you want you need to delve in, perhaps, some aspects of Gematria, which assert that the sounds of various Hebrew letters were selected by God for some particular meaning. (Even for Gematria that's a non-standard belief...or at least partially so. But check out why the first word of the Bible doesn't begin with an Aleph. https://judaism.stackexchange.com/questions/109202/analysis-of-an-answer-to-why-does-the-torah-begin-with-the-letter-bet )

Expand full comment

It's certainly way way way way more interconnected and overlapping and complicated and messy than this, but here's a way in which it might work in a hypothetical person. This is just off the top of my head, and I expect any practicing linguist would be able to rip it apart, but...

Noise comes in our ears, and goes through a NN layer which picks out human voices from random noise (have you ever felt the wind talking to you?). Then the output is fed to a NN that picks out the phonetics of the languages that we know from the human voices (why it's hard to learn to distinguish and make the weird noises used in other languages). Then a layer which converts the phonetics to phonology (why you can understand people with different accents, to the point where most English speakers don't even notice the problem involved in "spelling reform"). At this point, we're probably interpreting some tone and rhythm, so the end result is words and emphasis. And that's not too different from what LLMs deal with. But there's also channels for things like for speed and emotion.

There'll be more processing to handle syntax (which in this example includes inflection, like in Latin). And some to handle semantics (not just references to earlier parts of the conversation, but also emotional content, and how the same statement means different things coming from different people).

And this is just how it might look if we separated it all out using monosemanticity analysis. In practice, it's probably all smushed together. Try listening to speech in a language you don't understand, without looking at the faces, or at a transcription or translation (not some speech deliberately created as an example, but actual speech from the wild). It'll be hard to make out the word boundaries, but you'll probably get some sense of the emotional content. And the emotions you pick up may or may not be real - different languages and cultures express emotions differently, and sometimes normal speech in one language sounds very emotional in another. And congratulations, that's a "hallucination". :-)

Expand full comment

Lockheed-Martin published an ad in the Smithsonian in 1989 I think. Two pages. Tower of Babel painting. Saying we’re a company to do away with the Babel Affect. Confusion of languages. They said God confused these smart people who were poised to do anything. It seemed implied that dangerous to let them do that. Didn’t have wisdom to know how to use their knowledge in a positive way. That is our world. So much knowledge used without wisdom. No asking the 7th Generation Question.

It’s Lockheed-Martin that had the Mars Lander crash because inches and centimeters didn’t match up.

AI This is where we are. Doing away big time with the Babel Affect.

This is something I’ve pondered deeply for a long time.

Expand full comment

I expected big constellations of stuff but am excited about this because unlike the brain you can just go around turning stuff on and off and changing it with no consequences. This is all way more complicated than one to one relationships but if you can do this systematically at scale to at least get things like “involved with” can you get enough data to train another model that can generalize across all models?

If all these things are sort of chaotically formed then no.

If they represent some eternal order that’s in the universe then maybe yes?

Not sure if that checks out as a thought but I had three minutes and I like it.

Expand full comment

This is a fascinating, substantive result and a *fantastic,* easily-understood writeup by Scott. Bravo!

Expand full comment

There's an analogy with mantis shrimp vision. Humans have 3 types of colour receptor and the colours we perceive are naturally organised as a 3D (RGB) space. Mantis shrimps have something like a dozen colour receptor types. So do they see a 12D space of colours? Probably not. It turns out mantis shrimp have very poor colour differentiation. It's likely that colours for shrimp are are monosemantic and for humans are polysemantic. We carve up our reds, say, into pink, scarlet, carnelian and so on because we differentiate blends of R, G and B. That gives us a very big space. Mantis shrimp probably don't do this. They don't have enough neurons to be reasoning in a 12D colour space.

Expand full comment
author

Interesting! I still don't understand what you're saying well enough to imagine what it would be like to be a mantis shrimp; would they only see twelve colors (whichever receptor was most activated would be the one that wins out in perception)?

Expand full comment

I don't think we know that much about how mantis shrimp see colours, and I'm sure nobody knows what it's like to be one, but we can test their colour differentiation and it's poor so the model of them colours in a 12D space doesn't fit experimental results.

There are alternatives to a "most activated wins" type model too. Eg. imagine designing a simple robot that turns around and runs away if red receptors are activated because red means blood and blood is bad. There's no need to posit any kind of anything wins scenario. Red neurons are connected to legs (by a possibly complex pathway) and that's that. I'm not saying that's how mantis shrimp work, just pointing out that there doesn't have to even be an assignment of "colour" at all. This would be monosemanticity: red receptor = RUN!

Expand full comment

12 distinct colors could be visualised I think, once you decided what the colours were. Dall-E might be able to draw examples. Wes Anderson could direct the movie.

Expand full comment

How old are you? Imagine a 16 color game and remove white and gray from the palette. Black can probably stay for when no neurons activate.

Expand full comment

If I'm understanding this well I think the best analogy would be how if you set a digital camera to give you a black and white picture of a scene you'll still get a different image if you put a single color filter in front of it or not even though all the images are monochromatic instead of polychromatic. With a little knowledge for most scenes you could probably even predict (or I guess post-dict?) if the filter in front of the camera was red, blue or green, but this won't necessarily be automatically obvious at a pre-conscious level like our normal color differentiation is. Not as perfect an analogy but if you had red, blue and green lights shining on a white wall in a black and white video you can see when they're turned on and off easily even if you can't tell which color light it was. Mantis shrimp can see other lights you can't see (ultraviolet and infrared) can tell when such lights turned on and off, but don't appear to be able to differentiate well if the light you turned of was say, infrared vs green.

I would guess this indicates that at the level of the eye nerves the color information from a mantis shrimp's 12(?) types of light receptors is be compressed down to just one dimension, dark <-> bright; analogous to our color camera set to output black and white images. I'm too lazy to research if maybe all the color information is passed on to the shrimp's version of the visual cortex and it just devotes most neurons to processing overall brightness information instead of color discrimination. (I probably should be interested in this, it's fascinating how much visual processing (such as contrast enhancement/edge detection) is done inside the human eye before the brain even gets involved so I'm sure mantis shrimp eyes are fascinating too)

As a total aside, Likewise you could also add new color receptors into a digital camera that would let you "see" into ultraviolet and more into infra-red, adding more information into the images, but if it's still set to black and white you're not adding new colors to the image, just more information about brightness over a larger range of light wavelengths. Security cameras take advantage of the fact that our camera technology generally "sees" into the infrared but humans don't, so they can use infrared spotlights that don't bother human vision at night, but still allow the camera to record the area. (Then antagonistic humans put bright infrared LEDs on their hats to blind the cameras so their faces can't be recorded) Other camera lenses are often chosen specifically to filter *out* infrared light so that images they take don't show or get affected by light humans can't see.

Expand full comment

I'm surprised nobody has mentioned how similar this behavior is to poly/omnigenic phenotypes -- that is, behavioral or physical traits that are controlled by many or all parts of DNA. Back propagation on a neural net isn't exactly the same as evolution operating on DNA strands, but these two processes share a similar trait: they are optimized in parallel at every level. The optimizer doesn't care for our desire for simple categories, only for efficiency.

(To be honest, in my many years as an AI researcher I've just taken to calling the whole set of behavior as 'omnigenic' -- I didn't realize polysemanticity was a thing until this new Anthropic paper, seems like it may be a case of AI folks reinventing terminology already popular in another field)

Expand full comment

also possibly an interesting angle on a molecular bio research paper here:

- take some DNA sequence data

- feed that into a very large autoencoder

- try and map the intermediate neurons in the autoencoder back to phenotypes

Expand full comment

Interesting! Can you say more? Why might we expect that intermediate neurons in the autoencoder could be mapped back to phenotypes? What would it mean if we could map the intermediate neurons in the autoencoder to phenotypes?

Expand full comment

This is a very rough sketch of an idea, but the basic thought is that a DNA sequence is akin to the inner layer of a neural network. It's this highly optimized thing that is 'polysemantic' -- more often than not, a given DNA sequence will encode a protein that does many things, in the same way a neuron in an embedding layer will encode many semantic meanings. So perhaps the tools that help us decode the inner layer of a neural network can also help us decode how DNA works.

In the monosemanticity paper, they train an autoencoder on the embedding activations of an intermediate layer of some model. The autoencoder has a much larger intermediate layer than the model does, which allows it to separate out polysemantic neurons. So, in the base model, if a neuron represented like 5 concepts, in the autoencoder that will be represented by 5 different neurons.

If you take a DNA sequence (split by protein start/ending locations) and pass it through an autoencoder that's way bigger than the DNA sequence itself, you might be able to do something similar where individual autoencoder neurons map back to a combination of DNA sequences.

There's still a problem though -- we have to then map the autoencoder neurons back to the actual phenotypes. In the monosemanticity paper, they do this by passing words into the bottom of the model and identifying what 'lights up' in the autoencoder. You can't easily do this in DNA world, this is where the abstraction kinda breaks down. But you can do some correlation statistics on the phenotypes themselves, and the autoencoder neurons that light up for different sets of DNA. A success case might look something like: 'all people who are above 6 feet have this single neuron in the auto-encoder lighting up'. And then you work backwards from there, to see what feeds into that one neuron.

Expand full comment

Yes, it's kind of amusing how 15 or so years ago "we found a gene for a specific thing" was the totally respectable mainstream thing that happened every other day, and then 99% of that turned out to be bunk and quietly swept under the rug, nothing to see here.

Expand full comment

"Scientists Discover The ____ Gene"

Expand full comment

Yeah, the "polysemantic" behavior has been obvious for a long time, in the field.

But in reference to your XKCD, that's what decompilers are for, and isn't it awesome that we made some progress toward a decompiler for a brain? :-D

Expand full comment

Also, isn’t the front end source code for the google page just a search box.

Expand full comment

Haven't read the whole thing yet, but if you want to know more about the "mysterious things" that happen in neural nets in an understandable way, I highly recommend the 3blue1brown series on deep learning: https://www.youtube.com/watch?v=aircAruvnKk

Expand full comment

In some ways, understanding the individual components is a poor way to understand a vector. Mathematicians typically try to explain vectors based on relationships to simpler vectors, rather than breaking them down into coordinates. Perhaps, in the same way, understanding one neuron at a time is a poor way to understand one layer of a neural net.

Expand full comment

This really made it click for me for some reason. Why on earth would you expect any of the features recognized in the brain to be along a single direction? I'm curious if you have any further analogy for how one might relate them to simpler vectors.

Expand full comment

I can't help thinking there is something misguided about the idea of "monosemanticity" vs. "polysemanticity" in the first place. What looks polysemantic from one perspective might be monosemantic from another - the choice of basis looks kind of arbitrary (based on concepts that we find "interpretable")? Is green a primary color or a mixture of other colors? Is a face a "basic" concept, or instead its constituents (eyes, nose, mouth, oval outline)? You could describe a rectangular volume in terms of L x W x H. Or you could describe it in terms of sqrt(A1 x A2 x A3), where the As are the areas of the three distinct faces. We have conventions about what we take to be more basic, but there's nothing objective about it, and the areas look like "mixtures" of the individual lengths along different dimensions. It looks like they've found a way to change the semantic basis to one that matches more intuitive dimensions for us, and that's useful. But casting it in terms of "monosemanticity" less so.

Expand full comment

The premise of the work is that sparsity provides an objective (or at least unambiguous) criterion to distinguish between different bases. There are many feature bases that can capture the same information, but imposing the requirement that the activity of each feature is sparse across the dataset of interest allows you to distinguish between them. A basis which satisfies this property facilitates interpretability not just because the features uncovered are more intuitive for humans (though this seems to be the case), but also because it means that, for any given input to the model, only a small fraction of features needs to be considered to explain the computation performed by the model on that input.

As an aside, I think looking for sparsity may in fact be a stronger requirement than is needed in theory -- I believe that looking for a feature basis such that the feature activations have non-gaussian distributions is sufficient (this is the premise of independent components analysis https://en.wikipedia.org/wiki/Independent_component_analysis, or nonlinear variants like https://openreview.net/pdf?id=XqEF9riB93S).

Expand full comment

From what I can tell, right now they're operating mostly on the level of English words as the basic building block, so it makes a certain amount of sense for them to call it this. But the net is already doing things like creating different nodes for different contexts of the word "the". And perhaps some of the different weighting represents abstract concepts that span multiple English words.

It's like we're actually seeing the basis for the weak Sapir-Whorf effect...

Expand full comment
Nov 28, 2023·edited Nov 28, 2023

What does it even mean for a neural network used for predicting the next word of a text to be plotting the downfall of humanity? It feels like many people got stuck with this 2010s concept of the AI as a master planner and are refusing to accept that we ended up with an entirely different kind of AI. This result sounds like a very impressive advance in understanding how LLMs work, but I don't see how it could be applicable to any kind of end-of-the-world-scenario safety problem. (It does seem useful for the more mundane safety problems like thwarting users who are generating fascist imagery using associations to get around banned words. [1])

[1] https://www.bellingcat.com/news/2023/10/06/the-folly-of-dall-e-how-4chan-is-abusing-bings-new-image-model/

Expand full comment
author

It doesn't mean anything right now. But see the section called "Mini-Scenario 1: AutoGPT Vs. ChaosGPT" at https://www.astralcodexten.com/p/tales-of-takeover-in-ccf-world

Expand full comment

A looping LLM is already extremely interpretable though, You just need to read its internal monologue.

Expand full comment
author

Yeah, I was mostly trying to make the point that it's pretty trivial to turn an LLM into an agent. I don't think the actually-useful version of LLM-turned-into-agent will be hack-ish enough to have a human-readable internal monologue.

Expand full comment

Why, though?

What's stopping us from creating a clever enough scaffolding around LLM or even multiple LLMs, to create a system, that would be processing information in completely trasparent way, somewhat similar to how human thinking process works? Why do you think that it's not going to be useful?

Expand full comment

But the human thinking process *isn't* transparent. Only a certain percentage of the human population has a constantly-running internal monologue, so it's clearly an optional add-on, not the core of human information processing.

Expand full comment

Fair, but irrelevant to the point I was making.

Expand full comment

My guess is that the system you describe will be less efficient than a more integrated system, creating pressure to use a less legible but more efficient system. Think of it like always running with trace logging turned on?

At the same time, I worry that widespread use would generate so much data that it would require automated or even AI tools to check, creating the opportunity for exploits similar to "corruption" in corporate accounting.

But these are perhaps potentially easier problems to solve, than dealing with an opaque superintelligence. :-)

Expand full comment

> My guess is that the system you describe will be less efficient than a more integrated system, creating pressure to use a less legible but more efficient system.

Well yes, but as we would prefer slow takeoff to a fast one, this seems to be a feature, not a bug. We need to outlaw any other ways to make AI as not safe enough and just slowly move towards ASI with this method.

Expand full comment

> processing information in completely transparent way, somewhat similar to how human thinking process works?

Do you really think that the way human beings think is transparent?

Expand full comment

Well, lets just say that having an artificial intelligence as transparent as my own thinking process seems to be very good interpretability-wise.

Expand full comment

A looping LLM is not as extremely interpretable as it looks: inner-monologue transcripts can have errors and still reach the right answer, in the same way that few-shot Q/A examples can work even if they include the *wrong* answers or shuffle the answers! The role that the inner-monologue plays is not as simple as you think it is, and it gets even less simple and interpretable as, among other things, LLMs scale or LLMs increasingly train on outputs derived from inner-monologues (ie. Internet scrapes filled with material from people increasingly using LLMs which compute using inner-monologue while including only the final answer, or all the datasets cloning GPT-3/4 through the API) or people do the obvious knowledge-distillation thing https://www.lesswrong.com/posts/bwyKCQD7PFWKhELMr/by-default-gpts-think-in-plain-sight?commentId=vdNmuDgC5YqsA4kq6 .

Expand full comment

The fact that we can see at all that there is an error in the transcript is already extremely interpretable, compared to literally anything else in AI.

Expand full comment

Being able to interpret the "inscrutable matrix of weights" in an AI makes it possible to edit it and refactor it and optimize it. This is the first step toward a massive increase in capabilities. Of course, we humans won't be able to do this all directly, we'll need programs to help us do it, perhaps even AI programs. And perhaps even one day there will be AIs that will be able to modify themselves to become more intelligent.

Expand full comment

Q: How many computer programmers does it take to change the way an AI thinks?

A: only one, but the AI really has to want to change.

Expand full comment

Programmers always fall for the bad AI, because we think we can change them.

Expand full comment

bugs bunny in lipstick

Expand full comment

The true shoggotim were inside us all along.

Expand full comment

I had to look that up. I was out of the loop on this meme. It's interesting that it would be chosen as the stand in.I found a NYT article about it, and this quote stood out, made by someone in AI who was speaking with the reporter.

“I was also thinking about how Lovecraft’s most powerful entities are dangerous — not because they don’t like humans, but because they’re indifferent and their priorities are totally alien to us and don’t involve humans, which is what I think will be true about possible future powerful A.I.”

My question is how can an entity be indifferent and have priorities as well?

That something smarter than us would be mean to us is entirely our own problem because that's the kind of world we trained in. It is as though a good chunk of humanity is being forced into psycho-analysis.

Expand full comment

What role could monosemanticity have in the self-regulation of the AI industry to ensure AI aligns with humanity?

What role could it have in government regulation of the AI industry, if the AI industry cannot or will not regulate itself?

Would or should investors and customers start to demand some sort of monosemanticity guardrails for AI alignment?

If investors or customers cared about (monosemanticity) AI guardrails, who would help them understand monosemanticity or other AI guardrails, so those guardrails don't "greenwash" the AI industry but actually work as intended?

Expand full comment

Nick Land going crazy rn

Expand full comment

“The Anthropic interpretability team describes this as simulating a more powerful AI. That is, the two-neuron AI in the pentagonal toy example above is simulating a five-neuron AI.”

I’m not sure why this would be explained as simulation rather than merely classification. If you applied this reasoning to another kind of AI like e.g. a Support Vector Machine, then the opposite happens– a large number of dimensions “simulate” a much smaller number of toys.

Expand full comment

Re: the brain utilizing superpositions, is it even possible for a polysemantic network such as the brain not to have a monosemantic representation? It feels like a monosemantic equivalent would always exist, almost like how any two numbers always have an LCM (Least Common Multiple)

Expand full comment
founding

This to me feels less like a simple AI simulating a more complex AI and more like it's coming up with a language by turning its neurons into letters in a language

Expand full comment

I think that thinking about this in terms of simulating a bigger AI is a bit dramatic, possibly to the point of being misleading. I'll give my linear algebra version first, and a non-technical version second.

In linear algebra we know that you can pack N "almost orthogonal vectors" in dimension k (much smaller than N), even though there are only k actually orthogonal dimensions available. This is a counterintuitive property of high-dimensional Euclidean space. We do a lot of dimensionality reduction in machine learning in general, in which we take a high dimensional set of vectors and try to cram them into lower dimensional space (for example t-SNE).

For a non-technical example of why dimensionality reduction can be mundane, just think about a map of the globe. You're throwing away one dimension (elevation), but that noise may not matter to you in a lot of applications. You may also be familiar with the fact that distances get distorted by popular projections. You can project with less and less distortion as you enter higher dimensions.

So I think it's a lot more mundane than "simulating a larger AI". We just have a model which is taking advantage of geometry to pack some 10,000 dimensional vectors (the globe) into 500 dimensional space (the map).

Another reason to not think of it as simulating a larger model, is that we'd probably expect an even larger model to use those 10,000 neurons to represent 50,000 dimensional space or something (speculative).

This is a pretty common operation, so "neural network lossily represents 10000 dimensional vector in 500 dimensions" is much more mundane than simulating a larger AI.

Expand full comment

I'm glad somebody made reference to the vector space math involved, but your comment doesn't really indicate just how hugely absurd the effect is. Suppose we define a fixed error tolerance for how close 2 vectors need to count as "nearly orthogonal" (say within .01% of 90 degrees, this error tolerance can be as small as you like as long as you hold it fixed). Then the maximum number of vectors you can have, all of which are nearly orthogonal, grows *exponentially* in the number of dimensions!

One way to get intuition for this is that for a d dimensional sphere, in the limit of large d you should think of nearly all the (hyper)volume of its surface as being near the equators, and almost none of it close to the north or south polls. So 2 randomly selected vectors will usually be nearly orthogonal. You can then find exponentially many nearly orthogonal vectors (with high probability) simply by choosing their angles randomly.

Hence, there is a lot more room in a high dimensional vector space than you might intuitively think...

Expand full comment

Funny seeing your name pop up, as this whole discussion reminded me of an idea from quantum gravity where (some people expect that) distinct geometries are encoded as states that are nearly but not quite orthogonal. (And the semiclassical approximation is just a regime in which we treat the relevant geometries as truly orthogonal and hence suitable to be the eigenstates of an observable operator, so that it makes sense to e.g. promote the metric to a quantum field operator h.)

Expand full comment

Thanks for posting what I was thinking. The simulation analogy is absolutely wrong and being "monosemantic" whilst being interpretable to humans limited to 2D chart plots are just entirely inefficient.

Expand full comment

I wonder if this is related to cell signaling pathways in biology? I've long pondered this problem, where we just don't have enough signaling cascades/pathways to cover all the various biological functions. Meanwhile, something like NFkappaB fires for practically everything. There are thousands of papers talking about how some signaling cascade is "vital for [X] mechanism", but that simply CAN'T be true in the 1:1 sense of "if this, then that", because then you'd never be able to modulate any one function.

However, if cells are doing something similar - using 'simulated' signaling pathways that are various modulated combinations of multiple other signaling cascades - you could easily get the functionality you're looking for here. Very interesting and exciting potential for extrapolation to biological mechanisms!

Expand full comment

Let me push back on the NF-kB example, because it's a pet peeve.

Contrary to the standard narrative ("Cell activated; NF-kB go BRRRRRR"), "NF-kB" actually refers to a family of protein dimers. There are five possible subunits that can compose those dimers, meaning 15 possible combinations (though I believe that 3 don't occur in nature). That's 12 protein complexes, already enough for some subtlety.

Now consider that each NF-kB subunit can be phosphorylated on any of, like, a dozen sites. Each phosphorylation event can allow or completely disbar certain molecular interactions.

Now consider that NF-kB dimers can apparently "flex" when they bind to DNA of different structures. Depending on their new conformation, more molecular interactions are permitted or prevented.

Thus, while there are only 12 NF-kB dimers, functionally, they can probably adopt thousands of meaningfully different conformations (and that's before we get into their actual associations with cofactors, which can themselves be phosphorylated and adopt different conformations).

I think the simple reality molecular biologists are loathe to admit is that the field is deifically complex, and we stand zero chance of solving it this millenium without the aid of AI.

Expand full comment

I think it's entirely justified for this to be a pet peeve. Too many papers don't get into enough specifics to allow you to parse the difference between one subunit combination and another - because often the authors themselves don't really understand it. It would take a lot more work to get into the biochemistry (where the real answers lie), and many non-biochemists are unjustifiably afraid of the biochemistry so they figure what they don't know can't hurt their hypothesis. So yes, it's entirely valid to be telling groups of biologists (who should know better) that the details matter.

And we can push back further to discuss the myriad ways any signaling pathway can become complexificated:

1. Post-transcriptional modifications everyone ignores because we really don't know what we're doing with them. (Leave it to the biochemists to sort it out!)

2. Presence/absence of rafts or scaffolds that organize subsets of signaling complexes within any given cell that may be expressing a heterogenous assortment of subunits (because most of the time we're not talking about a single molecule, but rather a family of molecules/subunit configurations); we normally ignore these because it's a lot harder to do the co-localization work on 3+ proteins than it is to just ask whether something is present/absent or active/inactive in the cell.

3. Temporal activation is likely a factor as well, but most papers talk about whether something is 'turned on/off'; many kinases/phosphorylases are in a balance, where the activation of their target protein is a function of what percent of the time is it active vs. inactive based on the expression balance between kinase/phosphorylase, as opposed to the simplified 'on/off' signals reported in hundreds of papers and represented by arrows on so many pathways posters plastered on the walls of every institution.

And who cares about any of these complications? Many an immunologist/developmental biologist/cell biologist/etc. will shrug and say that's for the biochemists to sort through. But biochemists can't do all the hard work!

I'm not claiming that these systems aren't complex. The list of potentially unique signaling pathways is much longer than what a basic molecular cell biology textbook will lead you to believe. Nor do I think we can use some magic tool to avoid or get around that complexity. (We didn't need the biochemists after all! /s) But I'm still unconvinced that we can reduce the system to the point of, to put it in context of Scott's post, 'monosemanticity' just by parsing the nuances of all the pathways.

I don't think there's a one-to-one from signal to pathway to response, nor is this just a matter of overlapping pathways activated by different stimuli. Cell signaling has always looked more like a complex network to me. I agree that it has a lot of nodes, but I still see a network that's poorly explained through current theory. Maybe there will be a corollary with the work of these AI researchers. Maybe it'll all turn out to be an unrelated dead end. But I've never been convinced current theory has a firm handle on communication within cellular systems, so I'm open to considering other possible explanations.

Expand full comment

Yeah, this is huge, in the "one small step for a man" sense.

I suspect that this goes much deeper than simple concepts. I recall when I was doing math and physics in college, and I was in a study group with about 5 other people. We were all quite smart, but the courses were the intensive "introduction" designed for people who'd go on to be math and physics majors, so we did a lot of vector calculus and analysis and general relativity. We were being dumped in the deep end, and it was graded on a curve because on some tests not a single person got more than half the questions right. What I found fascinating was that in our study group, we all found different parts of it to be easy and hard, and we all had different ways of conceptualizing the same "simple" math. That is, we were all high "g", but our brains were all organized differently, and some people's brains were simply better at doing certain types of math.

> Are we simulating much bigger brains?

I've said before, I think humans aren't innately rational, we just run a poor emulation of rationality on our bio neural nets.

Expand full comment

You reminded me of Mumford's four tribes (and various subtribes) of mathematicians, although though he's talking less about differences in cognitive profiles and more about aesthetic inclinations drawing different sorts of people to different subfields / problems etc https://www.dam.brown.edu/people/mumford/blog/2015/MathBeautyBrain.html

Expand full comment

> I've said before, I think humans aren't innately rational, we just run a poor emulation of rationality on our bio neural nets.

I am very much inclined to agree with you.

Expand full comment

I wonder if it would be possible to "seed" an AI with a largish number of concepts -- e.g. one for every prominent wikipedia article -- embedded in its neurons, at a preliminary or early stage of training it, in order to, once it is fully trained, understand what it is thinking more easily.

Expand full comment

Could someone help me understand what the limiting factor for the simulation of bigger NNs is? It seems to me that the more accurate the weights can be ( ie. how many figures after the decimal) the better you could cram multiple abstract neurons into one "real" neuron. So is it just a matter of increasing floating point precision for the weights?

Expand full comment

This is pure speculation from a person who's a hobbyist and whose work is kind of adjacent to machine learning.

Training a neural network isn't a precision task - basically imagine a helicopter flying over a landscape and periodically stopping to take a reading of the elevation, searching for an elevation of a certain height. If it stops to read elevation every meter, it'll take it an eternity to search the whole country. if it stops every hundred kilometers, it'll miss most of the landscape.

The solution is to have it take broad samples when initially searching and narrow its search as it gets closer. So, for instance, if it's looking for sea level and it's in the Rockies, it'll quickly move on, while if it's only a few kilometers off the coast, it'll search more frequently.

But still, odds are that it will miss its actual target by a bit. It might find shallow water at -10 meters below sea level and say "eh close enough, we can call this area sea level."

The more precision you require, the more of an issue that "close enough" becomes. That means longer training times, and possibly never defining some hard-to-catch concepts.

Expand full comment

I know fuck all about how any of it works, but hobbyists have been reducing the precision not only to 4 or 5 bits per parameter, but even to TWO bits. That's after the training is complete, but suggestive of you being wrong.

Expand full comment

Okay that's pretty good improvement for transparency!

Can be used as a safety policy, where its forbidden to train models larger than we can currently evaluate. Of course, there is an obvious failure mode with evaluator AI essentially becoming an even more powerful version of on AI being evaluated, which... has its safety risks, to say the least.

Expand full comment

The interesting question is what happens when one (or a few) neurons defining a concept are damaged. This happens all the time in humans. Does this mean that one concept we remember can flip to another one because one or a few neuron states are disturbed? Also - does the brain and AIs have error correction for this?

Expand full comment

It makes me think of how we use high and low voltages for computing. To do more complex things then represent a 1 or a 0 we throw more high/low voltage measurerers into the system.

However, we can create multiple-valued logic systems that use multiple voltage levels. These give efficiency in the terms of space and power, with the tradeoff being that they're more susceptible to noise and interference.

Expand full comment

I was familiar with autoencoders but not sparsity when I read this. I couldn't wrap my head around how an autoencoder could be used in this way, since they're usually used to compress information into a lower dimension. I went to ChatGPT and got even more confused but, perhaps appropriately, Claude was able to clear things up for me in just a few messages.

Expand full comment

This was very very cool, thanks for the article.

Will be honest that this makes me a little more bearish on the whole AI revolution. It indicates to me that what's happening is neither inductive nor deductive reasoning, but just a clever method of fuzzy data storage and retrieval. And that as scaling becomes more difficult, utility will go down. The need for 100 billion neurons just to reach this level of performance seems to bear that out.

But who knows? Maybe understanding this will make us able to better pack more information into fewer neurons. And maybe inductive and deductive reasoning are, at their roots, just a clever method of fuzzy data storage and retrieval.

Expand full comment

If we sold mildly buggy business software which still mostly worked based on a testing that used subjective judgement to measure results at deployment, it would feel very weird. To some extent, I’m sure this actually does happen but I’d like to think those systems really don’t matter while the software is likely free.

Expand full comment

Was your comment supposed to be for a different post? I don't see a connection with this post.

Expand full comment

I just double checked and Yes, it was for this post.

Thanks for checking.

Expand full comment

I spent several days slogging through those two papers, and it would have gone much more quickly if I'd read this excellent writeup first! Hopefully this'll save other people some serious time, and bring attention to what seems to me like a really important step forward in interpretability.

Expand full comment

"That is, we find non-axis aligned directions in the neural state space that are more interpretable than individual neurons."

This sentence that, based on context, seems to be trying to explain what the previous sentence meant, further obfuscates the meaning of the prior sentence.

I've only skimmed through papers relating to my field of ophthalmology, and as a layman at that. Is this common, that an attempt to clarify a point makes it clear as mud?

I'm using the prior sentence to decode what the quoted sentence says. I guess it's actually an attempt at greater specificity, and not clarification? Boy howdy, is it confuzzling.

Expand full comment

The whole abstract polyhedra thing reminds me of QAM and TCM. These are two modulation techniques that attempt to produce most distinguishable symbols for transmission over the air (or phone lines, in old-timey modems). The symbols are defined in 2-dimensional (or more, for things like MIMO and other more sophisticated techniques) space, and it results in "constellations" of points (each point representing a symbol). The goal is to keep the distance between any pair of symbols to a maximum (minimize error ate), while being power limited. This ends up making all kinds of cute dot diagrams, sometimes very much along the lines of "8 points on the outside in an octagon, 4 points on the inside in a square" setup or what not.

With TCM, there's an additional innovation, which is basically error-checking (I'm really simplifying here). In any given "state", you know you can't have some of the symbols, which increases distances between the other ones. Is it possible the neural networks may arrive at TCM-like encodings, where the set of "legal" states for some neurons depends on state of other neurons? That'd be amazing.

Expand full comment

Meanwhile in parallel universe AI-Creators are wondering how do the humans (they just created) think.

"It is very strange - their brains are not logic based. And when there is more of them, their thinking improve, just because of quantity. Strange... "

Expand full comment

> when there is more of them, their thinking improve

"Never mind, they just invented something called 'social media', and the trend reversed."

Expand full comment

For animal brains, there is there is another (if somewhat trivial) fact to consider: brain cells die sometimes. It would be inconvenient if you woke up one day and forgot the concept of "tiger" because the wrong brain cell had died.

The obvious way to prevent that is to use redundancy. While you could have extra copies of the brain cell responsible for "tiger", this will not be the optimal solution.

Consider QR codes. There is no square, or set of squares, which corresponds exactly to the first bit of the encoded message. Instead, the whole message is first interpreted as describing a polynomial, and then the black and white squares correspond to that polynomial being sampled at different points. (From what I can tell, I am not an expert on Reed-Solomon encoding). Change a bit in the message, and the whole QR code will change. As there are more sampling points used than the grade of the polynomial, your mobile can even reconstruct the original polynomial (e.g. the message) if some of the squares are detected wrong, for example because some marketing person placed their company logo in the middle of the QR code.

Human circuit designers generally prefer to design circuits in which meaning is very localized. If a cosmic ray (or whatever) flips a single bit in your RAM, you might end up seeing a Y instead of an X displayed in this message. By contrast, animal brains have not evolved for interpretability. (This is why if you destroy 1% of the gates in a computer, you get a brick, while if you randomly destroy 1% of the neurons in a human you get a human.) I guess the only limit is the connectedness of your brain cells: if every neuron was connected to every other neuron, the safest way to store the concept of "tiger" would be some pattern of roughly half of your neurons firing.

Expand full comment

Seems like a fantastically exciting breakthrough, but I struggle to understand how achieving monosemanticity furthers transparency/explainability, principles frequently trumpeted as important in AI policy spaces. Like, if an AI system rejects my application for a bank loan, and I want the decision reviewed, how does knowing that specific neurons encode specific concepts help me? I worry that a monosemantic explanation of AI systems, while useful to ML engineers and others who already have a firm grasp of how neural networks operate, will be close to meaningless for a lay person.

Feels similar to how I often hear calls for "algorithmic transparency" from civil society activists in relation to digital platforms like Twitter/Facebook, but I've never really seem an explanation of how an algorithm operates that is meaningful, let alone satisfying, to such non-technical people. My suspicion is that trying to import classic public-interest concepts like "transparency" to digital tech without appreciating the technical complexity inherent to these systems is a recipe for broad confusion (but maybe it me that is confused)

Expand full comment

Related idea: perhaps for information to be meaningful, it needs to have the potential to change behaviour in some direction. In this case, I'm not sure how knowing that neuron x encodes concept y supports behaviour change - i.e having learnt this, I'm not sure what one ought to do differently. That's the root of my assertion above that this information would not be "meaningful" for most people.

Expand full comment

It like 90% made sense . Possibly naive question is it possible to "manually" activate neurones in the larger model and see if that actually generates the expected output?

Expand full comment

The "two neurons to represent 5 concepts" reminds me of Quadrature Amplitude Modulation for encoding digital signals in lossy communication channels. https://en.wikipedia.org/wiki/Quadrature_amplitude_modulation

Expand full comment

Given that such sentences as "In their smallest AI (512 features), there is only one neuron for “the” in math. In their largest AI tested here (16,384 features), this has branched out to one neuron for “the” in machine learning, one for “the” in complex analysis, and one for “the” in topology and abstract algebra" appear in this column, I am aghast that Scott didn't pull the "the the" trick.

Furthermore: the

Expand full comment

So here's a stupid question: Are the five concepts in the two neuron dimensional space categorical or continuous data? If neuron 2 activates at .30 (I don't know what these numbers represent, but whatever) which concept does the activation map to--a dog or a flower? Is that even possible? Can you get an essentially infinite number of simulated neurons/concepts out of two real neurons by treating them as a continuous spectrum of values? If you did, would these act as weights determining the probability of triggering some specific concept? Or are neurons always just binary (on or off)?

And finally, do human neurons act this way? I have always pretty much assumed that neurons are "on/off" switches, despite all their complexity of design, they either reach their action potential or they dont. But now it occurs to me that the signal across the neuron, from dendrite to axon terminal, could contain a wider range of signals, potentially based on the duration, frequency and voltage of the signal. But is there any evidence of that?

Expand full comment

First, I'm not a neuroscientist. But I know someone who used to do research on a similar phenomenon in mice.

This is one of the most relevant papers, which describes how the neural responses change over time as mice learn to distinguish between pairs of odors: https://www.cell.com/neuron/fulltext/S0896-6273(16)30563-3.

The basic setup is that the mice are strapped on a rig with two water pipes. When they detect odor A, they can drink water from the left side. When they detect odor B, they can drink water from the right side. To ensure that the mice are motivated to cooperate, their water intake is restricted before training.

The main result is that when the mice have to distinguish very similar odors, they learn to represent each odor more precisely. They never reach the extreme case of having just a single neuron dedicated to each odor, because the brain always requires some redundancy. But as they are trained, more neurons begin to respond to one of the two paired odors and not the other. In other words, to use the terms of this post, the neurons become more monosemantic.

But when mice are trained to distinguish very different odors, they can afford to be sloppier while still maintaining high accuracy. So the representations of the two paired odors actually become more similar during training. In other words, the neurons become more polysemantic.

I don't seem to be able to copy images to Substack comments, but I recommend the following figures in particular:

- Figure 1B shows the neural architecture of the olfactory bulb. The neurons studied in this paper are the mitral cells, which are roughly equivalent to the hidden layers in an artificial neural network.

- Figure 2C shows how the number of neurons with divergent responses (i.e., responding to one of the paired odors but not the other) drops in the easy case.

- Figure 3C shows how the number of neurons with divergent responses increases in the difficult case.

- Figure 4 shows 3D plots with the first three principal components in both cases. In the easy case, the representations start off far apart, then move somewhat closer together. In the difficult case, the representations start off completely overlapping, but separate by the end.

Expand full comment

I have nothing insightful to add, just want to express my thanks for this post. I've been meaning to read the monosemanticity paper since I first heard about it maybe a month ago, but haven't found the time to. This provides a nice overview and more motivation to have another crack at it. This seems like a very important development for interpretability.

Expand full comment

>Feature #2663 represents God.

This is a pretty good choice, but you missed a great opportunity to pick out the Grandmother Neuron (assuming this model had one).

Expand full comment

Long time reader, first time commenter here. Thanks for this piece, it was equal parts awe and inspiring. I think there are some implications for medicine here that might actually be more important the the AI safety pursuits. You inspired me to make my first substack post about it, so thanks for that!

Expand full comment

> 100 billion neurons

Nit: GPT probably has in the order of 100 billion parameters, but since every neuron has about 100-1000 parameters in these networks, that amounts to 'only' 100 million to 1 billion neurons.

Expand full comment
Dec 3, 2023·edited Dec 3, 2023

AI predicts a future it does not project a future. That is why AI is not conscious: it cannot project goals for itself; it does not care about anything, not even itself. I think this is because human time projection --an imagined future-- has a determinant effect and machine time is only determined by the past prediction. AI does not want anything of us. It has no future, whereas our consciousness is entirely in the nonexistent future. We are in a state of incessant future desire; AI is fully sated always, since it draws only from the past.

Next, the Luddites can break the looms but the AI revolution is here. I have no fear of a machine that turns garbage into golden patterns. Think of those ENORMOUS dump trucks for strip mining with the teeny cab where the driver sits. Who do you fear? Not the truck. That is AI. So it absolutely must be monitored and regulated. But not because it wants to eat us! AI is still like a very powerful chainsaw that is very hard to control, but gosh does it cut. You wouldn't want a bunch of D&D Valley Boys playing with it without some government oversight. Again I am pro-regulation.

But...Humans fetishize our own inventions. The Valley Boys would love to imagine themselves gods, but Q* is still "locked in," still fundamentally solipsistic. Still all "transcendental consciousness" with no access to the transcendent things in themselves. All the training data is human made, it is made of us and we fall in love with our machines because they seem like us. But they care not a jot for us one way or the other. So, these man-made golems are nevertheless very dangerous since they will follow the algorithms we put in them relentlessly and with no common sense at all. I think-- though I do not yet know how to describe this-- that the hallucination problem as another comment said is a "feature of the architecture" not really a defect of it.

Expand full comment

"check how active the features representing race are when we ask it to judge people"

ah, this is not going to work, actually. The beauty of AI ethics is that there are several definitions of fairness and they are mathematically incompatible with each other. Not activating the features representing race is aligned with a relatively dumb definition of fairness but is not compatible with several other definitions.

Let's say that our model takes a look at a lot of data about a person including race and decides if they should receive a loan. If the system is trained to make a decision based solely on probability of loan being repaid, the chances are that this it will 1) use the race information 2) reject a higher proportion of people from a certain race A 3) reject some people of race A that would have been accepted if their race was different but all other features the same. It is also possible that the system will 4) reject people of race A so much that the people of race A who do receive the loan will default on it less frequently than people from other races.

Each of these 4 system features may be taken as a definition of discrimination. The problem is that if we eliminate 1) (which is easy and often done in practice, just drop the race attribute from the dataset) the other three measures of discrimination can and often will get worse.

Expand full comment

The real problem is that race is a real and perfectly valid datapoint in life but our theocratic moron government can’t accept that

Expand full comment

Non-technical people tend to attribute "thought" to "AI", Which isn't actually artificial intelligence, it's machine learning – this isn't a nitpick, it's a principal difference.

The best alternatives I can offer to how to think about the current crop of "AI":

1) it works like the unconscious, or to say otherwise, like "System 1 thinking". So just like the unconscious doesn't actually "think" but "approximate a gut feeling/leaning towards something", "AI"/ML does the same.

2) this doesn't mean ML is useless for actual (as in the first draft of AI that will do its independence actions) artificial intelligence, quite the contrary – human/animal "natural" intelligence consists of multiple parts, and one of the parts is almost identical to ML in function. But to ascribe thought to it is missing the point.

Expand full comment

When discussing these things I like to avoid saying it thinks or understands anything and instead say it "effectively understands" something, where I define that as "behaves in a way such that if it were a human instead of a machine we would say that it understands it". I think this avoids pointless nitpicking about word choice and lets us get back to talking about what the machine actually does.

Expand full comment

As someone who recently attended a neuroscience conference, let me say that our brains are absolutely doing this type of thing.

Some examples:

"Place cells" are neurons in the hippocampus that fire when you are in a particular place. One physical location is not encoded as a single cell, but rather as an ensemble of place cells: so neurons A + B + F = "this corner of my room" and A + E + F = "that corner of my room".

"Grid cells" are neurons in the entorhinal cortex (which leads into the place cells) that encode an abstract, conceptual hexagonal grid, with each grid hexagon represented as an ensemble of grid cells.

Memories are also stored as ensembles of cells, or to be more precise, as collections of synaptic weights. So much so that, in one study, they located an ensemble of cells that collectively stored two different (but related) memories, and by targeting and deleting specific synapses, they could delete one memory from the set of cells without deleting the other.

https://www.youtube.com/watch?v=kMvvWikHklA

I also highly recommend this presentation of a paper where they found they could trace - and delete - a memory as it was first encoded in the hippocampus, then consolidated in the hippocampus, then transferred to the frontal cortex. It doesn't directly have to do with superposition or simulating higher-neuron brains, but it is really cool.

https://www.youtube.com/watch?v=saFDeGTYnRU

Expand full comment