233 Comments
deletedNov 28, 2023·edited Nov 28, 2023
Comment deleted
Expand full comment

None of the described programs/apps/platforms are AI, much less GP AI.

Expand full comment

There is a section of Gödel Escher Bach where this seems to happen: each “djinn” consults a higher djinn to answer a question, and eventually the answer “comes back down the call stack”. The same trick seems to work in my faith practice: simulating someone wiser and more loving than me and then trying to emulate it , worked ok. But trusting that I was receiving signals transmitted from that wiser thing, which would bubble up through my subconscious- oh man, did that work better.

Expand full comment

Here's where humanists and poets could help. Consider a line like "My true love is fair." What does 'fair' mean? The ambiguity is the point. The polysemanticity is the point. Don't try to achieve 1:1 signifier : signified relationship. Think differently.

Expand full comment

> Are we simulating much bigger brains?

Sure. We've been doing it for thousands of years, ever since the invention of writing. The ability to store knowledge outside of our brains and refer back to it at a later time allows us to process far more information than we can hold in our brains at any given moment, which is essentially what "simulating a much bigger brain" is.

Expand full comment

I'm a linguist, and for me as a linguist it makes total sense. _Of course_ you would try to emulate a bigger system via weighted intersections. On toy experiments small enough to build the big sparse model one of the best ways seems to be "build the big sparse model then compress". Of-freaking-course (in hindsight) big LLMs do the same, only in one step.

Expand full comment

So, AI first-level neurons pull (at least one kind of) double-duty by each recognizing multiple concepts, and then the next level(s) up narrow down which concept best applies.

I expect this will work more efficiently when the concepts being recognized are mutually exclusive.

The metaphor that's coming to mind here is "homonyms": we know from context whether we're talking "write", "right", or "rite", but the sound's the same.

Expand full comment

Feature 2663 represents God. Ezekiel is the 26th book of the bible. Ezekiel 6:3 reads:

> and say: ‘You mountains of Israel, hear the word of the Sovereign Lord. This is what the Sovereign Lord says to the mountains and hills, to the ravines and valleys: I am about to bring a sword against you, and I will destroy your high places

This is either a coded reference to 9/11 (bringing us back to the Freedom tower shape association) or an ominous warning about what the AI will do once it becomes God.

Expand full comment

I’m actually cited a few times in the anthropic paper for our sparse coding paper; this is the most exciting research direction imo. Next is circuits connecting these feature directions.

If you’re interested in working on projects, feel free to hang out in the #sparse_coding channel (under interpretability) at the EleutherAI discord server: https://discord.gg/eleutherai

Expand full comment

Here are a couple of random concepts that this post sparked, that I'm sure are of 0 novelty or value to anyone actually in the field:

1. It seems like maybe the AIs are spending a lot of their brainpower relearning grammar everywhere. Like if there's a "the" for math and a "the" for, I don't know, descriptions of kitchens, to a large extent it's mostly learning the rules for "the" in both places. You could imagine making the AI smarter by running a pre and post-processor that, for the pre-processor, stripped grammar down to a bare minimum, and for the post-processor, took minified grammar than the AI actually learned from and then reinject normal human grammar rules so that the output looks fluent.

2. In terms of trying to understand a big AI through this: you could imagine running the classifier while seeing how the AI responds to a largish-but-highly-restricted set of texts. So like maybe run the classifer while the LLM is responding to input from only essays on religion. You won't discover all the things that each neuron is used for, but you will perhaps find out what neurons activate for religious essays (and you won't know what else those neurons are used for), but you might create an output restricted enough that you could then interrogate it usefully for higher-level abstractions or "opinions."

Expand full comment

Their second interface reminds me greatly of Karpathy famous old blog post on RNNs and their effectiveness. (https://karpathy.github.io/2015/05/21/rnn-effectiveness)

In particular, there is a section ("Visualizing the predictions and the “neuron” firings in the RNN"), where he visualises individual neurons. In that case however there presumably only a few neurons that do not correspond to a distributed representation and which are thus directly interpretable already.

Expand full comment

We gamified explaining AI over at Neuronpedia, thanks to grants from EV and Manifund: neuronpedia.org

It's a crowdsourcing / citizen science experiment like FoldIt, except for AI interpretability. We have ~500 players currently - you can explain neurons, verify/score human explanations, and of course browse the neuron/activations/explanations database (and export the data for whatever research you'd like to do).

We also recently added a set of autoencoder features from the Cunningham et al team for GPT2-Small Layer 6: https://www.neuronpedia.org/direction/hla-gpt2small-6-mlp-a

Short term goal is to train a model that explains neurons better than GPT-4 (we're seeing about 50%+ of human explanations score better than GPT-4's explanations, training starts in December). Medium to long term, Neuronpedia would like to be an open data hub (an interactive Wikipedia for any AI model) where researchers can upload their own features/directions, and run campaigns and various experiments publicly.

You can also contribute code - currently the scorer is open source (with a readme that has many low hanging fruit), and soon the whole thing will be open sourced.

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

You know, Scott, there's more going on that how 'neurons' represent 'concepts.' How does an LLM know how to tell a story? Sure, there's the concept of a story, which is one thing. But there's the recipe for telling a story. That's something else. And there are all kinds of stories. Are there separate recipes for those? And stories are only one kind of thing LLMs can generate. Each has a recipe.

BTW, by 'recipe' I mean something like 'program,' but LLMs aren't computers, no matter how much we'd like to think they are. So I don't think it's very helpful to think about programs for generate stories, or menus, or travel plans, etc.

It's increasingly clear to me that LLMs are associative memories. But not passive memories that can only store and retrieve, though they can do that.

Moreover, no matter how much you fish around in the layers and weights and neurons, you're not going to discover everything you need. You'll find part of the story, but not all. Let me suggest an analogy, from a recent short post (https://tinyurl.com/2ujw2ntc):

"In the days of classical symbolic AI, researchers would use a programming language, often some variety of LISP, but not always, to implement a model of some set of linguistic structures and processes, such as those involved in story understanding and generation, or question answering. I see a similar division of conceptual labor in figuring out what’s going on inside LLMs. In this analogy I see mechanistic understanding as producing the equivalent of the programming languages of classical AI. These are the structures and mechanisms of the virtual machine that operates the domain model, where the domain is language in the broadest sense."

Mechanistic understanding will yield an account of a virtual machine. The virtual machine runs what I'm calling the domain model in that paragraph. They are two separate things, just as the LISP programming language is one thing and a sentence parser is something else, something that can be implemented in LISP.

You can begin investigating the domain model without looking at the neurons and layers just as you can investigate language without knowing LISP. You just have to give the LLM a set of prompts so that its response gives you clues about what it's doing. I've got a paper that begins that job for stories: ChatGPT tells stories, and a note about reverse engineering: A Working Paper (https://tinyurl.com/5v2jp6j4). I've set up a sequence at LessWrong where I list my work in this area (https://www.lesswrong.com/s/wtCknhCK4qHdKu98i).

Expand full comment

"Feature #2663 represents God.

The single sentence in the training data that activated it most strongly is from Josephus, Book 14: “And he passed on to Sepphoris, as God sent a snow”. But we see that all the top activations are different uses of “God”.

This simulated neuron seems to be composed of a collection of real neurons including 407, 182, and 259, though probably there are many more than these and the interface just isn’t showing them to me"

After reading those lines (and having read UNSONG before), I fully expected an explanation of why numbers 2663, 14, 407, 182 and 259 are cabbalistically significant and heavily interrelated.

Expand full comment

Most cynically, seems like an extended exercise in reifying correlations.

Expand full comment

Thinking about what it might mean if the human brain works like this:

- this makes me more pessimistic about brain-computer interfaces, since BCIs like neuralink can only communicate very coarsely with groups of many neurons... but now I am thinking that even communicating with individual neurons would still create a lot of confusion/difficulty, since really you need to map out the whole abstract multi-dimensional space and precisely communicate with many individual neurons at the same time, in order to read or write concepts clearly.

- on the other hand, this makes me more optimistic about mind uploading (good RationalAnimations video here: https://www.youtube.com/watch?v=LwBVR68z-fg) , since people are always like "idk bro, how could we possibly fit all our knowledge into so few neurons, therefore there must be some insane low-level DNA computation going on behind the scenes, or something". But with polysemanticity, it turns out that maybe you can just fit a preposterous amount of info into neurons, using basic math!! So maybe just scanning the connectome structure would be enough to make mind uploading work (as long as you are also able to read the connection weights out of the synapses).

Expand full comment

> Are our brains full of strange abstract polyhedra?

Yes: https://blog.physics-astronomy.com/2022/12/the-human-brain-builds-structures-in-11.html

Expand full comment
Nov 27, 2023·edited Nov 27, 2023

How does language work?

I mean, 'd', 'o', and 'g' don't have lexical meaning by themselves (except maybe for Shakespeare for the middle one). But together, they mean the animal descended from wolves that wags its tail, likes to go fetch, and pees to mark its territory.

Is then English analogous to a form of neural net with 26 neurons in the first layer, with many layers?

(It would be even more interesting to compare this to Chinese.)

Expand full comment

Lockheed-Martin published an ad in the Smithsonian in 1989 I think. Two pages. Tower of Babel painting. Saying we’re a company to do away with the Babel Affect. Confusion of languages. They said God confused these smart people who were poised to do anything. It seemed implied that dangerous to let them do that. Didn’t have wisdom to know how to use their knowledge in a positive way. That is our world. So much knowledge used without wisdom. No asking the 7th Generation Question.

It’s Lockheed-Martin that had the Mars Lander crash because inches and centimeters didn’t match up.

AI This is where we are. Doing away big time with the Babel Affect.

This is something I’ve pondered deeply for a long time.

Expand full comment

I expected big constellations of stuff but am excited about this because unlike the brain you can just go around turning stuff on and off and changing it with no consequences. This is all way more complicated than one to one relationships but if you can do this systematically at scale to at least get things like “involved with” can you get enough data to train another model that can generalize across all models?

If all these things are sort of chaotically formed then no.

If they represent some eternal order that’s in the universe then maybe yes?

Not sure if that checks out as a thought but I had three minutes and I like it.

Expand full comment

This is a fascinating, substantive result and a *fantastic,* easily-understood writeup by Scott. Bravo!

Expand full comment

There's an analogy with mantis shrimp vision. Humans have 3 types of colour receptor and the colours we perceive are naturally organised as a 3D (RGB) space. Mantis shrimps have something like a dozen colour receptor types. So do they see a 12D space of colours? Probably not. It turns out mantis shrimp have very poor colour differentiation. It's likely that colours for shrimp are are monosemantic and for humans are polysemantic. We carve up our reds, say, into pink, scarlet, carnelian and so on because we differentiate blends of R, G and B. That gives us a very big space. Mantis shrimp probably don't do this. They don't have enough neurons to be reasoning in a 12D colour space.

Expand full comment

I'm surprised nobody has mentioned how similar this behavior is to poly/omnigenic phenotypes -- that is, behavioral or physical traits that are controlled by many or all parts of DNA. Back propagation on a neural net isn't exactly the same as evolution operating on DNA strands, but these two processes share a similar trait: they are optimized in parallel at every level. The optimizer doesn't care for our desire for simple categories, only for efficiency.

(To be honest, in my many years as an AI researcher I've just taken to calling the whole set of behavior as 'omnigenic' -- I didn't realize polysemanticity was a thing until this new Anthropic paper, seems like it may be a case of AI folks reinventing terminology already popular in another field)

Expand full comment

Haven't read the whole thing yet, but if you want to know more about the "mysterious things" that happen in neural nets in an understandable way, I highly recommend the 3blue1brown series on deep learning: https://www.youtube.com/watch?v=aircAruvnKk

Expand full comment

In some ways, understanding the individual components is a poor way to understand a vector. Mathematicians typically try to explain vectors based on relationships to simpler vectors, rather than breaking them down into coordinates. Perhaps, in the same way, understanding one neuron at a time is a poor way to understand one layer of a neural net.

Expand full comment

I can't help thinking there is something misguided about the idea of "monosemanticity" vs. "polysemanticity" in the first place. What looks polysemantic from one perspective might be monosemantic from another - the choice of basis looks kind of arbitrary (based on concepts that we find "interpretable")? Is green a primary color or a mixture of other colors? Is a face a "basic" concept, or instead its constituents (eyes, nose, mouth, oval outline)? You could describe a rectangular volume in terms of L x W x H. Or you could describe it in terms of sqrt(A1 x A2 x A3), where the As are the areas of the three distinct faces. We have conventions about what we take to be more basic, but there's nothing objective about it, and the areas look like "mixtures" of the individual lengths along different dimensions. It looks like they've found a way to change the semantic basis to one that matches more intuitive dimensions for us, and that's useful. But casting it in terms of "monosemanticity" less so.

Expand full comment
Nov 28, 2023·edited Nov 28, 2023

What does it even mean for a neural network used for predicting the next word of a text to be plotting the downfall of humanity? It feels like many people got stuck with this 2010s concept of the AI as a master planner and are refusing to accept that we ended up with an entirely different kind of AI. This result sounds like a very impressive advance in understanding how LLMs work, but I don't see how it could be applicable to any kind of end-of-the-world-scenario safety problem. (It does seem useful for the more mundane safety problems like thwarting users who are generating fascist imagery using associations to get around banned words. [1])

[1] https://www.bellingcat.com/news/2023/10/06/the-folly-of-dall-e-how-4chan-is-abusing-bings-new-image-model/

Expand full comment

What role could monosemanticity have in the self-regulation of the AI industry to ensure AI aligns with humanity?

What role could it have in government regulation of the AI industry, if the AI industry cannot or will not regulate itself?

Would or should investors and customers start to demand some sort of monosemanticity guardrails for AI alignment?

If investors or customers cared about (monosemanticity) AI guardrails, who would help them understand monosemanticity or other AI guardrails, so those guardrails don't "greenwash" the AI industry but actually work as intended?

Expand full comment

Nick Land going crazy rn

Expand full comment

“The Anthropic interpretability team describes this as simulating a more powerful AI. That is, the two-neuron AI in the pentagonal toy example above is simulating a five-neuron AI.”

I’m not sure why this would be explained as simulation rather than merely classification. If you applied this reasoning to another kind of AI like e.g. a Support Vector Machine, then the opposite happens– a large number of dimensions “simulate” a much smaller number of toys.

Expand full comment

Re: the brain utilizing superpositions, is it even possible for a polysemantic network such as the brain not to have a monosemantic representation? It feels like a monosemantic equivalent would always exist, almost like how any two numbers always have an LCM (Least Common Multiple)

Expand full comment
founding

This to me feels less like a simple AI simulating a more complex AI and more like it's coming up with a language by turning its neurons into letters in a language

Expand full comment

I think that thinking about this in terms of simulating a bigger AI is a bit dramatic, possibly to the point of being misleading. I'll give my linear algebra version first, and a non-technical version second.

In linear algebra we know that you can pack N "almost orthogonal vectors" in dimension k (much smaller than N), even though there are only k actually orthogonal dimensions available. This is a counterintuitive property of high-dimensional Euclidean space. We do a lot of dimensionality reduction in machine learning in general, in which we take a high dimensional set of vectors and try to cram them into lower dimensional space (for example t-SNE).

For a non-technical example of why dimensionality reduction can be mundane, just think about a map of the globe. You're throwing away one dimension (elevation), but that noise may not matter to you in a lot of applications. You may also be familiar with the fact that distances get distorted by popular projections. You can project with less and less distortion as you enter higher dimensions.

So I think it's a lot more mundane than "simulating a larger AI". We just have a model which is taking advantage of geometry to pack some 10,000 dimensional vectors (the globe) into 500 dimensional space (the map).

Another reason to not think of it as simulating a larger model, is that we'd probably expect an even larger model to use those 10,000 neurons to represent 50,000 dimensional space or something (speculative).

This is a pretty common operation, so "neural network lossily represents 10000 dimensional vector in 500 dimensions" is much more mundane than simulating a larger AI.

Expand full comment

I wonder if this is related to cell signaling pathways in biology? I've long pondered this problem, where we just don't have enough signaling cascades/pathways to cover all the various biological functions. Meanwhile, something like NFkappaB fires for practically everything. There are thousands of papers talking about how some signaling cascade is "vital for [X] mechanism", but that simply CAN'T be true in the 1:1 sense of "if this, then that", because then you'd never be able to modulate any one function.

However, if cells are doing something similar - using 'simulated' signaling pathways that are various modulated combinations of multiple other signaling cascades - you could easily get the functionality you're looking for here. Very interesting and exciting potential for extrapolation to biological mechanisms!

Expand full comment

Yeah, this is huge, in the "one small step for a man" sense.

I suspect that this goes much deeper than simple concepts. I recall when I was doing math and physics in college, and I was in a study group with about 5 other people. We were all quite smart, but the courses were the intensive "introduction" designed for people who'd go on to be math and physics majors, so we did a lot of vector calculus and analysis and general relativity. We were being dumped in the deep end, and it was graded on a curve because on some tests not a single person got more than half the questions right. What I found fascinating was that in our study group, we all found different parts of it to be easy and hard, and we all had different ways of conceptualizing the same "simple" math. That is, we were all high "g", but our brains were all organized differently, and some people's brains were simply better at doing certain types of math.

> Are we simulating much bigger brains?

I've said before, I think humans aren't innately rational, we just run a poor emulation of rationality on our bio neural nets.

Expand full comment

I wonder if it would be possible to "seed" an AI with a largish number of concepts -- e.g. one for every prominent wikipedia article -- embedded in its neurons, at a preliminary or early stage of training it, in order to, once it is fully trained, understand what it is thinking more easily.

Expand full comment

Could someone help me understand what the limiting factor for the simulation of bigger NNs is? It seems to me that the more accurate the weights can be ( ie. how many figures after the decimal) the better you could cram multiple abstract neurons into one "real" neuron. So is it just a matter of increasing floating point precision for the weights?

Expand full comment

Okay that's pretty good improvement for transparency!

Can be used as a safety policy, where its forbidden to train models larger than we can currently evaluate. Of course, there is an obvious failure mode with evaluator AI essentially becoming an even more powerful version of on AI being evaluated, which... has its safety risks, to say the least.

Expand full comment

The interesting question is what happens when one (or a few) neurons defining a concept are damaged. This happens all the time in humans. Does this mean that one concept we remember can flip to another one because one or a few neuron states are disturbed? Also - does the brain and AIs have error correction for this?

Expand full comment

It makes me think of how we use high and low voltages for computing. To do more complex things then represent a 1 or a 0 we throw more high/low voltage measurerers into the system.

However, we can create multiple-valued logic systems that use multiple voltage levels. These give efficiency in the terms of space and power, with the tradeoff being that they're more susceptible to noise and interference.

Expand full comment

I was familiar with autoencoders but not sparsity when I read this. I couldn't wrap my head around how an autoencoder could be used in this way, since they're usually used to compress information into a lower dimension. I went to ChatGPT and got even more confused but, perhaps appropriately, Claude was able to clear things up for me in just a few messages.

Expand full comment

This was very very cool, thanks for the article.

Will be honest that this makes me a little more bearish on the whole AI revolution. It indicates to me that what's happening is neither inductive nor deductive reasoning, but just a clever method of fuzzy data storage and retrieval. And that as scaling becomes more difficult, utility will go down. The need for 100 billion neurons just to reach this level of performance seems to bear that out.

But who knows? Maybe understanding this will make us able to better pack more information into fewer neurons. And maybe inductive and deductive reasoning are, at their roots, just a clever method of fuzzy data storage and retrieval.

Expand full comment

If we sold mildly buggy business software which still mostly worked based on a testing that used subjective judgement to measure results at deployment, it would feel very weird. To some extent, I’m sure this actually does happen but I’d like to think those systems really don’t matter while the software is likely free.

Expand full comment

I spent several days slogging through those two papers, and it would have gone much more quickly if I'd read this excellent writeup first! Hopefully this'll save other people some serious time, and bring attention to what seems to me like a really important step forward in interpretability.

Expand full comment

"That is, we find non-axis aligned directions in the neural state space that are more interpretable than individual neurons."

This sentence that, based on context, seems to be trying to explain what the previous sentence meant, further obfuscates the meaning of the prior sentence.

I've only skimmed through papers relating to my field of ophthalmology, and as a layman at that. Is this common, that an attempt to clarify a point makes it clear as mud?

I'm using the prior sentence to decode what the quoted sentence says. I guess it's actually an attempt at greater specificity, and not clarification? Boy howdy, is it confuzzling.

Expand full comment

The whole abstract polyhedra thing reminds me of QAM and TCM. These are two modulation techniques that attempt to produce most distinguishable symbols for transmission over the air (or phone lines, in old-timey modems). The symbols are defined in 2-dimensional (or more, for things like MIMO and other more sophisticated techniques) space, and it results in "constellations" of points (each point representing a symbol). The goal is to keep the distance between any pair of symbols to a maximum (minimize error ate), while being power limited. This ends up making all kinds of cute dot diagrams, sometimes very much along the lines of "8 points on the outside in an octagon, 4 points on the inside in a square" setup or what not.

With TCM, there's an additional innovation, which is basically error-checking (I'm really simplifying here). In any given "state", you know you can't have some of the symbols, which increases distances between the other ones. Is it possible the neural networks may arrive at TCM-like encodings, where the set of "legal" states for some neurons depends on state of other neurons? That'd be amazing.

Expand full comment

Meanwhile in parallel universe AI-Creators are wondering how do the humans (they just created) think.

"It is very strange - their brains are not logic based. And when there is more of them, their thinking improve, just because of quantity. Strange... "

Expand full comment

For animal brains, there is there is another (if somewhat trivial) fact to consider: brain cells die sometimes. It would be inconvenient if you woke up one day and forgot the concept of "tiger" because the wrong brain cell had died.

The obvious way to prevent that is to use redundancy. While you could have extra copies of the brain cell responsible for "tiger", this will not be the optimal solution.

Consider QR codes. There is no square, or set of squares, which corresponds exactly to the first bit of the encoded message. Instead, the whole message is first interpreted as describing a polynomial, and then the black and white squares correspond to that polynomial being sampled at different points. (From what I can tell, I am not an expert on Reed-Solomon encoding). Change a bit in the message, and the whole QR code will change. As there are more sampling points used than the grade of the polynomial, your mobile can even reconstruct the original polynomial (e.g. the message) if some of the squares are detected wrong, for example because some marketing person placed their company logo in the middle of the QR code.

Human circuit designers generally prefer to design circuits in which meaning is very localized. If a cosmic ray (or whatever) flips a single bit in your RAM, you might end up seeing a Y instead of an X displayed in this message. By contrast, animal brains have not evolved for interpretability. (This is why if you destroy 1% of the gates in a computer, you get a brick, while if you randomly destroy 1% of the neurons in a human you get a human.) I guess the only limit is the connectedness of your brain cells: if every neuron was connected to every other neuron, the safest way to store the concept of "tiger" would be some pattern of roughly half of your neurons firing.

Expand full comment

Seems like a fantastically exciting breakthrough, but I struggle to understand how achieving monosemanticity furthers transparency/explainability, principles frequently trumpeted as important in AI policy spaces. Like, if an AI system rejects my application for a bank loan, and I want the decision reviewed, how does knowing that specific neurons encode specific concepts help me? I worry that a monosemantic explanation of AI systems, while useful to ML engineers and others who already have a firm grasp of how neural networks operate, will be close to meaningless for a lay person.

Feels similar to how I often hear calls for "algorithmic transparency" from civil society activists in relation to digital platforms like Twitter/Facebook, but I've never really seem an explanation of how an algorithm operates that is meaningful, let alone satisfying, to such non-technical people. My suspicion is that trying to import classic public-interest concepts like "transparency" to digital tech without appreciating the technical complexity inherent to these systems is a recipe for broad confusion (but maybe it me that is confused)

Expand full comment

It like 90% made sense . Possibly naive question is it possible to "manually" activate neurones in the larger model and see if that actually generates the expected output?

Expand full comment

The "two neurons to represent 5 concepts" reminds me of Quadrature Amplitude Modulation for encoding digital signals in lossy communication channels. https://en.wikipedia.org/wiki/Quadrature_amplitude_modulation

Expand full comment

Given that such sentences as "In their smallest AI (512 features), there is only one neuron for “the” in math. In their largest AI tested here (16,384 features), this has branched out to one neuron for “the” in machine learning, one for “the” in complex analysis, and one for “the” in topology and abstract algebra" appear in this column, I am aghast that Scott didn't pull the "the the" trick.

Furthermore: the

Expand full comment

So here's a stupid question: Are the five concepts in the two neuron dimensional space categorical or continuous data? If neuron 2 activates at .30 (I don't know what these numbers represent, but whatever) which concept does the activation map to--a dog or a flower? Is that even possible? Can you get an essentially infinite number of simulated neurons/concepts out of two real neurons by treating them as a continuous spectrum of values? If you did, would these act as weights determining the probability of triggering some specific concept? Or are neurons always just binary (on or off)?

And finally, do human neurons act this way? I have always pretty much assumed that neurons are "on/off" switches, despite all their complexity of design, they either reach their action potential or they dont. But now it occurs to me that the signal across the neuron, from dendrite to axon terminal, could contain a wider range of signals, potentially based on the duration, frequency and voltage of the signal. But is there any evidence of that?

Expand full comment

First, I'm not a neuroscientist. But I know someone who used to do research on a similar phenomenon in mice.

This is one of the most relevant papers, which describes how the neural responses change over time as mice learn to distinguish between pairs of odors: https://www.cell.com/neuron/fulltext/S0896-6273(16)30563-3.

The basic setup is that the mice are strapped on a rig with two water pipes. When they detect odor A, they can drink water from the left side. When they detect odor B, they can drink water from the right side. To ensure that the mice are motivated to cooperate, their water intake is restricted before training.

The main result is that when the mice have to distinguish very similar odors, they learn to represent each odor more precisely. They never reach the extreme case of having just a single neuron dedicated to each odor, because the brain always requires some redundancy. But as they are trained, more neurons begin to respond to one of the two paired odors and not the other. In other words, to use the terms of this post, the neurons become more monosemantic.

But when mice are trained to distinguish very different odors, they can afford to be sloppier while still maintaining high accuracy. So the representations of the two paired odors actually become more similar during training. In other words, the neurons become more polysemantic.

I don't seem to be able to copy images to Substack comments, but I recommend the following figures in particular:

- Figure 1B shows the neural architecture of the olfactory bulb. The neurons studied in this paper are the mitral cells, which are roughly equivalent to the hidden layers in an artificial neural network.

- Figure 2C shows how the number of neurons with divergent responses (i.e., responding to one of the paired odors but not the other) drops in the easy case.

- Figure 3C shows how the number of neurons with divergent responses increases in the difficult case.

- Figure 4 shows 3D plots with the first three principal components in both cases. In the easy case, the representations start off far apart, then move somewhat closer together. In the difficult case, the representations start off completely overlapping, but separate by the end.

Expand full comment

I have nothing insightful to add, just want to express my thanks for this post. I've been meaning to read the monosemanticity paper since I first heard about it maybe a month ago, but haven't found the time to. This provides a nice overview and more motivation to have another crack at it. This seems like a very important development for interpretability.

Expand full comment

>Feature #2663 represents God.

This is a pretty good choice, but you missed a great opportunity to pick out the Grandmother Neuron (assuming this model had one).

Expand full comment

Long time reader, first time commenter here. Thanks for this piece, it was equal parts awe and inspiring. I think there are some implications for medicine here that might actually be more important the the AI safety pursuits. You inspired me to make my first substack post about it, so thanks for that!

Expand full comment

> 100 billion neurons

Nit: GPT probably has in the order of 100 billion parameters, but since every neuron has about 100-1000 parameters in these networks, that amounts to 'only' 100 million to 1 billion neurons.

Expand full comment
Dec 3, 2023·edited Dec 3, 2023

AI predicts a future it does not project a future. That is why AI is not conscious: it cannot project goals for itself; it does not care about anything, not even itself. I think this is because human time projection --an imagined future-- has a determinant effect and machine time is only determined by the past prediction. AI does not want anything of us. It has no future, whereas our consciousness is entirely in the nonexistent future. We are in a state of incessant future desire; AI is fully sated always, since it draws only from the past.

Next, the Luddites can break the looms but the AI revolution is here. I have no fear of a machine that turns garbage into golden patterns. Think of those ENORMOUS dump trucks for strip mining with the teeny cab where the driver sits. Who do you fear? Not the truck. That is AI. So it absolutely must be monitored and regulated. But not because it wants to eat us! AI is still like a very powerful chainsaw that is very hard to control, but gosh does it cut. You wouldn't want a bunch of D&D Valley Boys playing with it without some government oversight. Again I am pro-regulation.

But...Humans fetishize our own inventions. The Valley Boys would love to imagine themselves gods, but Q* is still "locked in," still fundamentally solipsistic. Still all "transcendental consciousness" with no access to the transcendent things in themselves. All the training data is human made, it is made of us and we fall in love with our machines because they seem like us. But they care not a jot for us one way or the other. So, these man-made golems are nevertheless very dangerous since they will follow the algorithms we put in them relentlessly and with no common sense at all. I think-- though I do not yet know how to describe this-- that the hallucination problem as another comment said is a "feature of the architecture" not really a defect of it.

Expand full comment

"check how active the features representing race are when we ask it to judge people"

ah, this is not going to work, actually. The beauty of AI ethics is that there are several definitions of fairness and they are mathematically incompatible with each other. Not activating the features representing race is aligned with a relatively dumb definition of fairness but is not compatible with several other definitions.

Let's say that our model takes a look at a lot of data about a person including race and decides if they should receive a loan. If the system is trained to make a decision based solely on probability of loan being repaid, the chances are that this it will 1) use the race information 2) reject a higher proportion of people from a certain race A 3) reject some people of race A that would have been accepted if their race was different but all other features the same. It is also possible that the system will 4) reject people of race A so much that the people of race A who do receive the loan will default on it less frequently than people from other races.

Each of these 4 system features may be taken as a definition of discrimination. The problem is that if we eliminate 1) (which is easy and often done in practice, just drop the race attribute from the dataset) the other three measures of discrimination can and often will get worse.

Expand full comment

Non-technical people tend to attribute "thought" to "AI", Which isn't actually artificial intelligence, it's machine learning – this isn't a nitpick, it's a principal difference.

The best alternatives I can offer to how to think about the current crop of "AI":

1) it works like the unconscious, or to say otherwise, like "System 1 thinking". So just like the unconscious doesn't actually "think" but "approximate a gut feeling/leaning towards something", "AI"/ML does the same.

2) this doesn't mean ML is useless for actual (as in the first draft of AI that will do its independence actions) artificial intelligence, quite the contrary – human/animal "natural" intelligence consists of multiple parts, and one of the parts is almost identical to ML in function. But to ascribe thought to it is missing the point.

Expand full comment

As someone who recently attended a neuroscience conference, let me say that our brains are absolutely doing this type of thing.

Some examples:

"Place cells" are neurons in the hippocampus that fire when you are in a particular place. One physical location is not encoded as a single cell, but rather as an ensemble of place cells: so neurons A + B + F = "this corner of my room" and A + E + F = "that corner of my room".

"Grid cells" are neurons in the entorhinal cortex (which leads into the place cells) that encode an abstract, conceptual hexagonal grid, with each grid hexagon represented as an ensemble of grid cells.

Memories are also stored as ensembles of cells, or to be more precise, as collections of synaptic weights. So much so that, in one study, they located an ensemble of cells that collectively stored two different (but related) memories, and by targeting and deleting specific synapses, they could delete one memory from the set of cells without deleting the other.

https://www.youtube.com/watch?v=kMvvWikHklA

I also highly recommend this presentation of a paper where they found they could trace - and delete - a memory as it was first encoded in the hippocampus, then consolidated in the hippocampus, then transferred to the frontal cortex. It doesn't directly have to do with superposition or simulating higher-neuron brains, but it is really cool.

https://www.youtube.com/watch?v=saFDeGTYnRU

Expand full comment