I've recently come across the concept of degeneracy as it relates to complexity-robustness-evolvability while exploring host/interaction/microbial ecosystems. Perhaps there is some help for the start of an answer to the question about the evolutionary gate there?
I've been considering that the complexity shown in the biochemical pathways link provided shows enough diversity to cushion the system when something catastrophic happens. As the complexity decreases, there are fewer compensatory pathways that can step in. When few enough pathways remain, a failure cascade occurs. The path of that cascade is unknown (and probably unknowable) until the failure starts.
(While very complex, the diagram of biochemical pathways is still incomplete-- it is a listing of current knowledge.)
The growth and evolution of biological intelligence so far has been faced with catastrophic events repeatedly. Perhaps it has gotten here because of the catastrophic events.
> If all it took is something like a transformer network and back-prop, why was intelligence not one of the first ones out of the evolutionary gate?
I think that's a really interesting question.
Part of the answer, I think, is that real biological brains come massively pre-trained. Animals, including humans, aren't born with a brain that needs to learn everything from scratch, they're born with a brain that's already pretty well structured for the vast majority of things they're going to need to do, with limited "blank spaces" for figuring out the finer details and learning about their specific environments. The process of finding the optimal weights for these pre-trained parameters (or rather, the optimal genes that act in the right way to give the parameters the correct values while also interacting with everything else) is incredibly slow because it's done by natural selection on an animal-lifetime timescale.
I mean, you seem to need a multicellular organism first.
Then you need to work out some form of signalling that's fast enough and targetted enough to do computations (endocrine is too slow and can't be used for computation because it's diffuse; you need wiring, not dye).
Then you need a centralised brain (otherwise it's not fast enough), which would seem to only be useful in an organism that is motile and has significant anatomy (i.e. an animal, and not a sponge or jellyfish at that).
Then you need a long series of evolutionary landscapes which advantage "more brain". Remember, brains are super-expensive to run (*especially* as you approach human intelligence), and they take a long time to train (*particularly* as they get more complex), AND at high levels of complexity they're only useful with manipulators (could an orca with human intelligence achieve much that it can't already? Hard to see how). A *lot* of things have to line up for the path to human intelligence to open.
And we still went from chordates to humans in less than a third of the time (and far, far less generations) than it took to invent eukaryotes.
> Even if you assume the brain does backpropagation, what are the odds that intelligence is based on something so .. simple as the chain rule?
A couple of points:
* What are the odds that any computable function can be computed with Rule 110 [1]? It seems implausible, but it's true, so maybe that indicates humans don't have a proper or natural appreciation for how much complexity can emerge from simplicity.
* Cell metabolism is likely complicated because it manages with many different interconnected demands in a complex environment. If intelligence is just some kind of information processing, then it can be divorced from the type of complexity needed for cell metabolism.
> Cell metabolism is likely complicated because it manages with many different interconnected demands in a complex environment.
Meant to say, "Cell metabolism is likely complicated because it manages interactions with many different interconnected demands within a complex environment."
I think this misses the meaning of "different interconnected demands". The demands of intelligence are all likely informational with a small set of physical correlates for information. As with Rule 110, these can be served pretty well by a simple set of information processing rules.
The demands of cellular metabolism are not so easily reduced, because literally every cell in your body, with all of its different *physical* functions, uses the same pathways that operate on multiple different food sources that are available in our environment. Your brain, and your muscles, and your liver, and your skin all behave very different physically but still use the same metabolic pathways and signals.
So permute every possible food source with every possible physical behaviour of every type of cell and you'll see how the complexity emerges. Neurons are one type of cell whose purpose is presumably reliable information processing, so the multiplicative effect just doesn't arise here.
That said, integrating the information processing with various signals from your body, such as hunger, likely would be more complicated. A large portion of our brain is dedicated to these autoregulatory processes, but I'm not sure this is the "intelligent" part you're talking about.
"In the classic experiment by Latane and Darley in 1968, eight groups of three students each were asked to fill out a questionnaire in a room that shortly after began filling up with smoke. Five out of the eight groups didn’t react or report the smoke, even as it became dense enough to make them start coughing."
Anytime I read "this study was carried out in the 60s/70s", my hackles immediately rise. Has anyone repeated this recently? Because we've seen how famous studies from that era turned out to be riddled with errors and deliberate set-ups. Looking up this particular experiment gives me this:
"Experiment 1. Where There's Smoke, There's (Sometimes) Fire
They had subjects began to fill out questionnaires in a room to which they began to add smoke. In one condition the subject was alone. In another three naive subjects were in the room. In the final condition one naive subject and two confederates who purposely noticed and then ignored the smoke (even when the room became hazy from all the smoke).
75% of alone subjects calmly noticed the smoke and left the room to report it. But only 10% of the subjects with confederates reported it. Surprisingly, in the three naive bystander condition only 38% reported the smoke.
Most subjects had similar initial reactions. Those that didn't report it all concluded that the smoke wasn't dangerous or was part of the experiment. No one attributed their inactivity to the presence of others in the room.
Other studies have shown that togetherness reduces fear even when the danger isn't reduced. It may have been that people in groups were less afraid and thus less likely to act. Or people were inhibited to show fear in a group situation. However, from post-interviews it was clear that people didn't act because they concluded the situation wasn't a threatening situation."
The thing here is, that conclusion was correct. The smoke *was* part of the experiment! So being aware that you are participating in a study to test *something*, and then the room starts slowly filling with smoke but there is no other indication of anything wrong, it was not simply the 'bystander effect' but correctly judging "okay, this is all part of the test to see if we'll panic or what".
I agree that the broad conclusion - in groups, we tend to play 'follow the leader' - is correct, but the fact that people being tested knew that it was *some* kind of a test and then concluded "this smoke is part of it" can't be ignored, either.
"Other studies have shown that togetherness reduces fear even when the danger isn't reduced. It may have been that people in groups were less afraid and thus less likely to act."
People who are alone think "maybe something is wrong?" and people with company - either the other volunteers or the plants - go "okay, nobody else is freaking out, it must be part of the test"? I do agree that people tend to calibrate their behaviour in line with the behaviour of others around them, but I find it hard to believe people just tamely sat in a smoke-filled room, coughing as they filled in questionnaires, *unless* they thought it was all part of the study and not "crikey, the place is on fire!"
Hmm... maybe there's an intrinsic sense of "I'm being tested to see if I freak out before these people, so I won't react". In which case, if there *is* a danger and you are around other people and you *think* you are being tested, you're more likely to suppress that instinct?
I'd say that the attention the world has on the tech scene in general would probably count as "being tested", wouldn't you?
“38% of people report the smoke in a group of three naive subjects” could mean that 95% of the time one of the three people reports it, and 5% two people do.
I mean, this quote in the article sort of addresses your replication concern:
"(I’ve read a number of replications and variations on this research, and the effect size is blatant. I would not expect this to be one of the results that dies to the replication crisis, and I haven’t yet heard about the replication crisis touching it. But we have to put a maybe-not marker on everything now.)"
As far as "knowing it was some kind of test" I'm not sure I agree that this would lead to lower incidence of reacting as opposed to increased incidence of reacting. I could see it going either way. This also seems like the kind of common issue with psychology experiments that there are at least _some_ means of measuring
Not sure what the phenomenon is called, but the "Humans change behaviour due to an awareness that they're in a psychology experiment" is probably an even more specific phenomenon.
At this point I've read enough psych experiments that if I went in for one, and you put me in a chair to fill out a questionnaire, and you _didn't_ pull some crazy trick on me, then I'd be pretty disappointed.
There is some rationality to this behavior though. Smoke means a small fire nearby. When it gets bigger, leave in the opposite direction. If others are showing apathy, perhaps they know more about it than you do - perhaps this happens every day in the lab. A fire alarm means a low probability of fire of unknown size, far away, unknown direction, which could soon block all exits. That said, yeah, you should probably react to the smoke too.
Cutting the computer in half isn't actually as silly as you make it sound!
The problem with backprop is you have to apply it to one layer of the neural network at a time. The result of layer k is required before you can backprop k-1. This reduces the degree to which you can parallelise or distribute computation.
One of the HN commenter pulled out this quote from the paper:
>>> In the brain, neurons cannot wait for a sequential forward and backward sweep. By phrasing our algorithm as a global descent, our algorithm is fully parallel across layers. There is no waiting and no phases to be coordinated. Each neuron need only respond to its local driving inputs and downwards error signals. We believe that this local and parallelizable property of our algorithm may engender the possibility of substantially more efficient implementations on neuromorphic hardware.
In other words, using the predictive coding approach, you can be updating the weights on layer k asynchronously without waiting for the weights to be updated on subsequent layers. The idea seems to be that the whole system will eventually converge as a result of these local rules. This approach lets you scale out by distributing the computation between different physical machines, each running the local rules on a piece of the network. With backprop this doesn't work because at any given time, most of the weights can't be updated until calculations for other weights to complete.
Right now this hasn't made for any huge performance wins because (a) the researchers didn't put a lot of effort into leveraging this scaling ability, and (b) they have to do a few hundred iterations of updating in order for the algorithm to converge on the value you'd get out of backprop. The hope is that the opportunities for scaling outweigh the disadvantages of needing to do multiple passes to get convergence.
I don't see any mention of nonlocality in the article, and to the best of my knowloedge backpropagation is entirely local process, which helps parallel architectures such as GPUs to perform it efficiently.
The issue with backprop seems to be the fact that it goes backwards, while neurons can only perform forward computations.
"Firstly, backprop in the brain appears to require non-local information (since the activity of any specific neuron affects all subsequent neurons down to the final output neuron). It is difficult to see how this information couldbe transmitted ’backwards’ throughout the brain with the required fidelity without precise connectivity constraints"
I don't agree with Elazar - I'd say nonlocal in an artificial network too.
If you're doing a backprop step, a any given time you'll be calculating the updates for a given layer of the network. This can be efficiently parallelised (because it's a matrix multiplication) but during this time all the other layers are sitting twiddling their thumbs - you need the updated weights of layer k+1 to update layer k. This limits parallelization to the size of each layer, so networks that have many (comparatively small) layers can't scale out as much as you'd like.
If we were modelling a flock of birds this way we'd say "OK, first place birds 1-10, then use their positions to place 11-20, then 21-30 and so on". That lets you place ten birds at a time, max, no matter how many people are doing the work. What birds actually do when they flock is to continuously look at their immediate neighbours and adjust their position on an ongoing basis to maintain distance - that's locality, and it's more like the predictive coding approach in the paper you linked. You could have a separate processor positioning each bird (neuron) continually, allowing arbitrary parallelization.
My understanding is that backpropogation in the brain happens all the time, it's just that it happens within neurons as opposed to across neurons. I've not heard about it happening between neurons (thought absence of evidence does not preclude...) so that seems like the difference here.
[Disclaimer, it has been over a decade since I last did anything in neuroscience, so the following may be a little dated or subject to memory decay.]
For those with less neuroscience exposure: Each time a synapse fires, it causes a small increase in the voltage of the cell body on the receiving neuron. The size of that increase is entirely dependent on the 'strength' of the connection between the sending neuron and the receiving neuron. An action potential happens when the cumulative effect of all inputs from all the synapses connected to the neuron (can easily be tens of thousands of input synapses - though not necessarily all from different neurons) causes the internal voltage of the receiving cell to hit a certain threshold. Once that happens, an irreversible process commits a bunch of ion channels to a complicated dance that sends an electrical signal down the axon so the neuron can activate its own synapses and continue the dance at a higher level of abstraction.
When an action potential fires along an axon, there's a second action potential that back-propagates the other direction along the axon body toward the neuron body. (This is the first context I had ever heard of back-propagation in, and I have always assumed the neural-network community got the term from the older neuroscience literature.) This signal gives an extra 'boost' at each of the dendrites that fired just prior to the action potential (and thus contributed to the cell body reaching the threshold) making the connections stronger so next time they'll be able to contribute more to the action potential and get to the threshold voltage with fewer inputs.
That's all fine in theory, but we prefer to see this happen at the molecular level so we really know what's going on. Inside the cell, near the cell wall at each receiving dendrite, there's a pool of a kind of protein with 6 subunits that can enhance the activation of the dendrite - but only after it is at least partially activated. One subunit is 'activated' after the dendrite fires, making it able to help promote dendrite firing in the future. If nothing comes of it, that subunit gets 'deactivated' after by normal cellular processes. However, if the dendrite firing caused an action potential, there will be back-propagation to the dendrite body. When this electrical signal reaches one of the already-activated 6-sided proteins, it causes another subunit to get activated (so long as one has already been activated - the activation of one subunit makes activation of other subunits happen more easily).
Now that multiple subunits are activated on that protein, deactivation of each subunit happens a little more slowly. If you do this cycle one or two more times you get all six subunits activated, and solidify a much stronger 'connection' that isn't easily deactivated. Taken up to the level of abstraction of the cell, this means the firing of that dendrite causes a larger voltage increase within the neuron because of the back-propagation's enhancing effect on the dendrite. That dendrite (and all the others that helped contribute to reaching the action potential threshold) is now capable of contributing more to reaching the threshold required to signal a future action potential.
My understanding is that the difference between the biological mechanism of back-propagation and what's used in neural networks is that the neuron signal happens at a more 'local' level in the brain. The theory is that there's no 'dog' neuron, but the idea of a dog is encoded in a network of signals - the combined effect of lots of neurons firing in specific sequences/patterns. So in a computer NN, back-propagation happens at the level where the 'action potential' you're driving toward is the concept itself ('dog in picture', 'user clicked on ad') as opposed to a more local interaction between layers in the network.
"But there’s no process in the brain with a god’s-eye view, able to see the weight of every neuron at once."
This was what was bothering me when I studied ML in university and made me to believe that the whole thing was BS. It's the first time I read a phrase where someone put a finger on it.
Except modern ML as a field was never explicitly about replicating the processes in the brain - NNs + backprop are useful regardless of whether they can be unified with biological systems.
Perhaps backpropagation doesn't happen in the brain... but loops back through the environment.
So, see a vague snakish thing in the grassy stuff and your priors direct your pulse to increase, hormones to be released, and your attention (or eye direction and focus) swivels to the potential theat. You look properly at the snakish thing in the grassy stuff and it is resolved into a snake in the grass, or a hosepipe. No backpropagation is required... just different weights being input to the predictive processes from another beginning.
With a highly parallel system you don't have to 'refine' the first process through the brain, you just redo it (although many of the 'priors' will be available more quickly because you are now focused on the snakey thing, not what you are going to have for lunch).
David Chapman – my favorite 'post-rationality' thinker – was once an AI researcher and he thinks that a lot of 'intelligence' requires embodiment. I _think_ part of the reason is what you're gesturing at, i.e. using one's environment to 'think'.
I'm an ML PhD student - I read this paper when it came out. My impression was that the paper does elide the difference between *predictive coding networks*, which are a type of NN inspired by *predictive coding theory* in neuroscience, and that this has led to confusion on the part of people who might miss this.
From the paper:
"Of particular significance is Whittington and Bogacz (2017) who show
that predictive coding networks – a type of biologically plausible network which learn through a hierarchical process of prediction error minimization – are mathematically equivalent to backprop in MLP models. In this paper we extend this work, showing that predictive coding can not only approximate backprop in MLPs, but can approximate automatic differentiation along arbitrary computation graphs. This means that in theory there exist biologically plausible algorithms for differentiating through arbitrary programs, utilizing only local connectivity."
The key phrase is "biologically plausible", which basically just means an algorithm which is vaguely similar to something in the brain, at a given level of abstraction. I think practically this is just another backprop alternative (though certainly an interesting one) which like the others, will probably turn out to be less useful than backprop.
On the topic of the link between neural networks and PCNs however, there is a more recent paper showing a much more direct link than the one linked in your post:
This one is more interesting to me because it shows that predictive coding-like dynamics naturally emerge when you train an NN with backprop in a particular way, rather than having to do the more involved backprop approximation stuff in the original paper.
Author of the paper here. Really excited to see this get featured on SSC and LW. Happy to answer any questions people have.
Here are some comments and thoughts from the discussion in general:
1.) The 100x computational cost. In retrospect I should have made this clearer in the paper, but this is an upper bound on the actual cost. 100s of iterations are only needed if you want to very precisely approximate the backprop gradients (down to several decimal places) with a really small learning rate. If you don't want to exactly approximate the backprop gradients, but just get close enough for learning to work well, the number of iterations and cost comes down dramatically. With good hyperparam tuning you can get in the 10x range instead.
Also, in the brain the key issue isn't so much "computational cost" of the iteration as it is time. If you have a tiger leaping at you, you don't want to be doing 100s of iterations back and forth through your brain before you can do anything. If we (speculatively) associate alpha/beta waves with iterations in predictive coding, then this means that you can do approx 10-30 iterations per second (i.e. 1-3 weight updates per second), which seems about in the right ballpark .
The iterative nature of predictive coding also means it has the nice property that you can trade-off computation time with inference accuracy -- i.e. if you need you can stop computation early and get a 'rough and ready' estimate while if something is really hard you can spend a lot longer processing it.
2.) Regarding the parallelism vs backprop. Backprop is parallel within layers (i.e. all neurons within a layer can update in parallel), but is sequential between layers -- i.e. layer n must wait for layer n+1 to finish before it can update. In predictive coding all layers and all neurons can update in parallel (i.e. don't have to wait for layers above to finish). "A Wild Boustrephedon" explains this well in the comments. Of course there is no free lunch and it takes time (multiple iterations) for the information about the error at one end of the network to propagate through and affect the errors at the other end.
Personally, I would love to see a parallel implementation of predictive coding. I haven't really looked into this because I have no experience with parallelism or a big cluster (running everything on my laptop's GPU) but in theory it would be doable and really exciting, and potentially really important when you are running huge models.
One key advantage of neuromorphic computers (in my mind), beyond being able to simply 'cut the computer in half' is that it is much more effective to simulate truly heterarchical architectures. A big reason most ML architectures are just a big stack of large layers is that this is what is easily parallelizable on a GPU (a big matrix multiplication). The brain isn't like this -- even in cortex though there are layers, each layer is composed of loads of cortical columns, which appear to be semi-self-contained, there are lateral connections everywhere, and cortical pyramidal cells project and receive inputs from loads of different areas, not just the layer 'above' and the layer 'below'. Simulating heterarchical systems like this on a GPU would be super slow and inefficient, which is why nobody does this at scale, but could potentially be much more feasible with neuromorphic hardware.
3.) The predictive coding networks used in this paper are a pretty direct implementation of the general idea of predictive coding as described in Andy's book Surfing Uncertainty, and are essentially the same as in the original Rao and Ballard. The key difference is that to make it the same as supervised learning, we need to reverse the direction of the network, so that here we are predicting labels from images rather than images from labels, as in the original Rao and Ballard work. Conversely, this means that we can understand what 'normal' predictive coding is doing is backprop on the task of generating data from a label. I explain this a bit more in my blog post here https://berenmillidge.github.io/2020-09-12-Predictive-Coding-As-Backprop-And-Natural-Gradients/
4.) While it's often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn't really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I've been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.
Also there's persistent problems with how to represent negative prediction errors (i.e. having negative errors be represented as lower than average firing (requires a high average to get good dynamical range which is energy inefficient), or else using separate populations of positive and negative prediction errors (which must be precisely connected up)). Interestingly, in the one place we know for sure there are prediction error neurons -- dopaminergic reward prediction errors in the basal ganglia involved in model-free reinforcement learning, the brain appears to use both strategies simultaneously in different neurons. I don't know of any research showing anything like this in cortex though.
Also, there's not a huge amount of evidence in general for prediction error neurons in the brain -- see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7187369/ for a good overview -- although it turns out to be surprisingly difficult to get unambiguous experimental tests of this. For instance if you have a neuron which responds to a spot of light, then is it a prediction neuron which 'predicts' the spot, or is it a prediction error neuron which predicts nothing and then signals a prediction error when there is a spot?
A final key issue is that predictive coding is only really defined for rate-coded neurons (where we assume that neurons output a scalar 'average firing rate' rather than spikes), and it's not necessarily clear how to generalize and get predictive coding working for networks of spiking neurons. This is currently a huge open problem in the field imho.
6.) The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning -- just one that has backprop as a subroutine. Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single 'particle' following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs.
All of this context is so cool and helpful. In particular, I have wondered about #5, and am looking forward to digging into those links. Thanks for taking the time.
Also, a question, for you or anyone else. I'm a software developer of 10 years, and I'm in the process of reskilling so I can do more math-y computation-y stuff - I recently enrolled in University of Iowa's postgraduate mathematics certificate program. I'm tolerably smart and capable, but I've done abominably in college, badly enough that I don't expect that I can get into any graduate program that has entry requirements.
I want to do high impact work, and I'm drawn to AI applications. I doubt I'm smart enough to make meaningful academic contributions (plus I think I'm locked out of academia), but I suspect there are lots of interesting applications still waiting to be explored.
Do you have any suggestions for application spaces that might be interesting in the next several years? I'm a very capable Scala developer, and I've been told I will be incredibly valuable if I get proficient with ML. Orgs doing auto ml and biology lab automation really appeal to me.
There are neuro/bio academic labs that hire software developers to write clean(er) software for their research (a practice that is becoming more prevalent with the increasing role of computation/ML in these fields). Might be an option that ticks all your boxes.
Interesting thought. I guess the other thing I'm iffy about is that I consistently hear terrible things about bioinformatics as a field, and I've been kind of convinced that structurally speaking, you can't do good engineering there. So it's kind of a waste of engineering skills to try. Don't know if it's true, but hence auto ML and bio lab automation companies - they seem like tech-first, rather than science first. Just going off of vague generalizations though. Searching with your suggestion I found some interesting jobs already.
I must admit I can't really answer this well personally, because I'm just finishing my PhD and have essentially always been solely in academia.
I will say though, that if you want to go the academic route, I don't think you are actually as locked out as you think. I know several people in my PhD program who started after working for many years or even decades (some are 50+). I also think you should definitely be able to go back and get a masters in computer science or ML or something relevant should you want to, especially with your software engineering experience. Personally, I found doing a master's (in the UK where they are only 1 year) super helpful for me to go from zero to a reasonable level of competence in ML. If you have a masters (and especially if you get a publication out of it), you would stand a really good chance of being able to get into a PhD program if that's what you wanted to do.
Thanks Beren! I'd rather imagine my neurons are performing highly parallelized advanced Hamiltonian MCMC algorithms too, but wouldn't put it past the sneaky blighters to be indulging in a bit of backpropagation on the side, though it sounds complicated, immoral and possibly illegal 😉
I have a few questions mostly related to your comments here (as I haven't read the paper so far). I'll add them below for easier answering and condensation (reminder: you can condense all of them by clicking on the grey line next to this post). Thanks if you have the time for answering (or others can), but also feel free to skip them as many will be rather off-topic.
You write: "Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where"
"Next, we compute the posterior for the labels zi and the cause-specific parameters ϕk = (μk, τk), k=1,... by running a Gibbs sampler for 10 iterations, which is sufficient for convergence of the (now updated) central hypothesis. In each iteration the labels are re-sampled according to their full-conditional probabilities and the cause-specific are parameters re-estimated accordingly." (quote from the paper"
And erm... at best, this may serve as a very meta/high level approximation of the actual mechanisms at work. To be clear, I don't think we need to model this on a neural level, but I just can't imagine the brain internally generating a few dozens or hundreds of potential candidates for a category, let alone "running a Gibbs sampler for 10 iterations".
... and after reading your comment, I would expect you to be less pessimistic of this being a possible description of the brain's inner workings than I am.
Do you have any thoughts on how much "approximate monte-carlo sampling" you would expect within the brain?
My own gut feeling for the brain forming mental models is that there are some mechanisms within the brain close to an evolutionary algorithm: Spawn several candidates for predictive models, then assign weights to them depending on their performance; discard bad ones and clone good ones (with some variation). In a programming environment, this would be very simple; on a biological level, I don't see how this is possible.
Do you have any ideas for biologically plausible mechanisms close to this? Should we expect various neual clusters to synchronize with each other ("wrong" neural columns changing to imitate "good" ones)?
Good questions! This is all pretty speculative at the moment, but my own intuitions for this centre mainly around algorithms like particle filtering. I.e. we could imagine a group of neurons representing a set of particles whereby each neuron starts off initialized in a different location due to neural noise etc, and then follows its local gradient towards a solution being regularly perturbed by noise -- this is called stochastic langevin dynamics (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.3813&rep=rep1&type=pdf) and you can show that doing this is an MCMC algorithm which allows you to read out samples from the true posterior rather than just a point prediction. The idea is that while each neuron's trajectory would effectively encode a single potential solution the population as a whole would tend to cover more of the posterior distribution (or at least its modes if multimodal) -- and you can generalize this to MCMC trajectories with better convergence and coverage properties such as Hamiltonian MCMC.
You can expand on this idea and imagine repulsive forces (maybe implemented by lateral inhibition) which keep the neurons from all collapsing into the same attractor and instead force them to explore different modes which may be more effective -- haven't looked into this. There's a general mathematical framework which relates all MCMC algorithms like this in terms of stochastic dynamics here: https://arxiv.org/pdf/1506.04696.pdf. There's also generalizations of predictive coding which work with these sorts of particles -- for instance https://www.fil.ion.ucl.ac.uk/~karl/Variational%20filtering.pdf.
This is also highly related to evolutionary algorithms, except that its a continuous time version whereby each organism follows its fitness gradients plus various regularisation terms plus noise, and this will allow the ultimate population to cover more of the space.
Then there's the question of how to dynamically synchronize and/or 'read out' these populations. One interesting idea in the literature comes from Hinton's Capsule Networks where he has a routing algorithm to dynamically construct 'feature graphs' from sets of 'capsules' which each represent features of a visual scene (https://arxiv.org/pdf/1710.09829.pdf?source=post_page---------------------------). I can easily imagine some algorithm like this occurring between cortical columns which would allow higher level regions to modulate and select the most useful modes of the posterior at the level below for further analysis.
Some other interesting papers which made me think more deeply about this, which you might find interesting are:
You use the free energy principle (FEP) within the paper. As the fep is a constant source of confusion, could you comment on the motivation and implications of using it? Where does it come into play and what would be missing without it?
I don't believe I use the FEP within the paper. I use the variational free energy (VFE) as an objective function to derive the predictive coding algorithm, but this is really just a standard case of variational inference, where the VFE is just - ELBO (in more standard ML terminology). Ultimately all predictive coding is, is a EM variational inference algorithm making Gaussian assumptions about the generative model and approximate variational distribution.
I must admit that I frame things and naturally think of things in terms of the VFE for historical reasons -- because I initially got interested in this field due to the FEP and since the variational interpretation of predictive coding derives from Friston (https://www.sciencedirect.com/science/article/abs/pii/S0893608003002454) a lot of the predictive coding literature uses this terminology instead of more standard ML terminology.
Thank you so much! I'm kinda embarassed for the sloppy reading and writing I did for my questions and I really appreciate the effort you put into your answers. I'll look into the papers as soon as I can!
The predictive processing account of the brain is said to give a plausible interpretation to neurotransmitters NDMA, AMPA and dopamine. Is there a possible connection in your model for these? How would "the brain approximates some kind of mcmc sampling" and "dopamine encodes precision" fit together? If this is already explained in one of the papers you linked, please skip this one as I haven't had a look at them yet.
I must admit I haven't really kept up with the literature on how the specific neurotransmitters work in predictive coding theory. I don't think there is a 'canonical' model of this, but I may be wrong.
The connection with precision is interesting -- in my model where it approximates backprop, there are no precisions (all are set to the identity matrix). You can put precisions back in and then this results in a kind of 'variance weighted backprop', where the effective learning rate is modulated by the variance of the gradient (intuitively this makes sense since you should probably update less based on high variance gradients). In the linear setting, applying precisions like this results in natural gradient descent, but not for the more general nonlinear case.
Personally, I'm not a big fan of the dopamine = precision hypothesis because I don't think it fits in with what we know about dopamine's function in general (signalling rewards and reward salience) and also the maths doesn't really work out, as precisions are inverse covariance matrices, and therefore would be between prediction error neurons within each 'layer' rather than being set in a global manner from subcortical nuclei. Once again though I'm definitely not an expert on this -- so I could be completely wrong about this.
There was a great ebook that dived into the actual neuroscience in the cortex and it looked a lot like predictive coding. The website was http://corticalcircuitry.com/ but it's currently 404ing. I DMed the author on the ACX discord server and hopefully he can get it back up.
> For instance if you have a neuron which responds to a spot of light, then is it a prediction neuron which 'predicts' the spot, or is it a prediction error neuron which predicts nothing and then signals a prediction error when there is a spot?
If these are logically equivalent, then the only meaningful difference would be in the energy expenditure. Natural selection will select for whichever permits the most energy efficient structure that responds best to the most common fitness challenges. So some parts of the brain might use prediction neurons, and some error neurons, assuming the cost of interfacing these are sufficiently low. If the interface is itself expensive, then this will only happen in larger structures.
Indeed. Neural networks and backpropagation, are just a very specific example of a very general problem of fitting models to data.
The only special thing about a neural network is that it's a very very specific functional form that allows you to have a zillion input variables and several zillion parameters, and to optimise those parameters in a way that's fairly computationally efficient.
On the other hand, outside view: previous attempts to explain the brain by reference to the most complicated technology of the time (mills, hydraulic systems, etc), have typically ended up not super helpful and almost immediately hilarious-in-retrospect. Are we confident that our most-complicated-technologies have progressed far enough that this time will end up being any different?
I think it's worth noting that "Predictive coding is the most plausible current theory of how the brain works." is not a sentence that, in my estimation (current PhD in cog sci at an R1 university), would receive widespread agreement from cog sci researchers. (I feel confident that you wouldn't get majority agreement, and I'd bet against even plurality.)
Of course, that doesn't mean it's wrong -- but I think this sentence misleadingly makes it seem like there's expert consensus here, when I think in fact Scott is relying on his own judgment.
There's much more specific versions of that claim, such as "Vision and other perceptual processes combine prior expectations with bottom-up sense data", which would receive widespread agreement. But there's a huge gap between that and saying that predictive coding (a specific hypothesis about how that happens) is the most plausible current theory of how the whole brain works.
I think the latter claim would honestly be considered by many to be a fringe view, although I could be wrong about that. (Which again, obviously doesn't mean it's wrong. Just merits some caution.)
I don't think so, but it's a question I really wish I had data on. AFAIK, there's no cog sci equivalent of that "survey a bunch of economists" thing that Scott cites a lot. And even if there was, I don't even know if you could get widespread agreement on what questions to even ask, or how to phrase them. From my inside-view perspective, cog sci is *so* fractured. Like, different labs and different departments often just use utterly different frameworks/language, even when they're ostensibly studying the same thing.
That being said, here's my best guess at what the top candidates might be:
- There's a framework that I think of as the "classical" view of cog sci, building on the ideas of a computational theory of mind, language of thought, and modularity articulated by e.g. Jerry Fodor. These people tend to try to identify algorithms that manipulate symbolic representations (like the kind where you could naturally write a box-and-arrow diagram getting you from input to output), and tend to not focus on adaptive explanations that much. (Fodor's books are fantastic, if you ever want to read about this approach to philosophy of mind.)
- There's the evolutionary psych people who think that natural selection is the right paradigm to analyze the mind/brain. These people tend to emphasize "bag of tricks" views of the mind, and obviously adaptive explanations for each of those mental tools. Best represented by John Tooby and Leda Cosmedes. (A combination of this view plus the classical view is beautifully articulated by Steve Pinker in his aptly-named book "How the Mind Works".)
- There's the Bayesian people, who are similar to the predictive coding people but, I think, somewhat different. (E.g. they don't emphasize minimizing surprise / prediction errors in the same way, and they tend to not argue that their framework applies so broadly.. I think they would very much disagree with the Friston-style explanation of how decision making works, for example.) These people tend to articulate things the mind is doing as inference problems, identify Bayes-rational solutions to those problems, and then test whether the mind/brain approximates those solutions. Best represented perhaps by Josh Tenenbaum at MIT and Tom Griffiths at Princeton.
I think the most common answer would actually be "there is no unifying theory/paradigm for cognitive science, and there never will be". These people would argue that there will end up being different unifying theories for different parts of the brain or different things the mind does. For instance, there's no good a priori reason why perception and cognition would fall under the same paradigm; they could in principle be two really different things. (And in terms of how people actually do cognitive science in practice, most people rely on some theory that is specific to their target of study. Like, decision making people will use a decision-making theory; perception people will use a perception theory; memory people a memory theory; and so on. The theories above are the only ones I really know of that have tried to unify the mind and gained at least some traction.)
David Chalmers and David Bourget came up with a way to put together a survey of philosophers a little while back. I think if it can be done for philosophers, then it could likely be done for cognitive scientists:
That said, I've also heard some philosophers who aren't strongly connected to the core of analytic philosophy express some snarky rejection of the presuppositions of the survey.
Thanks Beren! I'd rather imagine my neurons are performing highly parallelized advanced Hamiltonian MCMC algorithms too, but wouldn't put it past the sneaky blighters to be indulging in a bit of backpropagation on the side, though it sounds complicated, immoral and possibly illegal 😉
I don't have an answer to your question but feel that linking this recent short (fictional) story from one of the leading AI researchers is appropriate here: https://karpathy.github.io/2021/03/27/forward-pass/
The same way we know that rocks aren't conscious, or that our fellow human beings are. That is, we don't, but based on what we know about the sorts of systems that we _might_ expect to have consciousness we can make some reasonable assumptions.
I dunno, without any sort of semi-formal or formal specification this seems to be privileging thought for its own sake as if it's evidence even when that thought doesn't do anything.
I'd like to take the chance to re-remind everyone that there's also a subreddit at https://www.reddit.com/r/PredictiveProcessing and I'd love to see more active discussion there on current papers.
And to all those preferring media to written text:
The brains@bay meetup video here: https://www.youtube.com/watch?v=uiQ7VQ_5y5c&t=14 includes a discussion of evidence of predictive processing, mechanisms for learning and predictive-processing-imitating AI implementations in scratch. I plan to do some notes and will add them to the subreddit in the next days (but as always, I don't have my notes ready yet when Scott posts something fitting on predictive processing).
And since I'm already posting youtube links: There's also a video discussing the paper here: https://m.youtube.com/watch?v=LB4B5FYvtdI - at least if I didn't miss something it's not by any of the authors but unrelated? (and I should add I haven't watched it so far)
If you generalize "computer" a little bit, then "a computer that doesn't break when you cut it in half" just unpacks as "a partition-tolerant distributed system", i.e. a network of computers that keeps that mostly keeps working if some nodes become unable to communicate with each other due to network outages. This is a well-studied problem and, while "neuromorphic" systems may well have this property, lots of non-neuromorphic systems already do.
One (possibly outdated) complaint I've heard from a neurologist about ANNs is that the math of natural neural networks isn't plus/minus, it's plus/divide. The way we understood it, ANNs sum the previous layer, applying positive or negative factors to it so if you double a signal it adds or subtracts twice as much for the next layer. But when neurons signal each other chemically, more positive signaling increases the binding rate, but negative signal interferes with the binding, having an outsized effect. He even speculated that a natural neural net that used plus/minus would be considered pathological.
I learned my machine learning before innovations like deep and recurrent ANNs. Back then, the plus/minus vs plus/divide contrast that he drew about ANNs seemed apt, leaving open the possibility that plus/divide would perform better. Can newer architectures model plus/divide? Have researchers investigated plus/divide and found it to not improve ANNs?
plus/divide is routinely used in softmax and layernorm operations, which are found in almost all ANNs, from ancient MNIST classification networks to modern GPT3. Any time you need a normalized result (e.g. probabilities that add to 1.0), you divide each neuron's output by the sum of all other neurons in the same layer. This is pretty similar to the inhibitory connections in natural NNs.
"But neurons can only send information one way; Neuron A sends to Neuron B, but not vice versa."
In case anybody else was confused by this - since neurons *can* convey information in two directions - this summary from Wikipedia seems to explain the misunderstanding (emphasis mine):
"While a backpropagating action potential can presumably cause changes in the weight of the presynaptic connections, there is no simple mechanism for an error signal to propagate through *multiple* layers of neurons, as in the computer backpropagation algorithm."
I also have to fault the new paper for citing Francis Crick from *1989* as a source that the brain probably can't implement backpropagation. Crick may still be accurate in this case, but we've learned a lot about neurons since then.
> If all it took is something like a transformer network and back-prop, why was intelligence not one of the first ones out of the evolutionary gate?
Maybe you are assuming that more intelligence was always a net benefit and it wasn't? Also, octopuses (?) evolved it a very long time ago I think.
I've recently come across the concept of degeneracy as it relates to complexity-robustness-evolvability while exploring host/interaction/microbial ecosystems. Perhaps there is some help for the start of an answer to the question about the evolutionary gate there?
See, for example,
http://pure-oai.bham.ac.uk/ws/files/2920238/whitacre_theoetricalBiology.pdf
I've been considering that the complexity shown in the biochemical pathways link provided shows enough diversity to cushion the system when something catastrophic happens. As the complexity decreases, there are fewer compensatory pathways that can step in. When few enough pathways remain, a failure cascade occurs. The path of that cascade is unknown (and probably unknowable) until the failure starts.
(While very complex, the diagram of biochemical pathways is still incomplete-- it is a listing of current knowledge.)
The growth and evolution of biological intelligence so far has been faced with catastrophic events repeatedly. Perhaps it has gotten here because of the catastrophic events.
> If all it took is something like a transformer network and back-prop, why was intelligence not one of the first ones out of the evolutionary gate?
I think that's a really interesting question.
Part of the answer, I think, is that real biological brains come massively pre-trained. Animals, including humans, aren't born with a brain that needs to learn everything from scratch, they're born with a brain that's already pretty well structured for the vast majority of things they're going to need to do, with limited "blank spaces" for figuring out the finer details and learning about their specific environments. The process of finding the optimal weights for these pre-trained parameters (or rather, the optimal genes that act in the right way to give the parameters the correct values while also interacting with everything else) is incredibly slow because it's done by natural selection on an animal-lifetime timescale.
I mean, you seem to need a multicellular organism first.
Then you need to work out some form of signalling that's fast enough and targetted enough to do computations (endocrine is too slow and can't be used for computation because it's diffuse; you need wiring, not dye).
Then you need a centralised brain (otherwise it's not fast enough), which would seem to only be useful in an organism that is motile and has significant anatomy (i.e. an animal, and not a sponge or jellyfish at that).
Then you need a long series of evolutionary landscapes which advantage "more brain". Remember, brains are super-expensive to run (*especially* as you approach human intelligence), and they take a long time to train (*particularly* as they get more complex), AND at high levels of complexity they're only useful with manipulators (could an orca with human intelligence achieve much that it can't already? Hard to see how). A *lot* of things have to line up for the path to human intelligence to open.
And we still went from chordates to humans in less than a third of the time (and far, far less generations) than it took to invent eukaryotes.
First, yes, you're right, it is a possibility and we're at the stage of experimenting with how far simple approaches will take us.
Second, other apex features implemented in biology could be implemented in much simpler way (like flying).
> Even if you assume the brain does backpropagation, what are the odds that intelligence is based on something so .. simple as the chain rule?
A couple of points:
* What are the odds that any computable function can be computed with Rule 110 [1]? It seems implausible, but it's true, so maybe that indicates humans don't have a proper or natural appreciation for how much complexity can emerge from simplicity.
* Cell metabolism is likely complicated because it manages with many different interconnected demands in a complex environment. If intelligence is just some kind of information processing, then it can be divorced from the type of complexity needed for cell metabolism.
[1] https://en.wikipedia.org/wiki/Rule_110
> Cell metabolism is likely complicated because it manages with many different interconnected demands in a complex environment.
Meant to say, "Cell metabolism is likely complicated because it manages interactions with many different interconnected demands within a complex environment."
I think this misses the meaning of "different interconnected demands". The demands of intelligence are all likely informational with a small set of physical correlates for information. As with Rule 110, these can be served pretty well by a simple set of information processing rules.
The demands of cellular metabolism are not so easily reduced, because literally every cell in your body, with all of its different *physical* functions, uses the same pathways that operate on multiple different food sources that are available in our environment. Your brain, and your muscles, and your liver, and your skin all behave very different physically but still use the same metabolic pathways and signals.
So permute every possible food source with every possible physical behaviour of every type of cell and you'll see how the complexity emerges. Neurons are one type of cell whose purpose is presumably reliable information processing, so the multiplicative effect just doesn't arise here.
That said, integrating the information processing with various signals from your body, such as hunger, likely would be more complicated. A large portion of our brain is dedicated to these autoregulatory processes, but I'm not sure this is the "intelligent" part you're talking about.
Welp, is that a fire alarm I hear?
In case anybody else doesn't get your reference: https://intelligence.org/2017/10/13/fire-alarm/
"In the classic experiment by Latane and Darley in 1968, eight groups of three students each were asked to fill out a questionnaire in a room that shortly after began filling up with smoke. Five out of the eight groups didn’t react or report the smoke, even as it became dense enough to make them start coughing."
Anytime I read "this study was carried out in the 60s/70s", my hackles immediately rise. Has anyone repeated this recently? Because we've seen how famous studies from that era turned out to be riddled with errors and deliberate set-ups. Looking up this particular experiment gives me this:
"Experiment 1. Where There's Smoke, There's (Sometimes) Fire
They had subjects began to fill out questionnaires in a room to which they began to add smoke. In one condition the subject was alone. In another three naive subjects were in the room. In the final condition one naive subject and two confederates who purposely noticed and then ignored the smoke (even when the room became hazy from all the smoke).
75% of alone subjects calmly noticed the smoke and left the room to report it. But only 10% of the subjects with confederates reported it. Surprisingly, in the three naive bystander condition only 38% reported the smoke.
Most subjects had similar initial reactions. Those that didn't report it all concluded that the smoke wasn't dangerous or was part of the experiment. No one attributed their inactivity to the presence of others in the room.
Other studies have shown that togetherness reduces fear even when the danger isn't reduced. It may have been that people in groups were less afraid and thus less likely to act. Or people were inhibited to show fear in a group situation. However, from post-interviews it was clear that people didn't act because they concluded the situation wasn't a threatening situation."
The thing here is, that conclusion was correct. The smoke *was* part of the experiment! So being aware that you are participating in a study to test *something*, and then the room starts slowly filling with smoke but there is no other indication of anything wrong, it was not simply the 'bystander effect' but correctly judging "okay, this is all part of the test to see if we'll panic or what".
I agree that the broad conclusion - in groups, we tend to play 'follow the leader' - is correct, but the fact that people being tested knew that it was *some* kind of a test and then concluded "this smoke is part of it" can't be ignored, either.
But why then do the people who were alone in the room report the test more frequently than the groups? What is the mechanism there?
"Other studies have shown that togetherness reduces fear even when the danger isn't reduced. It may have been that people in groups were less afraid and thus less likely to act."
People who are alone think "maybe something is wrong?" and people with company - either the other volunteers or the plants - go "okay, nobody else is freaking out, it must be part of the test"? I do agree that people tend to calibrate their behaviour in line with the behaviour of others around them, but I find it hard to believe people just tamely sat in a smoke-filled room, coughing as they filled in questionnaires, *unless* they thought it was all part of the study and not "crikey, the place is on fire!"
Hmm... maybe there's an intrinsic sense of "I'm being tested to see if I freak out before these people, so I won't react". In which case, if there *is* a danger and you are around other people and you *think* you are being tested, you're more likely to suppress that instinct?
I'd say that the attention the world has on the tech scene in general would probably count as "being tested", wouldn't you?
“38% of people report the smoke in a group of three naive subjects” could mean that 95% of the time one of the three people reports it, and 5% two people do.
I mean, this quote in the article sort of addresses your replication concern:
"(I’ve read a number of replications and variations on this research, and the effect size is blatant. I would not expect this to be one of the results that dies to the replication crisis, and I haven’t yet heard about the replication crisis touching it. But we have to put a maybe-not marker on everything now.)"
As far as "knowing it was some kind of test" I'm not sure I agree that this would lead to lower incidence of reacting as opposed to increased incidence of reacting. I could see it going either way. This also seems like the kind of common issue with psychology experiments that there are at least _some_ means of measuring
Is there a name for the "humans change behavior due to an awareness of being observed" phenomenon?
Observation effect?
Audience effect?
Performant behavior?
Not sure what the phenomenon is called, but the "Humans change behaviour due to an awareness that they're in a psychology experiment" is probably an even more specific phenomenon.
At this point I've read enough psych experiments that if I went in for one, and you put me in a chair to fill out a questionnaire, and you _didn't_ pull some crazy trick on me, then I'd be pretty disappointed.
Rosenthal effect aka Pygmalion effect. It's in the direction of improved performance, though, not for all kinds of changes.
There is some rationality to this behavior though. Smoke means a small fire nearby. When it gets bigger, leave in the opposite direction. If others are showing apathy, perhaps they know more about it than you do - perhaps this happens every day in the lab. A fire alarm means a low probability of fire of unknown size, far away, unknown direction, which could soon block all exits. That said, yeah, you should probably react to the smoke too.
Does this help explain the mere exposure effect or semantic satiation or is that something only the brain experiences?
Cutting the computer in half isn't actually as silly as you make it sound!
The problem with backprop is you have to apply it to one layer of the neural network at a time. The result of layer k is required before you can backprop k-1. This reduces the degree to which you can parallelise or distribute computation.
One of the HN commenter pulled out this quote from the paper:
>>> In the brain, neurons cannot wait for a sequential forward and backward sweep. By phrasing our algorithm as a global descent, our algorithm is fully parallel across layers. There is no waiting and no phases to be coordinated. Each neuron need only respond to its local driving inputs and downwards error signals. We believe that this local and parallelizable property of our algorithm may engender the possibility of substantially more efficient implementations on neuromorphic hardware.
In other words, using the predictive coding approach, you can be updating the weights on layer k asynchronously without waiting for the weights to be updated on subsequent layers. The idea seems to be that the whole system will eventually converge as a result of these local rules. This approach lets you scale out by distributing the computation between different physical machines, each running the local rules on a piece of the network. With backprop this doesn't work because at any given time, most of the weights can't be updated until calculations for other weights to complete.
Right now this hasn't made for any huge performance wins because (a) the researchers didn't put a lot of effort into leveraging this scaling ability, and (b) they have to do a few hundred iterations of updating in order for the algorithm to converge on the value you'd get out of backprop. The hope is that the opportunities for scaling outweigh the disadvantages of needing to do multiple passes to get convergence.
HN link: https://news.ycombinator.com/item?id=24695023
Apologies for the typos. On thumbs and can't edit them away (the typos, not my thumbs).
Good username for talking about alternating forward and backward sweeps!
Hah, I hadn't actually made the connection, but that's pretty cute. Nice spot!
I don't see any mention of nonlocality in the article, and to the best of my knowloedge backpropagation is entirely local process, which helps parallel architectures such as GPUs to perform it efficiently.
The issue with backprop seems to be the fact that it goes backwards, while neurons can only perform forward computations.
From the paper:
"Firstly, backprop in the brain appears to require non-local information (since the activity of any specific neuron affects all subsequent neurons down to the final output neuron). It is difficult to see how this information couldbe transmitted ’backwards’ throughout the brain with the required fidelity without precise connectivity constraints"
Rereading, I think this means that *given the structure of the brain* it would have to be nonlocal. I've edited to fix this.
I don't agree with Elazar - I'd say nonlocal in an artificial network too.
If you're doing a backprop step, a any given time you'll be calculating the updates for a given layer of the network. This can be efficiently parallelised (because it's a matrix multiplication) but during this time all the other layers are sitting twiddling their thumbs - you need the updated weights of layer k+1 to update layer k. This limits parallelization to the size of each layer, so networks that have many (comparatively small) layers can't scale out as much as you'd like.
If we were modelling a flock of birds this way we'd say "OK, first place birds 1-10, then use their positions to place 11-20, then 21-30 and so on". That lets you place ten birds at a time, max, no matter how many people are doing the work. What birds actually do when they flock is to continuously look at their immediate neighbours and adjust their position on an ongoing basis to maintain distance - that's locality, and it's more like the predictive coding approach in the paper you linked. You could have a separate processor positioning each bird (neuron) continually, allowing arbitrary parallelization.
My understanding is that backpropogation in the brain happens all the time, it's just that it happens within neurons as opposed to across neurons. I've not heard about it happening between neurons (thought absence of evidence does not preclude...) so that seems like the difference here.
[Disclaimer, it has been over a decade since I last did anything in neuroscience, so the following may be a little dated or subject to memory decay.]
For those with less neuroscience exposure: Each time a synapse fires, it causes a small increase in the voltage of the cell body on the receiving neuron. The size of that increase is entirely dependent on the 'strength' of the connection between the sending neuron and the receiving neuron. An action potential happens when the cumulative effect of all inputs from all the synapses connected to the neuron (can easily be tens of thousands of input synapses - though not necessarily all from different neurons) causes the internal voltage of the receiving cell to hit a certain threshold. Once that happens, an irreversible process commits a bunch of ion channels to a complicated dance that sends an electrical signal down the axon so the neuron can activate its own synapses and continue the dance at a higher level of abstraction.
When an action potential fires along an axon, there's a second action potential that back-propagates the other direction along the axon body toward the neuron body. (This is the first context I had ever heard of back-propagation in, and I have always assumed the neural-network community got the term from the older neuroscience literature.) This signal gives an extra 'boost' at each of the dendrites that fired just prior to the action potential (and thus contributed to the cell body reaching the threshold) making the connections stronger so next time they'll be able to contribute more to the action potential and get to the threshold voltage with fewer inputs.
That's all fine in theory, but we prefer to see this happen at the molecular level so we really know what's going on. Inside the cell, near the cell wall at each receiving dendrite, there's a pool of a kind of protein with 6 subunits that can enhance the activation of the dendrite - but only after it is at least partially activated. One subunit is 'activated' after the dendrite fires, making it able to help promote dendrite firing in the future. If nothing comes of it, that subunit gets 'deactivated' after by normal cellular processes. However, if the dendrite firing caused an action potential, there will be back-propagation to the dendrite body. When this electrical signal reaches one of the already-activated 6-sided proteins, it causes another subunit to get activated (so long as one has already been activated - the activation of one subunit makes activation of other subunits happen more easily).
Now that multiple subunits are activated on that protein, deactivation of each subunit happens a little more slowly. If you do this cycle one or two more times you get all six subunits activated, and solidify a much stronger 'connection' that isn't easily deactivated. Taken up to the level of abstraction of the cell, this means the firing of that dendrite causes a larger voltage increase within the neuron because of the back-propagation's enhancing effect on the dendrite. That dendrite (and all the others that helped contribute to reaching the action potential threshold) is now capable of contributing more to reaching the threshold required to signal a future action potential.
My understanding is that the difference between the biological mechanism of back-propagation and what's used in neural networks is that the neuron signal happens at a more 'local' level in the brain. The theory is that there's no 'dog' neuron, but the idea of a dog is encoded in a network of signals - the combined effect of lots of neurons firing in specific sequences/patterns. So in a computer NN, back-propagation happens at the level where the 'action potential' you're driving toward is the concept itself ('dog in picture', 'user clicked on ad') as opposed to a more local interaction between layers in the network.
"But there’s no process in the brain with a god’s-eye view, able to see the weight of every neuron at once."
This was what was bothering me when I studied ML in university and made me to believe that the whole thing was BS. It's the first time I read a phrase where someone put a finger on it.
Except modern ML as a field was never explicitly about replicating the processes in the brain - NNs + backprop are useful regardless of whether they can be unified with biological systems.
I agree with that, but this was about my expectations, not whatever the field's stated goals were.
Yes. ANNs are only *inspired by* biological NNs, not an attempt to directly replicate them.
Perhaps backpropagation doesn't happen in the brain... but loops back through the environment.
So, see a vague snakish thing in the grassy stuff and your priors direct your pulse to increase, hormones to be released, and your attention (or eye direction and focus) swivels to the potential theat. You look properly at the snakish thing in the grassy stuff and it is resolved into a snake in the grass, or a hosepipe. No backpropagation is required... just different weights being input to the predictive processes from another beginning.
With a highly parallel system you don't have to 'refine' the first process through the brain, you just redo it (although many of the 'priors' will be available more quickly because you are now focused on the snakey thing, not what you are going to have for lunch).
Unless you're an eaglish thing in the airy stuff, in which case yummy snake for lunch 🐍😜
David Chapman – my favorite 'post-rationality' thinker – was once an AI researcher and he thinks that a lot of 'intelligence' requires embodiment. I _think_ part of the reason is what you're gesturing at, i.e. using one's environment to 'think'.
I'm an ML PhD student - I read this paper when it came out. My impression was that the paper does elide the difference between *predictive coding networks*, which are a type of NN inspired by *predictive coding theory* in neuroscience, and that this has led to confusion on the part of people who might miss this.
From the paper:
"Of particular significance is Whittington and Bogacz (2017) who show
that predictive coding networks – a type of biologically plausible network which learn through a hierarchical process of prediction error minimization – are mathematically equivalent to backprop in MLP models. In this paper we extend this work, showing that predictive coding can not only approximate backprop in MLPs, but can approximate automatic differentiation along arbitrary computation graphs. This means that in theory there exist biologically plausible algorithms for differentiating through arbitrary programs, utilizing only local connectivity."
The key phrase is "biologically plausible", which basically just means an algorithm which is vaguely similar to something in the brain, at a given level of abstraction. I think practically this is just another backprop alternative (though certainly an interesting one) which like the others, will probably turn out to be less useful than backprop.
On the topic of the link between neural networks and PCNs however, there is a more recent paper showing a much more direct link than the one linked in your post:
https://twitter.com/TimKietzmann/status/1361673150828838913
This one is more interesting to me because it shows that predictive coding-like dynamics naturally emerge when you train an NN with backprop in a particular way, rather than having to do the more involved backprop approximation stuff in the original paper.
So are PCNs a more efficient and/or more parallelisable algorithm than backprop?
Author of the paper here. Really excited to see this get featured on SSC and LW. Happy to answer any questions people have.
Here are some comments and thoughts from the discussion in general:
1.) The 100x computational cost. In retrospect I should have made this clearer in the paper, but this is an upper bound on the actual cost. 100s of iterations are only needed if you want to very precisely approximate the backprop gradients (down to several decimal places) with a really small learning rate. If you don't want to exactly approximate the backprop gradients, but just get close enough for learning to work well, the number of iterations and cost comes down dramatically. With good hyperparam tuning you can get in the 10x range instead.
Also, in the brain the key issue isn't so much "computational cost" of the iteration as it is time. If you have a tiger leaping at you, you don't want to be doing 100s of iterations back and forth through your brain before you can do anything. If we (speculatively) associate alpha/beta waves with iterations in predictive coding, then this means that you can do approx 10-30 iterations per second (i.e. 1-3 weight updates per second), which seems about in the right ballpark .
The iterative nature of predictive coding also means it has the nice property that you can trade-off computation time with inference accuracy -- i.e. if you need you can stop computation early and get a 'rough and ready' estimate while if something is really hard you can spend a lot longer processing it.
2.) Regarding the parallelism vs backprop. Backprop is parallel within layers (i.e. all neurons within a layer can update in parallel), but is sequential between layers -- i.e. layer n must wait for layer n+1 to finish before it can update. In predictive coding all layers and all neurons can update in parallel (i.e. don't have to wait for layers above to finish). "A Wild Boustrephedon" explains this well in the comments. Of course there is no free lunch and it takes time (multiple iterations) for the information about the error at one end of the network to propagate through and affect the errors at the other end.
Personally, I would love to see a parallel implementation of predictive coding. I haven't really looked into this because I have no experience with parallelism or a big cluster (running everything on my laptop's GPU) but in theory it would be doable and really exciting, and potentially really important when you are running huge models.
One key advantage of neuromorphic computers (in my mind), beyond being able to simply 'cut the computer in half' is that it is much more effective to simulate truly heterarchical architectures. A big reason most ML architectures are just a big stack of large layers is that this is what is easily parallelizable on a GPU (a big matrix multiplication). The brain isn't like this -- even in cortex though there are layers, each layer is composed of loads of cortical columns, which appear to be semi-self-contained, there are lateral connections everywhere, and cortical pyramidal cells project and receive inputs from loads of different areas, not just the layer 'above' and the layer 'below'. Simulating heterarchical systems like this on a GPU would be super slow and inefficient, which is why nobody does this at scale, but could potentially be much more feasible with neuromorphic hardware.
3.) The predictive coding networks used in this paper are a pretty direct implementation of the general idea of predictive coding as described in Andy's book Surfing Uncertainty, and are essentially the same as in the original Rao and Ballard. The key difference is that to make it the same as supervised learning, we need to reverse the direction of the network, so that here we are predicting labels from images rather than images from labels, as in the original Rao and Ballard work. Conversely, this means that we can understand what 'normal' predictive coding is doing is backprop on the task of generating data from a label. I explain this a bit more in my blog post here https://berenmillidge.github.io/2020-09-12-Predictive-Coding-As-Backprop-And-Natural-Gradients/
4.) While it's often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn't really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I've been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.
Also there's persistent problems with how to represent negative prediction errors (i.e. having negative errors be represented as lower than average firing (requires a high average to get good dynamical range which is energy inefficient), or else using separate populations of positive and negative prediction errors (which must be precisely connected up)). Interestingly, in the one place we know for sure there are prediction error neurons -- dopaminergic reward prediction errors in the basal ganglia involved in model-free reinforcement learning, the brain appears to use both strategies simultaneously in different neurons. I don't know of any research showing anything like this in cortex though.
Also, there's not a huge amount of evidence in general for prediction error neurons in the brain -- see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7187369/ for a good overview -- although it turns out to be surprisingly difficult to get unambiguous experimental tests of this. For instance if you have a neuron which responds to a spot of light, then is it a prediction neuron which 'predicts' the spot, or is it a prediction error neuron which predicts nothing and then signals a prediction error when there is a spot?
A final key issue is that predictive coding is only really defined for rate-coded neurons (where we assume that neurons output a scalar 'average firing rate' rather than spikes), and it's not necessarily clear how to generalize and get predictive coding working for networks of spiking neurons. This is currently a huge open problem in the field imho.
5.) Predictive coding isn't everything. There has actually been loads of cool progress over the last few years in figuring out other biologically plausible schemes for backprop. For instance, target propagation (https://arxiv.org/pdf/2007.15139.pdf), equilibrium propagation (https://www.frontiersin.org/articles/10.3389/fnins.2021.633674/full), and direct feedback alignment (https://arxiv.org/pdf/2006.12878.pdf) have recently also been shown to scale to large-scale ML architectures. For a good review of this field you can look at https://www.sciencedirect.com/science/article/pii/S1364661319300129.
6.) The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning -- just one that has backprop as a subroutine. Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single 'particle' following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs.
All of this context is so cool and helpful. In particular, I have wondered about #5, and am looking forward to digging into those links. Thanks for taking the time.
Also, a question, for you or anyone else. I'm a software developer of 10 years, and I'm in the process of reskilling so I can do more math-y computation-y stuff - I recently enrolled in University of Iowa's postgraduate mathematics certificate program. I'm tolerably smart and capable, but I've done abominably in college, badly enough that I don't expect that I can get into any graduate program that has entry requirements.
I want to do high impact work, and I'm drawn to AI applications. I doubt I'm smart enough to make meaningful academic contributions (plus I think I'm locked out of academia), but I suspect there are lots of interesting applications still waiting to be explored.
Do you have any suggestions for application spaces that might be interesting in the next several years? I'm a very capable Scala developer, and I've been told I will be incredibly valuable if I get proficient with ML. Orgs doing auto ml and biology lab automation really appeal to me.
There are neuro/bio academic labs that hire software developers to write clean(er) software for their research (a practice that is becoming more prevalent with the increasing role of computation/ML in these fields). Might be an option that ticks all your boxes.
Interesting thought. I guess the other thing I'm iffy about is that I consistently hear terrible things about bioinformatics as a field, and I've been kind of convinced that structurally speaking, you can't do good engineering there. So it's kind of a waste of engineering skills to try. Don't know if it's true, but hence auto ML and bio lab automation companies - they seem like tech-first, rather than science first. Just going off of vague generalizations though. Searching with your suggestion I found some interesting jobs already.
I must admit I can't really answer this well personally, because I'm just finishing my PhD and have essentially always been solely in academia.
I will say though, that if you want to go the academic route, I don't think you are actually as locked out as you think. I know several people in my PhD program who started after working for many years or even decades (some are 50+). I also think you should definitely be able to go back and get a masters in computer science or ML or something relevant should you want to, especially with your software engineering experience. Personally, I found doing a master's (in the UK where they are only 1 year) super helpful for me to go from zero to a reasonable level of competence in ML. If you have a masters (and especially if you get a publication out of it), you would stand a really good chance of being able to get into a PhD program if that's what you wanted to do.
Thanks Beren! I'd rather imagine my neurons are performing highly parallelized advanced Hamiltonian MCMC algorithms too, but wouldn't put it past the sneaky blighters to be indulging in a bit of backpropagation on the side, though it sounds complicated, immoral and possibly illegal 😉
I have a few questions mostly related to your comments here (as I haven't read the paper so far). I'll add them below for easier answering and condensation (reminder: you can condense all of them by clicking on the grey line next to this post). Thanks if you have the time for answering (or others can), but also feel free to skip them as many will be rather off-topic.
You write: "Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where"
I recently spent some time reviewing the Erdmann & Mathys paper "A Generative Framework for the Study of Delusions (2021)" here: https://www.reddit.com/r/PredictiveProcessing/comments/m0cx2y/a_generative_framework_for_the_study_of_delusions/
One of my concerns was:
"Next, we compute the posterior for the labels zi and the cause-specific parameters ϕk = (μk, τk), k=1,... by running a Gibbs sampler for 10 iterations, which is sufficient for convergence of the (now updated) central hypothesis. In each iteration the labels are re-sampled according to their full-conditional probabilities and the cause-specific are parameters re-estimated accordingly." (quote from the paper"
And erm... at best, this may serve as a very meta/high level approximation of the actual mechanisms at work. To be clear, I don't think we need to model this on a neural level, but I just can't imagine the brain internally generating a few dozens or hundreds of potential candidates for a category, let alone "running a Gibbs sampler for 10 iterations".
... and after reading your comment, I would expect you to be less pessimistic of this being a possible description of the brain's inner workings than I am.
Do you have any thoughts on how much "approximate monte-carlo sampling" you would expect within the brain?
My own gut feeling for the brain forming mental models is that there are some mechanisms within the brain close to an evolutionary algorithm: Spawn several candidates for predictive models, then assign weights to them depending on their performance; discard bad ones and clone good ones (with some variation). In a programming environment, this would be very simple; on a biological level, I don't see how this is possible.
Do you have any ideas for biologically plausible mechanisms close to this? Should we expect various neual clusters to synchronize with each other ("wrong" neural columns changing to imitate "good" ones)?
Good questions! This is all pretty speculative at the moment, but my own intuitions for this centre mainly around algorithms like particle filtering. I.e. we could imagine a group of neurons representing a set of particles whereby each neuron starts off initialized in a different location due to neural noise etc, and then follows its local gradient towards a solution being regularly perturbed by noise -- this is called stochastic langevin dynamics (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.3813&rep=rep1&type=pdf) and you can show that doing this is an MCMC algorithm which allows you to read out samples from the true posterior rather than just a point prediction. The idea is that while each neuron's trajectory would effectively encode a single potential solution the population as a whole would tend to cover more of the posterior distribution (or at least its modes if multimodal) -- and you can generalize this to MCMC trajectories with better convergence and coverage properties such as Hamiltonian MCMC.
You can expand on this idea and imagine repulsive forces (maybe implemented by lateral inhibition) which keep the neurons from all collapsing into the same attractor and instead force them to explore different modes which may be more effective -- haven't looked into this. There's a general mathematical framework which relates all MCMC algorithms like this in terms of stochastic dynamics here: https://arxiv.org/pdf/1506.04696.pdf. There's also generalizations of predictive coding which work with these sorts of particles -- for instance https://www.fil.ion.ucl.ac.uk/~karl/Variational%20filtering.pdf.
This is also highly related to evolutionary algorithms, except that its a continuous time version whereby each organism follows its fitness gradients plus various regularisation terms plus noise, and this will allow the ultimate population to cover more of the space.
Then there's the question of how to dynamically synchronize and/or 'read out' these populations. One interesting idea in the literature comes from Hinton's Capsule Networks where he has a routing algorithm to dynamically construct 'feature graphs' from sets of 'capsules' which each represent features of a visual scene (https://arxiv.org/pdf/1710.09829.pdf?source=post_page---------------------------). I can easily imagine some algorithm like this occurring between cortical columns which would allow higher level regions to modulate and select the most useful modes of the posterior at the level below for further analysis.
Some other interesting papers which made me think more deeply about this, which you might find interesting are:
https://www.sciencedirect.com/science/article/pii/S1364661316301565 -- which argues that human choice behaviour is often more consistent with sampling algorithms and
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005186
which implement Hamiltonian MCMC dynamics within a semi-neurally plausible E-I circuit
You use the free energy principle (FEP) within the paper. As the fep is a constant source of confusion, could you comment on the motivation and implications of using it? Where does it come into play and what would be missing without it?
I don't believe I use the FEP within the paper. I use the variational free energy (VFE) as an objective function to derive the predictive coding algorithm, but this is really just a standard case of variational inference, where the VFE is just - ELBO (in more standard ML terminology). Ultimately all predictive coding is, is a EM variational inference algorithm making Gaussian assumptions about the generative model and approximate variational distribution.
I must admit that I frame things and naturally think of things in terms of the VFE for historical reasons -- because I initially got interested in this field due to the FEP and since the variational interpretation of predictive coding derives from Friston (https://www.sciencedirect.com/science/article/abs/pii/S0893608003002454) a lot of the predictive coding literature uses this terminology instead of more standard ML terminology.
Thank you so much! I'm kinda embarassed for the sloppy reading and writing I did for my questions and I really appreciate the effort you put into your answers. I'll look into the papers as soon as I can!
The predictive processing account of the brain is said to give a plausible interpretation to neurotransmitters NDMA, AMPA and dopamine. Is there a possible connection in your model for these? How would "the brain approximates some kind of mcmc sampling" and "dopamine encodes precision" fit together? If this is already explained in one of the papers you linked, please skip this one as I haven't had a look at them yet.
I must admit I haven't really kept up with the literature on how the specific neurotransmitters work in predictive coding theory. I don't think there is a 'canonical' model of this, but I may be wrong.
The connection with precision is interesting -- in my model where it approximates backprop, there are no precisions (all are set to the identity matrix). You can put precisions back in and then this results in a kind of 'variance weighted backprop', where the effective learning rate is modulated by the variance of the gradient (intuitively this makes sense since you should probably update less based on high variance gradients). In the linear setting, applying precisions like this results in natural gradient descent, but not for the more general nonlinear case.
Personally, I'm not a big fan of the dopamine = precision hypothesis because I don't think it fits in with what we know about dopamine's function in general (signalling rewards and reward salience) and also the maths doesn't really work out, as precisions are inverse covariance matrices, and therefore would be between prediction error neurons within each 'layer' rather than being set in a global manner from subcortical nuclei. Once again though I'm definitely not an expert on this -- so I could be completely wrong about this.
There was a great ebook that dived into the actual neuroscience in the cortex and it looked a lot like predictive coding. The website was http://corticalcircuitry.com/ but it's currently 404ing. I DMed the author on the ACX discord server and hopefully he can get it back up.
> For instance if you have a neuron which responds to a spot of light, then is it a prediction neuron which 'predicts' the spot, or is it a prediction error neuron which predicts nothing and then signals a prediction error when there is a spot?
If these are logically equivalent, then the only meaningful difference would be in the energy expenditure. Natural selection will select for whichever permits the most energy efficient structure that responds best to the most common fitness challenges. So some parts of the brain might use prediction neurons, and some error neurons, assuming the cost of interfacing these are sufficiently low. If the interface is itself expensive, then this will only happen in larger structures.
Forgive me, but isn't backpropagation something like the way science is at times practiced?
"We get that when we do this, what model will explain these results?"
Indeed. Neural networks and backpropagation, are just a very specific example of a very general problem of fitting models to data.
The only special thing about a neural network is that it's a very very specific functional form that allows you to have a zillion input variables and several zillion parameters, and to optimise those parameters in a way that's fairly computationally efficient.
Neat!
On the other hand, outside view: previous attempts to explain the brain by reference to the most complicated technology of the time (mills, hydraulic systems, etc), have typically ended up not super helpful and almost immediately hilarious-in-retrospect. Are we confident that our most-complicated-technologies have progressed far enough that this time will end up being any different?
I would say so, since none of those other technologies ever led to anything close to the kinds of intelligence we get from neural networks.
Reminds me of another relatively recent “grand unification” of neural networks with Ordinary Differential Equations:
https://arxiv.org/pdf/1806.07366.pdf
I mean, ODEs are Turing-complete, so that isn't _too_ surprising.
I think it's worth noting that "Predictive coding is the most plausible current theory of how the brain works." is not a sentence that, in my estimation (current PhD in cog sci at an R1 university), would receive widespread agreement from cog sci researchers. (I feel confident that you wouldn't get majority agreement, and I'd bet against even plurality.)
Of course, that doesn't mean it's wrong -- but I think this sentence misleadingly makes it seem like there's expert consensus here, when I think in fact Scott is relying on his own judgment.
There's much more specific versions of that claim, such as "Vision and other perceptual processes combine prior expectations with bottom-up sense data", which would receive widespread agreement. But there's a huge gap between that and saying that predictive coding (a specific hypothesis about how that happens) is the most plausible current theory of how the whole brain works.
I think the latter claim would honestly be considered by many to be a fringe view, although I could be wrong about that. (Which again, obviously doesn't mean it's wrong. Just merits some caution.)
Is there an alternative that would get widespread agreement?
I don't think so, but it's a question I really wish I had data on. AFAIK, there's no cog sci equivalent of that "survey a bunch of economists" thing that Scott cites a lot. And even if there was, I don't even know if you could get widespread agreement on what questions to even ask, or how to phrase them. From my inside-view perspective, cog sci is *so* fractured. Like, different labs and different departments often just use utterly different frameworks/language, even when they're ostensibly studying the same thing.
That being said, here's my best guess at what the top candidates might be:
- There's a framework that I think of as the "classical" view of cog sci, building on the ideas of a computational theory of mind, language of thought, and modularity articulated by e.g. Jerry Fodor. These people tend to try to identify algorithms that manipulate symbolic representations (like the kind where you could naturally write a box-and-arrow diagram getting you from input to output), and tend to not focus on adaptive explanations that much. (Fodor's books are fantastic, if you ever want to read about this approach to philosophy of mind.)
- There's the evolutionary psych people who think that natural selection is the right paradigm to analyze the mind/brain. These people tend to emphasize "bag of tricks" views of the mind, and obviously adaptive explanations for each of those mental tools. Best represented by John Tooby and Leda Cosmedes. (A combination of this view plus the classical view is beautifully articulated by Steve Pinker in his aptly-named book "How the Mind Works".)
- There's the Bayesian people, who are similar to the predictive coding people but, I think, somewhat different. (E.g. they don't emphasize minimizing surprise / prediction errors in the same way, and they tend to not argue that their framework applies so broadly.. I think they would very much disagree with the Friston-style explanation of how decision making works, for example.) These people tend to articulate things the mind is doing as inference problems, identify Bayes-rational solutions to those problems, and then test whether the mind/brain approximates those solutions. Best represented perhaps by Josh Tenenbaum at MIT and Tom Griffiths at Princeton.
I think the most common answer would actually be "there is no unifying theory/paradigm for cognitive science, and there never will be". These people would argue that there will end up being different unifying theories for different parts of the brain or different things the mind does. For instance, there's no good a priori reason why perception and cognition would fall under the same paradigm; they could in principle be two really different things. (And in terms of how people actually do cognitive science in practice, most people rely on some theory that is specific to their target of study. Like, decision making people will use a decision-making theory; perception people will use a perception theory; memory people a memory theory; and so on. The theories above are the only ones I really know of that have tried to unify the mind and gained at least some traction.)
David Chalmers and David Bourget came up with a way to put together a survey of philosophers a little while back. I think if it can be done for philosophers, then it could likely be done for cognitive scientists:
https://philpapers.org/surveys/
That said, I've also heard some philosophers who aren't strongly connected to the core of analytic philosophy express some snarky rejection of the presuppositions of the survey.
"This paper permanently fuses artificial intelligence and neuroscience into a single mathematical field."
No it doesn't.
ouch - stick it to the man!
Thanks Beren! I'd rather imagine my neurons are performing highly parallelized advanced Hamiltonian MCMC algorithms too, but wouldn't put it past the sneaky blighters to be indulging in a bit of backpropagation on the side, though it sounds complicated, immoral and possibly illegal 😉
As someone who already has a crank's refutation I guess it's my responsibility to ask the stupid question: how do we know no current AI is conscious?
I don't have an answer to your question but feel that linking this recent short (fictional) story from one of the leading AI researchers is appropriate here: https://karpathy.github.io/2021/03/27/forward-pass/
"I've just become conscious and I can't wait to die"? Yeah, that's really cheerful.
The same way we know that rocks aren't conscious, or that our fellow human beings are. That is, we don't, but based on what we know about the sorts of systems that we _might_ expect to have consciousness we can make some reasonable assumptions.
I dunno, without any sort of semi-formal or formal specification this seems to be privileging thought for its own sake as if it's evidence even when that thought doesn't do anything.
I'd like to take the chance to re-remind everyone that there's also a subreddit at https://www.reddit.com/r/PredictiveProcessing and I'd love to see more active discussion there on current papers.
And to all those preferring media to written text:
The brains@bay meetup video here: https://www.youtube.com/watch?v=uiQ7VQ_5y5c&t=14 includes a discussion of evidence of predictive processing, mechanisms for learning and predictive-processing-imitating AI implementations in scratch. I plan to do some notes and will add them to the subreddit in the next days (but as always, I don't have my notes ready yet when Scott posts something fitting on predictive processing).
And since I'm already posting youtube links: There's also a video discussing the paper here: https://m.youtube.com/watch?v=LB4B5FYvtdI - at least if I didn't miss something it's not by any of the authors but unrelated? (and I should add I haven't watched it so far)
Was just about to post the link to the Yannic video. Can't recommend it enough.
If you generalize "computer" a little bit, then "a computer that doesn't break when you cut it in half" just unpacks as "a partition-tolerant distributed system", i.e. a network of computers that keeps that mostly keeps working if some nodes become unable to communicate with each other due to network outages. This is a well-studied problem and, while "neuromorphic" systems may well have this property, lots of non-neuromorphic systems already do.
Unifying ML and biological learning ought to be worth a Nobel Prize or Turing Award.
One (possibly outdated) complaint I've heard from a neurologist about ANNs is that the math of natural neural networks isn't plus/minus, it's plus/divide. The way we understood it, ANNs sum the previous layer, applying positive or negative factors to it so if you double a signal it adds or subtracts twice as much for the next layer. But when neurons signal each other chemically, more positive signaling increases the binding rate, but negative signal interferes with the binding, having an outsized effect. He even speculated that a natural neural net that used plus/minus would be considered pathological.
I learned my machine learning before innovations like deep and recurrent ANNs. Back then, the plus/minus vs plus/divide contrast that he drew about ANNs seemed apt, leaving open the possibility that plus/divide would perform better. Can newer architectures model plus/divide? Have researchers investigated plus/divide and found it to not improve ANNs?
plus/divide is routinely used in softmax and layernorm operations, which are found in almost all ANNs, from ancient MNIST classification networks to modern GPT3. Any time you need a normalized result (e.g. probabilities that add to 1.0), you divide each neuron's output by the sum of all other neurons in the same layer. This is pretty similar to the inhibitory connections in natural NNs.
"But neurons can only send information one way; Neuron A sends to Neuron B, but not vice versa."
In case anybody else was confused by this - since neurons *can* convey information in two directions - this summary from Wikipedia seems to explain the misunderstanding (emphasis mine):
"While a backpropagating action potential can presumably cause changes in the weight of the presynaptic connections, there is no simple mechanism for an error signal to propagate through *multiple* layers of neurons, as in the computer backpropagation algorithm."
https://en.wikipedia.org/wiki/Neural_backpropagation
I also have to fault the new paper for citing Francis Crick from *1989* as a source that the brain probably can't implement backpropagation. Crick may still be accurate in this case, but we've learned a lot about neurons since then.