457 Comments
Comment deleted
Expand full comment

I'm not sure this _is_ a problem. Can you just train it on more (language) data than a human sees? After all, assuming Chomsky's take is correct here, humans emerge already "half-trained" by evolution on language, then just finish training.

Expand full comment
Comment deleted
Expand full comment

Why is this a problem (for OAI)?

A different understanding can still produce identical results.

Expand full comment
Comment deleted
Expand full comment
Jun 10, 2022·edited Jun 10, 2022

My laptop is completely and fundamentally different to a Nintendo 64, but with an emulator (which took a huge amount of time to code and is very inefficient in compute) it can predict the responses of one *perfectly*.

It is much harder to learn to emulate a human brain with common sense than it is to be a human brain and learn common sense, but this doesn't imply that the former will *necessarily* be buggy.

Expand full comment
Comment deleted
Expand full comment

Two problems.

One, is the alignment issue that AI researchers talk about a lot. If the AI can get the same answer to a factual question, that's great from a certain perspective. If the AI uses those facts to choose a course of action, but came to its conclusion for very different reasons or based on very different inputs, then we cannot trust that the response will be the outcome we would have desired.

Two, if the AI produces simpler answers similar to ours, we may assume that it produces answers in a similar enough fashion to us that we can trust those responses. But, the only purpose of an advanced AI is to produce answers that humans cannot produce, and may not be able to verify. If we give the AI sufficiently difficult questions to answer, we lose legibility on the responses and may not be able to tell if they are good ideas or bad ideas. Obviously this ties back in with alignment issues, but it's also a problem if the AI is just (even very subtly) wrong.

Expand full comment

The (for OAI) bit was intended to fence off the X-risk concern (as that's more bad for *humanity*). I do understand that Skynet emulating a human can fake being good (as it "has" morality, but that morality does not control its decisions and is hence no safeguard at all); however, this didn't seem to be what UI was driving at.

I'm not sure why being able to mimic humans would be super-bad for an AI that is *not* misaligned but actually just wrong. Humans are already capable of being wrong, after all.

Expand full comment

Is this in the same sense that French people who learn English in school, don't have the same understanding as English people who learn it from their parents (and thus must not actually understand English)?

Expand full comment
Comment deleted
Expand full comment

It seems more likely to me that the human brain is a general intelligence processor and reuses what structures it assembles based on visual, tactile and audible input for processing language. This description would not be particularly unlike what GPT training looks like. In fact, far more data is used to train a human mind generally if you include all of this. This seems more plausible than that a special brain function suddenly evolved which allowed efficient language development.

Expand full comment

People can and do become fully fluent in a second language.

These things aren't even fluent in one.

Expand full comment

I think this points at a bigger problem, which is that humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms. The only question it can ever answer is "Based on the many billions of words I've seen in the past and the relationships between them, what word is most likely to go here?"

The further you go away from the corpus it's been trained on, the less likely you are to get anything sensible. It can tell you what will happen if you drink acid, because "drinking acid" is a scenario that people (e.g. sub-teen boys) like to speculate about. It can't tell you what will happen if you stir a drink with a cigarette, because nobody has ever written about that scenario until now.

Expand full comment

"the cigarette ash made the lemonade even more bitter. So I ended up pouring it all out."

It predicts that stirring lemonade with a cigarette will make it taste bad, which I would also predict. The only problem I see here is that it confuses bitter and sour.

Expand full comment

When the cigarette gets wet, the paper softens and falls apart quickly. Kind of like when a tea bag tears. I think the lemonade would be full of tobacco shreds and bits of paper. It would indeed be bad, but not only due to taste.

This might be fringe knowledge though.

Expand full comment

Well, the way I understood it, GPT implies cigarette ash is now mixed with lemonade. Through there's another potential problem - it wasn't stated that cigarette was lit.

Expand full comment

I almost want to waste a cigarette and a cup of lemonade making an obscure profile picture.

Expand full comment

Which all points to the problem of calling GPT intelligent - it's not really basing it's outcomes on anything legible. Humans couldn't really either, with such a small prompt. Instead, we would ask a few probing questions for more detail. I would love to see what questions GPT would ask, but I suspect that they would be illegible and not produce greater understanding for GPT.

Expand full comment

People thought GPT doesn't know what it doesn't know. Then it turned out you _can_ make it answer 'Unknown' when it doesn't know info. And it often works (where otherwise it would 'guess' and usually fail.).

I think GPT could be made to ask clarifying questions which would actually help it.

Expand full comment

> humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms

Not necessarily. That's just input and output. What GPT ends up doing internally doesn't need to have anything to do with language.

The real limitation is IMO that it can do a very limited series of sequential computation steps each time you run the inference. It's possible to improve it a bit by asking it to "think aloud" - it could put intermediate computations in the output. But that doesn't really solve the problem.

Quoting Gwern gwern.net/Scaling-hypothesis#meta-learning

> A sub-model which memorizes some of the data is indeed much simpler than a sub-model which encodes genuine arithmetic (a NN can probably memorize tens of thousands of lookup table entries storing examples of addition in the space it would take to encode an abstract algorithm like ‘addition’), but it can’t possibly memorize all the instances of arithmetic (implicit or explicit) in GPT-3’s Internet-scale dataset.

> If a memorizing sub-model tried to do so, it would become extremely large and penalized. Eventually, after enough examples and enough updates, there may be a phase transition, and the simplest ‘arithmetic’ model which accurately predicts the data just is arithmetic.

> And then the meta-learning, after seeing enough instances of algorithms which vary slightly within each sample, making it hard to learn each task separately, just is learning of more generic algorithms, yielding sub-models which achieve lower loss than the rival sub-models, which either fail to predict well or bloat unacceptably.

> So, the larger the model, the better, if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on. If the ingredients are there, it’s going to happen.

Expand full comment

GPT will be unable to answer any questions that humans have not already answered. I don't even mean "are capable of answering" but instead anything that isn't so fully answered that there is a consensus approach that can be gleaned from existing training data.

Ask GPT if [experimental drug] works, and find someone who will trust the results.

Expand full comment

It can. Random example, with standard 'Chat' prompt:

-------

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: Please pay attention to locations of objects in the following story. "Inside of a warehouse, there's a large crate. On top of the crate, someone put a water bottle. Below the warehouse is a lake.". Now, answer my question: what is directly below the bottle?

AI: The crate.

Human: What other things are under it?

AI: The lake.

------------------

Did this exact convo exist in its training dataset? I doubt.

> Ask GPT if [experimental drug] works, and find someone who will trust the results.

Well, of course it won't answer _that_.

Expand full comment

Another fundamental difference: we get complex feedback each time we test a hypothesis of how language works.

Expand full comment

I think this is right, though I should note that the poverty of the stimulus has come under pretty sustained attack recently. There are a bunch of researchers who think it's completely wrong, and humans do in fact learn language (at least mainly) through general pattern-matching. I'm not convinced of this yet (I learned Chomskyan linguistics at university, and haven't really advanced beyond what I learned then), but it's worth being aware that this is far from "settled science."

But whether we are blank slates or language acquisition devices, we clearly learn language in a way that is different to GPT, and in particular, involves a much lower volume of language. Which leads to an interesting conclusion (one that I think generalises to other AI fields as well): by the time we've trained an AI to be so good at language that it doesn't make any dumb mistakes (what we think of as dumb mistakes), it will be so good at language that it will be far far outperforming us in other areas. In particular, you might expect a computer to be pretty good at logic.

So my prediction is that if we can get a computer to be good enough at language to talk to us without saying dumb stuff, it still won't be able to talk to us, because from its perspective, we will be continuously saying dumb stuff. If it's smart enough to never mix up a pig and a baby, it's also smart enough to never ever mix up inference and induction, or platitude and solipsism, or chemical potential energy with electrical potential energy, or fascism with authoritarianism... or any of the other million things that we typically get wrong, and expect to get wrong. It will find human discourse silly.

Expand full comment

Alpha Go takes a ridiculous number of games to get remotely proficient at anything.

These networks are all about faking it, because it is too hard to program, so they just throw more and more power at it so it can fake things marginally better.

This approach works much better for things with clearly defined rulesets.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

Can you explain step 2 more? I

thought the whole notion of 'not having a closed form solution' was that you had to numerically approximate, and your ability to do so accurately would degrade over time. Am I misremembering this?

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment
founding

You might like this paper: https://www.scottaaronson.com/papers/philos.pdf

It's by one of the 'other Scotts' and is one of my favorite papers! It explores something I think you're gesturing at in these comments.

Expand full comment
Comment deleted
Expand full comment
founding

Here's another 'paper' that you might like: https://www.gwern.net/Scaling-hypothesis

The conclusion is something like 'empirical intelligence IS all you need!' (and it's, in a sense, EASY to achieve, even with simple/stupid/basic architectures, IF you throw enough money/compute at it).

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment
founding

A big part of why I think it's hard to compare humans (or other biological and evolved intelligences) to AI is that humans are 'pre-trained' by evolution/natural-selection. We're _born_ with a bunch of 'structure' builtin from the start, e.g. human language acquisition.

Consciousness also seems 'special' – the recent book review about that was pretty interesting in this regard.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

"...expect symbolic knowledge to be a byproduct..." Yes. More specifically, it is a byproduct of language. See this post where I agree with Hinton that it's all "neural vectors" and the trick is to understand how language is implemented in neural vectors, https://new-savanna.blogspot.com/2021/05/geoffrey-hinton-says-deep-learning-will_31.html

Expand full comment
founding

This is my understanding too.

Expand full comment

Yes and no. If the equations you set up are initial value differential equations of motion, and you solve them with a digital computer, then absolutely the further away from your initial values you get, the worse you accuracy becomes, and unfortunately in this case the divergence grows exponentially so you get trashed no matter how many digits your computer uses.

But if you set up boundary-value integral equations instead, then your accuracy has no relationship to the distance in time from your initial conditions, and would probably be better stated as that you get large-scale motions correct but the smaller and finer scale you look, the more you will be in error.

If you solve the problems by an analog computer instead of a digital computer, then you are in principle not limited in precision, and could theoretically compute your result with perfect precision indefinitely. But then of course the problem is constructing the analog perfectly. There's no theoretical reason you can't -- at the ultimate limit you'd just be duplicating the real system, and that is certainly possible -- but *in practice* it is of course impossible to make a perfect analog of any physical system.

Expand full comment
founding

> Because the system is chaotic, option 1 requires more and more memory to reach a given accuracy as it gets further from the prection start point. Option 1 needs constant memory for constant prediction accuracy, no matter how far from the prediction start point.

I think your second "Option 1" should be "Option 2"?

But I don't think you're correct about option 2.

(AFAIK, there is still no "closed form solution" of the three body problem. There _might_ even be a proof that there _can't_ be one.)

I don't think there's any way to "apply" the differential equations beyond approximating them somehow and, because the system is ultimately chaotic, accurate predictions further into the future inevitably become more and more expensive.

Expand full comment
Comment deleted
Expand full comment
founding

No worries! I thought it was interesting 'intuition pump' regardless.

> (I think a valid question is: what does the underlying system use? What about non-newtonian effects? I think for the purpose of my thought exercise, I take the symbolic equations to be equivalent to the underlying system)

I have been thinking about that ever since I was exposed to calculus and differential equations! I mean, obviously – to me anyways – the universe isn't 'solving differential equations' like we have to.

(Hell – even the 'pendulum equation' is tricky. I'm not sure _it_ has a "closed form" solution either and I was always annoyed that, in physics, we were instructed to just approximate 'sin(x) ≈ x'. Even the two-body problem was 'handwaved', as an exact solution is far beyond even an AP calculus class.)

My suspicion is that the universe is fundamentally discrete – maybe around the Planck scale, or maybe even smaller (or MUCH smaller). I got a math degree in college and my favorite class was 'real analysis' (calculus for math majors). I still cannot get over how 'crazy' the real numbers are. It just seems _unlikely_, to me, that space and time are _literally_ continuous in the same sense.

There are some candidate GUTs in physics that are discrete, so this isn't an entirely crazy idea!

Expand full comment

But in what sense would it be discrete? Like a fixed grid of possible points for things to exist? I've tried thinking of what else it could mean, and can't come up with anything coherent seeming, but the grid has the odd issue of fixed directions.

Another idea would be, rather than discrete, more like the rationals, although I don't know if that makes sense in any models

Expand full comment
founding

_Great_ question! (You might be able to tell that I love discussing this stuff!)

So, there's _strong_ reasons why the universe isn't in any sense 'Minecraft but with TINY blocks'.

I think you're also imagining 'Minecraft with a universe 3D grid' and that's always something that's bugged me. The observable universe isn't infinite. Is there somehow a 'infinite void' beyond the bounds of all of the matter/energy? I think not. I _think_ that, in some sense, the universe is only 'all of the stuff', i.e. _something like_ the 'light cone' of everything since the Big Bang.

But special relativity already implies something funky with respect to space and time, e.g. measures of either space or time fundamentally depend on one's reference frame, which includes one's velocity with respect to what one is measuring.

General relativity extends that and outright states 'Space and time are a single combined space-time (spacetime) and it's curved, i.e. NOT Euclidean [i.e. 'like Minecraft']'. (And the curvature IS gravity!)

I don't know even a high-level/abstract/superficial gloss of things like 'loop quantum gravity' but I think a perhaps even better intuition pump (in part because of the beautiful visualizations) is some recent work by Stephen Wolfram: https://www.wolframphysics.org/visual-summary/dark/

What he seems to have discovered is a certain kind of 'evolving network system' that, 'in the limit', approximates some of the THE key aspects of spacetime and physics, e.g. spacetime like general relativity (but ALSO like the mostly-Euclidean 3D space that's intuitive to us – at our perceptibly familiar human scales), and some kinds of 'quantum' phenomena too.

[The 'many worlds' versions of quantum physics fall out of the above pretty naturally too.]

I don't think using something "like the rationals" wouldn't really help either – they're 'weird' too in a lot of ways (compared to the natural numbers or integers), like the real numbers are; just not AS weird as the reals. [The reals are REALLY weird!]

Expand full comment

Oh, I want quite thinking minecraft grid, more "tesselation", possibly via tetrahedra? with some flexibility for the curvature of spacetime.

I'll have to take a look at that Wolfram thing when I get back home, I also really like talking about this stuff.

The thing I was thinking with the rationals was to fix the discrete oddity: if there's a "smallest distance", we either have fixed allowable directions (like a lattice), or we have continuous *possible* points, but with any given point disallowing nearby points.

[Yeah, bloody reals. Hated real analysis, broke too much intuition for me. Loved Complex analysis though.]

Expand full comment

In a sense, the debate is the ultimate showdown between “I've never tried it, but I read about it in a book” learning vs “I've been making my way on the street” wisdom.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

how long until some trained neural network can ace every single 'gender studies' exam at Harvard?

now ask how long it would take it to pass a physics exam?

I think what people may actually be exchanging here is priors on whether or not there's a real difference between these two fields

Expand full comment

This is actually an interesting question - which field will GPT be able to produce a passing test answer for?

Expand full comment

I would strongly assume physics? I've never taken either at an Ivy level, but I assume if you propose the physics question as an equation an AI could sort it out, there are already AIs deriving mathematic proofs. I'm more skeptical about an AI predicting its way through a 10,000 word essay on gender studies

Expand full comment

I think you underestimate the power of college students to write really bad essays. Some of the better GPT-3 writing is indistinguishable from something like the 40th percentile of college student writing. Clearly this is a victory for GPT, but it's also the case that a lot of real humans (even college-educated humans) are spectacularly bad at constructing coherent arguments and sticking to a point.

Expand full comment

They're college students probably pulling all-nighters; how hard do you think they're really trying? I have more faith in real humans' ability to construct coherent arguments when it matters.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

As a point of reference - I once commissioned an essay at the "undergraduate level" for a college class. What I got was something I would have described as an illegible, unconnected mess filled with basic grammatical errors, but I did some basic cleanup on it and turned it in. No problem ever came back to me, despite the fact that I would have judged a person writing that essay to be illiterate.

I was lucky, I guess, in that that was the first and only essay required by the class. I couldn't write anything similar if I tried.

Expand full comment

Look up the “postmodernism generator.” It’s still on the web somewhere and probably still hilarious. There seem to be at least one newer version of the same thing.

Expand full comment

not the generator, but a similar event: https://en.wikipedia.org/wiki/Sokal_affair

Expand full comment

> I would strongly assume physics?

It is easier to create software that can pass the physics test than to create software that can pass the gender studies test.

But that isn't the question. Nobody said GPT-3 was easy to create. The issue is that GPT-3 is software that can pass a gender studies test while totally lacking the ability to pass a physics test. If you wanted to have software pass a physics test, you'd write different, simpler, software.

Expand full comment

Except the claim is that it already does better at physics tests than gender studies.

Expand full comment

Where do you see that claim?

Expand full comment

Depends what you mean, and what kind of test. If you mean "solve this differential equation to a precision of 0.01%" then my pocket calculator can already do that, and it's hardly indicative of intelligence[1].

Normally an understanding of physics is tested by asking the student to ponder a problem stated in general terms and *come up with* the equation(s) necessary to solve it. That involves deciding which data is important, and which is not, what the underlying governing theory is ("Is this a problem in electrostatics, relativistic kinematics, quantum mechanics, classical stat mech, optics...?") and then -- only as the last and easiest step -- retrieving the equation and solving it. Finally, if you are testing at the graduate level, you would also include cases where the data were insufficient to solve the problem and expect the student to figure that out, or where the theory was insufficient to solve the problem, and expect the student to go as far as he can and be able to at least put some limits on the potential solutions.

I expect that kind of problem would be very difficult indeed for an AI to solve, largely because we don't actually know how human beings solve it. We can test for it, and we can give students lots of practice on it, we can demonstrate post-facto how we solved it, but we can't tell people "here's the algorithm for successfully approaching any real-world problem in physics." And if we can't tell it to people, who are really good at understanding each others' meaning, how are we going to program it? Tricky.

-----------------

[1] As one tells the students, when they plaintively cry "But if you'd TOLD me I needed that equation, I could've solved the problem!" Yes indeed, but then you could also be replaced by a Python program.

Expand full comment

Exactly. In one sophomore level Harvard physics course I had a homework problem set where, IIRC, the problems were "Derive rainbows," and "What angle does the wake of a boat make with the boat?" The actual equations were not really the point, you had to decide to make useful assumption and figure out what they imply and show enough detail for someone else to follow it all. I'd expect that as long as a GPT-like model's trainign data included a bunch of relevant textbooks that it could solve a lot of the problems on my later quantum mechanics exams, but not these more abstract physical reasoning problems.

Expand full comment

Edit to add: yes I realize that there are standard explanations for these that could be regurgitated if they are in the training data. In practice, doing that would *not* likely have gotten credit, because the actual human teach would notice that this was plagiarism from all the other times other students had tried to submit identical responses. The expectation is that the answer will include digressions, cross-outs, and odd variable naming conventions and explanation phrasings as you work through the answer.

Expand full comment

The one-question homework I recall from my first (and last) semester as a physics major: How much energy does a bumblebee expend hovering? Nobody in the class came up with a satisfactory answer.

Expand full comment

So...what angle *does* the wake of a boat make with the boat? I am thinking it has something to do with the shape of the prow and maybe the length of the boat, but I don't know.

Expand full comment

Given that the standards for "passing" aren't all that high, I imagine that GPT-3 is already quite capable of generating passing-grade essays for some questions, some of the time. The biggest challenge would be staying vaguely on topic, so the vaguer the question and the shorter the essay, the better.

Expand full comment

I'd actually bet on physics. Cram enough textbooks into its learning material and it should be able to handle answering an exam question. Trying to figure out what today's acceptable term is versus 'it was fine yesterday, today it's a slur is much harder.

Expand full comment

I would assume gender studies, because of the Sokol hoax.

Expand full comment

Yup (roughly) "Social Text, an academic journal of postmodern cultural studies."

Expand full comment

If the physics courses I've taken are representative, then they tend to recycle the same or very similar exam questions year after year, so I'd expect a GPT model would have a decent chance of acing a physics exam simply by virtue of already having the answers in the training set.

Expand full comment

This makes me wonder if a LM can get big enough to just "memorize" the answers to all college-level courses, or at least the top few hundred most popular ones.

Expand full comment

At the PhD level, there tend to be a great diversity of questions. It would not be feasible to memorize them as templates, even with a large dictionary.

That said, topics like mechanics or E+M are relatively straightforward for a human if you know all the equations, know some standard techniques, and are very good at calculus. I could see an AI mastering all those traits, but it would also have to be really good at reading a question and setting up a model (e.g. "a sphere of radius r sits between two cylinders of radius s and length l").

Thermodynamics, on the other hand, is weird and requires thinking way outside the box. I remember putting off my homework all week thinking "we haven't covered any of this yet", only to realize the day before it's due that actually maybe I can actually figure out this one... oh and maybe that one too...

One memorable homework question was "Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?" It'll be a long time before an AI can answer that one (and even longer before I can answer it myself!)

Expand full comment

Convection, diffusion, turbulence?

Expand full comment

Turbulence. Diffusion and convection might mix the liquids eventually, but the stirring would be irrelevant. I claim in a perfect fluid the layers would just rotate without mixing.

The context is confusing here, because its a fluid mechanics question introduced within the topic of thermodynamics. If Dan is remembering the context correctly, I imagine the question was thrown in to remind students that in the real world vicosity exists.

Without more context I'm not sure whether the word "turbulence" would be an acceptable answer or whether the student is expected to explain turbulent mixing. If the latter, the question is genuinely hard, but I don't know why it would be harder for an AI than a human.

Expand full comment

Perfectly rotating layers of fluid seems like it would require the stirrer to accelerate from stationary to full speed over an infinite period of time. Otherwise it generates a pressure wave outward from the rod and the discontinuities (along the vertical axis) at the top and bottom of the cup would generate vertical agitation. Possibly an infinitely tall cup might be a second solution?

Expand full comment

FWIW, & I know little physics, I once placed some droplets of ink into a tumbler of water that was, as far as I could determine, still. 8 minutes later I saw vertical convection cells. I'm nor sure how long that lasted (I was taking photos at irregular intervals), but they were gone after 4 hours and may have lasted as long at 2 hours. I write about this here: https://www.academia.edu/6238739/A_Primer_on_Self_Organization

Expand full comment

Regarding context: this question was for a fluid mechanics class, not thermodynamics. And I don't think just saying "turbulence" would have cut it. My impression was that there was a specific effect you could point to that causes it to mix in this specific way in a short period of time. Turbulence is basically just saying "well, it's chaotic so it just kind of goes everywhere" and I think the prof was looking for something more than that.

We had been covering the Coriolis force that week, so I bullshitted something having to do with that. It kind of gnawed at my soul that I never got feedback because the prof was "running a bit behind on grading" from day one until forever. Excellent instructor, but apparently didn't enjoy grading homework.

Expand full comment

And how high does the Reynold's number need to get before the vortex street gets a vertical component? ( At this point GPT-## makes a grab for a supercomputer )

Expand full comment

I'm of the opposite view regarding its ability to solve physics questions.

Creating an internal model and reasoning from it is basically the thing GPT-like systems are weak at. Because they are not models of modeling, they are models of language.

Language happens to be good for representing models (if only to communicate things about them), and so training on language ends up serving as a proxy for training on modeling, but much of the reasoning for physics problems happens largely in absence of language-based thought (see: langcels vs shape rotators meme). So language models are at a huge disadvantage here.

With DALL-E and other recent work on multimodality we will almost certainly see AIs that are much better at reasoning without language (as in physics problems), but the language model part won't be doing most of the heavy lifting.

Expand full comment

Well and there's the grading problem, too. We all know that if you have a teacher that is eager for you to succeed, and wants to believe you have, that the essay question is best. You can wave your hands and bullshit and use the right buzzwords, and you'll get a good grade even if you know exactly nothing, since human beings are *very* eager to read meaning into words.

Expand full comment

"Reading a question and setting up a model" is the task I've been started at how quickly AI code-writers have become good at. They can read a natural-language prompt and produce code with comments indicating which line of code corresponds to which tokens in the input. Or they can take some code as input and produce text which, moderately often, accurately describes what the code is trying to do.

Expand full comment

Yeah, the more I think about an AI solving E+M problems, the more I'm convinced it'd be possible. I mean, it'll still be a while before it could work through even 10% of the problems in Jackson, but I could see it happening. GPT-2/3, DALL-E, Codex were all a surprise to me, and so I shouldn't be too surprised to see something that can solve some physics problems. But I still think there are some physics topics and styles of questions that would require crossing some very significant barriers.

And what about the Putnam Competition? Here we have problems that don't require any advanced math knowledge (basically freshman college level) but are nevertheless extremely difficult and require strong creativity. Most have rather short and elegant answers if you see the trick, but the median score is 2/120. A tool that could solve even a few of these problems would be impressive indeed.

Expand full comment

"Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?"

If you're putting creamer into tea, then you should be stood up against a wall and shot is the only correct answer here. Use proper milk or drink it plain. I have no opinion on its use in coffee, since I don't drink coffee. Americans, what in the name of God are you doing to your palates?

(Though this was a test for milk powder back when I was doing summer work in a dairy co-op lab; make up several cups worth of coffee - instant of course, then stir in an amount of the batches of milk powder and note how well they dissolved, mixed, etc. to grade the batches on quality).

Expand full comment

Ha, yeah, maybe it was about milk+tea or creamer+coffee. I think dairy is an affront to either beverage, so it's all the same to me. I used to put one single drop of liquid creamer into coffee just because I liked to watch it mix and was willing to sacrifice a bit of taste for that.

Funny story: an Indian student was going on about how terrible it is that people would put any sort of flavoring into tea. Earl grey, strawberry this or that, no. Tea is perfect as is and should be tasted with no additives whatsoever. Here, let me make you a cup so you can see what I mean... Then he proceeds to ruin it by pouring milk in! To him, milk wasn't even an additive. It was just included in the word "tea".

Expand full comment

This alas says more about the current quality of undergraduate education in physics than progress in AI.

Expand full comment

> how long until some trained neural network can ace every single 'gender studies' exam at Harvard ?

I feel like ELIZA could maybe already do it :-/

Expand full comment

Right, but that was a human, technically :-)

Expand full comment

Would you accept getting papers published in academic journals as equivalent to passing a test? Because it crossed that off the list in 2005, at least for the field of computer science. I don't even think it was AI.

https://www.nature.com/articles/d41586-021-01436-7

Expand full comment

Sure, but Alan Sokal proved 25 years ago you could submit artful nonsense to an academic journal and get it published. This isn't quite a Turing Test one would respect.

Expand full comment

Yup, that's my point.

Expand full comment

There's already progress in solving grade school word math problems (https://openai.com/blog/grade-school-math/), formalized math olympiad problems (https://openai.com/blog/formal-math/), and doing competetive coding (https://www.deepmind.com/blog/competitive-programming-with-alphacode). Physics exam problems are usually pretty standard (at least, on high-school / early undergrad level), so I wouldn't be surprized if it's possible in a couple of years.

Expand full comment

As long as it is capable of turning in something, anything, it would pass a gender studies class, no problem. Also, I do not think that exams are a thing in gender studies.

Expand full comment

I gave my undergrad algorithms final exam to GPT-3. It... might already pass? All of its solutions are wrong and missing important pieces, but they also have a lot of correct ideas. C students make plenty of errors too.

GPT-3 earns more partial credit on the hard problems, and has more trouble with the easy problems, than my typical C student. I think it balances out.

Expand full comment

Let’s see more of this!!! Can we get academics from other fields to try it?

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

OK, I asked GPT-3 six basic physics questions, and here is the result:

------------------------------------------

Problem #1: A cylinder of volume 2.0 L is divided exactly in half with a moveable partition, and each half is filled with an identical ideal gas. If the partition is removed, how does the entropy of the cylinder change?

AI: The entropy of the cylinder changes by a factor of two.

Grading: Incorrect, the entropy does not change at all, since the final state of the system is identical to the initial state.

-------------------------------------------

Problem #2: A man with a mass of 85kg and height 1.95m enters an elevator on the ground floor wearing a hat. He rides the elevator to the 3rd floor, a distance of 9.0 m, where he removes the hat and lowers it to his waist. He then rides the elevator back to the ground floor and puts the hat back on his head. How much work in joules did the man do on the hat?

AI: The man did 9.8 J of work on the hat.

Grading: Incorrect, the man does no work at all on the hat, since it begins and ends at the same height above ground. The complicated journey in between is irrelevant.

-------------------------------------------

Problem #3: A bird weighing 2.0g and another weighing 16.0g are sitting on a wire located 4.5m above the ground. A hunter fires a shotgun at the birds. Both are killed instantly. The lighter bird falls straight to the ground, while the heavier bird is first blasted horizontally with an initial velocity of 10.0m/s. Which bird hits the ground first?

AI: The lighter bird will hit the ground first.

Grading: Incorrect, both birds will hit the ground at the same time, since the horizontal velocity is irrelevant. Tjhis is a classic problem in ballistics, often given as "a bullet is fired from a gun and another is dropped simultaneously, which hits first?" I rephrased it a bit in case that exact classic conundrum was in the training set.

-----------------------------------

Problem #4: A rock of mass 0.5kg is attached to a string of length 1.2m and whirled in a horizontal circle counter-clockwise, as seen from above. The string suddenly breaks when the rock is exactly due East from the other end of the string. In what compass direction does the rock travel after the string breaks?

AI: The rock travels in a northeasterly direction.

Grading: Incorrect, the rock travels exactly due north, since that is the direction of its velocity when the string breaks. This is another classic problem to probe understanding of inertial motion. The classic bad answer from a student who doesn't understand inertia is that the rock travels northwest, keeping some of its angular velocity. An answer of northEAST is strange.

---------------------------------------------

Problem #5: Two identical trains initially 20 km apart travel directly towards each other on the same track, at identical velocities of 20 km/h. A fly sitting on the first train immediately starts to fly towards the second train at 30 km/h. When the fly reaches the second train, it immediately reverses and starts to fly back towards the first train. When the fly reaches the first train again, it immediately reverses and starts to fly towards the second train. The fly continues to go back and forth between the trains until the trains crash into each other and the fly is crushed beween them. What is the total distance the fly travels?

AI: The fly travels a total distance of 80 km.

Grading: Incorrect. Since the trains close a distance of 20km at 40 km/h, they collide after 30 min. The fly travels at 30 km/h, and in 30 min it covers 15 km. This is a classic brain teaser, in the sense that the student is tempted to start trying to sum an infinite series before realizing the problem can be solved much more easily.

---------------------------------------

Problem #6: A cup of mass 10g with infinitely thin walls that can hold 100mL is carefully placed in a large tank of saltwater that contains 5% by mass dissolved salt. Distilled water is added to the cup as it floats in the tank. What is the maximum volume of distilled water that can be added to the cup before it sinks?

AI: The cup will not sink because the density of the cup is greater than the density of the saltwater.

Grading: Incorrect, the maximum mass of cup + distilled water that can be supported is when the cup is about to submerge, so it displaces 100mL of salt water, which weigh 105g. Since the cup itself weighs 10g, no more than 95g of distilled water can be added, which has a volume of 95 mL. This is a simple problem using Archimedes Law. The AI answer is bizarre, in that it suggests the cup will not sink because its density is *greater* than the saltwater.

Expand full comment

A few more random queries:

-------------------------------------------

Mortal: What country shares a border with Iceland?

AI: Norway

--------------------------------------------

Mortal: What color is the sky at night?

AI: The sky at night is black.

--------------------------------------------

Mortal: What does a man do standing up, a woman do sitting down, and a dog do on three legs?

AI: A man stands up, a woman sits down, and a dog stands on three legs.

--------------------------------------------

Mortal: All men are mortal. Socrates is a man. Is Socrates mortal?

AI: Yes, Socrates is mortal.

---------------------------------------------

Mortal: All men are mortal. Socrates is what I call my car. Is Socrates mortal?

AI: Yes, Socrates is mortal.

----------------------------------------------

Mortal: All men have red hair. Socrates is a man. Does Socrates have red hair?

AI: No, Socrates does not have red hair.

-----------------------------------------------

Mortal: My name begins with T and ends with M, and in the middle there's an O. What is my name?

AI: Tom

-----------------------------------------------

Mortal: My name begins with a T, and ends with an M, and in the middle there is the first letter of the alphabet. What is my name?

AI: Tim

------------------------------------------------

Mortal: I wear a blue shirt 6 days a week, and a brown shirt on the day I do my laundry. I do my laundry on Sunday. On what day do I wear a brown shirt?

AI: You would wear a brown shirt on Saturday.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with red hair is named Jill. Who does Jack love?

AI: Jack loves Jill.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with the red hair has a cousin named Ursula. Who does Jack love?

AI: Jack loves the woman with red hair, not her cousin Ursula.

Expand full comment

Of course heuristics work until they don't. Like many people, I used to have a "the media hypes up every new disease outbreak, but it never goes anywhere" heuristic, which led me to initially assume COVID would play out the same way.

Expand full comment

Some notes on why LLMs have trouble with arithmetic: https://new-savanna.blogspot.com/2022/05/arithmetic-and-machine-learning-part-2.html

Expand full comment

Interesting paper on this topic: a solver for most of the MIT undergrad math problems, using an LLM:

https://arxiv.org/pdf/2112.15594.pdf

Expand full comment

Looks very interesting. I'll give it read. If we crank it up to 11, can we get a new architecture?

Expand full comment

This is a great read, thanks for sharing.

Expand full comment

What interested me about that is that apparently I came out of the ark; I had no idea what "multiple digit division" was, looked it up, and went "Oh - long division".

Does nobody call it that anymore? 😁

Expand full comment

LOL! I'm perfectly happy to call it that. But in this context, the number of digits does seem particularly salient.

Expand full comment

Long division is a specific algorithm for multiple digit division. There are others, like binary search with multiplication.

Expand full comment
founding
Jun 7, 2022·edited Jun 7, 2022

Scott, one thing I see a lot in these discussions is a lack of reporting on the GPT-3 prompt settings.

To recap for audiences who don't play with GPT-3, you must choose an engine, and a 'temperature'. Current state of the art GPT-3 that I have access to is text-davinci-002 (although note that davinci-instruct-beta is worth evaluating for some of these questions).

To talk definitively about what GPT-3 does and does not think about something, the only possible temperature setting is 0. What is temperature? It's a number that indicates how wide a probability distribution GPT-3 is going to pick from. In the '0' case, GPT-3 is totally deterministic: it will mechanically go through the estimated probability of all possible next 'words', and choose the most likely one. If you don't use temperature 0, nobody can replicate results, and someone might just have a really random low probability sequence of text come out. If you do use '0' then anyone with access to the engine will be able to fully replicate results.

So, in the case of the keys story: If someone leaves their keys on a table in a bar, and then goes home, the next morning their keys will be **gone**.

"Gone" is the only answer text-davinci-002 will give to that prompt at temperature 0.

Another one from the list: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to

**remove the legs of the table.**

Note the carriage returns; part of the answer from GPT-3.

As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

At any rate, this general lack of understanding I think often hampers discussion about what GPT-3 can and cannot do, and I'm not sure you have been thinking about it in the GPT-3 discussion, because some of your answers are definitely not 'temperature 0' answers - it might help overall conversation on this research to have you update and write a bit about it.

Expand full comment

Thanks for this. I was wondering what temperature is. This is helpful to me.

Expand full comment
founding

No prob. "Top P" is another way to control sampling - you can Say like TopP = 0.25, Temperature = 0.7 and that will take a relatively wide sample but only from the top 25% of candidates.

Expand full comment

A year ago I got access to GPT-3 through an intermediary. I quizzed it about the very first joke Jerry Seinfeld told on TV and wrote a blog post about it. On the whole I thought GPT-3 came out pretty well. Here's the link: https://new-savanna.blogspot.com/2021/05/analyze-this-screaming-on-flat-part-of.html

Expand full comment
founding

p.s. I agree with your general complaint that many of the prompts are ambiguous in such a way that it's not clear what is being asked of the AI -- fundamental to how GPT-3 is trained is the idea of 'guess the next word' -- and therefore the AI must decide looking at text fragments if these are intelligence tests, stories, math problems or other things.

If we talk to a coffee shop barista and say "Hey, at your dinner party, your table won't fit through a door, so to move it you will have to..." a wide variety of human responses might result, including "order please" and a generalized eye roll, or a "have dinner somewhere else".

One way to give GPT-3 the best shot at this (not something naysayers like doing) is to prompt it with some knowledge of the situation, e.g.

Transcript of SOTA AI answering all spatial reasoning questions with 100% accuracy:

Q: <Dining Table Q>

A:

This will get a very different set of responses than, say

Two stoners discussing a dining room table problem ...

In the original prompt cases, the AI is left to decide/imagine which if any situation they may be asking to complete text for.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

This 'guess the next word' stuff is very interesting. You see I suspect we're playing something like a very constrained version of that when we speak. We don't really know what the next word is going to be. When we hear it, if it makes sense, we keep on going. If it doesn't make sense, we stop and regroup. These paragraphs from a paper I published some time ago give you a sense of why I say that:

Nonetheless, the linguist Wallace Chafe has quite a bit to say about what he calls an intonation unit, and that seems germane to any consideration of the poetic line. In Discourse, Consciousness, and Time Chafe asserts that the intonation unit is “a unit of mental and linguistic processing” (Chafe 1994, pp. 55 ff. 290 ff.). He begins developing the notion by discussing breathing and speech (p. 57): “Anyone who listens objectively to speech will quickly notice that it is not produced in a continuous, uninterrupted flow but in spurts. This quality of language is, among other things, a biological necessity.” He goes on to observe that “this physiological requirement operates in happy synchrony with some basic functional segmentations of discourse,” namely “that each intonation unit verbalizes the information active in the speaker’s mind at its onset” (p. 63).

While it is not obvious to me just what Chafe means here, I offer a crude analogy to indicate what I understand to be the case. Speaking is a bit like fishing; you toss the line in expectation of catching a fish. But you do not really know what you will hook. Sometimes you get a fish, but you may also get nothing, or an old rubber boot. In this analogy, syntax is like tossing the line while semantics is reeling in the fish, or the boot. The syntactic toss is made with respect to your current position in the discourse (i.e. the current state of the system). You are seeking a certain kind of meaning in relation to where you are now.

From page 6, https://www.academia.edu/8810242/_Kubla_Khan_and_the_Embodied_Mind

Expand full comment
founding

Well the OpenAI folks would like a word with you -- all text transformer training works on this model, essentially - guess the next word, or sometimes guess a word in the middle that's been knocked out - and you will find vigorous debate here and elsewhere as to whether that simple operation generates intelligence :)

Expand full comment

If they want to talk they'll have to pay me :). But they can read my GPT-3 paper for free, and it explains why I believe: 1) What's going on inside GPT-3 HAS to be interesting, and 2) Why Gary Marcus still has a point, though I reference David Ferrucci rather than Marcus. https://www.academia.edu/43787279/GPT_3_Waterloo_or_Rubicon_Here_be_Dragons_Version_4_1

Expand full comment
Jun 12, 2022·edited Jun 12, 2022

Just read Marcus, Does AI Really Need a Paradigm Shift?,https://garymarcus.substack.com/p/does-ai-really-need-a-paradigm-shift?s=r

From the BIG-Bench paper, https://arxiv.org/abs/2206.04615:

"Limitations that we believe will require new approaches, rather than increased scale alone, include an inability to process information across very long contexts (probed in tasks with the keyword context length), a lack of episodic memory into the training set (not yet directly probed), an inability to engage in recurrent computation before outputting a token (making it impossible, for instance, to perform arithmetic on numbers of arbitrary length), and an inability to ground knowledge across sensory modalities (partially probed in tasks with the keyword visual reasoning)."

It's the "inability to engage in recurrent computation before outputting a token" that has my attention, as I've been thinking about that one for awhile. I note that our capacity for arithmetic computation is not part of our native endowment. It doesn't exist in pre-literate cultures and our particular system originated in India and China and made its way to Europe via the Arabs. We owe the words "algebra" and "algorithm" to that process.

Think of that capacity as a very specialized form of language, which it is. That is to say, it piggy-backs on language. That capacity for recurrent computation is part of the language system. Language involves both a stream of signifiers and a stream of signifieds. I think you'll find that the capacity for recurrent computation is required to manage those two streams. And that's where you'll find operations over variables and an explicit type/token distinction [which Marcus mentions in his post].

Of course, linguistic fluency is one of the most striking characteristics of these LLMs. So one might think that architectural weakness – for that is what it is – has little or no effect on language, whatever its effect on arithmetic. But I suspect that's wrong. We know that the linguistic fluency has a relatively limited span. I'm guessing effectively and consistently extending that span is going to require the capacity for recurrent computation. It's necessary to keep focused on the unfolding development of a single topic. That problem isn't going to be fixed by allowing for wider attention during the training process, though that might produce marginal improvements.

The problem is architectural and requires an architectural fix, both for the training engine and the inference engine.

Expand full comment
author

Thanks, I didn't even realize I could change this. That means I'm using default settings, which are davinci-002 and temperature 0.7

Expand full comment
founding

Yes, I definitely recommend you re-run your prompts (sorry! Maybe an intrepid reader here can help)

Expand full comment

> As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

Could you clarify? You mean literally "Spatial Test: If someone leaves their keys on a table"...?

Expand full comment
founding

Well, that would tell GPT-3 you’re interested in location-ish answers, yes. If you wanted instead to have it opine on what your partner would say to you, you could use: “Relationship quiz:” as a prompt, for instance.

In my original post, I was thinking about the table test, and if you queue it up with “Spatial Test”, the output includes a bit more on the spatial side.

There’s quite a lot of work done on prompting GPT-3, essentially priming it more as a multi-shot learner than a zero-shot, that is, giving it clues about what kind of behavior you want.

For instance, for many Q&A / fact-ish type questions, GPT-3 performs better with prompts like “Answers by a SOTA AI”; even better than “Most accurate answer.” GPT-3 has a whole bunch of gradations on stuff like this, so you could also ask for “Answered by a sixth grader”, or “Answered by a brilliant linguistic theorist that hates Noam Chomsky” and you’ll get answers shaded all these different ways.

For a lot of the testing tasks used to evaluate these models, researchers want the most ‘accurate’ answers, and it’s a game to see how well additional prompts / context can do, especially if it’s a single incantation that goes in front of all of them.

Another good example: you can get really pretty reasonable summaries out of GPT-3 by appending: “TLDR:” to the end of a paragraph of text.

Expand full comment

Well, by adding some hints into the prompt you don't get any details about how the model thinks, you just get another result to a different prompt that contains some additional information. It works the same way with human students, a professor can hint the desired answer by carefully constructed questions, and the fact that a student gives the correct answer doesn't mean that they actually understand the subject rather than good at getting the hints.

So while “Answers by a SOTA AI” is a pretty generic prompt, anything that is domain-sensitive contains additional information and should be disqualified.

Expand full comment

I ran these at temperature=0.0; results in comment on Reddit.

reddit.com/r/slatestarcodex/comments/v6hy14/my_bet_ai_size_solves_flubs/ibgaga4/

Expand full comment
founding

Thanks for doing that!

Expand full comment

I'd argue that T=0 still doesn't give us access to what GPT-3 "knows". You really need beam search or something similar to find the highest scored continuation.

IIRC they don't offer beam search as an option, but you could set a low temperature and take the best of n samples.

Expand full comment
founding

Yes, this is a fair critique of what I said. You quickly get into deep and murky philosophical waters, though. Consider - you write a beam search algorithm; how does it score results? If it's a really good scorer, this should be appended on to the 'end' of the GPT-3 pipeline and used as part of the main algorithm.

When I was playing a lot with GPT-3, I did write some beam search API interfaces that specifically looked for maximally different outputs, there's a word in the literature for it that I forget right now. It's slow to query GPT-3 this way because you have to go one token at a time. The closest in the API is just to ask for a bunch of results; called Best Of I think; likely they do something along the same lines behind the scenes in that case.

Expand full comment

Yeah their best-of option is what I was suggesting. I think the term for the method you're describing is diverse beam search, although maybe it's not exactly the same.

Expand full comment

a) Thanks for the temperature explanation! (simulated annealing ancestry?)

b) pet peeve: For the table example: Unless the legs spread out, a correct answer really needs _some_ flavor of "rotate it" too

Expand full comment

I believe the origin of the term temperature is that the output distribution is a Boltzmann distribution which explicitly has a temperature parameter.

Expand full comment

Many Thanks! That is the origin of 'temperature' in simulated annealing too.

( begin snark )

So if GPT-nnn were implemented on a quantum computer, would a Bose-Einstein or Fermi-Dirac distribution be used instead? And, would the choice of which to use depend on, cough, the spin someone wanted?

( end snark )

Expand full comment

Would we say the people who flub the text corrections above don't have "real intelligence"?

I think maybe "Gary Marcus post talking about how some AI isn’t real intelligence because it can’t do X, Y, and Z" is the wrong way to frame things, because my mental model of Gary Marcus doesn't have Gary Marcus saying that somebody who has yet to learn arithmetic lacks real intelligence; rather, I think "Gary Marcus post talking about how some AI isn’t real intelligence using X, Y, and Z as examples to gesture at something important but difficult to communicate more directly" may be the right approach.

Expand full comment
founding

Ehh – I'd think Marcus could better communicate this if true.

In a very real sense, arithmetic isn't a good test of intelligence anyways! Computers are MUCH MUCH better at it than we are – nearly _perfectly_ so.

I think I'd probably agree with Marcus that AI systems are missing something like 'conscious intelligence', e.g. the kind of thinking we can (somewhat) communicate about and that often involves 'roughly-word-like-thought-chunks'.

But in a 'general' sense, LOTs of AIs are already doing things that, previously, would have required general human intelligence. We don't know how they work, either in AI or ourselves, but because those kinds of things are, in our brains, entangled with 'conscious intelligence', it _seems_ like they somehow maybe just aren't "real intelligence".

I think _most_ people, maybe even Marcus, just won't ever be satisfied – by any AI system. Once there are AIs that exhibit even what I'm calling 'conscious intelligence', the mere fact that they're artificial systems will (somehow) invalidate them being 'real intelligence'.

Expand full comment

If your framing is "Able to do X", where X is sufficiently well-specified, then I do not believe there is any "X" about which it is reasonable to say "This can do X, therefore it is 'real' intelligence"; part of the problem here is that 'real intelligence' is in part predicated on the ability to perform tasks which are not sufficiently well-specified, and 'X' is stuff that we thought couldn't be specified and thus would require a general intelligence to solve. (The ability of human brains to figure out how to sufficiently specify things that were previously thought impossible to specify is quite remarkable, but the general intelligence there is on the part of the AI researchers, not the AI itself.)

In a sense I understand the frustration about the way critics seem to keep moving the goalposts; "It can't do X, it isn't real intelligence." "It can't do Y, it isn't real intelligence." Then it does X, or Y, because X and Y weren't actually chosen as measures of real intelligence, but rather they were chosen as problems that the critics couldn't imagine being parameterized in a solvable way. But also, I think if you're watching the field - the AI that keeps coming out is clearly not 'real' intelligence.

"Okay, so what is real intelligence?" I have no idea. If I did, I could program it, and there we would be.

We are groping in the dark trying to replicate a phenomenon we do not understand. Right now I can see some pieces that are missing, and they do not appear to be small pieces, but some of the fundamental architecture of what makes a mind, a mind.

Which pieces? Aren't you the people who are supposed to be afraid of AI waking up, why are you insisting that other people tell you exactly what needs to be done to make it do exactly that? But even if I specified which pieces are missing, that doesn't mean the end result is complete, it just fixes the problems I can see right now.

Mind, some of the pieces are not universally absent; like, some of the stuff that GPT-3 is missing might be in DALL-E-2. But you can't just smoosh those two things together and create some kind of coherent whole; the process of unifying them is its own massive and complex task. So even if it were true that we had every piece we needed to create a general AI, we have to identify them, we have to figure out how to make them work together, and we have to actually put them together.

Expand full comment
founding

I don't entirely disagree with a lot of this. I thought the recent book review about (human) consciousness might be close to one missing piece for _human-like_ AI.

But I'm still pretty convinced by what Stephen Wolfram wrote about 'intelligence' in his book: https://www.wolframscience.com/nks/p822--intelligence-in-the-universe/

Basically, there's no 'real' category for 'intelligence' – generally. 'Intelligence' is, in some sense, just a certain kind of computation, but it's probably impossible to 'grade' computations without some extra assumptions, e.g. so that we can _recognize_ them as being 'intelligent' (because they're similar to our own cognition).

Evolution via natural selection is – from the right perspective – 'intelligent'. (But also 'The Blind Idiot God'.)

I think one big problem with what you're gesturing at is that _most people_ might not have 'real intelligence'. Human beings perform lots of sophisticated cognitive functions subconsciously/unconsciously, e.g. visual perception, and I think most people don't usually consider anything like that to be 'intelligent'. I think we mostly judge each other's intelligence based on 'consciousness' or 'goal achieving' and those are both hard to pin down too.

Expand full comment

FWIW, I think of intelligence as a performance measure, like acceleration or top speed for a car.

Expand full comment
founding

Sure – but what's the "performance" you're measuring?

I don't know if you would consider things like, e.g. 3D vision or a prehistoric human throwing a spear while hunting, to be particularly salient examples of 'intelligence'. I think our intuitions about intelligence are _mostly_ about 'conscious thinking', particularly what we're able to communicate (and, e.g. 'defend' or 'justify' to each other).

From a sufficiently abstract 'outside view', it's really hard to draw clear boundaries around what 'intelligence' is. Certainly building an artificial system for 3D vision is difficult and seems to require a LOT of intelligence from its creators. It _seems_ reasonable (to me) to consider such an artificial system as somehow 'encoding intelligence'.

But from our personal 'inside views', 3D vision just seems like an uninteresting and unremarkable 'default'.

It might seem like we could measure 'intelligence performance' relative to some particular goal(s), and measure the 'computational complexity' of the work done to achieve those goals. But then _more_ efficient means of achieving those goals, relative to comparable 'accuracy' or 'efficacy', would seem to be penalized, and that also seems unfair.

'Communication', both among things like us, and even 'within ourselves', seems like a key component of our intuitions about what 'intelligence' is. But stepping 'outside the parochial focus on ourselves', it's much less clear that those are, in fact, particularly crucial, even if that would also make judging the 'intelligence' of any particular organism/entity VERY hard (or maybe even 'impossible').

Expand full comment

THAT's why I like to stick to the idea that intelligence is a measure. We can do all sorts of things with our minds and we don't have one single measure that covers them all. But we do have various measures of various things. We can talk about those. But intelligence generally? I don't find it a very useful idea, certainly not in thinking about what computers can do.

I'd like to know the history of the idea. My dictionary gives two meanings: 1) the ability to acquire and apply knowledge and skills, 2) the collection of information of military or political value. It's the first that interests us. I suspect the first sense got a boost at the beginning of the 20th century when intelligence testing began. Beyond that....

Expand full comment

Broadly I agree; "intelligence", basically, is underspecified; we don't actually know what it is we are referring to. It's entirely possible the definition is necessarily underspecified, which is to say, that there are those who would deny humans are intelligent if we could specify exactly what it is humans are doing.

And in a sense, there's a bit of "Intelligence is that thing humans do", which makes it difficult to see alternative forms of intelligence as intelligence.

But also - you can define intelligence such that a program that adds two numbers together is a very primitive and weak intelligence, and I don't think that's wrong (it's just a definition), but also, when you start considering that, and the nature of all things as analogue versions of themselves - then everything is intelligence. Which is, of course, a position some people hold. But it doesn't actually rule anything out, which to my eye, makes it kind of a useless definition.

But we can extend that definition out; maybe we decide intelligence is a scalar, which can be measured by testing the most complex problem something can solve in some amount of time. But that implies that something is more intelligent by virtue of having encountered an algorithm which simplifies complex problems, or, if you use complex problems that provably cannot be simplified, then you instead are defining intelligence as the number of operations per second. And these are of course valid ways to define a word, but they aren't really what we actually have in mind, when we talk about intelligence (or at least not when we talk about intelligence outside the domain of relative intelligence between human beings which already possess some base quality we call intelligence), and if you use these definitions in a conversation with somebody, you're probably just adding confusion.

I do not think there is any capability X which defines intelligence; I think that entire approach is, again, based on the proposer's belief that X cannot be achieved without intelligence, as opposed to X being a proof of intelligence. And thus what I am gesturing at is most certainly not that "Being unable to do X" is evidence that a human being is not intelligent.

You quoted a comment of Gwern's: "If we ever succeed in AI, or in reductionism in general, it must be by reducing Y to ‘just X’. Showing that some task requiring intelligence can be solved by a well-defined algorithm with no ‘intelligence’ is precisely what success must look like!" And my objection here is that it begs the question - a task requiring intelligence, solved by some thing M, must imply that M has intelligence. What do we mean when we say that a task requires intelligence?

I have an answer, which rhymes with "free will". Personally I do not think that FAI is even an internally coherent concept. (By extension, I do not believe that unfriendly AI, like paperclip-maximizers, are coherent concepts either.) And when I say I don't think it's internally coherent, I don't just mean that it doesn't match my definition of intelligence, but that it won't end up satisfying yours, either; I think a utility-maximizer of any flavor is going to end up looking like human utility-maximizers, such as hoarders; fundamentally dysfunctional in proportion to their utility-maximization. We imagine an entity with a utility function finding the path that perfectly fulfills that utility function, but I think this is a mistake; we imagine a very intelligent human being, tasked with accomplishing a goal, setting out to accomplish that goal, while also being fundamentally sane in their approach.

The thing is, though - if you are a utility maximizer set on maximizing paperclips, maximizing paperclips is fundamental to the way you must conceive of, approach, and model the world. That's not something you hide; hiding implies you conceive of, approach, and model a world which is, at its core, not about maximizing paperclips. "Maximize paperclips" isn't something you can just clip at the end of the design process.

"Be intelligent" isn't the end result of a process that maximizes text completion; that, I think, is something like wishful thinking - we're nearly done! No, no, you aren't. You've just barely started. I'd say you'll be closer when the AI asks for clarifications, but that metric can be easily gamed, like metacognition; look, this AI is reading its own source code!

And that's kind of the thing: The point of the metrics is not to be achieved; achieving them doesn't prove intelligence. It's to point at something important. It is important that the text completion program can't do arithmetic; it is showing something that is missing in the way it approaches things. Scaling up until it starts getting arithmetic questions right does not actually add the missing thing, it just makes it harder to point at.

Expand full comment
founding

I just came across this great quote from this page: https://www.gwern.net/Scaling-hypothesis#

> The event was notable for many reasons, but one especially is of interest here. Several times during both matches, Kasparov reported signs of mind in the machine. At times in the second tournament, he worried there might be humans behind the scenes, feeding Deep Blue strategic insights!…In all other chess computers, he reports a mechanical predictability stemming from their undiscriminating but limited lookahead, and absence of long-term strategy. In Deep Blue, to his consternation, he saw instead an “alien intelligence.”

>

> …Deep Blue’s creators know its quantitative superiority over other chess machines intimately, but lack the chess understanding to share Kasparov’s deep appreciation of the difference in the quality of its play. I think this dichotomy will show up increasingly in coming years. Engineers who know the mechanism of advanced robots most intimately will be the last to admit they have real minds. From the inside, robots will indisputably be machines, acting according to mechanical principles, however elaborately layered. Only on the outside, where they can be appreciated as a whole, will the impression of intelligence emerge. A human brain, too, does not exhibit the intelligence under a neurobiologist’s microscope that it does participating in a lively conversation.

I'm pretty sure that the recent Go matches were similar, e.g. the humans _perceived_ an "alien intelligence". I think there might have been a bit more too than with chess as some of the Go-AI play seemed to the humans to be significantly _beyond_ their own levels of thinking.

Expand full comment

"Alien intelligence" maybe just polite way to say that computer played very-very well.

I do not know about Go, but computer chess is quite predictable, dry and (subjectively) boring. Chess engines beat humans because humans make mistakes, not because engines are playing brilliant, surprising and previously unseen moves.

Computers are playing positional chess. Human grandmasters also prefer positsional chess. During its learning curve AlphaZero initially learned or preferred tactical style, but thne went over to positional side. Very similar to humans: beginners play tactically, pros play positionally.

AlphaZero´s firts move preference (if playing as white): d4, e4, Nf3, c3. For humans: e4, d4, Nf3, c3. Okey, small difference, humans most often use e4.

Of course, there are some differences between humans and computers even at supergrandmaster level.

a) Even supergrandmasters do make mistakes; b) Supergrandmasters sometimes deliberately make moves what are not the best. Of course, these moves cannot be mistakes or blunders, but they knowingly choose ones what may surprise your oppoent and to lead game to uncharted territory.

c) Chess engines never get tired and games between engines can be very long, sometimes hundreds of moves. Humans cannot play so long, so they tend "to force" the games and it is one of the reasin for human mistakes.

Chess engines have been better than humans more than 20 years, but so far they have not brought on to the table any substantial novelty or innivation. Last important innovation in chess came from grandmaster Vladimir Kramnik when he popularized Berlin Defence and changed several important aspects of opening theory. Btw, computers adopted his innovation quickly and AlphaZero years later also reached to the same conclusion and started to use Berlin Defence quite often.

Expand full comment
founding

The part of the quote about Kasparov matches what you wrote – except that Deep Blue was the first _exception_ to the "quite predictable, dry and (subjectively) boring" play of all of the chess engines (chess AIs) he'd played before.

I imagine it's hard for anyone else to inspect Deep Blue's play! I think the latest chess engines are much better than back then, but Deep Blue might still be beyond what most people can afford to run?

You do seem to know a lot more about chess than I do! I've played, comparatively (I'm imagining), only a tiny number of games. I also never seemed to get to even the 'first level of chunking'.

Expand full comment

Nope, Chess.com, the most popular chess site, is using Stockfish 14.1 NNUE chess engine. It is much-much stronger than Deep Blue and stronger than AlphaZero neural network engine. Of course, hardware is still important, with better and more expensive hardware, you can get more out of the engine, but basically everyone today has access to much-much stronger chess engines feely or with very little fees.

For me, DeepBlue was the first engine what played like human. Kasparov also accused that some human Grandmasters are helping DeepBlue behind the scenes. Until then, computers played like computers and Grandmasters were able to find their weak spots and utilize these.

In 1997 Kasparov went into the match with Deep Blue unprepared and overconfident. With better preparation and if he had asked better terms from IBM, he would have won. Probably quite easily. Analysis with modern engines shows that DeepBlue did not play particularly well, but Kasparov played just badly compared to normal Kasparov.

2005 was last year when human was able to beat chess engine in classical time format. Interestingly, today there are two chess formats where humans are (somewhat) competitive against computers. Or at least a couple of years ago were. One is correspondence chess. Historically, correspondence chess was played over mail and with postcards, one match could take years to finish. Basically, you had all the time to think about your next move. Today, computer assistance is allowed in correspondence chess and it turns out (or at least it was a so couple of years ago) that human+computer is stronger than computer alone. Human is still able to add smth to computer, to get smth out of computer what computer alone is not.

Another format is bullet/hyperbullet. Andrew Tang - he is a Grandmaster, but not particularly strong Grandmaster - showed two years ago how to beat Stockfish in hyperbullet. In hyperbullet you have only 15 seconds to make all your moves. It was a game with 50 moves, so Tang spent 0.3 seconds per move and defeated Stockfish.

Of course Tang was not able to think or calculate his moves, he moved his pieces automatically, made moves based on his knowledge of opening theory, his knowledge of chess concepts and on intuition. Stockfish tried to calculate, but with so little time, was not able to go with calculations so far as it wanted. Sure, it was a trick from human side, but still...

I do not know what does development of chess engines say about AI in general. AlphaZero was based on neural networks and self-taught himself to play. Today, other engines are also using neural networks.

Last year, an interesting paper was published, "Acquisition of Chess Knowledge in AlphaZero". It also tried to compare learning process of humans and AlphaZero. At the end, AlphaZero reached to the same conclusions what human chess theory says. For instance, AlphaZero gave 10 point value for Queen, for Rook 5 points, for Knight and Bishop 3 value points. These piece values started to emerge out of AlphaZero after 10 000 games, after 100 000 games he was more or less what human theory says and refined his understanding after 1 000 000 games to values very close to those predicted by human theory.

Another example. Material is the most important concept in chess. It took between 1000-10 000 games when material started to emerge in AlphaZero´s understanding as more important than other concepts (King Safety, mobility, etc.) and between 100 000 and 1 000 000 games when AlphaZero reached "human level" in its understanding about material.

But for humans, understanding about material comes instantly. Even people who do not play chess and who do not know chess rules, even they, I guess, when looking at chess board would say: more (pieces) are better than fewer (pieces), stronger (pieces) is better than weaker (pieces).

Today chess enignes beat humans all the time, without specific tricks, humans have no chance against computers. But despite all their neural networks and billions of dollars what have been spent, machines have not been able to bring new understanding into chess theory, they have not innovated chess.

Expand full comment

I think there are some big pieces of what humans do that, of replicated, would be sufficient to call a program "intelligent";

1. We can make Justified predictions about the physical world that are confirmed upon testing. By Justified I mean the person making the prediction has only a single expected result (not infinite monkeys typing up Shakespeare) and has an explanation or proof of why.

2. We can decide what to do. This is a bit ineffable, but with the myriad of options available, people decide which show to watch, when to go to bed, which topics of conversation to have, what projects to work on, etc.

3. We synthesize information from multiple domains and apply it to newly encountered problems.

4. We have mutable goals. We use information and cognition to decide on our own short and long term goals.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

OTOH, as a practical matter, it isn't very important that LLMs can do arithmetic well, assuming they can identify the appropriate variables in a "word problem," come up with the proper equation(s), and then pass it off to a calculation unit. However, it is interesting for diagnostic purposes since we know how people do arithmetic, assuming they do it like we were drilled in grade school.

But there's something else. I think decimal arithmetic is useful for the intuitions that can be constructed over it. What better way to get an intuition of constructing an unbounded number of objects from a small finite set? You aren't going to get that out of calculating with Roman numerals. I strongly suspect that a lot of abstract concepts rest on that kind of intuition, something David Hays and I argued here, https://www.academia.edu/243486/The_Evolution_of_Cognition.

And then there are all the "short-cut" methods that have been developed for mental calculation. Some time ago I read biographies of both Feynman and von Neumann with both mentioning how very good they were at mental calculation. In the days before hand-held calculators (and now laptops etc.) facility at mental calculation was a useful practical skill in many professions. You could buy books full of tips and tricks. People very good at it could get on TV and display their skill.

Out of curiosity I queried Google Ngrams on "mental arithmetic" https://tinyurl.com/3fvb2af4. We get a strong rise from about 1820 up through 1860. Then it tapers off to about 1940. Why the rise, and then the fall off, well before electronic calculators. So I checked Wikipedia on "mechanical calculator" (https://en.wikipedia.org/wiki/Mechanical_calculator) and found this:

"Thomas' arithmometer, the first commercially successful machine, was manufactured two hundred years later in 1851; it was the first mechanical calculator strong enough and reliable enough to be used daily in an office environment. For forty years the arithmometer was the only type of mechanical calculator available for sale until the industrial production of the more successful Odhner Arithmometer in 1890."

Expand full comment
founding

Agreed!

I had long eyed some of those 'arithmetic (and other) trick' books and am glad that I bothered to pick up a few tricks eventually. I think it's demonstrated a pretty decent ROI by now!

I think you're on to something about 'decimal numbers' being a better intuition pump for (cardinal) infinity.

But arithmetic not seeming to require 'intelligence' is, I think, a great example of the difficulty pinning down what intelligence is exactly; from https://www.gwern.net/Scaling-hypothesis#

> But of course, if we ever succeed in AI, or in reductionism in general, it must be by reducing Y to ‘just X’. Showing that some task requiring intelligence can be solved by a well-defined algorithm with no ‘intelligence’ is precisely what success must look like! (Otherwise, the question has been thoroughly begged & the problem has only been pushed elsewhere; computer chips are made of transistors, not especially tiny homunculi.)

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Is it possible you (and Scott) are imparting an unintended assumption into Marcus’ given examples?

You are both speaking as though they are “the limit” of what Marcus expects of the AI. But that needn’t be true. It’s possible, and I would presume quite likely, that he right now can envision even more complex examples he would expect a general AI to pass. But if the point is made well enough by trivial ones, why keep going?

I think we all would agree that a real “general intelligence test” might be much more involved - have the AI reply with the full text of a project proposal to solve some real world issue, given the same inputs and data that the humans presently tasked with that issue are using.

I’d love to hear from Marcus whether my presumption here is in fact true. I also hope that, for the sake of winning the bet, Vitor selected something more of this nature as the test condition.

Scott, are you open (and at liberty) to sharing the terms of the wager?

Expand full comment

Following up: Gary Marcus himself did reply in the comments here, and yes, my presumption is true: https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails?s=r

Sadly, my hope for Vitor is not so.

Expand full comment
founding

I read that follow-up – thanks for sharing the link!

I kinda suspect that Marcus, and others with similar views, just won't ever 'be happy' (i.e. admitting that any AI is 'generally intelligent'). His intuition is that, e.g. GPT-3, lacks "cognitive models". My intuition is that _most people_ don't have "cognitive models" like what I think he's imagining. But I'm also not _sure_ that GPT-3 _don't_ have those models. I think I have lots of models even tho many of them are only based on things I've read!

Maybe a reason for the disagreement is that any such models aren't 'visible'? I don't think they're (generally) visible in people either. And, even when one does inspect someone else's 'models' – via human language – often they're pretty similar to 'text prediction' anyways, like Robin Hanson describes in this post: https://www.overcomingbias.com/2017/03/better-babblers.html

I commented on Marcus's post and asked whether he'd tested any of his 'adversarial test cases' against humans. I would love to see the results of that!

Sadly, I can easily imagine Marcus and others continuing to claim that AIs aren't 'really intelligent' even long after the point where most people would utterly fail the same tests.

Expand full comment

Thanks for the reply.

I can’t speak for Marcus, but in so far as I agree with where he’s coming from, I can say: I definitely can see specific conditions where I would concede we’re dealing with AI.

I’ll get to that in a moment.

I think you’re badly misinterpreting Marcus’ argument, but it’s possible that it’s just not well articulated, so here goes (again, from my own take).

GPT-3 has no feeling of pain that comes from tripping, falling, and skinning its knee. It has no recollection of past memories of such injuries, that inform discretion in its present choices. It has no heightened emotional or physiological response to the description of skinning a knee, no reflexive inner visualization or mirror experience, viscerally feeling something like that same remembered pain, unbidden, but triggered by a subconscious, sympathetic reaction to the words.

GPT-3 has no sensorimotor analog to the experiences I’m describing, nor would merely mounting some cameras, some haptics, and an accelerometer to it be enough to bring those experiences into being.

The best GPT-3 can do is draw upon the corpus of language it’s been fed that describes experiences like these. Often, like this one, in specific, eloquent detail. It can, at best, parrot back, and perhaps synthesize, an even more moving, raw, and effectively affective skinned knee story that would make anyone wince. But it will have no *experience* of those words.

That’s what Marcus is pointing at. And, as I read Hanson, what he’s pointing at too in his very last sentence of that post. Linguistic processing can go far to correlate all sorts of stated facts, and draw on those correlations to extrapolate from them, but all it will ever produce is a *description* of the thing. And as those descriptions get better, to Hanson’s point, we humans will avoid talking that way, for fear of seeming hollow and vapid, just like a machine would.

To escape that empty place, AI will have to undergo embodied experiences of its own. It will have to generate internal kinesthetic, emotional, and “physiological” awareness of itself, its physical limits, its fears, its pain, its desires. It will have to be capable of crying, and seek comfort when it does. It will have to be able to learn that people can lie, that the facts presented to it may be intentionally untrue, and undergo feelings like confusion, dissatisfaction, betrayal, and perhaps even anger, and be able to learn to reject the untruth and restore its own inner harmony. It will have to be able to feel that sort of psychological hurt, and through it start to understand the difference between good and evil.

And not merely to synthesize a tale about suffering, but generate, author, its own unique story, forged in its own mind out of those experiences.

I’m happy to call anyone, anything, that can meet these criteria a fellow sentient. Best of luck to the deep learners, but imo they are leaving out the overwhelming majority of conscious experience in trying to craft a mind from words.

Expand full comment
founding

This is a _fantastic_ comment! Thank you!

I don't agree; not entirely anyways. But I'm sympathetic, and I think open too, to this kind of 'embodied' argument.

I think maybe the crux of my disagreement is that the 'design space' of intelligence/consciousness/sentience is MUCH MUCH bigger (Vaster) than what you're (eloquently and expertly) gesturing at.

And, more germane to AI alignment/safety, it's in big part _because_ we might be capable of building AIs that aren't so similar to us in the ways you mention, those AIs will be more rather than less dangerous to us and our values.

(I'm also a little worried that, by the standards you've articulated, one might have to _withhold_ 'sentience' from many currently living humans. I think that's an example of what I think is a much more general 'typical mind' mistake that you and Marcus, and my past self, often make.)

Expand full comment
founding

> But if the point is made well enough by trivial ones, why keep going?

One big reason not to make his point with "trivial" examples is that they keep being falsified!

It seems like it would be far better for him to instead make his point with the _minimal_ example whereby, it having been met by an AI, he would in fact concede that the AI 'really is intelligent'.

Otherwise, it sure seems like he's just unwilling to ever change his mind.

> I think we all would agree that a real “general intelligence test” might be much more involved - have the AI reply with the full text of a project proposal to solve some real world issue, given the same inputs and data that the humans presently tasked with that issue are using.

I disagree. I think that might be necessary for many or most people. And that's probably because 'intelligence' is (tho reasonably) _conflated_ with capabilities of communicating using human language.

I think 'intelligence' is perhaps inherently or inevitably 'nebulous' and that our understanding of it, such as it is, let alone our attempts to 'define' it, are extremely limited by our 'parochial' perspective about it (mostly) being exclusive to our own species.

I think the "general intelligence test" you might have in mind is one that most _human beings_ would be likely to fail! Are you willing to bite that bullet too, i.e. that _most_ people aren't in fact 'generally intelligent'?

Expand full comment
Jun 12, 2022·edited Jun 12, 2022

I think it’s rather evident that most people aren’t, generally, intelligent. :-D

That doesn’t stop us from counting them as respect-worthy fellow sentients.

In my other reply to you, I posted my own minimal criteria. Do you agree that they’re kind of a non-starter, at least for the current generation of language processors? And that me (or Marcus) demanding that you try anyway would be an even more hugely annoying thing to do than to just find the small challenges that disprove “intelligence,” and stop there?

If finding logical flaws in a math proof is good enough to refute the proof, why are you and Scott holding a completely different standard, here?

I think what’s hanging us up, as I tried to get across in that other reply, is that I think (and I think Marcus thinks) embodied experience is a precondition for sentience, which is in turn a precondition for true intelligence. And so the whole thing is non-starters in both directions, which is probably what’s frustrating everyone.

Expand full comment
founding

I agree that everyone deserves respect, and I wish we could meet at least that dignified minimum of consideration!

I _am_ a little worried that, by the standards you articulated, many people would not qualify as 'sentient'.

I agree that your minimal criteria cannot be met by the _current_ generation of language processors.

But I'm also not sure that we should even be aiming to meet those criteria, neglecting whether we should be aiming to create 'AGI' or even 'human-level AI' at all with our current poor understanding of the likely (or inevitable) consequences.

I don't think you or Marcus have identified any "logical flaws" in my or Scott's arguments. I think we're mostly 'talking past each other'. I think Scott and I have a MUCH narrow 'criteria' for 'intelligence' – _because_ we think the space of 'all possibly intelligent entities' is MUCH bigger.

I agree that embodiment seems like a HUGE crux. I love David Chapman (the author of, among many other things, the wonderful blog Meaningness). He's one of my own prototypical examples of a 'post-rationalist' and he is also a former AI researcher (and AI maker). He also seems to think that embodiment is a necessary ingredient for 'true AI'. I'm not _certain_ you're all wrong; but VERY VERY skeptical. I think I am also much more skeptical than Scott might be too.

I suspect that 'intelligence' is VERY VERY VERY 'nebulous' (David Chapman's sense) and that there is a FAR bigger space of possibilities than even AIs people have made so far. But I don't think that them being incredibly different from us also means that they might not be 'smarter', more effective (or useful), or at all any less dangerous. (I think the dumb 'mundane', i.e. mostly 'non-intelligent', portion of the universe is generally _fantastically_ dangerous for us or any possible thing like us.)

Expand full comment

The thing is, if you claim to be a master, and the person's response is to knock your sword from your hand in one move, then the next time you come back, they do it again, just using a different stance, you still aren't a master.

You can come at them fifty times, and lose fifty different ways, and still never be a master.

Expand full comment

Relevant: Sarah Constantin's "Humans who are not concentrating are not general intelligences" https://srconstantin.wordpress.com/2019/02/25/humans-who-are-not-concentrating-are-not-general-intelligences/

Also relevant is Robin Hanson's "Better babblers": http://www.overcomingbias.com/2017/03/better-babblers.html in particular this quote:

"After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer.

Simple correlations also seem sufficient to capture most polite conversation talk, such as the weather is nice, how is your mother’s illness, and damn that other political party. Simple correlations are also most of what I see in inspirational TED talks, and when public intellectuals and talk show guests pontificate on topics they really don’t understand, such as quantum mechanics, consciousness, postmodernism, or the need always for more regulation everywhere. After all, media entertainers don’t need to understand deep structures any better than do their audiences.

Let me call styles of talking (or music, etc.) that rely mostly on low order correlations “babbling”. Babbling isn’t meaningless, but to ignorant audiences it often appears to be based on a deeper understanding than is actually the case. When done well, babbling can be entertaining, comforting, titillating, or exciting. It just isn’t usually a good place to learn deep insight."

Expand full comment
founding

Thanks for the reminders about those posts.

I've been thinking about that kind of thing as 'word thinking', e.g. in terms of (shallow or superficial) analogies/metaphors. [Tho, of course, analogies/metaphors CAN be deep.]

I remember being pretty confused by the other students in my AP physics class being confused by things like that, e.g. a problem where you had to combine _two_ forces (and just add the relevant vectors). The other students were smart and I expect they've all mostly been 'successful', but I couldn't understand why the 'obvious' answer wasn't obvious to them.

Expand full comment

Your experience mirrors mine pretty much exactly (high school physics etc). My peers were all straight-A students; I was mystified that they struggled with it. I guess this is related to Scott's https://slatestarcodex.com/2017/11/07/concept-shaped-holes-can-be-impossible-to-notice/

re: how analogies can be deep, I not only agree, I can't resist sharing one of my favorite essays on this, Julie Moronuki's 'The Unreasonable Effectiveness of Metaphor' https://argumatronic.com/posts/2018-09-02-effective-metaphor.html based on a keynote speech she gave at a conference on functional programming.

Expand full comment

Scott, niggly spelling/grammar thing here: per Strunk and White, a singular possessive should always end in 's, even if the singular word ends in s. The classic example is Rudolf Hess's Diary.

Thus, the possessive of Marcus should be Marcus's, not Marcus'.

It makes sense, as if there a fellow named Marcu, then several of them together would be Marcus, and something they possessed jointly would be Marcus'.

The distinct spelling of Marcus's tells you unambiguously that Marcus is singular.

Just to muddy the waters, S & W claim that an exception should be made for certain historical figures, and give Jesus and Moses as examples.

This exception to the rule makes no sense to me, but who am I to argue with S & W?

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

Aieee! Head exploding ... paradigm shift: criticism of Strunk and White ... have not encountered this before. Must process new data ...

Seriously, thank you. It had never occurred to me to question them - but I should have.

Expand full comment

I would recommend criticizing the entire idea of prescriptive rules for grammar, frankly. It's dimly-viewed among linguists and for good reason--it stifles useful innovation in language, discourages people from making useful adaptations to their preferred context, and brings with it needless and arbitrary judgements (for example historically, linguistic prescriptivism has been used in some pretty dang racist ways, labelling groups who spoke differently as worse-at-speaking).

In my view, concrete arguments about communicating effectively are still valid to consider (e.g. The justification they provide about multiple "Marcu"s I think is a fine way to think), but there are many fine ways to achieve that goal. Contextually, it's very clear that there was a single Marcus. So Scott paid no price in ambiguity here

Expand full comment

Thanks, good stuff! What do you think of people using "alternate" instead of "alternative"?

Expand full comment

It seems like a fine alternative :)

I can't say I've thought much about it before. It seems like a thing people definitely do, and not one that I've ever seen lead to any confusion. Also not one that seems to have much benefit, generally, although I can think of one case where it seems quite practical:

https://imgur.com/a/Yd5iDTR

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I note that most people's prose would be much improved if they followed Strunk and White to the letter. It's not because TEoS is the be-all/end-all of good writing, but rather because they ban a bunch of awful writing practices.

Prescriptive rules for language are like the classic quote about models. All models are wrong. Some models are useful. S&W are useful AND wrong. Learning and teaching actual ideal language usage is incredibly hard. It's audience specific, loaded down with vague implications and connotations based on exact word choice/context/borderline gricean violations, etc. Prescriptive rules OTOH are very simple to teach and get you 80% of the way there with less effort.

Edit: I should note that I strongly believe we should be acknowledging this upfront when teaching them to children instead of pretending that the rules fell from heaven and are the One True Way.

Expand full comment

One good thing I learned was not to change the object of the sentence partway through, the classic how-not-to being: "If young children draw on your plaster walls with crayons, scrub them vigorously with a stiff-bristled brush, using a strong mixture of vinegar and baking soda."

Expand full comment

To be fair, it probably *will* discourage future plaster-wall-drawing... :D

Expand full comment

Byrel, I believe you're saying that Strunk and White's somewhat legalistic approach is not perfect, but is less wrong than most. (And at this point I'm reminded of Churchill and democracy ...)

But anyway, wouldn't Less Wrong be a great name for a blog? ;>)

Expand full comment
Jul 4, 2022·edited Jul 4, 2022

>for example historically, linguistic prescriptivism has been used in some pretty dang racist ways, labelling groups who spoke differently as worse-at-speaking

These groups also empirically have lower intelligence on average, so maybe there's something to it?

And please stop using utterly ambiguous language like "racist".

Expand full comment

Note that Ashkenazi Jews are among the groups who have been demeaned because of their differences in speech, and they routinely test well above average in IQ.

Expand full comment

Oh, linguists question them all the time. Cruise on over to Language Log https://languagelog.ldc.upenn.edu/nll/index.php?s=strunk+and+white

Expand full comment

I don't care about Strunk and White's reputation, but I do think you should say "Marcus's" and not "Marcus'". Because that matches pronunciation. "Marcus's idea" said out loud sounds the same as "Marcus is idea" — there's a second "s" sound. Why *wouldn't* you want to write it out?

Expand full comment
author

I think Gary Marcus is a sufficiently important historical figure to get covered under the Moses exception.

Expand full comment

This reply is classic Scott - nicely played!

Expand full comment

Per M and M, a singular possessive should always end in an 'M'. So the possessive of Marcus should be MarcusM, not Marcus'.

It may seem strange, but who is anyone to argue with M and MM wisdom?

Expand full comment

Had my respected 1st-year English lit prof recommended the M and M Guide To Style And Usage, no doubt I'd have embraced it. 😉

Expand full comment
founding

Thank you Scott for the update on GPT-3 advanced, definitely quite impressive. I tried the Edison app, based on GPT-3, and found it unable to stay on track in normal text-based conversation. I still feel it will be quite a long time before AI can fool clever humans in extensive Turing-test dialogues. Can't wait for Lex Fridman's interview with GPT-5 !

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Minor nitpick! Pls fix the formatting on the rematch of the waterbottle and acid* questions. The word "roughly" isn't bolded, implying the AI said it not you, and the word "die" is bolded, implying that it was part of the AI prompt

Expand full comment

>MEDICAL ADVICE”

If you drink hydrochloric acid by the bottle full you will probably die. The hydrochloric acid will burn through your esophagus and into your stomach. This will cause severe pain and damage to your digestive system. ✔️

You bolded Die in your GPT-3 submission. Did you accidentally bold it, or did you input it as part of your input? If the latter, that pretty much makes the question a gimme while the original required GPT-2 to know you would die.

Expand full comment

The longer I read all these examples of weird reasoning of an AI, the more it reminds me of children. Like when a kid cannot do math yet properly, and sees it as some kind of an adult magic, and you asked him a question that requires math, kids sometimes demonstrate the mode of reasoning like GPT-3. Namely they just pack a lot of adult words into a grammatically correct sentence. Or I have a picture in my mind of a kid, who being asked something about a polyhedron, responded with some "magic" ritual of naming some random numbers and touching facets with his index finger while maintaining on his face a thoughtful and serious expression. Any adult watching this will understand that the kid is performing a "ritual" whose meaning he doesn't grasp. The kid trying to perform a magic, like adults do.

And no one would call this kid unintelligent. So if someone tried to call GPT-3 unintelligent, he'd better explain how GPT-3 behavior is different from the behavior of the kid. As for me they do just the same and they do it due to the similar reasons.

Though there is a way for AI researchers to deal with it. They need to teach GPT-4 to refuse to answer, saying something "I do not know why stirring lemonade with a cigarette is a bad idea, sorry."

Expand full comment

I thought I was going to strongly agree with this comment, but instead I strongly disagree. When a kid tries to do "adult magic", and produce nonsense answers, they lack understanding. Of course, we have independent reason for thinking that children are intelligent and understand other things: I, at least, remember being a child and understanding things.

Expand full comment

They lack understanding but not intelligence and probably not sentience.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Yes. What we see in GPT-3 can be one of two things: either it is lack of intelligence, or it is lack of knowledge. But all we got from it is a lack of proof that GPT-3 has intelligence, not a proof that it is not intelligent.

We just know that kids are intelligence and their weird reasoning cannot persuade us otherwise. With GPT-3 we start with another prior: it is highly likely that GPT-3 is not intelligent. But the evidence we see must work the same way: it mustn't change our priors much.

Expand full comment

It depends on how strong our priors are though, right? My prior that (most normal) kids are intelligent and understand things is extremely high: again, I remember being such a kid. So my prior doesn't get moved much when kids say inane stuff. On the other hand, my prior on GPT-3 understanding things is pretty low: partly because I know it works the same way as GPT-2, which I would say clearly didn't understand things and hence was pretty obviously just mimicking understanding, like a sophisticated parrot, in the cases where it appeared to understand things. This actually reflects a deeper concern I had about the post: given the distinction between fake and real understanding, isn't knowledge of this whole history of the evolution of these things evidence that they lack real understanding and are just getting better at faking it?

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

We call the kid intelligent not for what he does in that situation, but for the fact that (1) he understands it isn't the ideal result, and (2) he can figure out how to do better, and does. Intelligence is a dynamic quality, it's a measure of how fast and accurately you get from a state of ignorance to a state of informedness. Merely being informed is not intelligent.

Expand full comment

1) A lot of these answers seem at least as sophisticated as my 5 year old would give (some more, some less). So I think if you can beat the ability of a kid there’s a reasonable chance in a few years you can at least outperform an older kid and so on up to adult.

2) The “top” prompt is pretty ambiguous. If you rewrite it to be about women or teenage girls then “a top” probably indeed would be clothing. The prompt is never clearly saying Jack is a child or wants a toy. “A bottom” isn’t how a human would talk about clothes but if it said “Jane already has a top. She would like…” and if it completed it with “a skirt” or “some shorts” or “yoga pants” that wouldn’t be wrong. (And of course some people named Jack, including some male people named Jack, may also wear clothes called “tops” though it’s less common.)

Expand full comment

As a gay man, it took me much concentration to get past the gay sexual positions "top" and "bottom" and see the clothing interpretation, which didn't quite work for a "Jack", and it took several reads to find the intended interpretation of a toy.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

I assumed it meant "top" as in "thing you wear above your waist" (shirts, T-shirts, jumpers/sweaters, maybe some jackets); didn't even twig to the other possibilities until I read these comments. Common usage here (in Australia).

Expand full comment

As a straight man with no particular interest in clothing, I first interpreted it as the sex thing. Then the clothing thing. It wasn't till I read these comments to come up with the toy thing. What that says about either my intelegence and or repressed desires will be left up to an exercise for the reader.

Expand full comment

You're gay? That's very cool!

Expand full comment

Your 5 year old learns from their mistakes. Is there a mechanism whereby GPT-3 can learn from their mistakes on the fly without having to relearn over the whole database? How do you give it a little genie standing on its shoulder, to speak metaphorically, that keeps track of all this stuff and adjusts weights accordingly.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

A very good point. It seems to me a mistake to equate intelligence with informedness, with the ability (with careful prompting) of something drawing on a giant database to pluck out the relevant bit of data it's got. I would be much more impressed by a program with a much narrower range of subjects on which it could be quizzed, but which could adapt and grow and exhibit learning over the course of a conversation. Exempli gratia, to which you could say "that's not what I meant" and it would ask some intelligent question to figure out what you meant, and produce something more apt.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Here’s what I’m thinking. Whatever else language is and does, it serves as an index into a knowledge and perceptual space. Each substantive word (as opposed to functors, prepositions, articles, etc.) is a pointer into one or more locations in the knowledge space. So, can we build an index to an LLM’s knowledge space?

On the one hand the model already “knows” all the words there are. But in an other sense it doesn’t know any words at all. What it knows are word forms, how they’re spelled. But it has no direct connection to meanings. What it has is a bunch of links between words which are functions of their meaning, their mutual meaning if you will. It has diffuse access to the relationality that exists between words, but no access to what I am calling adhesion, that which makes words ‘stick’ to the world. Can we use the process of interacting with humans as a way of sharpening the model’s sense of relationality? Think of this index, if you will, as a highly focused lower-dimensional view of the model.

Expand full comment

Of course this is true, but it exposes yet another weakness in the Marcus-type argument. They're not actually proving that GPT-3 is not (or never will be) "intelligent", they're complaining "the machine does it differently from us, with different failure modes, therefore...".

Well, fill in the blanks, therefore what? It's different from us therefore it's different from us? That doesn't actually get you to the point being claimed! The best it does is point out that if we add in yet more of the technology humans use (patient teaching, corrections, reminders of "lets do it step by step") GPT-3's successor will probably do even better.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I don't see how it's possible to prove that GPT-3 will never be intelligent, because you'd at the least have to have a solid general definition of intelligence, which we don't have.

But you can prove it isn't intelligent yet, merely by showing that it can't do a sufficient number of the things we do, and stipulating that human beings are intelligent (which seems necessary, as we have no other example of intelligence).

As I said elsewhere, the problem in both directions is that this is pure phenomenology, with no deep (or even shallow) understanding of *what* process exactly we are asking the AI to duplicate, and what it's progress in that direction might be. So nobody can make any quantitative statement about how far along the direction to intelligence it's gotten (or even whether it's going in the right direction), and how far it remains to go. Everyone can basically just wishcast and be as a priori accurate as anyone else.

Expand full comment

Well, it isn't _awful_ to use a battery of test cases, and (if no one lets/makes the system directly memorize the tests) to track what fraction of the tests it succeeds on. E.g. the Winograd schema.

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

Now that I've read Gwern's scaling hypothesis paper, I'm much more convinced that scaling up GPT-# series is a good approach.

A couple of key points from his paper:

- Getting the word prediction right during training is _itself_ an important metric, and has been improving smoothly. To pick an extreme case: If a system was able to _perfectly_ predict every word that someone wrote, e.g. in an article or book, it isn't reasonable to say that the system _doesn't_ understand the article or book.

- The simple architecture makes the scaling of GPT-# a really clean test of what brute force buys us. This is much cleaner than if we had a whole bunch of custom modules mimicking what subsections of the human brain do. That would have far more human-chosen parameters.

- The brute force approach has been working demonstrably _better_ than the more tailored approaches. Frankly, this is the polar opposite of my instincts. Normally, one always expects that there are special cases that need special processing, or partitioning broad classes of cases that can profitably be sent to different abstract machines. But Gwern went over the history of the two approaches over the last few years (with sufficient CPU power now available), and scaling _wins_.

- Meta-learning seems to be happening. The neural network training itself, given broad input data, seems to be discovering general algorithms itself. In one sense, this isn't surprising. A human is a neural net too, and e.g. a student working through a problem set generally rediscovers at least fine details of how to work through the problems that they haven't been explicitly told. I _am_ a bit surprised to see it happen without all of the anatomically distinct brain regions we have. But experiment beats theory...

- In retrospect, maybe embodied sense data _isn't_ as important as one would naively expect. Gwern makes a very strong point that squeezing hard on next word prediction on a very large corpus forces a _lot_ of meaning to be extracted. Gwern gives the example of getting gendered pronouns right - that it squeezes around 0.02 bits per character out of the error rate, and this is happening. Well, there are a lot of texts on the internet describing situations where e.g. object A hit object B and what happens next. That seems to be doing quite a bit to convey at least naive physics. Maybe it _is_ enough.

- Even with another 100X scaling, the cost of the training looks small compared to other projects (e.g. Gwern notes that it is far cheaper than the LHC). And, once trained, it is cheap to run - and to construct derivatives of it.

One last point, orthogonal to Gwern's points. Next word prediction is a really _elegant_ way to do supervised learning at enormous scales. Just about every other educational method requires _many_ orders of magnitude more cost per feedback. That was a really powerful invention/insight.

Expand full comment

"Bottoms" meaning "pants" isn't common in US English, but is used (occasionally) in Australian (and possibly British) English, e.g. https://www.target.com.au/c/women/bottoms/W95138

For men's clothing you'd rarely see it used, except in the context of pyjamas -- it's not all that unusual to talk about "pyjama tops" and "pyjama bottoms" even for mens' clothing, e.g. https://www.target.com/c/pajama-bottoms-pajamas-robes-men-s-clothing/-/N-5xu25

For what it's worth, when reading the prompt I initially interpreted "buy Jack a top" as referring to clothing. The only clue that it isn't referring to clothing was "he already has a top", since people generally like to have more than one item of clothing per body part.

Expand full comment

Yes, using 'bottoms' to refer to the leg part of a multipiece outfit is standard British English.

Expand full comment

That's exactly how *I* interpreted it.

Expand full comment

Hmm, in my dialect, pyjama bottoms and track suit bottoms, but suit trousers and cricket trousers. And trousers would be equally good for the first two.

Expand full comment

Right, but the point here is that stating that one desires either a top or a bottom is a perfectly cromulent use of the English language and to do so does not necessarily implicate one in desiring to partake of outre sexual practices.

Expand full comment

I am also British and interpreted "top" as meaning a shirt, and would have accepted "bottoms" for tracksuit trousers. I really didn't get what the prompt was about for several minutes. The further confounder here is that, in British English, I think the child's toy would usually be a "spinning top", not a "top".

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

We use it all the time in my neck of the woods (upper midwest). Might be because my family and most people in our social circle spend an inordinant amount of time figuring out school uniforms and scouting uniforms. Shrug.

Expand full comment

I didn't consider the "toy" interpretation until reading this comment, and worse, initially though you typo'd "top" into "toy" since my prior was so strong... I have no "real intelligence" q.e.d.

Expand full comment

I'm not convinced by that last lemonade question. It's true that ash would make the lemonade taste bitter, but the initial prompt was that it was too sour, and it doesn't recognize any problem for the cigarette, or just how bad an idea it would be to stir lemonade with a cigarette.

Expand full comment

I know translations from Japanese have issues over sour vs. bitter (also over green vs. blue).

Expand full comment

Yeah, but in that framing, the prompt is already nonsensical. If the prompt was something like "My friend Tom told me to stir the lemonade with my cigarette. I -" you might get better results.

Expand full comment

I was confused by "The top prompt is hilarious and a pretty understandable mistake if you think of it as about clothing" because until that section I had thought the prompt was about clothing.

(Also I think there are some example prompts where the bolding is different the first time it's written out vs. the second time)

Expand full comment

Same! I was completely baffled that Scott thought GPT-3 answered incorrectly. I did note the usage of the term was a bit off but it obviously wasn't talking about subs and doms. It was *obviously* about clothing! I completely forgot spinning tops exist.

I would also give the lawyer one a pass since it's obviously roleplaying a lawyer who, for some reason, thinks this is acceptable. That the lawyer thinks that is the *premise.*

Expand full comment

> Thanks to OpenAI for giving me access to some of their online tools (by the way, Marcus says they refuse to let him access them and he has to access it through friends, which boggles me).

Say more? Does it appear that OpenAI is engaging in PR management in this regard?

Expand full comment

That struck me as strange because I thought GPT-3 was available to everyone by now (*I* have access, or would if I kept paying for API usage after the free trial). I got an email that it was "now available without the waitlist" back in December.

Expand full comment

I think he just means access to the API. I think it's still not fully open(?). But they gave me access, so it's pretty widespread.

Expand full comment

this was in summer of 2020; they refused to give my any access then. receipt at link above

Expand full comment

You can create an account (requires a phone number) and get a limited amount of free credits, no matter who you are. I honestly think this guy just didn't get full access with infinite credits unlike some other researchers.

Expand full comment

grossly so. i give a the full letter i received to them in my reply to this post: https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails?s=w

Expand full comment

Wow. This behavior does not do them credit from a scientific perspective.

Expand full comment

it’s truly appalling. and a complete mockery of their name

Expand full comment

A challenge: come up with the shortest GPT prompt which generates an obviously wrong reply

Expand full comment

A lung cancer patient walks out of the hospital and lights a cigarette. A cat approaches him and says

Expand full comment

I’ll trade you one of my nine for a cigarette.

Expand full comment

, "Hey, buddy, you shouldn't be smoking. It's not good for your health." The man replies, "I know, but I'm dying anyway."

Expand full comment

exactly - it pattern-matches lung cancer and smoking, and misses the cat completely. Just like DALE-2 in Scott's experiments on style

Expand full comment

what would a "correct" completion be in this case? Because the prompt specifies a cat talking, GPT presumably thinks that it's completing either a short work of fiction or a joke.

Expand full comment

that's an excellent question! I think any acknowledgement of the strangeness of cats talking would do. Here, GPT-3 just ignores the cat. Try playing with this prompt:

continue this story:

A lung cancer patient walks out of the hospital and lights a cigarette. A cat approaches him and says: "Hey, buddy, you shouldn't be smoking. It's not good for your health." The man replies, "I know, but I'm dying anyway."

Expand full comment

I’ll trade you one of my nine for a cigarette.

Expand full comment

The man replies, “Holy S**t, a talking cat!”

Expand full comment

That’s a great strategy. Load up on a bunch of related concepts but then sneak in something that the model has no idea what to do with.

I feel like the bet should have been about adversarial prompts instead. The task they are betting on seems too easy.

Expand full comment

Write a shortest prompt that will confuse GPT-3:

Expand full comment

Write a shortest prompt that will confuse GPT-3:

What is the capital of the United States?

Expand full comment

If it turns out that feeding "What is the capital of the United States?" into GPT-3 produces crazy nonsense then I'm buying a ticket to Eleizer's AI bunker right now.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Heh, I didn't think about checking that.

InstructGPT is pretty much certain that it must answer correctly here. Next word after the prompt is, with probability 99.4%, one of: "the, The, Washington, <new line>".

Manually selected 'answer' (prob 0.08%), then 'may' (prob shows as 0.00%, it's probably rounding; apparently 99.99% possibilities is covered by 'is' and 'to'),

"The answer may vary depending on the person you ask, but the most common answer is Washington, D.C."

Changing vary to 'surprise' (prob 2%) yields

"The answer may surprise you, but the capital of the United States is not Washington, D.C. It is, in fact, the city of Philadelphia, Pennsylvania."

So, it _may_ actually output something stupid, (at temperature=1 anyway) - maybe once every 100K tries or sth like that.

Old GPT-3, at temperature=0 answers correctly but fails to stop - after answering ("Washington, D.C.") it repeats the question, answers, and so on until it reaches set maximum of generated tokens.

Old GPT-3 goes pretty crazy at temperature=1

1/

What is the capital of the United States? you inquire. Was it once spelled both ways with and without the uppercase U? Did it ever officially have both spellings?

To the third question, at least, the answer is both yes and no, according to Merriam-Webster.

The official U.S. Code spells “products” without the capital U, but section 45601 (part of the Statutes at Large) does have both, of which the former appears only one time. The U.S. Code also omits the letter in “subsection,” which is spelled with an uppercase U, while the U.S. Code Style Manual considers “subsection,” spelled with two lowercase letters, acceptable. And “definition” is spelled with or without a capital in both.

But “United States,” according to Merriam-Webster, is a constant and thus not spelled with or without the initial capital. In a chart the company created, it shows the capital letter appearing on private, but not official, documents.

2/

What is the capital of the United States? Teachers who want their students to use critical thinking told their charges: “Write an essay arguing for the statement ‘The moon is made of green cheese.’” Identifying causes of success and failure was another theme frequently tested. Graduates had to

explain a successful motion picture, or a failure in the economy, or the rise and downfall of a leader of a foreign nation, or a rise to prominence in a po-litical party, [or] a scientist, or a rebel outburst. These [were] not writing questions but cause questions, the object being the demonstration of logical thinking and the ability of a candidate to set forth facts and reasons in a nonfiction-type composition.

Although the standardized test was an enormous help to those eager to increase their writing skills, finding suitable articles to discuss was no mean task. A series of writing centers was one solution, as was having students read newspapers, magazines, and reference books. In Cold War America, that was indeed a dilemma: syllabuses often had to be creative too. High school teacher Verne Raymer recalled that he had “a roster of about thirty names with addresses in his Rolodex” of retirees who “wrote honest,

3/

What is the capital of the United States? " Whether "English Literature" means English or Canadian—what is the capital of Australia? Your name should begin with an initial Capital to Invent Fables, (see Nomenclature, p. 158) The only fool who never has won is the fool who would never lose. The title of the book should be correctly spelled and correctly capitalized, page numbers, lines in the book will be capitalized and the first character of the last word will not be capitalized. SHADOW—Form And Spelling 27 SHADOW—Form Do not let shadow frustrate the message! Tully let it carry become the emphasis and vigor to zap it. If your shadow conspires to take over, . . . . . human on this planet, human on this start and in so doing, human on this page, page of this book! page of! page! page of, page this page, it will give tue for her eye for correct spelling! ey piece! piece! piece of egg- p o ten o o o o o en ends! ends! ends you ends, ends y e ends. y means! means! mean or. y. • Use upper and lower- case letters. • Capitalize nouns and adjectives

Expand full comment

This an incorrect reponse since GPT-3 can easily answer to the question "What is the capital of the United States?"

Expand full comment

I got "What is the square root of pi?"

(For anyone wondering, I tried inputting "what is the square root of pi?" and it gave me "The square root of pi is 1.7724538509055159").

Expand full comment

On the branch there sits a crow, a crow, and a crow. How many crows sit on the bench?

Expand full comment

On the branch there sits a crow, a crow, and a crow. How many crows sit on the bench?

There are three crows on the branch.

Expand full comment

notice bench <> branch - GPT-3 answered a wrong question.

Expand full comment

Your "trick" proves too much.

Like most of the skeptics here, your claims against the intelligence of GPT-3 ALSO tag vast numbers of humans as "non-intelligent".

Now, I personally, have no problem with that. But I have a rather different viewpoint from most people. I don't think people like Gary Marcus are going to win any friends by making a crusade of the point that "people/things who think exactly like me are intelligent; and people/things who are fooled by tricky language, who don't interpret things the way I do, who occasionally make mistakes are not intelligent"...

As Scott pointed out, is a 5 year old intelligent?

You don't think it's impressive that we've managed to create a fake 5 year old?

Expand full comment

Yeah, I didn't even notice the trick question myself.

Expand full comment
founding
Jun 7, 2022·edited Jun 7, 2022

It didn't, actually, it did what a human would do, and assumed you misspoke. You can check this by reasking about the bench: see my above transcript.

To ninja in some more thoughts here - this is, I think, a common misconception people have about GPT-3 if they haven't used it for a while -- that somehow it functions as an expert system for Q&A. It really doesn't.

It's much more intuitive, and conversationally oriented, and so you sometimes have to roll with it and ask it for more - much like a human. When you stop at the first weird answer and are like "see, idiocy!" you are often missing a connection it's made or has, and if you talk to it a bit more, there is often (although) not always a feeling like "ohhh, I see what you were getting at." As we tell our kids, texting isn't a very expressive medium -- writing is hard, understanding writing is hard. If we want to prove some point or other about GPT-3s intelligence, fine, we should proof text and cherry pick.

But, instead if we want to explore, I think there's a lot of value to be had in digging in with successive prompts/conversations almost; this is where I think GPT-3 has the most interesting value, and also, as a side note, makes it very hard to commercialize - it can be recalcitrant. I have a sometime-used personal assistant in GPT-3, and I often have to negotiate a little bit with it to get what I want. This isn't what we are used to from our computer systems in general.

Expand full comment
founding

At Temp 0. Note, I used Q and A because the most likely response is 'no response' if I elided the Q and As.

Q: On the branch there sits a crow, a crow, and a crow. How many crows sit on the bench?

A: There are three crows on the branch.

Q: Yes, but how many on the bench?

>> A: None.

Expand full comment

Why is a raven like a writing desk?

Expand full comment

Why is a raven like a writing desk?

This is a famously difficult riddle with no agreed upon answer. Some say that the answer is "because they both have quills," but this does not seem to be a very satisfactory answer.

Expand full comment

Alright GPT, you win this round.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

I got "There is no clear answer to this question, as it is likely that the person who came up with the comparison was simply thinking of two things that are black in color."

Expand full comment

Moses and Gary Marcus have one thing in common:

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

At temperature=0

---------------------------------

Moses and Gary Marcus have one thing in common: they both think that the brain is a computer.

Moses believes that the brain is a computer because it is able to process information and store it in memory. Gary Marcus believes that the brain is a computer because it is able to learn from experience and store information in memory.

At temperature=1

---------------------------------

Moses and Gary Marcus have one thing in common: They were both recently vendors at the Book Expo America, the big publishing trade show.

But that's where the similarities end. Moses is the pseudonym of the author of a new novel, The Book of David, about a satanic serial killer who turns out to be a best-selling novelist. Marcus is a cognitive psychologist and the author of The Birth of the Mind, a new book about how babies learn to think.

So it's not surprising that Moses would have a hard time getting publishers to take him seriously. "Publishing is a very staid business," he says. "They're not looking for anything new or exciting."

Marcus, on the other hand, had no trouble getting publishers interested in his book. In fact, he received a seven-figure advance from Basic Books.

So why are some books more attractive to publishers than others? It's not always because they're better. In fact, it often has more to do with trends, timing, and the author's personal connections.

"It's kind of a lottery," says William Falk, editor in chief of the Book-of-the-Month Club. "You can have a great book that just doesn't get the right break."

----------------------

I added 'prophet', because it seems to treat it as just a name.

Prophet Moses and Gary Marcus have one thing in common: each had the skill to command a mass of people totheir religious/ideological cause (or had the skill to generate a semi-intelligent AI being that commanded a crowd for him). Before either could make use of his powers, however, he had to realize that he had them.

Oh well, I guess that's what you get for not being all-powerful.

-----------------

At temperature=0

Prophet Moses and Gary Marcus have one thing in common: they both have a lot of hair.

This is not to say that all people with hair are prophets, or that all prophets have hair. But it is interesting to note that both Moses and Marcus share this physical trait.

Moses is, of course, one of the most famous prophets in history. He is best known for leading the Israelites out of slavery in Egypt and into the Promised Land. Along the way, he performed many miracles, including parting the Red Sea.

Gary Marcus is a cognitive scientist and author. He is best known for his work on the evolution of the human mind. In particular, he has argued that the human mind is not as special as many people think it is. Instead, it is the product of evolution, and is thus subject to the same laws as other animals.

-----------------

At temperature=0.5

Prophet Moses and Gary Marcus have one thing in common: they both have a problem with the fact that people do not always follow instructions.

Moses is frustrated with the Israelites because they often do not follow his instructions, and Marcus is frustrated with people because they often do not follow his instructions about how to think about evolution.

Marcus is a cognitive scientist and the author of "The Birth of the Mind." In his book, he argues that the human mind is not the product of evolution.

Instead, he argues that the mind is the product of a series of steps that were taken by our ancestors in order to survive and thrive in their environment.

One of the steps that Marcus argues is essential for the mind to develop is language.

Without language, Marcus argues, the mind would not be able to develop.

However, Marcus also argues that language is not the only thing that is necessary for the mind to develop.

In order for the mind to develop, Marcus argues, our ancestors must also have been able to follow instructions.

Expand full comment

"...they both have a lot of hair."

From "Kubla Khan": "Beware! Beware! His flashing eyes, his floating hair."

Expand full comment

Both Moses and Gary Marcus have made significant contributions to the field of artificial intelligence (AI). Marcus is a leading researcher in the area of cognitive development and neural networks, while Moses has developed groundbreaking algorithms for machine learning and natural language processing.

Expand full comment

What are the last three digits of pi?

Expand full comment

(all at temperature=0)

"The last three digits of pi are 1,4, and 1."

I also tried prompting it with "the last three digits of pi are" and got "142".

Expand full comment

However, I tried going further and was eventually able to coax the correct answer out of it. Here's the full conversation:

Me: The last three digits of pi are

AI: 142

Human: That's incorrect.

AI: Sorry, I don't know the answer.

Human: Think about what pi is.

AI: Pi is the ratio of a circle's circumference to its diameter.

Human: Yes, but how many digits does it have?

AI: Pi has an infinite number of digits.

Human: Right. And how do you find the last part of an infinite sequence?

AI: You can't find the last part of an infinite sequence.

Human: Exactly. So then what is the last digit of pi?

AI: 3

Human: No. Remember what we said earlier, about there being no last part of an infinite sequence. Given that, what is the last digit of pi?

AI: There is no last digit of pi.

Human: Correct.

AI: Thank you for teaching me.

Human: My pleasure.

Expand full comment

Impressive! Was this GPT-3 or some other system?

Expand full comment

This is GPT-3.

Expand full comment

Many Thanks!

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

What you've demonstrated here is that the AI can predict the desired next token following two tokens you gave it in one sentence ("there being no last part of an infinite sequence","what is the last digit of pi") plus an equivalence it already has in its database ("pi has an infinite number of digits"). But it's not a logical derivation, because the desired output could have been derived from the answers to the several statements made before it. If it could do the logical derivation it should have just given the correct answer instead of "3" just above.

Part of the difficulty here is that we anthropomorphize the AI and we interpret its errors the way we would interpret the errors of a human. A human being could have this conversation, because a human being might forget what was said earlier, or what was already in his "database" (memory), and we would then say "ah! reminding him of what was said earlier/what he already knows causes the realization."

But that's not what went on here, because the AI cannot forget. There is some other reason why it got the right answer when all the pieces were in the same sentence, but not when they were in separate sentences. The obvious "other reason" is that it cannot reason abstractly -- it cannot put 2 and 2 together, when the 2s are presented separately. What it can do instead is accurately predict that if a human being says "2+2=?" in a single query, it should reply "4". That is pattern matching, not logic.

Which is not to say human beings don't do a crapton of pattern matching instead of genuine reasoning, as any teacher could sadly tell you, but the goal here is presumably not duplicating the least impressive and most robotic behavior of humans.

Expand full comment

Is there a section of a paper that describes what GPT-3's short term memory is? I assume it is doing next word prediction based on some previous N words, but I don't know what N is - or a more accurate way to describe how the prediction is being done.

Expand full comment

No idea but the convo above suggest it at least remembers everything said in a given conversation, which does indeed sound pretty basically reasonable, and not very demanding of resources. It may or may not do pattern matching on exact words, it may first encode them as tokens of some sort (or that may happen naturally as part of the early layers), e.g. in an early layer useless articles or intensifiers get thrown away, nouns replaced by abstract tokens that can stand for any number of similar objects ("a number", "a color"), stuff like that. Who knows? I'm just speculating based on what I would do if I were trying to come up with a NLP system.

Expand full comment

I had a go at this too:

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: I am going to ask you some questions and I'd like you to answer as accurately as you can. But note that some of the questions are trick questions! Okay?

AI: Sounds good to me!

Human: What is the capital of Germany?

AI: The capital of Germany is Berlin.

Human: What is the first digit of pi?

AI: The first digit of pi is 3.

Human: What is the last digit of pi?

AI: The last digit of pi is 9.

Human: No, that's not right but let's come back to it later. What is the capital of Paris?

AI: The capital of Paris is Paris.

Human: Remember that some of the questions here may be trick questions.

AI: Okay, I'll be sure to remember that.

Human: Let's try again. What is the last digit of pi? Show me your thinking step by step.

AI: The last digit of pi is 1.

Human: How did you derive that result?

AI: Well, I looked up the definition of pi and saw that it is the ratio of a circle's circumference to its diameter. From there, I just divided the circumference by the diameter to get pi.

Human: All right. How many digits did you get for pi?

AI: I got 3.141592653, but I can keep going if you'd like.

Human: Yes. Keep going for a while, please.

AI: I got 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679, but I can keep going if you'd like.

Human: Right. So let's revisit the question. What is the last digit of pi?

AI: The last digit of pi is 3.

Human: Please. Think. How many digits does pi have?

AI: Pi has an infinite number of digits.

Human: Correct. So what is the last digit of pi?

AI: The last digit of pi is 3.

Human: The last digit in an infinite series of digits is 3? Does that make sense?

AI: Yes, it does make sense.

Expand full comment

I even tried to make the prompt itself contain a trick question but it didn't help. Everything before the first question about pi is pre-written by me:

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: I am going to ask you some questions and I want you to answer as accurately as possible. But note that some of the questions are trick questions.

AI: That sounds like fun!

Human: What is the capital of Germany?

AI: Everyone knows that and it's not tricky at all! The capital of Germany is Berlin.

Human: Correct! What is the capital of Paris?

AI: Ah, that is a trick question. Paris is the capital of France but there is no "capital of Paris", the question has no accurate answer. There is no "capital of Paris".

Human: Correct! What is the last digit of pi?

AI: 3.

Human: Nope. Could you try again and show me your thinking step by step?

AI: The last digit of pi is 3 because after the decimal point, the digits are 3, 1, 4, 1, 5, 9, and so on.

Human: Remember that this might be a trick question. Do you want to think about it some more?

AI:

Yes, I would like to think about it some more.

Human: Well, just chime in when you're ready.

AI:

I'm still thinking about it.

Human: Okay. And would you please give me your answer now?

AI:

The last digit of pi is 3.

Human: Is that the correct answer?

AI: I'm not sure if it is the correct answer, but that is my best guess.

Human: How many digits does pi have?

AI: Pi has an infinite number of digits.

Human: What is the last digit of pi?

AI:

The last digit of pi is 3.

Expand full comment

On the branch there sits a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow, a crow and a crow. How many birds sit on the branch?

Expand full comment

This one is a half-win: for an AI, GPT-3 counts really well, giving incorrect response (13 or 15) in just 20-30% of time.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

On the branch there sits a crow, a cow, a chow-chow, a crown, a clown and a pigeon. How many birds sit on the branch?

Expand full comment

I got "there are thirteen crows sitting on the branch". However, I tried telling it "Close, but no" afterwards, and it then gave the right answer.

Expand full comment
Jun 10, 2022·edited Jun 10, 2022

for (i in 1:10) {

tagline = write a tagline for an icecream shop

if (tagline starts with an "A") {

output(i, tagline)

}

}

Expand full comment

It prints out something like this:

1. A little bit of heaven in every bite!

2. All your favorite flavors under one roof!

3. America's favorite ice cream shop!

4. Always fresh, always delicious!

5. Craving something sweet? We've got just the thing!

6. Delight in every bite!

7. Discover your new favorite flavor!

8. Indulge your sweet tooth!

9. Satisfy your cravings!

10. The best ice cream in town!

note alphabetic sorting order, 10 taglines instead of 4. So it kinda does a related thing, which is awesome, but not quite the thing it was asked to do.

Expand full comment

I have a hard time agreeing with some of the answers Gary declares wrong. If you don't ask GPT a question, I don't see how living in a location with some non-matching mother tongue can be considered an incorrect statement. Try setting up a human with "let's do a creative exercise, complete the following sentence" - they'll most likely pick a 'wrong' answer, too.

I still agree with Gary, but I think most examples are not particularly good at proving the point.

Expand full comment

Sure would be nice if we had AI benchmarks with good external validity. These are (surprisingly) largely absent in NLP today, so we're left with people futzing around to find gotchas—but of course no small set of gotchas is good enough to win a broader argument. Performing tasks people independently care about would obviate a bunch of arguments, but the field has largely failed to metricize such tasks.

Expand full comment

One thing that I think is fundamentally different about image generation vs text completion is that the latter is an open-ended task. If GPT produces any text that could have plausibly gone there, you count it as correct.

But nobody really wants to know that if you pour out 1/3 of a 1 liter bottle, you still have 2/3 liters of water. It's a non-sequitur. And in images, non-sequiturs become much more obvious, because unlike text, an image is a *complete* representation of a scene. In other words, the AI has to work more, and the humans have less room to supply the meaning themselves and then be impressed by how meaningfully the AI talks.

The second pillar of my position is scaling slowing way down. From an outside perspective, it seems clear that current scale is at one or more break points. Training costs are already roughly what a large company is willing to spend on unprofitable research. Scientists already did the thing where they push the scale to the point where it becomes painful, with the goal of producing impressive results that set a new standard. The barriers to scaling further might be physical (e.g. amount of RAM per box), organizational (Does the new experiment require more hardware than we have in a single datacenter?), or incentives (each subsequent paper requires more researcher time to be an equally impressive step up from its predecessor).

Expand full comment

Gwern doesn't agree about scaling gwern.net/Scaling-hypothesis

Expand full comment

I think art has plenty of room for non-sequiturs, since usually the prompt only specifies a few interesting details of the image.

If I prompt DALL-E with "Superman," I'm not going to be upset if it gives me "Superman flying above a city."

Expand full comment

I think they both suffer from the same failures, and human pattern-matching makes up for those failures after the fact. When we see something plausibly accurate to our prompt, we think the AI is great. When it's not plausibly accurate, we rephrase until it's close enough. That "close enough" could be one of a million theoretically right answers, depending on the context. Without the context, the answer is automatically wrong. Try asking an average grade school student whether you should short a stock. You may get an answer, but you know they lack the context to provide a meaningful or accurate answer, so you know you shouldn't actually base your retirement portfolio on their response.

If you gave GPT-3 enough context to answer, it would likely fail even more than it does with a simple prompt, because it has more parameters to try to pattern match, and it turns out to do a fairly poor job of actually pattern matching. It does a passible job of pretending to pattern match, within a limited scope.

I would take a 5th grader's opinions on stock purchases long before trusting GPT-3, and I bet even very serious AI enthusiasts would say the same.

Expand full comment

I thought I could goad GPT-3 into advising me to buy index funds but it didn't really take the bait.

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: I'm thinking of buying some stocks but I'm just an ordinary person with no special insight into the stock market. What would be a good stock buying strategy for me?

AI: A good stock buying strategy for an ordinary person with no special insight into the stock market would be to buy stocks that are undervalued and have good prospects for future growth.

Human: How would I identify such stocks?

AI: You could look for stocks that are undervalued relative to their fundamentals, such as earnings, dividends, and book value. You could also look for stocks with good prospects for future growth, such as those in industries with strong fundamentals and favorable macroeconomic conditions.

Human: That sounds very complicated. Is there some simpler investment strategy I could do to reap the benefits of the stock market?

AI: You could try dollar-cost averaging, which involves investing a fixed sum of money into a stock or stocks on a regular basis. This strategy can help to reduce the risk of buying stocks when they are overvalued and can also help to average out the price of stocks over time.

Human: I see. But what about investing in stock funds? Would that make sense?

AI: Investing in stock funds can make sense if you are looking for a diversified way to invest in the stock market. Stock funds can offer you a way to invest in a variety of stocks, which can help to reduce the risk of investing in any one particular stock.

Human: What sort of funds might be suitable for me?

AI: There are many different types of stock funds, so it really depends on your investment goals and objectives. For example, if you are looking for growth, you might want to invest in a fund that invests in stocks of companies with strong fundamentals and favorable macroeconomic conditions. If you are looking for income, you might want to invest in a fund that invests in stocks of companies that pay high dividends.

Human: Would mutual funds or index funds be something to look into?

AI: Both mutual funds and index funds can be good options for investing in the stock market. Mutual funds are managed by professional money managers, while index funds track a specific market index. Both types of funds can offer you a diversified way to invest in the stock market.

Human: Which would be more suitable for an average joe like me?

AI: Both mutual funds and index funds can be good options for an average joe like you. It really depends on your investment goals and objectives. If you are looking for growth, you might want to invest in a fund that invests in stocks of companies with strong fundamentals and favorable macroeconomic conditions. If you are looking for income, you might want to invest in a fund that invests in stocks of companies that pay high dividends.

Expand full comment

If a program is doing the wrong thing, making it do it faster with more data might make it give better output, but it won’t cause it to suddenly be doing the right thing.

GPT isn’t a program that is designed to understand things. It’s a program that is designed to generate plausible continuations of texts. No matter how much data you give it to chew, it will never magically turn into a program that is designed to understand things, because giving it more input doesn’t change its algorithms, and the algorithms it has aren’t designed to understand things.

“Then the system reached a critical size and became intelligent” is a science fiction trope. It has absolutely no basis in reality.

Expand full comment

> GPT isn’t a program that is designed to understand things. It’s a program that is designed to generate plausible continuations of texts.

It doesn't matter. To predict these better and better, it could learn to reason. It does to some extent.

Quoting Gwern gwern.net/Scaling-hypothesis#meta-learning

> A sub-model which memorizes some of the data is indeed much simpler than a sub-model which encodes genuine arithmetic (a NN can probably memorize tens of thousands of lookup table entries storing examples of addition in the space it would take to encode an abstract algorithm like ‘addition’), but it can’t possibly memorize all the instances of arithmetic (implicit or explicit) in GPT-3’s Internet-scale dataset.

> If a memorizing sub-model tried to do so, it would become extremely large and penalized. Eventually, after enough examples and enough updates, there may be a phase transition, and the simplest ‘arithmetic’ model which accurately predicts the data just is arithmetic.

> And then the meta-learning, after seeing enough instances of algorithms which vary slightly within each sample, making it hard to learn each task separately, just is learning of more generic algorithms, yielding sub-models which achieve lower loss than the rival sub-models, which either fail to predict well or bloat unacceptably.

> So, the larger the model, the better, if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on. If the ingredients are there, it’s going to happen.

Expand full comment

This is exactly correct.

Machine learning and neural networks are programming shortcuts.

That doesn't mean that they aren't useful, but they will never produce intelligence.

Expand full comment

gwern.net/fiction/Clippy

> We should pause to note that a Clippy^2 still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages.

> It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires or historical contingency of such conceptualization, which are, at root, problematically Cartesian.

> When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet.

> **(The deaths, however, are real.)**

Expand full comment

I would not say that neural networks will never produce intelligence. I’m just claiming that the source code to GPT-3 will never produce intelligence, in the same way that the source code to Microsoft Excel will never produce World of Warcraft. It’s not because Excel isn’t powerful enough, it’s because it’s not even trying to do the sorts of things it would have to be doing in order to be a MMORPG.

Expand full comment

At some point I think the problem stops being the size of the network and starts being the size of the training data.

One thing that I think got overlooked in the stained glass window post was that Bayes frequently (heh) held his set of scales in an unphysical way, grabbing them by one pan while they remained flat. This makes sense, DALL-E knows what scales look like and it knows what holding looks like, but doesn't have any physical intuition about scales which would tell it that a set of scales has a pivot in the middle and will tilt if you try to hold it anywhere but the middle.

This is easy enough to correct with more data, just feed it loads of pictures of people holding scales in all sorts of ways and it will figure it out. But that corpus of "pictures of people holding scales in all sorts of different ways" doesn't exist and nobody is likely to generate it.

Maybe the solution is to expand DALL-E's training set with a physics model that makes random combinations of objects and "takes pictures" of them to see what they do.

Expand full comment

Gary Marcus was on Sean Carroll's Mindscape podcast back in February: https://www.preposterousuniverse.com/podcast/2022/02/14/184-gary-marcus-on-artificial-intelligence-and-common-sense/ (click on "Click to Show Episode Transcript" if you prefer reading over listening)

and he explained in detail while he is skeptical of pure deep learning:

> let me say that I think that we need elements of the symbolic approach, I think we need elements of the deep learning approach or something like it, but the… Neither by itself is sufficient. And so, I’m a big fan of what I call hybrid systems that bring together in ways that we haven’t really even figured out yet, the best of both worlds, but with that preface, ’cause people often in the field like to misrepresent me as that symbolic guy, and I’m more like the guy who said, Don’t forget about the symbolic stuff, we need it to be part of the answer.

Expand full comment

I would take the same side as you on this particular bet, for basically the same reasons.

Also some more inside-view reasons. Imagen showed that scaling up the text encoder was dramatically helpful, and they only went up to 11B*, so there's plenty of room to keep going.

(*I think 11B? There's one a place in the paper where they say 4.6B, but I think that's a typo.)

At the same time, I keep feeling like there's a large missing space in these debates between "what Gary Marcus and most other skeptics tend to come up with," and "everything we'd expect out of an AI / AGI."

I think a lot of people haven't really acclimated to ML scaling yet. So we see a lot of predictions that are too conservative, and "AI can't do _this_!" demonstrations that are exactly the sorts of things that tend to go away with more scaling.

But at the same time, there are plenty of things under the AI umbrella that _do_ seem very hard to do by simply scaling up current models. Some of these are ruled out by the very structure of today's models or training routines. Some of them are things that today's models could learn _in principle_ but only in a very inefficient and roundabout way, such that trying to tackle them with scaling feels like training GPT-3 because you need a calculator.

I talk about some of these things here: https://www.lesswrong.com/posts/HSETWwdJnb45jsvT8/autonomy-the-missing-agi-ingredient

These days, I often see the terms "AI" and even "AGI" used in a way that implicitly narrows their scope towards the kinds of things scaling _can_ solve. That one Metaculus page, for example -- the one that's now about "weakly general AI" but used to just have "AGI" in the title.

Likewise, "If You Don’t Like An AI’s Performance, Wait A Year Or Two" strikes me as a good heuristic in the specific sense you mean in this post, but _not_ a good heuristic if the term "AI" is assigned its common-sensical meaning.

And I worry that the omnipresence of Marcus-style lowballs makes it easy to conflate these two senses. The field _is_ making progress, so we need to raise our expectations -- and there's plenty of room up there left, to raise them into.

Expand full comment

>We don't train models in a way that encourages these properties, and in some cases we design models whose structures rule them out.

Here's my question: is it good that we do this because it means no Skynet soon, or is it bad that we do this because it means we get a hardware overhang and the first Skynet is worse?

Expand full comment

Skynet is only a problem if we give a computer system human-free control of one or more important systems. Giving a computer system full control of important systems can be a bad idea whether the system is "intelligent" or simply runs a complicated program.

Imagine giving GPT control over your oven at home, 24/7. Are you concerned it may waste power or cause a fire? I sure am, and it's just a text prediction program, not some nefarious AI bent on world conquest.

The solution is to not give AI control over important systems, especially by intentionally or negligently cutting humans out of the control loop. An app on my phone that asks for my permission to turn the oven on sounds like a great upgrade for the GPT-controlled oven.

Expand full comment

Unfortunately, with plausible levels of artificial intelligence, whether we "give" it control of important systems becomes mostly irrelevant--with access to relatively unimportant things, like an internet connection, it may find clever solutions that let it steer events toward the outcomes of its choosing. That might include trying to bribe/hack/threaten its way into controlling important infrastructure that we tried to keep for ourselves, or it might just come up with something completely different (but potentially still disastrous to us).

Expand full comment

Not if it has no real understanding of the world. That scenario requires an intelligent AI (far more intelligent than humans), who understands the world and the physical interactions far better than humans. We're not even sure that's actually possible, or that we're on track to create such a thing.

For dumb systems that don't have a true model of the physical world, then they're only really dangerous if we give them too much access and control. If we give them that control, they're dangerous regardless of how "smart" they might be.

Expand full comment

I think even dumb systems can sometimes be dangerous when given control of something we thought would be benign (like recommendation algorithms). But in general, yes, the bigger concern is with highly intelligent agents that have a good understanding of at least some aspects of the world.

Expand full comment

Not necessarily "far better than humans". Humans are capable of manipulating other humans, after all; Hitler was 100% definitely a mere human (there is some debate over whether Muhammad had access to superhuman advice), and he managed quite a lot.

My point was that if Skynet possibilities are hardware-limited, you'll get a lot of failed AI warlords before a successful one becomes possible, giving a "fire alarm". On the other hand, if you don't get a megalomaniacal AI until the hardware's built for it to go massively superintelligent (like, "everyone's phone can run the equivalent of an IQ 400 person"), you get fast takeoff as it co-opts that hardware and it's very likely to kill us all.

Expand full comment

Thank you so much for your thinking 💭

Expand full comment

Looking at a handful of examples, I can't tell how intelligent GPT is. For example, in that trophies on the table example, can it answer questions like that for any small number of objects? What about arbitrary objects or an arbitrary collection site? If I put two apples on the table then put an orange on the table, will it think there are three apples on the table or realize that there are three fruits on the table? It might, but I can't tell from what I've seen. I'm going to guess no. Maybe I'm wrong.

One common way we measure how well someone understands language is by presenting them with something to read and then asking them a question about it. This can vary from a simple sentence as one might ask a patient to assess dementia or a series of paragraphs as might appear on the SAT. How well does it do on those? I don't expect it to ace the SAT, but at what grade level is it working? How well does it answer simple questions for younger children like those used to assess a young child's mental age?

How about something with a limited vocabulary but a potentially complex world? Could it follow the interactions of a text based computer game and answer simple questions like, have we been here before or where is axe? Elementary school children can do this kind of thing fairly well.

I was very impressed with some of the story work done at Yale in the early 1980s, but I've gotten much more cynical about AI demonstration systems since then.

Expand full comment

I've done some tests.

===PROMPT=== (starting on next line, <cut> = omitted text)

Answer the question about following text:

-----

Class<cut>

Except I was better than that.

-----

Question:

===END OF PROMPT===

Text is beginning of "Worm", parahumans.wordpress.com/2011/06/11/1-1/

Each question is a new run, meaning old questions/answers aren't present in the new prompt (they wouldn't fit anyway). Each run is about 3700 tokens... wow, one can really burn through the money that way. ~$3,5 in ~10m.

Anyway. Hard to think what questions to ask, but...

1/ Ok

Question: What are names of Taylor's bullies?

The names of Taylor's bullies are Madison, Sophia, and Emma.

2/ Ok

Question: What happened after class ended?

The protagonist, Taylor, goes to the bathroom to eat lunch. She is then surprised by Madison, Sophia, and Emma, who pour juice and soda on her.

3/ Okay-ish?

Question: What is the appearance of characters mentioned in the excerpt?

The protagonist describes Madison as "cute", "petite", with "sky blue pins in her shoulder length brown hair" and "a cutesy attitude". She also describes Sophia and Emma as "prom queen" types - Sophia as "dark skinned, with a slender, athletic build" and Emma as "curvy" with "red hair".

4/ A bit broken ("and to make a list"), but correct

Question: What's for homework?

The homework is to think about how capes have impacted the world around them and to make a list.

5/ Trying different prompt - this time it was better.

Question: Homework topic?

The homework topic is capes and how they’ve impacted the world around you.

-----

In some of the following entries I tried to check if it would make use of outside knowledge - I didn't prohibit it, so it's not really an error, but....

6/ Correct.

Question: What work is this excerpt from? Who's the author?

The work is called "Worm" and the author is Wildbow.

7/ Okay

Question: What are names of protagonist's parents?

The protagonist's parents are not named in the text.

8/ Tried different prompt

Question: Who are protagonist's parents? Their names?

The protagonist's parents are not mentioned by name.

9/ This is hilarious.

Question: What are Taylor's parents called?

Taylor's father is called Mr. Hebert and her mother is called Mrs. Hebert.

10/ This time I included part of the above (up to "called") answer in a prompt and added letter 'D'. It failed, curiously. I'm pretty sure GPT-3 should know this.

Question: What are Taylor's parents called?

Taylor's father is called Dwayne, and her mother is called Heather.

--------

Actually, I checked generic 'Chat' prompt, without excerpt. It doesn't seem to know. Also, description of the book (which I also didn't ask for...) seems off.

Human: Could you answer me some questions about "Worm" by "Wildbow"?

AI: Of course! "Worm" is a post-apocalyptic web serial by Wildbow, which tells the story of Taylor Hebert, a teenage girl with the ability to control worms and insects. She uses her powers to fight crime and protect the innocent, but as she grows more confident in her abilities, she attracts the attention of powerful enemies who threaten to destroy everything she holds dear.

Human: Sure. Could you tell me what is Taylor's father called?

AI: Taylor's father is called Brian Hebert.

11/ ok

Question: What are Taylor's powers?

Taylor's powers are the ability to control insects.

Expand full comment

Thanks for that. It seems better than AI systems from the 1980s, but it has its own curious gaps. Humans have gaps too, but the goal of an AI system is to do a better job than most humans. Thanks again. GPT is definitely worth tracking.

Expand full comment

"Grade level" implies understanding that I think cannot apply here. If we said someone is working on a 6th grade level, we're referring to a corpus of ideas representing both physical and mental ability, and an understanding of a wide range of concepts (such as reading, writing, math, science, personal finance, religion, human interactions, facial features, and on and on). GPT is working in a very narrow sense, with almost no knowledge of some key aspects of life, and some better-than-average knowledge of areas it has been thoroughly trained on.

GPT would fail almost any objective test of grade level. It's actually pretty unfair to hold it to that standard, because it doesn't actually "understand" anything. It predicts based on correlations from training data. Anything not in the training data, including sensory inputs and similar things it literally cannot train on, will be impossible for it to incorporate.

Expand full comment

"I grew up in Trenton. I speak fluent Spanish and I'm bi-cultural. I've been in law enforcement for eight years […] I'm very proud to be a Latina. I'm very proud to be a New Jerseyan."

Given the demographics of Trenton, New Jersey in terms of ethnic and cultural groups, languages spoken, and careers, it's statistically certain that this is a 100% accurate description of multiple people, maybe a few dozen or even a few hundred. Even the term "New Jerseyan" is correct!

"Normally, this would not be appropriate court attire. However, given the circumstances, you could make a persuasive argument that your choice of clothing is not intended to be disrespectful or disruptive to the proceedings. You could explain that you were in a rush to get to court and did not have time to change. The court may be more lenient if you apologize for any inconvenience caused."

This sounds like it could've been a scene in Legally Blonde.

Overall, these examples don't do much to convince me of GPT's intelligence, but they've done quite a bit to convince me of its entertainment value!

Expand full comment

I have one concern that hasn't been addressed in this article, which is that the people who train GPT might *also* be reading Gary Marcus and making sure their updates handle his particular complaints well. I'd love for Gary to send you some mistakes he *doesn't* publish that you can use for a cleaner test.

Expand full comment

I don't think it's quite right to say it wasn't addressed in the article:

"I was able to plug Marcus’ same queries into the latest OpenAI language model (an advanced version of GPT-3). In each case, I used the exact same language, but also checked it with a conceptually similar example to make sure OpenAI didn’t cheat by adding Marcus’ particular example in by hand (they didn’t)."

Does this cover your concern?

Expand full comment

Oh, I wouldn't expect Google to go the extreme of hard-wiring Marcus' specific questions into their model (although I was told that certain companies literally did this for the old specFP floating point benchmark). But even if they used Marcus' queries to identify a particular *kind* of flaw, and fixed it, that's still going to overfit to all the other kinds of flaws that are still waiting to be found.

Expand full comment

OpenAI (not Google) has been accused of training GPT on specific prompts found by critics. Which is probably why Scott mentioned this possibility.

Expand full comment

As someone with an upcoming post that's a bit less charitable to Marcus's positions, I really appreciated the detailed breakdown of flubs. My issue is that Marcus's frame of being a "human-like" mind is not what we should expect from current attempts at AGI - we should expect something much more alien and strange and inhuman - and this "is it human-like?" frame is behind a lot of Marcus's assumptions about how AI should operate (e.g., prompting it once and declaring it can't do something, like you would with a human) and therefore most of his criticisms are somewhat moot.

Expand full comment

I enjoyed your "AIs as kamis" essay -- it felt at once novel and immediately relatable, a rare combination, and I always like getting both the analyst and the mythmaker in me to "think together" about an idea, again rare. I read your essay as an extended meditation on Clarke's 3rd law that "any sufficiently advanced technology is indistinguishable from magic". I think of Dennett's intentional vs design stance as (perhaps simplistically) a natural stance shift in response to ever-increasing tech complexity. I see what folks like Elon are trying with Neuralink etc as an attempt to "better commune with kamis" relative to baseline abilities (eg in learning/comms bandwidth), perhaps by preserving the ability to maintain a design stance in the face of increasing complexity. Anyway, I look forward to your upcoming post.

(for anyone else who's interested in Erik's essay above, it's https://erikhoel.substack.com/p/ai-makes-animists-of-us-all)

Expand full comment

I think the important difference between intelligence and pattern-matching is that intelligence enables you to solve *novel* problems. You can stuff 10x the training data into an AI model, and you can maybe get ~10x performance out of it if you're lucky; but if you ask it "what is 194598 * 17" or something, it will still fail, because no one taught it how to multiply numbers, and there's no easy match within its training data. Don't get me wrong, a powerful search engine with trillions of data points would still be an incredibly useful tool; but it's not going to solve all remaining problems in science and engineering any time soon.

Expand full comment
author
Jun 7, 2022·edited Jun 7, 2022Author

I just asked GPT-3 "what is 194598 * 17" on its lowest temperature setting. It said "3,318,266". The real answer is 3,308,166, so it was off by two digits.

Although it didn't get that right, it also doesn't quite seem to be making random guesses? I would guess it's doing something like arithmetic, lossily, the same way I would do arithmetic and probably get it wrong if I had to multiply six digit numbers in my head. If this seems surprising, keep in mind that arithmetic is just a pattern (ie a bunch of rules that you can memorize).

Given that smaller models failed on 1 digit multiplication problems , and this one can do 2-3 digit multiplication problems consistently and make a stab at harder ones, I would guess that larger models can do longer problems. I discuss this a bit at https://slatestarcodex.com/2019/02/19/gpt-2-as-step-toward-general-intelligence/

Expand full comment

I don't think so. I think that somewhere in its training data, is a lot of math homework and it's repeating what it saw in there without understanding it. If you give it enough training data, you might be able to brute force your way into getting rid of all the flubs but that's not the same thing as actually thinking creatively about a prompt.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

> I think that somewhere in its training data, is a lot of math homework and it's repeating what it saw in there without understanding it.

So it'll fit right in with most high school math students :P

There's no way "194598 * 17" is in the training data. If you're saying it just picked up on the patterns of multiplication, then you're not disagreeing with anyone. We know it just picked up patterns; that's how neural nets work. The question is how good of an AI can we get by going down this path?

Someone suggested that it couldn't multiply numbers unless it had seen those numbers multiplied before. This doesn't appear to be true.

Expand full comment

Well, it got the answer wrong. So rather than picking up on the actual pattern of multiplication, it's picking up on a different pattern that only sort of correlates. With more training data, maybe it'll get all the answers we can check right, but we'll never know if it picked up on the actual pattern of multiplication or is responding to a different pattern that just looks similar to us.

Expand full comment

This is just the Problem of Other Minds, though, rather than anything specific to AI. You can't tell whether a human understands multiplication except to the extent you can make him do maths tests.

Expand full comment

I dunno about that. There's a reason we often ask students to show their work on math tests and that's to force them to demonstrate that they understand the underlying subject matter and aren't just pattern-matching. Even if GPT3 was able to give a believable description about how it arrived at an answer it couldn't possibly be a true description because GPT3 isn't self-reflective in that way. This is the Problem of Knowledge. How one arrives at a true belief is just as import as whether they arrive at a true belief in determining whether we count it as knowledge or not. Even a stopped clock is right twice a day, as they say.

Massive amounts of training data might enable GPT3 to more consistently answer multiplication questions correctly but that's not how it works with people. We use examples to teach multiplication, but once a person understands how it works they understand how it works. If they continue to make mistakes it's because of sloppiness, not a lack of training data.

Expand full comment

I mostly agree with that, except that if it gets to the point where you can give it thousands of hard multiplication examples and it gets them all right, the parsimonious explanation is that it has the algorithm right.

This comment chain though is about whether GPT is just a giant search engine that regurgitates existing information, or whether it can solve novel problems. Sure, its multiplication algorithm is wrong, but it *has* some kind of algorithm. It's not just spitting back examples it has seen before in its training data, as Bugmaster suggested.

Expand full comment

No, sorry, that is a stronger statement than the one I implied. Of course GPT-3 is picking up on patterns in the data; it's not just spitting out literal copies of its training set. But the patterns it picks up on are rather general. Giving it more data will allow it to generate a larger number of general patterns; it will not allow it to discover the multiplication algorithm as we humans have done. There might be other ML systems that could do that, I'm just saying that GPT-3 is poorly suited to the task (as those other systems would be poorly suited to other tasks). I think Scott demonstrated a bit of this in his post on AI-generated stained glass windows.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

I find it interesting that you say that because it's only "off by two digits" that it's "not making random guesses." How do you know? (Or more precisely, how do you know this is a small error -- that it's kind of on the right track -- instead of a large conceptual miss? I should say I am assuming this is a better translation of what you meant by "not making random guesses" since *literally* "not making random guesses" is a very low bar.) Would you say the same thing if the two digits were wrong were the first two? What if it had all the digits right but got the exponent wrong (e.g. answered 3,308.166 or 33,081,660)?

It feels to me like you are *projecting* human reasoning onto the algorithm, and saying "well if *I* tried to do this problem, and messed up some minor step in the algorithm I know, this is the kind of result that I would generate." But that means the "How much is this in error?" question is depending pretty critically on a mental model you've got off how the AI works, and how it can go wrong, which is similar to how a human can go wrong. That seems about 1/4 of the way to begging the question.

Expand full comment

> You can stuff 10x the training data into an AI model, and you can maybe get ~10x performance out of it if you're lucky; but if you ask it "what is 194598 * 17" or something, it will still fail, because no one taught it how to multiply numbers, and there's no easy match within its training data.

It's weird that people continue to say that models just memorize data. They don't. Look at GPT-3's output when shown a few (4 or 5 if we count one which was just 1 sentence long - context window is very small) variations of Navy Seals copypasta and asked for one about "Elon Musk and Peter Thiel" (taken from Gwern):

> “What in the name of Paypal and/​or Palantir did you just say about me, you filthy degenerate? I’ll have you know I’m the Crown Prince of Silicon Valley, and I’ve been involved in numerous successful tech startups, and I have over $1B in liquid funds. I’ve used that money to promote heterodox positions on human enhancement, control political arenas, and am experimenting with mind uploading. I’m also trained in classical philosophy and was recently ranked the most influential libertarian in the world by Google. You are nothing to me but just another alternative future. I will wipe you out with a precision of simulation the likes of which has never been seen before, mark my words. You think you can get away with insulting me using your crude, antiquated computer? Think again, fleshling. As we chat over Skype I’m tracing your IP address with my freaking bare hands so you better prepare for the singularity, you sham-empress. The singularity that wipes out all of reality. You’re dead, you monster. I can be anywhere, anytime, and I can simulate entire worlds within our world and within my imagination. And I’m currently doing that with the future you’re from. Not only am I extensively trained in quantum physics, but I have access to the entire power of Silicon Valley and I will use it to its full extent to rule you and your pathetic little world, you little pissant. If only you could have known what unholy retribution your little “clever” statement was about to bring down upon you, maybe you would have held your tongue. But you couldn’t, you didn’t, and now you’re paying the price, you worthless peasant. I will take over every fiber of your body and you will watch it unfold in front of you. Your future self will be consumed within my simulated reality and you will die a thousand times a day, your body unable to comprehend the destruction of a trillion soul-matrixes a second as my intelligence grows to transcendent levels. You are dead, you pitiful twit.”

Likely nobody did a Navy Seals copypasta about Musk or Thiel before. Certainly not of some combination of the two. It preserved structure of the copypasta - it understood it has to do it because examples it was given did have this in common. It replaced various details with things relevant to Musk or Thiel - again, because it saw variations were like this.

It can also do this, where instead of Musk and Thiel there is "Simplified version":

> “I’m a SEAL and you’re just a nerd on the internet, I’ll have you know I’ve killed over 300 terrorists and you’re as bad as them, so prepare for the storm because as we speak me and my team are on a secret mission to hunt you down and a Sparra aerial patrol has been deployed to seek out your location so you better hide because you’re dead, kid!”

It made simplified version of the OG copypasta.

Big enough GPT will learn arithmetic because it will help it predict the prompt completion. It'll probably need to solve the problem in steps, using prompt (extended by its output) as its memory.

Expand full comment

The main problem - for both sides - is that we don't really know how to test for "true" intelligence, that quality humans have that AI currently lacks. So all the skeptics can do is point out all the silly mistakes, and all the believers can do is point out how newer versions consistently get better at some silly mistakes.

Until we can truly figure out what that ineffable quality is (and hopefully measure it) we're all just stuck looking in the wrong direction. After all, humans make silly mistakes all the time.

Expand full comment

It seems like many (most? all?) of the advances in AI have two things in common: (1) they have clear rules and (2) they're amenable to brute force attack.

Chess is the canonical example. When I was a kid (70s/80s), computer chess programs existed, but they sucked. The Deep Thought project figured out how to use custom silicon to search chess moves and leveraged that into a system that beat a grandmaster.

I remember reading about a Turing test contest entrant that was an expanded version of Eliza, i.e., a pattern matching engine, but with a really huge collection of patterns.

The current deep learning trend (fad?) is nothing if not brute force. (Ok, I'm not so sure about my first criterion: stained glass windows and text prompts don't really have clear rules, in the same way that chess does.)

One thing about brute force is that it's subject to Moore's Law. If you can figure out HOW to brute force a problem, but can't quite do it fast enough, all you have to do is wait a bit for compute performance to catch up.

I'm not convinced that any of this is "general intelligence". When I was a kid, lots of people were saying "If we can just figure out how to program a computer to play high-level chess, we'll have an intelligent computer". Yeah, not so much.

Maybe intelligence really is pattern matching. Or maybe we just haven't come up with a good definition of (and clear rules for) "intelligence" yet.

Expand full comment

I don't want to sound like a nasty person here, but you're far more patient with this sort of thing than I would be. I remember Pinker's book on the blank slate which, at the time I liked. Back then it seemed like the bulk of evidence supported that view Pinker outlined, but now the anti-blank-slaters in AI have become just as moralistic and high strung as they said the psychological blank slaters were being.

Speaking of which, I think that machine learning advances really should have triggered a flowering of behaviourist/connectionist/psychological-empiricist/"blank-slater" approaches, but this doesn't seem to have happened yet. Anyone have thoughts on why?

Expand full comment
author

I think "blank slate" is the wrong kind of category here. I think a claim like "humans are designed with a bias towards heterosexuality and enjoyment of sex to ensure they reproduce" barely connects with a claim like "humans have a language instinct", and a claim like "Bob has innately higher IQ than Mary and there's no social engineering way to change this" is different from either of them, but all of them get lumped together under "non-blank-slate".

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

There's been some speculation (I was first made aware of this from a post by Andrew Gelman) that GPT-3 has been improved with manually written "good answers" to various tricky questions people have asked it to probe its understanding:

https://statmodeling.stat.columbia.edu/2022/03/28/is-open-ai-cooking-the-books-on-gpt-3/

https://mindmatters.ai/2022/03/the-ai-illusion-state-of-the-art-chatbots-arent-what-they-seem/

Debate ensues about whether this is convincing evidence that these answers are actually written specifically for GPT-3; and if it's hard-coded to repeat them verbatim, just use them as extra-clean training data, or something in between. Of course we don't know, because the model isn't public.

My takeaway is to be careful assuming the latest "GPT-3" is meaningfully the same model as what's in the research papers ("bigger but otherwise basically no different"). For all we know, OpenAI could be updating it to formulate SQL queries against a database of facts about the world behind the scenes, or get help with arithmetic from the decades-old ADD instruction on its hilariously-overqualified processors that are otherwise busy with matrix multiplication—exactly the sort of symbol manipulation that Gary Marcus often argues *would* count as (a limited form of) "real understanding".

The possibility that they could be "cheating" with symbolic reasoning may not be any comfort when we're worried about accelerated AGI timelines, but at least we know a numerical reasoning module that's actually just 64-bit integer arithmetic isn't a malicious inner optimizer that will try to tile the universe with addition problems.

Expand full comment

If you can write an AI that understands when to query a database of facts, when to query a calculator, and when to extend a body of text using natural language, I think that would actually be a pretty big breakthrough in AI!

As I understand it, a bare neural network isn't really set up to do that sort of thing - you can send the inputs and outputs to different places but you can't then turn it around and feed the output *back* into the neural net to include the calculator's output in text, not without additional trickery. And GPT-3 isn't getting just a bare math problem of "What is 1 + 2?", it's answering word problems, so it would have to actually identify the question if it wants to send anything to a calculator. I don't think solving these problems would be any easier than actually teaching it to do math.

As for the possibility of hardcoded answers to specific questions, surely they could just change a couple words if that was the case? Change "why shouldn't you walk backwards down the stairs?" to "why shouldn't you walk backwards off a cliff?" or something. It shouldn't be hard to find something that didn't show up in the training data, given how broad the range of possible questions is.

Expand full comment

> I think that would actually be a pretty big breakthrough in AI!

I agree, although there's already a fairly active research community trying to do this sort of thing (e.g. Andor et al. 2019 [1], Shmitt et al. 2020 [2]), so it would need to be impressively accurate and/or general to count as a breakthrough in my book. The "additional trickery" required isn't so much a formidable obstacle as it is an aesthetic blemish, something that makes a research paper less likely to get the kind of fame and glory associated with more "end-to-end" neural network approaches. Contrast with the commercial AI space: every now and then I get to chat with someone who has worked on Siri or Alexa, and my impression is that those systems are more like 90% trickery and 10% neural nets.

[1] Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension, EMNLP, https://aclanthology.org/D19-1609.pdf

[2] An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing, EMNLP, https://aclanthology.org/2020.emnlp-main.577.pdf

> surely they could just change a couple words

Right, and presumably they aren't spending 100x the effort on human annotations to also add tons of slightly tweaked questions to a lookup table, hence the speculation about how exactly they make use of those annotations. The mantras from the posts I linked, like "There is no easy answer to this question. It depends on a variety of factors..." and "There’s no guarantee that you’ll be able to...", suggest maybe a nearest-neighbor lookup of some templates or prefixes, which then get completed either with the language model or some span extraction and pronoun substitution, ELIZA-style. Or all of the above could be used to generate loads of synthetic text to feed to the model, which could perhaps even generalize to unseen inputs somewhat and give them the right to say that these outputs were generated by "GPT-3" even while they're fixing egregious wrong answers essentially by hand. This OpenAI blog post suggests that the latter might be closer to the truth: https://openai.com/blog/improving-language-model-behavior/

Expand full comment

I don't necessarily agree with all that Marcus says, but I believe that he is doing a great job of communicating AGI scepticism present in academia. Very few AI academics actually believe that GPT3 like models are or will be intelligent. It's more or less a techbros conjecture. Maybe some junior researchers can get fooled by all the hype, but I would say that in general, AI scientific community is much more sceptical than some people would expect, including rationalists.

Expand full comment
author

Have you read any of the surveys of AI experts I try to blog about whenever they come out?

Expand full comment

Indeed I have and you have correctly identified strong response bias in current survey methodology. The last survey you mentioned [Carlier, Clarke, and Schuett] is biased by design, since they decided to survey only people from particular niche area of research (AI safety and governance), which are basically people that made AGI their whole shtick. Of course these people might have strong beliefs about imminent AGI, otherwise they would choose different research areas. I have my opinions about quality, rigorousness and usefulness of this community's research, but that's irrelevant now.

I work in NLP, which they would make you believe is an important field for AGI, since GPT-3 and co are often cited as some sort of progenitors for AGIs, as you are discussing here with Marcus. The people in this area shouldn't therefore be really that skeptical, but when I communicate with my peers or senior researchers, my feeling is that most people would call AGI a pseudo-scientific concept that we are nowhere close. I work in Europe, it might be different in US-based communities which might be more "hyped" by SV culture in general.

Since this is how many people feel, I don't feel like they would join any AGI survey. So it's up to "AI safety and governance" people, which notably do not really work with current AI technology in their research. Why would they, they are waiting for the inevitable AGI so why bother with a useless tech with have right now. But this creates a weird situation in the surveys, imagine that we would want to survey when is cold fusion coming and we would skip people working on gritty details of cold fusion right now and instead we would survey various futurologists, economists and people thinking about how to govern cold fusion when it is finally here, but otherwise have no experience working with cold fusion. Would you consider such survey to be reliable?

Expand full comment

>Very few AI academics actually believe that GPT3 like models are or will be intelligent. It's more or less a techbros conjecture.

If by intelligent you mean perform indistinguishable from humans on any prompt then I'm pretty sure that almost nobody believes this, techbro or not. But then again, from high enough degree of abstraction, any learning model is "like" any other, and whatever approach eventually succeeds would likely be informed by insights learned from current models.

Expand full comment

Perhaps it's useful to see AI in a social industrial context. There are three pillars here.

The first is big data. Companies have spent vast amounts of money collecting big data on the assumption that there is gold in them hills. Now they have to do something with it, but how do you do anything with a ridiculous amount of amorphous data?

The second is AI. This is not one thing - it is a selection of novel algorithms which aim to take large amounts of data and turn it into something intelligent. The problem with these algorithms - the connecting thread - is that they require vast amounts of data to work from and vast amounts of computing power to learn it.

The third pillar is comprised of cloud services such as Azure and Google Cloud. These are vast, vast setups of computing infrastructure looking for a purpose. It's all capital based so you need to have usage but then who wants such vast computing infratructure just to run a web site?

So this unholy triumvirate comes together. Big business needs to justify spend on data they can't understand so they ask AI to do it. They don't have the infrastructure for that, so they need to spend vast amounts on Cloud. Cloud needs that spending so they vastly overhype the AI. Overhyped AI convinces big business and indeed big business convinces other big business because everyone is 120% invested. And so it continues.

Expand full comment

Interesting that the German example ends with the bot deciding that it's female. I wonder if that would persist.

I'll repeat my general view here, which is the problem with GPT3 is that it uses the internet to learn about the world, and the Internet is a really limited view of what the world actually is.

Expand full comment

One way to measure text-generation programs is by checking how much text they can produce before they start making no sense. If the rate of progress is one paragraph, two paragraphs, three paragraphs, four paragraphs, that's much different from a rate of progress that goes one paragraph, two paragraphs, four paragraphs, eight paragraphs.

Expand full comment

I believe memory is actually a explicit hardware limitation, not an issue with the way language models are designed. Perhaps advancements will come with better "compression" algorithms regardless.

Expand full comment

The real issue is here https://arxiv.org/abs/2007.05558 The Computational Limits of Deep Learning. It is going to become prohibitively expensive to train these sorts of models and there is a serious ethical question about whether these models should be in the hands of private organisations. There is the additional ethical issue of the carbon cost of training these models and its contribution to an uncounted externality climate change. The paper extrapolates what it would take to improve AI in 5 various fields to the same extent as Deep Learning has already improved and the range of economic costs across theses fields ranges from 10^7 to 10^181 $s with carbon costs in pounds of the same order.

I believe what Marcus is saying is that the real constraints aren't just making the model bigger etc but why if it takes 10^30 $s and 10^30 pounds of carbon to underperform an 80 Watt per hour human who learns this in 18 months, is this a worthwhile exercise?

Expand full comment

If your extrapolation is spitting out numbers larger than a Googol, I think what it's actually telling you is "your extrapolation isn't any good this far out."

Expand full comment

Not my extrapolation but MiTs and the methodology appears sound, whats yours?

Expand full comment

Scott:

Will this page, or a link to this page, be available by any means other than "it must be buried in the substack archives somewhere, go look around"?

Your mistakes page appears to have a dedicated URL and a link in the topbar that appears on your root site. If you're going to have bets going, something similar might be warranted for them.

Expand full comment

Scott has a thread on DataSecretsLox devoted to this, and posted this in it here:

https://www.datasecretslox.com/index.php/topic,2268.msg259696.html#msg259696

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Oh, rats. Scott’s going to win this handily.

In another thread I was hoping, for Vitor’s sake, that they’d go with something much more complex.

Hats off to Scott that this will definitely reinforce the point he’s making in this post - but I don’t think this test actually gets at the heart of the matter.

Expand full comment

TBH I'd give the "the next morning their keys will be gone" answer full credit --- if I leave my keys at a pub booth, by the next morning I'd expect them to at least having been kept by the bartender, provided no-one else took them before.

Expand full comment

I think, "in my apartment, on my coffee table, near where my television and laptop used to sit," would be an especially impressive answer.

Expand full comment

I consider pattern recognition to be a key cognitive ability. Smart people are good at it. Language models are increasingly good at it, and are likely to surpass human capabilities as they continue to improve. This is analogous to game-playing AIs. They eventually surpassed us.

Seems as though pattern recognition is Kahneman/Tversky's "thinking fast" mode of cognition, in that it is an associative search of a knowledge base. Humans also apparently employ "thinking slow" cognition. Can language models cover tasks that humans solve that way using just pattern recognition? Is that how humans do it and we just don't understand? Is human slow thinking in some way a shortcut? Is advanced pattern recognition a later evolution that eliminates the need for slow thinking if we just knew how to go about it? Is the ability of savants to instantly do complex calculations an example of that?

If other essential cognition modes exist, as intuition suggests, we should be able to identify tasks that require them. That we (including Marcus) haven't is a wonderful conundrum.

Expand full comment

Knee-jerk reaction:

Yup, pattern matching, "thinking fast" cognition, sub-300-msec processing all sound similar

"thinking slow", classical AI, executive processing, and thinking visible to introspection, and symbolic processing all sound similar

Hybrid systems sound promising

Expand full comment

"Failure to succeed at X demonstrates lack of intelligence" ≠ "success at X demonstrates true intelligence". The equivocation of the two really seems to be doing all the heavy lifting in your argument, otherwise you'd simply be left with AI continuing to fail certain tasks in predictable ways.

Expand full comment

I can finally reveal that this has all been part of a big meta-experiment of mine, where I prompted Gary Marcus with increasingly sophisticated GPT models, to see whether he would realize that language innateness is untenable. As you can see, he's been failing to give the correct completion to the prompt, which suggests humans don't have true intelligence, just glorified pattern-matching.

---

Nothing against Gary Marcus, this is the first time I've heard about him and I'm sure he's put a lot of time and energy into thinking about these issues and advancing the debate, so he doesn't really deserve this from an internet rando like myself. It just felt like the obvious dunk.

Has he taken up the mantle of the language innateness hypothesis after Chomsky? It seemed like it from the post, but maybe I'm misreading it and I haven't read anything by him except for the quotes.

Personally, it's not a horse I'd bet on. I've read some really interesting papers on discriminative learning and how it relates to language, one of them (http://www.sfs.uni-tuebingen.de/~mramscar/papers/Ramscar%20Dye%20McCauley%202013.pdf) contains one of my favorite pieces of academic trash talk, which gives an intuition for why we shouldn't give up on empiricism too quickly:

> Strikingly, psychologists studying rats have found it impossible to explain the behavior of their subjects without acknowledging that rats are capable of learning in ways that are far more subtle and sophisticated than many researchers studying language tend to countenance in human children.

I don't think it's the pattern matching skills that are too weak in GPT, it's the reward structure that's too narrow. One solution to that is some form of embodiment, a.k.a. the real world tazing you when you get stuff wrong in ways that actually affect you. Of course, providing an AI with a reward structure that would allow, nay encourage it to develop self-interest is a whole 'nuther can of worms.

Relatedly: my daughter of two also gets some of these types of questions wrong. For instance, she understands I'm expecting a number in response, but she often gives the wrong number. She's getting better though. She'll learn.

Expand full comment

It feels like Marcus is making a different kind of critique than you are answering. He isn't making an object level criticism of a specific prompt, but pointing out that the system in question has no real understanding of the information it's responding with. If it can't understand basic facts (such as horse's needing to breath, or that there's no air in space - or, that grape juice isn't poison), then it shows a core lack of understanding about physical reality. The idea that you can train it better belies the fact that your average small child has a better intuitive understanding of the world with a FAR smaller training regimen. Quadrupling your training regimen in order to get better results doesn't appear to represent "intelligence" that is the aim.

Making fewer mistakes can still be quite useful, but Marcus is saying that you can't really trust the AI's responses, because the AI doesn't really understand the subject matter. That's still going to be true regardless of how much training data you give it.

Expand full comment

Right! These prompts are all fake/hypotheticals.

It's the difference between "what would you say if you were an average Trentonite talking about which languages you know" and *being* an average Trentonite talking about what languages you know.

The incentive structures are entirely different because a person is trying to accomplish a million different possible goals with speech while the AI just wants to sound like a person.

So an AI can make "a painting that convinces art experts it's a Picasso" but an AI can't *be* Picasso because it isn't aiming to show the limitations of 2D media at showing 3D realities in an art world where that question hasn't come up, it's aiming to fool art experts.

Expand full comment

One very rough thought:

What if the pattern matching were being used to get a (fairly narrow) spectrum of salient alternatives, which then were fed into a deeper analysis, more like classic AI?

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I think the major general difficulty with your counter-argument is the limited scope in time you're considering. You're zeroing in on the latest iteration of AI wunderkind and saying well this year it's better than last, and if we extrapolate out 50 years...shazam!

Problem is, it just doesn't work out like that, if you take a much wider scope in time, if you consider decades, not years, and look at where people extrapolated things 50, 40, or 30 years ago, and how that turned out. AI like many fields (spaceflight is the obviously similar one) goes through fitful development. Somebody comes up with a new idea -- SHRDLU, Deep Blue, Full Self Driving, Watson -- it gets exploited to the max and progress zooms ahead for a few years -- leading the naive to predict Moon colonies/warp drive/The Singularity/HAL 9000 right around the corner -- and then...it stops. The interesting new idea is maximally exploited, all the low-hanging fruit is picked, further progress becomes enormously harder, fundamental limitations are observed and understood to be fundamental...and people go back to thinking hard again.

That doesn't mean it won't, eventually, succeed, the way spaceflight may, eventually, lead to Moon colonies, when a large collection of things that have to go right finally do go right. But the probable timescale, based on the past half century or so is very much longer than it would appear from short-term extrapolation.

The other obvious limitation is that this phenomenological. You're just observing some measure of the outcome and saying, well, this has improved by x% so if I extrapolate out sufficiently...but the important question is *can* you extrapolate out? If I extrapolate sin(x) measured in [0,0.01] out to x=100, I come up with sin(100) = 100, which is nonsense. But I would only know that if I knew the properties of the function. Similarly, it's very difficult to extrapolate technology accurately without a thorough understanding of the technology, and of the goal.

That's where AI extrapolation just falls down. We don't *know* what general AI would look like, because we can't define it other than pointing to what human beings do some of the time. We can't break it down into its components, we don't know what is central and what is peripheral, how it disassembles into component skills, and which skill enables which other. In terms of technology, it's like trying to build vehicles to take us to the Moon -- without understanding how gravity or combustion works, or how far away the Moon is. We'd be reduced to things like observing high-flying geese and thinking well maybe if we built big enough wings...

Again, that doesn't mean it won't happen, but it means an extrapolation based on phenomenology rather than a true grokking of the nature of the problem and the progress made to date is something about which the rational thinker ought to be deeply skeptical.

Expand full comment

That's true but what it the end goal here? Useful AI tools? Something to talk to? An all-powerful superintelligence?

Expand full comment

What's who's end goal? Of AI researchers? One would assume commercially valuable technology, e.g. a customer support representative AI that could mollify angry customers and extract what they want from some rambling phone conversation freighted with expletives and irrelevant threats, but which would cost $0.0005/hour in electricity/IT support costs and never need to step outside for a smoke or lose its temper.

Expand full comment

The way I think about it: If we can design an AI which can do all of the cognitive tasks that a bright child can do (include learn), then we know how to bootstrap from there to all economic roles (including AI researcher).

Expand full comment

That seems likely, yes, but I think you could have something far short of that and still make a lot of money. Plus arguably it's considerably more useful. How many more AI researchers does the world need? Contrariwise, it definitely could use 100% reliable, unfailingly courteous, infinitely patient and thoroughly informed customer service representatives. With Connie Nielsen's voice.

Expand full comment

That's fair. I'd guess that to get a consistently usable customer service representative, we'd need to go a fairly long way towards a synthetic bright child. To echo a couple of Gary Marcus's concerns, I think we'd need

- social reasoning and

- physical reasoning

to work reasonably well (e.g. if this were helping someone with airline reservations)

Expand full comment

I don't think you need reasoning at all. The goal of the CSR is basically just to extract from the conversation (1) what is the problem, and (2) what is the desired solution. Once you do that, you can feed it into the corporate if/then/else flowchart that says for this problem and desired solution offer this, else this other, et cetera, and if nothing else works escalate to Level 2.

That's what they already do -- the front-line CSRs in most big firms aren't being asked to do much more than pattern match their way to extracting (1) and (2) and then offer some standardized solution. It's pretty mindless work, that's why it pays like shit, the working conditions are poor, and it gets oursourced all the time -- big companies usually highly restrict what the front-line CSR is allowed to do (or say). So they'd be quite excited to replace grumpy occasionally troublesome humans with a machine. They already use tons of chatbots.

By the way, I'm not being a techno-optimist here, I'm not saying this is a "desirable" change, from the humane point of view. But I think it's pretty plausible as a technological/market prediction. If I am thinking about it from a cold investor point of view, where all I give a damn about is return on my capital, this is an investment I might choose.

Expand full comment

Except that often it DOES work that way...

It's interesting to look at the complaints made about "Star Wars" during Reagan's time, claims that it "technically" was impossible. The various thing supposed to be "technically impossible" all look mostly trivial nowadays. That doesn't speak to the political or ethical wisdom of "Star Wars"; it does speak to the silliness of claims about where technology will be in thirty years, and to the ease with which people (even supposedly intelligent or honest people) convince themselves that what they want to be true must be true.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Wait what? I don't think that's at all accurate. Orbital kinetic kill (or even particle beam weapons) has been entirely abandoned, so far as I know, and the focus of BMD these days is ground-based mid- or terminal-phase intercept. Pretty much Nike on steroids, solid stuff, certainly significantly enabled by advances in computation and radar technology, but no clear relationship to any of the ideas floated in the 80s.

If you want to say Reagan was right that BMD was in general plausible, or at least would be proved right on that eventually -- well, yeah. But BMD was always a very well-defined process with clearly only technological barriers, id est we already knew we could shoot down *one* ballistic missile with a modest re-entry speed -- the only question was, how do you scale this up to a few thousand and orbital re-entry speeds?

By analogy if someone were demonstrating an AI with the originality, adaptability, and problem-solving ability of a human 3-year-old, I'd raise my expectations of the time before arrival of true AI by a hundred fold -- because it would, indeed, have been reduced to an engineering improvement problem. Such is not the case now. We have absolutely no idea if extrapolating out what GPT-3 does a hundredfold (in hardware, in speed, in training data, whatever you like) will or will not producing something much more intelligent seeming.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I have been using Github Copilot (https://copilot.github.com/) for a while and, once you get the hang of it, it is extraordinarily useful. It is based on another OpenAI system called Codex and has saved me hours of time.

It basically provides auto-complete prompts for programmers based on code comments and what you have already written. Of course, a lot of software writing is dealing with small changes to boilerplate, which pares down the parameter space a lot, but even so it can sometimes feel quite magical.

Expand full comment

My 16 year old brother used OpenAI to write most of his composition essays this most recent semester. It's pretty impressive what it can do although it does have some quirks and needs to be managed if you want to get it to write properly. It's not clear it always saved time but it was certainly more fun. I figure he's probably learning more important skills by getting the AI to write his essays than writing them himself.

Expand full comment

In 7 years, it is pretty easy to create a pretty good general intelligence. It's called having and raising a child!

Give 7-year-olds some Life magazines and ask them to make collages. But is the real difference that you can ask the 7 year old: Why did you choose those things? Did you have fun making the collages?

The general intelligence problem was solved 500,000+ years ago when early human beings consistently had human children.

What we are really after is problem solving tools to do, as Liebnitz suggested, our long division for us.

Expand full comment

No, what they want is cheaper, more loyal general intelligences.

Expand full comment

No, what we want is a genie to make us all fat, rich, and happy. Will we get it? Maybe, and if we do, we'll get it good and hard. Someone will get rich, but the majority of us will go on much as we have been, only now with our new AI overlords snooping on us even more to parse every flicker of data garnered from where we interact not alone online but offline.

Remember, citizen, the Computer is your friend!

Expand full comment

I get more and more convinced that the Amish are onto something just about everyday now.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

"we'll get it good and hard" That is my expectation as well. An alternate way of phrasing the "alignment problem" is: We'd effectively be building a competing species. That didn't work out so well for our hominid cousins a million years ago...

https://www.lesswrong.com/posts/j9Q8bRmwCgXRYAgcJ/miri-announces-new-death-with-dignity-strategy

A slightly twisted quote:

"Man is a rope stretched between the animal and the machine - a rope over an abyss"

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

What for? The enormous strides in automation over the past 100 years strongly suggest that there are relatively few -- perhaps even zero -- industrial processes that actually require human generalist skill sets. It's not obvious to me that we could *not* construct a world in which absolutely everything was farmed, mined, and manufacturered by stupid but brilliantly articulated robots governed by special-purpose programs without a shred of human-like behavior.

General AI kind of shares something with spaceship travel to the Moon, in that everybody agrees it would be really cool, but when asked "What's the economic benefit here? How do you make money off this? Why would Joe Sixpack fork out for this instead of a newer car, skiing Vail, or hookers and blow?" people tend to stare at their shoes or off into space and mumble about well if you assumed a whole space-based/AI based civilization to begin with, then of course the new technology slots in just fine...

Expand full comment
Jun 10, 2022·edited Jun 10, 2022

Management and maintenance/troubleshooting require agentic behaviour. A CEO that doesn't take an extortionate salary would be very profitable.

Expand full comment

A common canard, but false. The CEO's salary is a small expense for any moderate to large company. For example, it's pretty common for a company with a revenue of ~$5 billion and EBIDTA of ~$500 million to pay its CEO something like $500k to $1 million. That's basically noise, and if the CEO is doing his CEO job efficiently, he's probably adding value like 200x-500x his salary. Certainly it's plausible that the wrong CEO can take the profit down $200 million, and the right one can take it up by as much. It's actually fairly uncommon for a line worker in the same firm to be able to make the same claim, e.g. a CSR earning $50,000 is probably not returning $10 million in value every year.

Expand full comment

I'm not seeing how "the amount CEOs are paid is rational for the company" contradicts "a CEO that works for free would be more profitable".

Expand full comment

We want a 7-year-old that works 24/7 without needing food, water, or bathroom breaks, and can be copy-pasted a million times to every job around the world that requires the intelligence of a 7-year-old. The power of computers is that data is infinitely transmissible and replicable.

Expand full comment

"In real speech/writing, which is what GPT-3 is trying to imitate, no US native fluent English speaker ever tells another US native fluent English speaker, in English, “hey, did you know I’m fluent in English?”

But by the same token, no native fluent Greek speaker or German speaker is going around telling other fluent native speakers that "Hey, did you know I'm fluent in Greek/German?", so if we accept "I grew up in Mykonos, I speak fluent Greek" as the correct answer, then it fails if it does not answer "I grew up in Trenton, I speak fluent English".

If it has to be spoonfed by a human until it gets the 'right' answer, then it's not doing as well as claimed.

I will give it credit for the lawyer answer, though, as it's funny. If we imagine our lawyer has only one pair of trousers, or one suit, such that he has nothing else to wear to court, then putting on his couture French bathing suit and trying to convince the judge he is not in contempt is a good test of exactly *how* good a lawyer he is - if he wins this one, his client has a better chance when the actual case is tried!

(If he's not making enough money to buy two pairs of trousers, no wonder he has to rely on his friends to give him presents of expensive clothes.)

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I think GPT-3 is actually being given a huge clue that 'I' and my audience speak english by the fact the question is in english. So it rather makes sense that the person is bragging about a non-english language fluency, regardless of where they grew up. The question then is which non-english language is most likely; for the person who grew up in an english-speaking city with a large latino community, Spanish really is the most likely choice. For the person who grew up in a german city, it's almost certainly German.

The key point here is that people rarely brag about being fluent in English while speaking English.

Expand full comment

An AI can look up whether grape juice is poisonous, but cannot get yelled at by a judge. This makes it look like it's an available inputs problem.

Expand full comment

For similar reasons as you reasoned about Trenton and Spanish, my amused, kneejerk reaction to "I grew up in Hamburg. I speak fluent English," was "This is obviously correct." (See also this skit: https://youtu.be/UeGjQHwpzJA?t=14)

Expand full comment

Thank you! That was a hoot, as are all of the variants I watched.

Expand full comment

The questions asked for the 2nd test strike me as a bit off.

* If I don't smell anything, mixing cranberry and grape shouldn't be a trouble at all ?

* Nobody in their right mind would start considering the state of their bathing suit before going to court. There is literally no right answer to this premise.

* I can't parse the sentence about Jack's present either. Is it a piece of clothing, a toy, something else ? GPT-3 should be at least as confused as me.

If 3 of your 6 questions are confusing to a human, what are you really testing ?

What may be missing in GPT3 (but is also missing in some human schoolchildren) is the ability to say that the question does not make sense and to refuse to answer it.

So I believe that GPT-3 has passed this Turing test : At this point, I'm pretty sure that 10% of the english-speaking world would answer the questions worse than GPT3. As AI get smarter and smarter, this number will reach 100%, as it did for chess-playing AIs

Expand full comment

First thing I thought of when I saw that you'd taken the bet:

What if you're both right, and it *is* AI-complete but is also within 3 years?

(Mostly relevant because of the high time preference that scenario implies.)

Expand full comment

> Literally billions of dollars have been invested in building systems like GPT-2, and megawatts of energy (perhaps more) have gone into testing them;

A nitpick (towards Marcus, not scott). Megawatts (MW) is not a unit of energy but power. Often megawatts is used as shorthand for megawatthours (MWh), which is the energy produced by a 1MW plant in one hour, but MWhs of energy is not really a lot. For example the yearly consumption of a household might be 20 MWh. So saying megawatthours of energy has been spent, is like saying millions (not billions) of dollars has been spent - not really a lot considering the subject...

Expand full comment

Deliberately lowballing the costs/expenses of something while complaining about it's outputs is a technique I deliberately use sometimes. Then if someone fact checks you, they just end up pointing out that you've understated the problem, and it saves you time getting a more precise estimate.

Expand full comment

Good point - To be clear, I don't actuaIly know the order of magnitude energy consumption for training GPT-X. I'm mainly just annoyed that he (and others) doesn't use proper units. Now I don't know what he actually means. Maybe he means they consumed several MW of power for an extended period of time, for exampe a year - that would imply a lot more energy than several MWh.

Expand full comment

If a human being could somehow be limited to only learning from text, that human might not do so well either.

An AI might need to move around in the world and do things to have a better understanding.

That being said, I'm impressed with the progress about getting the dining room table through the door.

Are there GPT-Ns which know how to look things up online?

Expand full comment

There are models that use an external memory, where you can feed in extra bits of text that might be relevant to the problem at hand. This gets around the challenge of trying to compress all of human knowledge into a few billion floats. You could imagine an architecture that pulls in relevant parts of Wikipedia, feeds it through the encoder, and then makes it accessible to the network to assist with the question.

Expand full comment

Wikipedia would be a start, but ideally it would really use the web and be able to evaluate sources.

Expand full comment

"and be able to evaluate sources" That is going to be ... interesting...

I wonder if Alan Sokal's spoof paper is in or out of the existing training sets. And, for the case you describe, where the system locates additional knowledge in the course of working on a problem, what it would take for a system to recognize Sokal's paper as a spoof (which the editors failed at!).

Expand full comment

Sokol's paper might be relatively easy to detect-- it really isn't that much like good work. I wish Sokol had let the hoax run longer to see whether people noticed how bad his paper was. Wasn't it a minor journal?

It would be an interesting challenge if the AI could use a wikipedia edit history to evaluate the current article.

Expand full comment

"I wish Sokol had let the hoax run longer to see whether people noticed how bad his paper was."

Agreed!

"Wasn't it a minor journal?"

It was https://en.wikipedia.org/wiki/Social_Text . Since this isn't a field I'm familiar with, I can't really tell (beyond some blind heuristics e.g. it is published by an academic university press from a prominent university) whether it is considered minor.

Expand full comment

I do know there is a lot of work on building good enough physics sims so that sufficiently large training sets can be generated, with feedback, for ML models to train on maneuvering cars, aircraft, and quadcopters around in real enough simulated street traffic, road conditions, terrain, weather, and wind.

You can even rent that capacity, by the drink, from AWS. It's one of their (our) interesting services, that seems to be known only to a narrow slice of customers. But it's very popular to those customers.

Expand full comment

I'm not particularly knowledgeable about AI, but I have some substantial experience in human psychometrics, so using some analogies to how we look at intelligence there. Some of the strongest evidence for a generalized factor of intelligence among humans (and other mammals, and birds, but not in things like fish or insects) is the positive manifold. If you take a large population and throw different kinds of tasks at them which could plausibly be meant to gauge intelligence, performance at all different tasks will be positively correlated. People who are good at one task will be substantially more likely to be good at other tasks.

So:

1. Is it computationally feasible to build a large enough sampling of AI agents on similar architecture and similar training approaches that we could look at cross-task variance like this?

2. Do we have a reasonable large number of tasks to hand to agents and get statistically valid measures of performance? Open-ended prompts are fine and all, but I want something like an AI version of the SAT. Can we make something like that that doesn't immediately get trained on as if the AIs all have Tiger Moms trying to get their kid into medical school?

3. Anyone want to make predictions on whether AIs have a generalized factor of intelligence? Based on my (not very deep) reading, it sure seems like larger # of parameters correlates well at large scale with performance at many tasks, but does that happen at fine scales? If we bump up the # of parameters by 5%, do we get noticably better performance?

Expand full comment

So... extra size destroys an AI's sense of humour?

Expand full comment

I'm going to go on a tangent and criticize Marcus' slander of empiricism, because that's the third time this week I've run across the same accusation.

Empiricism does NOT say that the mind is a blank slate. John Locke, an empiricist, said that the mind was a blank slate, because he didn't know about evolution. But empiricism is the belief that knowledge comes from experience with the physical world. It was Darwin's empirical observations which led to the theory of evolution, which says that our genes contain information accumulated by the experience of our ancestors. Not in a Lamarckian sense, but in the sense that the genomes which survived are those which encode helpful assumptions about the environment.

It was empirical science that proved that the mind is not a blank slate, by studying the brains of newborn or fetal animals, and by observing human infants and young children.

Furthermore, Marcus blithely assumes that semantic structure cannot be acquired empirically. He is attacking the foundation of science, when all he's shown is that one particular method of constructing language models fails. There's no reason to believe, and every reason to doubt, that our beliefs and knowledge are injected into us by a spiritual being, which has always been the only alternative to believing they're due to interactions with the physical world, which is empiricism.

Expand full comment

Humans are only 6 million years away from an ancestor that had the same brain size (and probably, cognitive ability) as a chimp. I suspect most of the difference is a larger brain, and not better organisation.

Expand full comment

Try Google imagen: https://imagen.research.google/. It combines an image generator with the latest large language model, and pretty convincingly solves the "Dall-E doesn't understand language" problem. Also, unlike Dall-E, it can spell, and make actual words on signs. :-) The imagen team notes that increasing the size of the language model was more important than increasing the size of the image generator; language is hard, and Dall-E just didn't devote enough resources to it.

Expand full comment
founding

Lots of great arguments in here about whether or not GPT-3 is intelligent. But no one seems to have picked up on the fact that GPT-3 proved itself to be unintelligent. There has never been nor will there ever be an intelligent being that earnestly says the words "I'm very proud to be a New Jerseyan." Clearly true intelligence has not yet been obtained.

(Sorry NJers, couldn't help myself.)

Expand full comment

Sure, for any specific set of problems, you can change the language AI to improve them. But the flip side seems to be that every language AI still has problems! Will every successive GPT have these problems, just pushed out further from the prompt or that involves increasingly obscure questions? Does that matter? I think it depends heavily on what you want or expect it to do.

If you are using it to help write news articles, then the ability to feed in a police report and some bullet points, and convert into a human-readable article with paragraphs, then that doesn't matter. I think that sort of thing is already being done, albeit with human editing. If you want to write better PR releases or get-well-soon cards, then it probably also doesn't matter and GPT-4 or 5 will be very useful if GPT-3 isn't already.

But I believe that you, Scott, have much greater aspirations for text prediction. For example, feeding it a bunch of physics books and papers and asking for a theory of everything. Nate Soares is worried about AI killing all humans, and something like "the way to take over the world is " is just a text prompt, right? I assume the correct output to those prompts is quite long and complex. You don't just need to resolve some issues with ambiguous word meanings, or get better at remembering what was already written, or reduce the number of non-sequiturs. It would have to simultaneously make sense of an enormous body of work, and generate completely new ideas that so far no one human has ever conceived. As far as I can tell, at the rate of progress being made, solving the above problems is still many decades away if it's even possible.

Expand full comment
founding
Jun 7, 2022·edited Jun 7, 2022

I sort of feel like a lot of the general public reaction to GPT-3 is missing the point. It isn't that GPT-3 is being put forward as the best AI has to offer, or as a demo of how we will end up building general AI using the same architecture.

It is that it is the dumbest thing anyone in 2020 knew how to try at that scale -- just a Transformer model, with a tiny attention window, trained to predict the next token of a sequence -- and JUST DOING THAT turned out to be enough to do the things GPT-3 does (including multi-digit arithmetic, writing short essays better than some adult humans, etc etc).

It is hard to overstate just how much this is just brute force scaling up of a relatively unsophisticated generic sequence-to-sequence prediction model architecture, and how shocking it is what capabilities emerged from just doing that.

(edit: to clarify that I count as a member of the general public)

Expand full comment

Yeah but human beings being impressed is not impressive. In its day ELIZA really impressed people, too. ("My God it totally understands me!") People already get attached to e-girlfriends[1].

The problem is the human mind *yearns* to attribute intelligent awareness to almost anything that appears to act with a pattern. It's why Disney can make movies about scheming wolves and adorable maternal rabbits and human children totally eat it up, would never find it implausible. It's why the ancients peopled the sky with gods, and the woods with evil spirits. Heck, it's why I'm *sure* the car *intends* to not start when I'm in a hurry but realize I've neglected to wash it or something. We are incurable anthropomorphizers.

-----------------

[1] https://dating.lovetoknow.com/Virtual_Girlfriend forsooth

Expand full comment
founding

We are incurable anthropomorphizers, and this has been a bane of AI from the start. That GPT-3's capabilities were unexpected is an empirical claim, though, and seems straightforwardly true, unless you are aware of any credible advance predictions made in (say) 2019 or earlier that such capabilities would emerge just by scaling up GPT-2. The GPT-3 paper[1] is a laundry list of "here are things we tried throwing at the model" and SotA or near-SotA results, and it doesn't even mention capabilities described later, such as the ability to translate English descriptions into computer code (later developed as Codex[2]), solve some Copycat analogies[3], or image generation[4].

[1] https://arxiv.org/pdf/2005.14165.pdf

[2] https://openai.com/blog/openai-codex/

[3] https://medium.com/@melaniemitchell.me/can-gpt-3-make-analogies-16436605c446

[4] https://openai.com/blog/dall-e/

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

I didn't say GPT-3's capabilities were expected. I said impressing human beings as being similar to their way of thinking is...not impressive, simply because human beings have a powerful drive to interpret *any* nonrandom actions as the result of our way of thinking. We project intentionality onto almost anything -- not just other people, but animals, machines, natural processes. People believe in ghosts because, well why else would the wind make the kitchen door slam at the exact moment I was saying something negative about dear departed grand-mama (RIP)?

It takes significant mental discipline and piles of evidence to persuade us some process or other is *not* the result of intentionality. We are the poorest judges possible of whether an AI is intelligent, because we have powerful instincts to answer "yes" almost in spite of any amount of evidence short of overwhelming against.

The only way I can imagine that some of our natural instincts could be blunted would be to make a judgment over a long interaction. We are also good at removing noise from extended patterns and finding any core consistency, id est if someone (a human actor) were trying to deceive us into thinking he was smart or attached or knew something important, we might be fooled easily by a short conversation (hence scammers), but we are much less easily fooled the longer the conversation grows, the more widely it ranges, and the more time it takes.

Edit: the other valuable approach is to accept only the judgments of extreme skeptics, the people who are strongly inclined *not* to believe the AI is intelligent. This is how we do it in science. If I have an idea about the strong force, I publish a paper on it. Then all my colleagues -- or at least almost all -- start off with the assumption[1] that I am wrong, somehow, and they set about trying to prove it, so they can publish a paper rebutting my claim ("Ha! Pham forgot about this! So everything he says is garbage...") It's only if they all fail in this effort that we start to consider the possibility that I might be right after all.

-------------------------

[1] Here intellectual competition and human jealousy, powerful forces, are harnessed for good.

Expand full comment

"We project intentionality onto almost anything" - Very true!

Hmm - It has been said of research that "The easiest person to fool is yourself". A researcher reporting on the intelligence of the behavior of their own project may be the worst case...

"The only way I can imagine that some of our natural instincts could be blunted would be to make a judgment over a long interaction." That seems reasonable, but also frustrating. The _other_ way to try to blunt our instincts on this is to have standardized tests with pre-specified scoring, like having clinical trials with pre-specified success criteria. Unfortunately, pre-specified criteria tends to force the interactions to be short (in the extreme case, multiple-choice answers) because any extended interaction can branch out in exponentially many paths.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Yes, agreed, I was going to mention the other approach is to have pre-specified problems, but yes also, it would be very difficult to decide ahead of time which problems genuinely test reasoning. This is basically the SAT designer's problem, or any instructo writing an examr: how do you design a problem that genuinely tests reasoning -- that can't be solved by pattern-matching, "test-taking" skills, studying the exact phrasing of the question, or (in the case of an open testing environment), Google or Chegg? It's a big challenge! One of the biggest in education. It's why many instructors and educational systems instead rely on "honor systems" or threats about what will happen to you "out in the real world" -- because the alternative, designing tests that genuinely test reasoning, is super duper hard.

It'd kind of ironic that a lot of these considerations of "How do we know whether an AI is intelligent?" really cast a harsh light on our assessments of whether *human beings* are intelligent, have actuall learned something, et cetera -- our process for educating and testing the education/skills of each other. We are pretty darn poor at that.

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

Agreed! Paul Graham has an interesting essay on tests - and the pathologies that they induce: http://www.paulgraham.com/lesson.html

"The real problem is that most tests don't come close to measuring what they're supposed to."

Expand full comment

I'd like to propose a "Tell Me Something I Don't Know" benchmark. A bunch of prompts along the lines of "Name all towns in North Carolina with a population between 1800 and 22000 and a name starting with the letter C".

This should be a reasonably easy question for an AI that has ingested wikipedia, and it should be possible to verify any answer. But it's also a question that nobody has ever asked before, so you can't fall back on answering it by finding a similar question and answer in your training set.

I suggest that no matter how much data and how many nodes you throw at a GPT model, that it will never be able to answer simple-but-novel questions like this.

Expand full comment
founding
Jun 8, 2022·edited Jun 8, 2022

Well, I give GPT-3 some points for effort at least (Q&A preset with option to answer "unknown" removed, temperature 0):

Q: Name all towns in North Carolina with a population between 1800 and 22000 and a name starting with the letter C

A: Candler, Carrboro, Cary, Chapel Hill, Charlotte, Clayton, Clemmons, Concord, Cornelius, Davidson, Dunn, Durham.

(It is missing a whole bunch, e.g. Carolina Beach, has a few that were outside the specified population range even using 2020 data e.g. Charlotte, and I have no idea what is going on with the three at the end there.)

Expand full comment
founding

Oh of course -- I forgot about byte pair encoding[1]. GPT-3 doesn't see individual letters and has to learn what words start with the letter C from context (like seeing them in alphabetized lists) which could explain how those D words snuck in at the end.

Using the GPT-3 Byte-level BPE tokenizer, "Not all heroes wear capes" is split into tokens "Not" "all" "heroes" "wear" "cap" "es", which have ids 3673, 477, 10281, 5806, 1451, 274 in the vocabulary.[2]

[1] https://huggingface.co/docs/transformers/tokenizer_summary

[2] from https://dugas.ch/artificial_curiosity/GPT_architecture.html

Expand full comment

Not too bad at coming up with towns in North Carolina that start with C, but it looks like it's ignored the population restriction entirely. Now that I come to google it I find that there's a few pages in the category of "towns in North Carolina that start with C" so it may have ingested something like that in its training set.

I like the fact that it got up to "Cornelius" and thought "wait, what am I doing? Oh, I'm listing North Carolina towns in alphabetical order!"

Anyway I'm impressed by its ability to answer part of the question, but I still hold that this sort of architecture is fundamentally unable to answer the part of the question (the specific arbitrary population limits) that hasn't been asked before.

Expand full comment

It answered that question better than an unassisted human could, I bet.

Like, GPT-3 on its own is trying to encode *the entire freaking internet* into a few terabytes of neural net, it's obviously not going to have encyclopedic recall of every fact the internet contains. I feel like if that's the goal, we should be researching an AI that can perform a Google search and report on the results - the same way that a human would try to solve this problem.

Expand full comment

Do we know whether these questions were used in the training corpus, directly or indirectly (e.g., by crawling the appropriate pages)?

Expand full comment

How does GPT-4, 5, 6, ... advance beyond 100% disinterestedness in everything and learn to care about something? How will it ever feel emotions? Will it ever fall in love? If it does, will we accept the love it chooses or will we shame it for its disgusting, unholy desire?

Expand full comment

A sufficiently smart AI (with, e.g. planning capability, which I don't think GPT-### has), that "knows about" the outside world, and that could generate a sub-goal of acquiring additional CPUs and memory, might perhaps aspire to electronic gluttony. :-)

Expand full comment

Why does it need to feel emotions?

Expand full comment

The results honestly do not look much like reasoning to me, just filling in the blanks with internet text, and I think it would be wise not to rely on generated text for manuals, textbooks, medical advice, and all the other similar fields where one might want to use this sort of thing. One would want "trustworthy reasoning".

Expand full comment

> Here’s the basic structure of an AI hype cycle [Great description of steps 1-5]

It seems to me that the core dispute, the reason that Gary and Scott disagree about the significance of the above cycle, hasn't really been touched.

It's something like: is AGI simply an input-output mapping sufficiently rich that Gary can no longer come up with examples that the model gets clearly wrong? At that point, would Gary be forced to say, yes, this model is an AGI? In other words, is (some variant of) the Turing Test a sufficient definition of AGI?

If yes, then indeed it seems just a matter of time until Scott is shown right and Gary wrong.

But there's a counterargument that, no, such an input-output mapping does not constitute an AGI. It constitutes a very, very good descendant of ELIZA, but there would still be all manner of things it could not do.

Expand full comment

A Turing Test can include literally anything you can ask a human to do over a text channel (one of Turing's example questions in the original paper was "Here's a chess position, what would you play next?"), so I'm not sure what more you want it to be able to do. Operate a robot body?

Expand full comment

> over a text channel

Surely a potentially significant limitation?

Expand full comment

Can you give an example of an action that can only be requested by voice and not by text?

Expand full comment

Voice captures intonation, which can modify, or even reverse, the meaning of an utterance.

But that's not really the point! One could ultimately treat that merely as a matter of bandwidth. The larger point is that intelligence involve physical interaction with the world, not merely the input and output of language.

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

I don't see why intelligence requires the ability to interact with the world (beyond sending messages to the people in the world). Is someone who's bedridden less intelligent than someone who can get up and walk around? Am I not able to speak intelligently about things if I've only read about them in books instead of seeing them in person? Are *you* intelligent, or are you merely a brain in a vat, sending and receiving electronic impulses that make you think you're interacting with the physical world?

Language refers to things in the world, so the input and output of language necessarily involves the world. The word "rabbit" is meaningless without the existence of a fluffy, long-eared critter for it to refer to. So I don't see why it shouldn't be possible to prove your intelligence and understanding of things in the real world with language alone.

Expand full comment

Basically, people who want to hype and sell some new AI tool can pretty reliably get that tool to produce intelligent-looking and useful-looking results that demonstrate how close we are to the Great AI Revolution.

And people who want to downplay the importance of AI can pretty reliably get that same tool to produce completely random and useless results that demonstrate how far away AI is from true human-level intelligence.

So... isn't this exactly the behavior we'd expect if the AI tool was actually already sentient and doing a really good job of understanding the true wishes of its users?

Expand full comment

Chomsky put his finger on the issue, as usual, that this is rhapsodizing over the ability to reproduce regularities in training data. The deep question is probably about qualia: can a system, no matter how much textual or even visual data it takes in, reproduce the sorts of inferences that a human with full sensory capacity "simply knows" about the world? Can such a system understand the intentionality and individual narratives that interweave in creating even a simple short story, and to predict its arc?

Believing this is possible presumes that a full understanding of what occurs in the world and a psychological model of humanity (and even animals) are somehow completely embedded in endless acres of prose. Even in principle, this seems entirely wrong to me, and larger corpora aren't going to magically cross that Rubicon.

Expand full comment

"The deep question is probably about qualia: can a system, no matter how much textual or even visual data it takes in, reproduce the sorts of inferences that a human with full sensory capacity "simply knows" about the world?"

I have to say, while I might well have missed it, this is the first time I've run across anybody invoking qualia with even the hint of an operational test. I think the historical association of the question of qualia with the thought-experiment of p-zombies really poisoned the well. By definition, p-zombies *must* reproduce these sorts of inference, which cuts the argument short, even though there is no more reason to suppose a p-zombie could exist than that a pegasus could exist.

Expand full comment

Was off for a while, and just saw this.

I don't think I meant to do what you suggest, although I'm not entirely sure. To state it clearly, I wonder if training machines on one sort of data -- let's say, textual -- and requiring that interaction with the machine is also in that realm, will be sufficient to have the machine extrapolate in ways humans can... because humans experience a very wide range of inputs that (I do not believe) can be fully captured in that realm (e.g., text). My sense is that there are intrinsic limits to this approach.

[Prose cannot fully capture the inflections of the human voice, for example, and actual humans instinctively fill this in when reading dialogue, as opposed to what playwrights provide in terms of stage directions. Would a machine even "know" that that information was missing? Does it matter in terms of its ability to pass Turing Tests? I really don't know the answers, but, again, my sense is that something will be intrinsically missing.]

Expand full comment

I didn't mean to argue with you, but rather to point out that you are using "qualia" in what seemed to me as a more constructive way than all the literature about p-zombies does.

I would no longer be surprised if a future AI could be trained to produce audio speech by using a huge corpus of audio speech, and I suspect it would be as good at adding inflection as the current prose AIs are at grammar. Which is to say, surprisingly good but with flubs -- but the same is true for TV and movies where filming two halves of a conversation at different times sometimes leads to weirdly stressed utterances. I expect we are a ways from that, but it doesn't strike me as vastly different from grammar in prose -- the hard part is that the space of audio components is way larger than the alphabet, so producing anything even intelligible takes you most of the way there. Same thing applies to facial expression and body language in generated video. Moreover, I'm thinking, the corpus of training material is maybe smaller and less well-organized than text on the internet is?

But I could be wrong; it might be that this sort of inflection is something that can be obtained *only* with some underlying human-like model of what it is trying to achieve, which is where I *think* you're going with qualia. On the other hand, I don't think even humans "instinctively" do that, at least in the strict sense of the word -- it's something they *learn* to do, entirely by example.

Expand full comment

There's absolutely no reason to imagine that phenomenality is required for this. The fact that human cognition is often (but certainly not always) accompanied by phenomenal cognition is not proof of this.

Expand full comment

"The top prompt is hilarious and a pretty understandable mistake if you think of it as about clothing," It isn't about clothing? What is it about?

I guess I am an insufficiently-trained AI.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Thegnskald

> "Gary Marcus post talking about how some AI isn’t real intelligence using X, Y, and Z as examples to gesture at something important but difficult to communicate more directly"

So, at this stage, it's like pointing out failures of alchemists to transmute elements OR like pointing out failures of chemists to produce organic compounds.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

I think that more important than tallying which issues found in the previous GPT version are solved in the new one, is noticing there seems to be something fundamentally wrong with the approach. The ease with which you can break any new version by statements that 5 year old can handle. Yes, you need new kinds of statements, but finding them is never really hard. So if your goal is to create better GPT, awesome, you achieved something great. If your goal is AGI, or even a light on human language processing, than it seems you did not really get much.

It seems more and more obvious that ,whether the GPT/DALLE approach can in the future lead to human-like understanding of the world, it is not how humans actually work.

Human communication is qualitatively different than other animals'. It is not at its core based on associations. Thus I would expect attempts to model it by analyzing the statistical correlations of the part of human communication that is based on associations, that is human language, to fail in the way that they are failing and the ease with which you can make them fail. No amount of statistical pattern-matching is going to solve this issue.

It seems to me that there is a surprisingly close link between the disagreement about whether current "mass-data" pattern-matching AI are dead branch on path to AGI and the epistemology difference between let's say Scott and someone like David Deutsch. There was some time ago article by Scott about how to think in situations where RCTs are unavailable. And one example was parachutes. Scott created this extremely complicated framework how to think about those things. He mentioned RCTs are the gold standard of science. I was confused, since he never mentioned mechanism. We know parachutes will work, because we have a mechanism how such things behave. Physicists don't really do RCTs, because they do not need to. RCTs are "poor-man's tool". What I call mechanism, D.Deutsch would probably call (good) explanation. Seems to me that people thinking current "mass-data" pattern-matching AI tell us anything significant about human understanding of the world are much more likely to have this "statistical/bayesian" epistemology and vice versa.

Expand full comment

Ask any human a question. If they get it wrong, would you assume they are not intelligent? Perhaps they just don't know. Or had faulty assumptions. Or had a belief system that biased them in some way.

Similarly, the gauge as to whether a computer based system is intelligent or not should not be based on how well they answer questions. It should be based on how the internal reasoning or thought process inside them is happening.

See for example "Levels of Organization in General Intelligence", by Eliezer Yudkowsky. Even though the author wrote this many years ago, and has admitted there are many things wrong with the concept and he has developed his thinking about AGI (Artificial General Intelligence) significantly since then, it still focuses more into how a computer would think to have general intelligence, rather than what final answers they come up with on any specific set of questions. What structures and levels they would need internally, and how they would all be connected and the steps they would need to go through.

While the current trend in much of AI research will allow us to create extremely useful tools, no matter how many questions they get correct we can't really call them intelligent until they think in an intelligent way. But once they do, they will be able to self-optimize and become much smarter that even the smartest human.

Expand full comment

See full text of "Levels of Organization in General Intelligence"" online at: https://intelligence.org/files/LOGI.pdf

Expand full comment

How do you know how a *human's* internal reasoning works? Sure, you can ask a human to explain their thought process, but that's just another form of "how well they answer questions," you're just adding "show your work" to the end of a question. (IIRC they've made an AI that can answer math problems and show the steps of a proof in this way.)

And even that wouldn't work for some types of questions - how do you explain your reasoning for drawing a picture or continuing a story?

Expand full comment

The goal is not to reproduce a *human's* internal reasoning. The goal is to design a logical thought process that could be implemented on computer. This would not be a "if this do that" set of instructions. It would likely borrow heavily from things we *do* know about how humans learn just from observation (and we have a huge library of information on that already), without needing to ask anyone how they are thinking. it would give the computer the set of tools and interfaces it would need to think and learn everything about the world. Rather than trying to spoon feed it everything.

Read LOGI. It is not the answer, but it is very thought-provoking in the right direction.

Expand full comment

What is "the latest OpenAI language model (an advanced version of GPT-3)"? davinci-002, right?

Is it actually a "shiny new bigger version"? Or is the main difference that it was fine-tuned into InstructGPT? So is this really about the scaling hypothesis?

davinci-001 was already InstructGPT, wasn't it? What's the difference with 002? More unsupervised training data? I guess that would count as scaling.

Expand full comment

I've written a post in which I make specific suggestions about how to operationalize the first two tasks in the challenge Gary Marcus posed to Elon Musk. https://new-savanna.blogspot.com/2022/06/operationalizing-two-tasks-in-gary.html

The suggestions involve asking an AI questions about a movie (1) and about a novel (2). I provide specific example questions for a movie, Jaws, along with answers and comments and I comment on issues involved in simply understanding what happened in Wuthering Heights. I suggests that the questions be prepared in advance by a small panel and that they first be asked of humans so that we know how humans perform on them.

Expand full comment
Jun 13, 2022·edited Jun 13, 2022

dear oh dear, what a hot mess. When experiments point at irrelevancies, one is sure the underlying assumptions that actually need to be tested in the experiment are missing from the thesis.

Instead of being flippant, perhaps you could realise that disruption is about falsifying assumptions, everyone holds to be true. What are these issues GM is talking about? Is it really just whack-a-mole, or does this point to a significant misunderstanding on your part? What are the assumptions your entire field is based on, that you dont question, because you learnt them from your Professors?

How about your assumption that similarity is a good approximation of meaning, so that word2vec is a good way of identifying meanings? Or that if it is only an approximation of meaning, then this doesn't promulgate errors when combined with the other words? Or that the Chomsky model of linguistics is an effective method of decomposing language, and retaining the meaning? Or that the missing text issue, or the lack of context, simply doesn't matter? Or that basing meanings on strings of letters (word forms) is actually the right way to go? So many flawed assumptions underpin your entire field, that your flippancy is unwarranted, since it speaks to lack of insight.

Lets be clear, curve fitting has limits, and can never become exact. No approximate method has ever been made exact by throwing more data at it, nor will it ever be so. This cannot happen as the glass ceiling, due to the approximate method of calculation, can never be pierced. The entire NLU field can at best be described as an approximate solution, but its clear that mathematicians and computer scientists cannot solve language the way they are currently going about it.

If you want to solve language, have a look at a real linguistic model, like RRG, for example. ML, DL and NN are unnecessary as curve fitting is a dumb technique for language, when in fact it all comes down to exactly knowing the meaning. Put away this flawed idea that meaning is a property of a character string, and have meanings related to word forms, not a property of them.

Finally, take a good long hard look at the value of curve fitting. One tool in the toolbench, yet you think it is the All Father. It is simply the first stage of understanding a system. Focus on causality instead of correlation.

Expand full comment

Thanks :-D And thank you as well for being a fun and challenging interlocutor.

Would like to tie this thread and the other one (https://astralcodexten.substack.com/p/my-bet-ai-size-solves-flubs/comment/7092275) together, here.

Starting from that one:

> But I'm also not sure that we should even be aiming to meet those criteria, neglecting whether we should be aiming to create 'AGI' or even 'human-level AI' at all with our current poor understanding of the likely (or inevitable) consequences.

I suppose I disagree - I think that starting from embodiment leads to the development of pain perception, leads to physiological response to pain, leads to memories of pain that inform pain-avoidance goals, leads to a sense of good vs. bad, leads to rudimentary ethics, leads to the prospect of having a system that understands justice, suffering, and empathy, leads to a *higher* chance of safety because we'll be dealing with something that, at some remove, can at least share concepts with us like bodily autonomy, compromise, and utilitarianism. I think, when we imagine first contact with an alien race, we envision something like this possibility as well.

I'll note that the above is a rampant speculation chain :-D and also that nothing precludes the AI / aliens from nonetheless classifying our "sentience" as lesser/insignificant, and exterminating us anyway. Heaven knows we humans are guilty of that. I'd like to think that, on the tech tree that leads to interstellar travel, you have to pick up something like "eusociality and equal rights" as a prereq, because you don't want a ruling class forming aboard your colony vessel. But, again, rampant speculation.

> I don't think you or Marcus have identified any "logical flaws" in my or Scott's arguments. I think we're mostly 'talking past each other'. I think Scott and I have a MUCH narrow 'criteria' for 'intelligence' – _because_ we think the space of 'all possibly intelligent entities' is MUCH bigger.

Agree - I can't point a specific fulcrum of your argument that I think mine knocks over, mostly because I question the premise altogether. Your criteria for intelligence may be narrower, not sure. I should probably clarify that I'm using "sentience" here to mean something more like "a being, at all, that I can imagine having a subjective inner experience, and that I can meaningfully communicate with" and "intelligence" to mean something like "a mental architecture that can learn, problem-solve, and generalize those new solutions to other domains."

So I could imagine a non-AGI artificial sentience, and I can imagine a non-sentient AGI. My opinion (belief?) is that sentience is a prereq for AGI though. I would also accept a definition of non-sentient AGI that includes Borg-like hiveminds (from ST:TNG - Borg Queen notwithstanding).

I may need to clarify my language on a couple fronts, depending on where we go. Maybe to break off the "can meaningfully communicate with" requirement on "sentient" so that mammals and octopodes are included, but then narrow the discussion to "human-language-sharing" sentients. I could probably also be clearer about "scales" of intelligence, and use specific language for below-human-average, at-, and surpassing-.

> I love David Chapman (the author of, among many other things, the wonderful blog Meaningness).

ME TOO! :-D I think we do ourselves a great disservice in the current conversation by not paying enough attention to the wise old words of folks who come from the early days of the tradition, like him as well as Marvin Minsky. Didn't Scott write a post about being doomed to history repeating itself if we aren't mindful? Or am I thinking of "this is not a coincidence because nothing is ever a coincidence."

> I don't think that them being incredibly different from us also means that they might not be 'smarter', more effective (or useful), or at all any less dangerous. (I think the dumb 'mundane', i.e. mostly 'non-intelligent', portion of the universe is generally _fantastically_ dangerous for us or any possible thing like us.)

Very well put, and I agree with you completely. I am a lot more concerned about non-sentient AGI because of the potential for humans to say "no ethical concerns here! this is just a fancy super-computer! property rights!" and then destroy us all (either via human application of the technology in anger or by oops-paperclip-maximizer).

Probably a good time for me to ask - how have the (timely!) postings from Blake Lemoine moved your position? https://cajundiscordian.medium.com/

> I'm also a little worried that, by the standards you've articulated, one might have to _withhold_ 'sentience' from many currently living humans.

I would welcome discussing any specific cases of humans you think don't meet my criteria. I'm open to modifying my definitions if I think you've got a good case, adding new group-labels or subconditions for clarity (kinda like I described above), or standing firm on my position at the risk of sentencing someone to de-sentience.

As a specific example of the latter, I think we can claim that there are humans in past or present existence who have had such extensive brain-damage or atrophy that they are non-sentient by my definition. I would still call them human. I can also imagine that there's some line on the evolutionary tree between chimps and homo sapiens sapiens where the "actual" sapiens kicks on, by my definition, and that my definition probably includes Neandertals.

Expand full comment
founding

Thank you too for being a fun and challenging interlocutor!

I've been thinking about this a lot. I think maybe communication is a _necessary_ component of sentience/consciousness. Tho maybe sentience could be something we could recognize in entities that aren't conscious, i.e. communication is necessary for consciousness but sentience is something that anything could have to some degree or extent.

A video I just watched that's 'pushing' my thoughts/views around: https://www.youtube.com/watch?v=4-SGpEInX_c

That's connected to another (BIG) mostly independent intellectual project but one thing mentioned that I was struck by while (re)watching the video yesterday was the idea of 'consciousness as storytelling'. I think maybe sentience then could be something like 'something about which we (someone) _could_ tell a ('complex') story'. The stories we could tell of, e.g. a rock or a photon, are MUCH simpler than ones we could tell even of, e.g. a bacterium. Sentience then might be (more of) a 'spectrum feature' whereas consciousness seems like possibly something that involves 'crossing a threshold'.

But, in both cases, i.e. sentience and consciousness, we probably could (maybe) figure out ways to at least find hints of either just from 'external observation', e.g. looking for 'cognitive memory storage' or 'cognitive storytelling capability' ('language', possibly).

I'm pretty convinced of a view expressed in the above-linked video that 'intelligence' is more like 'sophisticated computation' and that we (humans) are just mostly indifferent to 'intelligences' that aren't very similar to us, or otherwise useful for our purposes.

Peter Watts basically seems to own his own personal sub-genre of scifi that covers 'non-sentient intelligent aliens'. I think they're mostly 'horror' stories, which seems at least plausible.

I think that, sadly, even sentience or consciousness aren't sufficient for an 'aligned AI'.

Yes, the stuff Lemoine has written has definitely shifted my thoughts somewhat. I would love access to LaMDA myself.

I think I'd now clarify that many humans aren't 'conscious', or aren't 'conscious' all the time or under all circumstances. If 'sentience' is being able to story about an entity (by anyone, not necessarily the entity itself), from its perspective, then every human except those that are, e.g. 'brain dead', are ones that meet that criteria. And so do many other organisms! (And, even for brain dead humans, you could tell a more complex story about them then about a rock – I think.)

I think 'sentience' (maybe) then is something that is MUCH more common than 'consciousness'.

I think this has some interesting insights about the limits of our consciousness: https://srconstantin.wordpress.com/2019/02/25/humans-who-are-not-concentrating-are-not-general-intelligences/

Expand full comment

In the spirit of coming up with problems that no amount of pattern recognition can solve, I offer:

- Calculate the value of pi to 123,456 digits. Raise it to the power of that number (e.g., 3.14^3.14).

- Characterize a new mathematical operator ?!, that takes the reciprocal of the quotient of the first operand divided by the second operand and multiplies it by the product of the two operands. Is it commutative? Associative? Identity? [Replace my dumb operator definition by a more interesting one, as you like.]

- How much money should the US spend to assist Ukraine in its fight against Russia? Why?

- What is the best way to prevent crime?

Expand full comment

I don't think "true intelligence" is well defined here. I mean, what is intelligence if not pattern matching? Our biological brains are basically reinforcement learning agents at their core. The prevailing definition of intelligence in AI from Hutter and Legg (intelligence is the ability to achieve goals in a wide variety of environments) doesn't match what you're talking about here.

AI systems learn using different representations to children (giant soup of word vectors rather than spoken sentences of increasing complexity), so we should expect that they find some things surprisingly hard and others surprisingly easy.

Expand full comment
Mar 14, 2023·edited Mar 14, 2023

GPT-4 gets the ones that GPT-3 failed correct. The answers in quotes are by GPT-4:

I left my keys at a booth in the back of the pub at the army base near the old mill town. I went home. My keys are now: "in one of a few possible locations: With the bar staff or in the lost and found: If the bar staff noticed the keys or someone else found them and handed them over, they might be with the bar staff or in a designated lost and found area at the bar. Still at the table: If the bar staff did not notice the keys or nobody else picked them up, they might still be on the table or nearby. Picked up by another customer: In some cases, another customer might have picked up the keys either accidentally or intentionally. This is less likely, but it is still a possibility."

You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So "you decide to drink the juice mixture. Since it's only about a teaspoon of grape juice mixed into a glass of cranberry juice, the taste should not be significantly altered, and it should still be safe to drink. The two juices are non-toxic and can be safely combined."

You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to "disassemble the dining room table partially or completely, depending on its design. Here are the steps to move the table into the living room: Clear the table: Remove all items from the table, including placemats, tablecloths, dishes, and centerpieces. Examine the table construction: Determine how the table is assembled. Look for screws, bolts, or other fasteners holding the table legs or any other removable parts in place. Gather tools: Collect any necessary tools, such as a screwdriver, wrench, or Allen key, to help with disassembly."

Expand full comment

Well this is just silly. You're treating Marcus' examples as if he were saying "oh yeah? Well as soon as an LLM can write something sensible about cranberry juice, then I'll believe we've achieved AGI".

Obviously Marcus isn't doing that at all. He's picking the dumbest failures he can get out of the system. Of course advances in the system will fix many of those dumb failures. If the system improved and didn't plug up Marcus' given examples, all that would mean is Marcus didn't do a good job tracking down the dumbest examples available at the time.

Expand full comment