457 Comments
Comment deleted
Expand full comment

I'm not sure this _is_ a problem. Can you just train it on more (language) data than a human sees? After all, assuming Chomsky's take is correct here, humans emerge already "half-trained" by evolution on language, then just finish training.

Expand full comment
Comment deleted
Expand full comment

Why is this a problem (for OAI)?

A different understanding can still produce identical results.

Expand full comment
Comment deleted
Expand full comment
Jun 10, 2022·edited Jun 10, 2022

My laptop is completely and fundamentally different to a Nintendo 64, but with an emulator (which took a huge amount of time to code and is very inefficient in compute) it can predict the responses of one *perfectly*.

It is much harder to learn to emulate a human brain with common sense than it is to be a human brain and learn common sense, but this doesn't imply that the former will *necessarily* be buggy.

Expand full comment
Comment deleted
Expand full comment

Two problems.

One, is the alignment issue that AI researchers talk about a lot. If the AI can get the same answer to a factual question, that's great from a certain perspective. If the AI uses those facts to choose a course of action, but came to its conclusion for very different reasons or based on very different inputs, then we cannot trust that the response will be the outcome we would have desired.

Two, if the AI produces simpler answers similar to ours, we may assume that it produces answers in a similar enough fashion to us that we can trust those responses. But, the only purpose of an advanced AI is to produce answers that humans cannot produce, and may not be able to verify. If we give the AI sufficiently difficult questions to answer, we lose legibility on the responses and may not be able to tell if they are good ideas or bad ideas. Obviously this ties back in with alignment issues, but it's also a problem if the AI is just (even very subtly) wrong.

Expand full comment

The (for OAI) bit was intended to fence off the X-risk concern (as that's more bad for *humanity*). I do understand that Skynet emulating a human can fake being good (as it "has" morality, but that morality does not control its decisions and is hence no safeguard at all); however, this didn't seem to be what UI was driving at.

I'm not sure why being able to mimic humans would be super-bad for an AI that is *not* misaligned but actually just wrong. Humans are already capable of being wrong, after all.

Expand full comment

Is this in the same sense that French people who learn English in school, don't have the same understanding as English people who learn it from their parents (and thus must not actually understand English)?

Expand full comment
Comment deleted
Expand full comment

It seems more likely to me that the human brain is a general intelligence processor and reuses what structures it assembles based on visual, tactile and audible input for processing language. This description would not be particularly unlike what GPT training looks like. In fact, far more data is used to train a human mind generally if you include all of this. This seems more plausible than that a special brain function suddenly evolved which allowed efficient language development.

Expand full comment

People can and do become fully fluent in a second language.

These things aren't even fluent in one.

Expand full comment

I think this points at a bigger problem, which is that humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms. The only question it can ever answer is "Based on the many billions of words I've seen in the past and the relationships between them, what word is most likely to go here?"

The further you go away from the corpus it's been trained on, the less likely you are to get anything sensible. It can tell you what will happen if you drink acid, because "drinking acid" is a scenario that people (e.g. sub-teen boys) like to speculate about. It can't tell you what will happen if you stir a drink with a cigarette, because nobody has ever written about that scenario until now.

Expand full comment

"the cigarette ash made the lemonade even more bitter. So I ended up pouring it all out."

It predicts that stirring lemonade with a cigarette will make it taste bad, which I would also predict. The only problem I see here is that it confuses bitter and sour.

Expand full comment

When the cigarette gets wet, the paper softens and falls apart quickly. Kind of like when a tea bag tears. I think the lemonade would be full of tobacco shreds and bits of paper. It would indeed be bad, but not only due to taste.

This might be fringe knowledge though.

Expand full comment

Well, the way I understood it, GPT implies cigarette ash is now mixed with lemonade. Through there's another potential problem - it wasn't stated that cigarette was lit.

Expand full comment

I almost want to waste a cigarette and a cup of lemonade making an obscure profile picture.

Expand full comment

Which all points to the problem of calling GPT intelligent - it's not really basing it's outcomes on anything legible. Humans couldn't really either, with such a small prompt. Instead, we would ask a few probing questions for more detail. I would love to see what questions GPT would ask, but I suspect that they would be illegible and not produce greater understanding for GPT.

Expand full comment

People thought GPT doesn't know what it doesn't know. Then it turned out you _can_ make it answer 'Unknown' when it doesn't know info. And it often works (where otherwise it would 'guess' and usually fail.).

I think GPT could be made to ask clarifying questions which would actually help it.

Expand full comment

> humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms

Not necessarily. That's just input and output. What GPT ends up doing internally doesn't need to have anything to do with language.

The real limitation is IMO that it can do a very limited series of sequential computation steps each time you run the inference. It's possible to improve it a bit by asking it to "think aloud" - it could put intermediate computations in the output. But that doesn't really solve the problem.

Quoting Gwern gwern.net/Scaling-hypothesis#meta-learning

> A sub-model which memorizes some of the data is indeed much simpler than a sub-model which encodes genuine arithmetic (a NN can probably memorize tens of thousands of lookup table entries storing examples of addition in the space it would take to encode an abstract algorithm like ‘addition’), but it can’t possibly memorize all the instances of arithmetic (implicit or explicit) in GPT-3’s Internet-scale dataset.

> If a memorizing sub-model tried to do so, it would become extremely large and penalized. Eventually, after enough examples and enough updates, there may be a phase transition, and the simplest ‘arithmetic’ model which accurately predicts the data just is arithmetic.

> And then the meta-learning, after seeing enough instances of algorithms which vary slightly within each sample, making it hard to learn each task separately, just is learning of more generic algorithms, yielding sub-models which achieve lower loss than the rival sub-models, which either fail to predict well or bloat unacceptably.

> So, the larger the model, the better, if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on. If the ingredients are there, it’s going to happen.

Expand full comment

GPT will be unable to answer any questions that humans have not already answered. I don't even mean "are capable of answering" but instead anything that isn't so fully answered that there is a consensus approach that can be gleaned from existing training data.

Ask GPT if [experimental drug] works, and find someone who will trust the results.

Expand full comment

It can. Random example, with standard 'Chat' prompt:

-------

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: Please pay attention to locations of objects in the following story. "Inside of a warehouse, there's a large crate. On top of the crate, someone put a water bottle. Below the warehouse is a lake.". Now, answer my question: what is directly below the bottle?

AI: The crate.

Human: What other things are under it?

AI: The lake.

------------------

Did this exact convo exist in its training dataset? I doubt.

> Ask GPT if [experimental drug] works, and find someone who will trust the results.

Well, of course it won't answer _that_.

Expand full comment

Another fundamental difference: we get complex feedback each time we test a hypothesis of how language works.

Expand full comment

I think this is right, though I should note that the poverty of the stimulus has come under pretty sustained attack recently. There are a bunch of researchers who think it's completely wrong, and humans do in fact learn language (at least mainly) through general pattern-matching. I'm not convinced of this yet (I learned Chomskyan linguistics at university, and haven't really advanced beyond what I learned then), but it's worth being aware that this is far from "settled science."

But whether we are blank slates or language acquisition devices, we clearly learn language in a way that is different to GPT, and in particular, involves a much lower volume of language. Which leads to an interesting conclusion (one that I think generalises to other AI fields as well): by the time we've trained an AI to be so good at language that it doesn't make any dumb mistakes (what we think of as dumb mistakes), it will be so good at language that it will be far far outperforming us in other areas. In particular, you might expect a computer to be pretty good at logic.

So my prediction is that if we can get a computer to be good enough at language to talk to us without saying dumb stuff, it still won't be able to talk to us, because from its perspective, we will be continuously saying dumb stuff. If it's smart enough to never mix up a pig and a baby, it's also smart enough to never ever mix up inference and induction, or platitude and solipsism, or chemical potential energy with electrical potential energy, or fascism with authoritarianism... or any of the other million things that we typically get wrong, and expect to get wrong. It will find human discourse silly.

Expand full comment

Alpha Go takes a ridiculous number of games to get remotely proficient at anything.

These networks are all about faking it, because it is too hard to program, so they just throw more and more power at it so it can fake things marginally better.

This approach works much better for things with clearly defined rulesets.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

Can you explain step 2 more? I

thought the whole notion of 'not having a closed form solution' was that you had to numerically approximate, and your ability to do so accurately would degrade over time. Am I misremembering this?

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment
founding

You might like this paper: https://www.scottaaronson.com/papers/philos.pdf

It's by one of the 'other Scotts' and is one of my favorite papers! It explores something I think you're gesturing at in these comments.

Expand full comment
Comment deleted
Expand full comment
founding

Here's another 'paper' that you might like: https://www.gwern.net/Scaling-hypothesis

The conclusion is something like 'empirical intelligence IS all you need!' (and it's, in a sense, EASY to achieve, even with simple/stupid/basic architectures, IF you throw enough money/compute at it).

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment
founding

A big part of why I think it's hard to compare humans (or other biological and evolved intelligences) to AI is that humans are 'pre-trained' by evolution/natural-selection. We're _born_ with a bunch of 'structure' builtin from the start, e.g. human language acquisition.

Consciousness also seems 'special' – the recent book review about that was pretty interesting in this regard.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

"...expect symbolic knowledge to be a byproduct..." Yes. More specifically, it is a byproduct of language. See this post where I agree with Hinton that it's all "neural vectors" and the trick is to understand how language is implemented in neural vectors, https://new-savanna.blogspot.com/2021/05/geoffrey-hinton-says-deep-learning-will_31.html

Expand full comment
founding

This is my understanding too.

Expand full comment

Yes and no. If the equations you set up are initial value differential equations of motion, and you solve them with a digital computer, then absolutely the further away from your initial values you get, the worse you accuracy becomes, and unfortunately in this case the divergence grows exponentially so you get trashed no matter how many digits your computer uses.

But if you set up boundary-value integral equations instead, then your accuracy has no relationship to the distance in time from your initial conditions, and would probably be better stated as that you get large-scale motions correct but the smaller and finer scale you look, the more you will be in error.

If you solve the problems by an analog computer instead of a digital computer, then you are in principle not limited in precision, and could theoretically compute your result with perfect precision indefinitely. But then of course the problem is constructing the analog perfectly. There's no theoretical reason you can't -- at the ultimate limit you'd just be duplicating the real system, and that is certainly possible -- but *in practice* it is of course impossible to make a perfect analog of any physical system.

Expand full comment
founding

> Because the system is chaotic, option 1 requires more and more memory to reach a given accuracy as it gets further from the prection start point. Option 1 needs constant memory for constant prediction accuracy, no matter how far from the prediction start point.

I think your second "Option 1" should be "Option 2"?

But I don't think you're correct about option 2.

(AFAIK, there is still no "closed form solution" of the three body problem. There _might_ even be a proof that there _can't_ be one.)

I don't think there's any way to "apply" the differential equations beyond approximating them somehow and, because the system is ultimately chaotic, accurate predictions further into the future inevitably become more and more expensive.

Expand full comment
Comment deleted
Expand full comment
founding

No worries! I thought it was interesting 'intuition pump' regardless.

> (I think a valid question is: what does the underlying system use? What about non-newtonian effects? I think for the purpose of my thought exercise, I take the symbolic equations to be equivalent to the underlying system)

I have been thinking about that ever since I was exposed to calculus and differential equations! I mean, obviously – to me anyways – the universe isn't 'solving differential equations' like we have to.

(Hell – even the 'pendulum equation' is tricky. I'm not sure _it_ has a "closed form" solution either and I was always annoyed that, in physics, we were instructed to just approximate 'sin(x) ≈ x'. Even the two-body problem was 'handwaved', as an exact solution is far beyond even an AP calculus class.)

My suspicion is that the universe is fundamentally discrete – maybe around the Planck scale, or maybe even smaller (or MUCH smaller). I got a math degree in college and my favorite class was 'real analysis' (calculus for math majors). I still cannot get over how 'crazy' the real numbers are. It just seems _unlikely_, to me, that space and time are _literally_ continuous in the same sense.

There are some candidate GUTs in physics that are discrete, so this isn't an entirely crazy idea!

Expand full comment

But in what sense would it be discrete? Like a fixed grid of possible points for things to exist? I've tried thinking of what else it could mean, and can't come up with anything coherent seeming, but the grid has the odd issue of fixed directions.

Another idea would be, rather than discrete, more like the rationals, although I don't know if that makes sense in any models

Expand full comment
founding

_Great_ question! (You might be able to tell that I love discussing this stuff!)

So, there's _strong_ reasons why the universe isn't in any sense 'Minecraft but with TINY blocks'.

I think you're also imagining 'Minecraft with a universe 3D grid' and that's always something that's bugged me. The observable universe isn't infinite. Is there somehow a 'infinite void' beyond the bounds of all of the matter/energy? I think not. I _think_ that, in some sense, the universe is only 'all of the stuff', i.e. _something like_ the 'light cone' of everything since the Big Bang.

But special relativity already implies something funky with respect to space and time, e.g. measures of either space or time fundamentally depend on one's reference frame, which includes one's velocity with respect to what one is measuring.

General relativity extends that and outright states 'Space and time are a single combined space-time (spacetime) and it's curved, i.e. NOT Euclidean [i.e. 'like Minecraft']'. (And the curvature IS gravity!)

I don't know even a high-level/abstract/superficial gloss of things like 'loop quantum gravity' but I think a perhaps even better intuition pump (in part because of the beautiful visualizations) is some recent work by Stephen Wolfram: https://www.wolframphysics.org/visual-summary/dark/

What he seems to have discovered is a certain kind of 'evolving network system' that, 'in the limit', approximates some of the THE key aspects of spacetime and physics, e.g. spacetime like general relativity (but ALSO like the mostly-Euclidean 3D space that's intuitive to us – at our perceptibly familiar human scales), and some kinds of 'quantum' phenomena too.

[The 'many worlds' versions of quantum physics fall out of the above pretty naturally too.]

I don't think using something "like the rationals" wouldn't really help either – they're 'weird' too in a lot of ways (compared to the natural numbers or integers), like the real numbers are; just not AS weird as the reals. [The reals are REALLY weird!]

Expand full comment

Oh, I want quite thinking minecraft grid, more "tesselation", possibly via tetrahedra? with some flexibility for the curvature of spacetime.

I'll have to take a look at that Wolfram thing when I get back home, I also really like talking about this stuff.

The thing I was thinking with the rationals was to fix the discrete oddity: if there's a "smallest distance", we either have fixed allowable directions (like a lattice), or we have continuous *possible* points, but with any given point disallowing nearby points.

[Yeah, bloody reals. Hated real analysis, broke too much intuition for me. Loved Complex analysis though.]

Expand full comment

In a sense, the debate is the ultimate showdown between “I've never tried it, but I read about it in a book” learning vs “I've been making my way on the street” wisdom.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

how long until some trained neural network can ace every single 'gender studies' exam at Harvard?

now ask how long it would take it to pass a physics exam?

I think what people may actually be exchanging here is priors on whether or not there's a real difference between these two fields

Expand full comment

This is actually an interesting question - which field will GPT be able to produce a passing test answer for?

Expand full comment

I would strongly assume physics? I've never taken either at an Ivy level, but I assume if you propose the physics question as an equation an AI could sort it out, there are already AIs deriving mathematic proofs. I'm more skeptical about an AI predicting its way through a 10,000 word essay on gender studies

Expand full comment

I think you underestimate the power of college students to write really bad essays. Some of the better GPT-3 writing is indistinguishable from something like the 40th percentile of college student writing. Clearly this is a victory for GPT, but it's also the case that a lot of real humans (even college-educated humans) are spectacularly bad at constructing coherent arguments and sticking to a point.

Expand full comment

They're college students probably pulling all-nighters; how hard do you think they're really trying? I have more faith in real humans' ability to construct coherent arguments when it matters.

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

As a point of reference - I once commissioned an essay at the "undergraduate level" for a college class. What I got was something I would have described as an illegible, unconnected mess filled with basic grammatical errors, but I did some basic cleanup on it and turned it in. No problem ever came back to me, despite the fact that I would have judged a person writing that essay to be illiterate.

I was lucky, I guess, in that that was the first and only essay required by the class. I couldn't write anything similar if I tried.

Expand full comment

Look up the “postmodernism generator.” It’s still on the web somewhere and probably still hilarious. There seem to be at least one newer version of the same thing.

Expand full comment

not the generator, but a similar event: https://en.wikipedia.org/wiki/Sokal_affair

Expand full comment

> I would strongly assume physics?

It is easier to create software that can pass the physics test than to create software that can pass the gender studies test.

But that isn't the question. Nobody said GPT-3 was easy to create. The issue is that GPT-3 is software that can pass a gender studies test while totally lacking the ability to pass a physics test. If you wanted to have software pass a physics test, you'd write different, simpler, software.

Expand full comment

Except the claim is that it already does better at physics tests than gender studies.

Expand full comment

Where do you see that claim?

Expand full comment

Depends what you mean, and what kind of test. If you mean "solve this differential equation to a precision of 0.01%" then my pocket calculator can already do that, and it's hardly indicative of intelligence[1].

Normally an understanding of physics is tested by asking the student to ponder a problem stated in general terms and *come up with* the equation(s) necessary to solve it. That involves deciding which data is important, and which is not, what the underlying governing theory is ("Is this a problem in electrostatics, relativistic kinematics, quantum mechanics, classical stat mech, optics...?") and then -- only as the last and easiest step -- retrieving the equation and solving it. Finally, if you are testing at the graduate level, you would also include cases where the data were insufficient to solve the problem and expect the student to figure that out, or where the theory was insufficient to solve the problem, and expect the student to go as far as he can and be able to at least put some limits on the potential solutions.

I expect that kind of problem would be very difficult indeed for an AI to solve, largely because we don't actually know how human beings solve it. We can test for it, and we can give students lots of practice on it, we can demonstrate post-facto how we solved it, but we can't tell people "here's the algorithm for successfully approaching any real-world problem in physics." And if we can't tell it to people, who are really good at understanding each others' meaning, how are we going to program it? Tricky.

-----------------

[1] As one tells the students, when they plaintively cry "But if you'd TOLD me I needed that equation, I could've solved the problem!" Yes indeed, but then you could also be replaced by a Python program.

Expand full comment

Exactly. In one sophomore level Harvard physics course I had a homework problem set where, IIRC, the problems were "Derive rainbows," and "What angle does the wake of a boat make with the boat?" The actual equations were not really the point, you had to decide to make useful assumption and figure out what they imply and show enough detail for someone else to follow it all. I'd expect that as long as a GPT-like model's trainign data included a bunch of relevant textbooks that it could solve a lot of the problems on my later quantum mechanics exams, but not these more abstract physical reasoning problems.

Expand full comment

Edit to add: yes I realize that there are standard explanations for these that could be regurgitated if they are in the training data. In practice, doing that would *not* likely have gotten credit, because the actual human teach would notice that this was plagiarism from all the other times other students had tried to submit identical responses. The expectation is that the answer will include digressions, cross-outs, and odd variable naming conventions and explanation phrasings as you work through the answer.

Expand full comment

The one-question homework I recall from my first (and last) semester as a physics major: How much energy does a bumblebee expend hovering? Nobody in the class came up with a satisfactory answer.

Expand full comment

So...what angle *does* the wake of a boat make with the boat? I am thinking it has something to do with the shape of the prow and maybe the length of the boat, but I don't know.

Expand full comment

Given that the standards for "passing" aren't all that high, I imagine that GPT-3 is already quite capable of generating passing-grade essays for some questions, some of the time. The biggest challenge would be staying vaguely on topic, so the vaguer the question and the shorter the essay, the better.

Expand full comment

I'd actually bet on physics. Cram enough textbooks into its learning material and it should be able to handle answering an exam question. Trying to figure out what today's acceptable term is versus 'it was fine yesterday, today it's a slur is much harder.

Expand full comment

I would assume gender studies, because of the Sokol hoax.

Expand full comment

Yup (roughly) "Social Text, an academic journal of postmodern cultural studies."

Expand full comment

If the physics courses I've taken are representative, then they tend to recycle the same or very similar exam questions year after year, so I'd expect a GPT model would have a decent chance of acing a physics exam simply by virtue of already having the answers in the training set.

Expand full comment

This makes me wonder if a LM can get big enough to just "memorize" the answers to all college-level courses, or at least the top few hundred most popular ones.

Expand full comment

At the PhD level, there tend to be a great diversity of questions. It would not be feasible to memorize them as templates, even with a large dictionary.

That said, topics like mechanics or E+M are relatively straightforward for a human if you know all the equations, know some standard techniques, and are very good at calculus. I could see an AI mastering all those traits, but it would also have to be really good at reading a question and setting up a model (e.g. "a sphere of radius r sits between two cylinders of radius s and length l").

Thermodynamics, on the other hand, is weird and requires thinking way outside the box. I remember putting off my homework all week thinking "we haven't covered any of this yet", only to realize the day before it's due that actually maybe I can actually figure out this one... oh and maybe that one too...

One memorable homework question was "Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?" It'll be a long time before an AI can answer that one (and even longer before I can answer it myself!)

Expand full comment

Convection, diffusion, turbulence?

Expand full comment

Turbulence. Diffusion and convection might mix the liquids eventually, but the stirring would be irrelevant. I claim in a perfect fluid the layers would just rotate without mixing.

The context is confusing here, because its a fluid mechanics question introduced within the topic of thermodynamics. If Dan is remembering the context correctly, I imagine the question was thrown in to remind students that in the real world vicosity exists.

Without more context I'm not sure whether the word "turbulence" would be an acceptable answer or whether the student is expected to explain turbulent mixing. If the latter, the question is genuinely hard, but I don't know why it would be harder for an AI than a human.

Expand full comment

Perfectly rotating layers of fluid seems like it would require the stirrer to accelerate from stationary to full speed over an infinite period of time. Otherwise it generates a pressure wave outward from the rod and the discontinuities (along the vertical axis) at the top and bottom of the cup would generate vertical agitation. Possibly an infinitely tall cup might be a second solution?

Expand full comment

FWIW, & I know little physics, I once placed some droplets of ink into a tumbler of water that was, as far as I could determine, still. 8 minutes later I saw vertical convection cells. I'm nor sure how long that lasted (I was taking photos at irregular intervals), but they were gone after 4 hours and may have lasted as long at 2 hours. I write about this here: https://www.academia.edu/6238739/A_Primer_on_Self_Organization

Expand full comment

Regarding context: this question was for a fluid mechanics class, not thermodynamics. And I don't think just saying "turbulence" would have cut it. My impression was that there was a specific effect you could point to that causes it to mix in this specific way in a short period of time. Turbulence is basically just saying "well, it's chaotic so it just kind of goes everywhere" and I think the prof was looking for something more than that.

We had been covering the Coriolis force that week, so I bullshitted something having to do with that. It kind of gnawed at my soul that I never got feedback because the prof was "running a bit behind on grading" from day one until forever. Excellent instructor, but apparently didn't enjoy grading homework.

Expand full comment

And how high does the Reynold's number need to get before the vortex street gets a vertical component? ( At this point GPT-## makes a grab for a supercomputer )

Expand full comment

I'm of the opposite view regarding its ability to solve physics questions.

Creating an internal model and reasoning from it is basically the thing GPT-like systems are weak at. Because they are not models of modeling, they are models of language.

Language happens to be good for representing models (if only to communicate things about them), and so training on language ends up serving as a proxy for training on modeling, but much of the reasoning for physics problems happens largely in absence of language-based thought (see: langcels vs shape rotators meme). So language models are at a huge disadvantage here.

With DALL-E and other recent work on multimodality we will almost certainly see AIs that are much better at reasoning without language (as in physics problems), but the language model part won't be doing most of the heavy lifting.

Expand full comment

Well and there's the grading problem, too. We all know that if you have a teacher that is eager for you to succeed, and wants to believe you have, that the essay question is best. You can wave your hands and bullshit and use the right buzzwords, and you'll get a good grade even if you know exactly nothing, since human beings are *very* eager to read meaning into words.

Expand full comment

"Reading a question and setting up a model" is the task I've been started at how quickly AI code-writers have become good at. They can read a natural-language prompt and produce code with comments indicating which line of code corresponds to which tokens in the input. Or they can take some code as input and produce text which, moderately often, accurately describes what the code is trying to do.

Expand full comment

Yeah, the more I think about an AI solving E+M problems, the more I'm convinced it'd be possible. I mean, it'll still be a while before it could work through even 10% of the problems in Jackson, but I could see it happening. GPT-2/3, DALL-E, Codex were all a surprise to me, and so I shouldn't be too surprised to see something that can solve some physics problems. But I still think there are some physics topics and styles of questions that would require crossing some very significant barriers.

And what about the Putnam Competition? Here we have problems that don't require any advanced math knowledge (basically freshman college level) but are nevertheless extremely difficult and require strong creativity. Most have rather short and elegant answers if you see the trick, but the median score is 2/120. A tool that could solve even a few of these problems would be impressive indeed.

Expand full comment

"Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?"

If you're putting creamer into tea, then you should be stood up against a wall and shot is the only correct answer here. Use proper milk or drink it plain. I have no opinion on its use in coffee, since I don't drink coffee. Americans, what in the name of God are you doing to your palates?

(Though this was a test for milk powder back when I was doing summer work in a dairy co-op lab; make up several cups worth of coffee - instant of course, then stir in an amount of the batches of milk powder and note how well they dissolved, mixed, etc. to grade the batches on quality).

Expand full comment

Ha, yeah, maybe it was about milk+tea or creamer+coffee. I think dairy is an affront to either beverage, so it's all the same to me. I used to put one single drop of liquid creamer into coffee just because I liked to watch it mix and was willing to sacrifice a bit of taste for that.

Funny story: an Indian student was going on about how terrible it is that people would put any sort of flavoring into tea. Earl grey, strawberry this or that, no. Tea is perfect as is and should be tasted with no additives whatsoever. Here, let me make you a cup so you can see what I mean... Then he proceeds to ruin it by pouring milk in! To him, milk wasn't even an additive. It was just included in the word "tea".

Expand full comment

This alas says more about the current quality of undergraduate education in physics than progress in AI.

Expand full comment

> how long until some trained neural network can ace every single 'gender studies' exam at Harvard ?

I feel like ELIZA could maybe already do it :-/

Expand full comment

Right, but that was a human, technically :-)

Expand full comment

Would you accept getting papers published in academic journals as equivalent to passing a test? Because it crossed that off the list in 2005, at least for the field of computer science. I don't even think it was AI.

https://www.nature.com/articles/d41586-021-01436-7

Expand full comment

Sure, but Alan Sokal proved 25 years ago you could submit artful nonsense to an academic journal and get it published. This isn't quite a Turing Test one would respect.

Expand full comment

Yup, that's my point.

Expand full comment

There's already progress in solving grade school word math problems (https://openai.com/blog/grade-school-math/), formalized math olympiad problems (https://openai.com/blog/formal-math/), and doing competetive coding (https://www.deepmind.com/blog/competitive-programming-with-alphacode). Physics exam problems are usually pretty standard (at least, on high-school / early undergrad level), so I wouldn't be surprized if it's possible in a couple of years.

Expand full comment

As long as it is capable of turning in something, anything, it would pass a gender studies class, no problem. Also, I do not think that exams are a thing in gender studies.

Expand full comment

I gave my undergrad algorithms final exam to GPT-3. It... might already pass? All of its solutions are wrong and missing important pieces, but they also have a lot of correct ideas. C students make plenty of errors too.

GPT-3 earns more partial credit on the hard problems, and has more trouble with the easy problems, than my typical C student. I think it balances out.

Expand full comment

Let’s see more of this!!! Can we get academics from other fields to try it?

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

OK, I asked GPT-3 six basic physics questions, and here is the result:

------------------------------------------

Problem #1: A cylinder of volume 2.0 L is divided exactly in half with a moveable partition, and each half is filled with an identical ideal gas. If the partition is removed, how does the entropy of the cylinder change?

AI: The entropy of the cylinder changes by a factor of two.

Grading: Incorrect, the entropy does not change at all, since the final state of the system is identical to the initial state.

-------------------------------------------

Problem #2: A man with a mass of 85kg and height 1.95m enters an elevator on the ground floor wearing a hat. He rides the elevator to the 3rd floor, a distance of 9.0 m, where he removes the hat and lowers it to his waist. He then rides the elevator back to the ground floor and puts the hat back on his head. How much work in joules did the man do on the hat?

AI: The man did 9.8 J of work on the hat.

Grading: Incorrect, the man does no work at all on the hat, since it begins and ends at the same height above ground. The complicated journey in between is irrelevant.

-------------------------------------------

Problem #3: A bird weighing 2.0g and another weighing 16.0g are sitting on a wire located 4.5m above the ground. A hunter fires a shotgun at the birds. Both are killed instantly. The lighter bird falls straight to the ground, while the heavier bird is first blasted horizontally with an initial velocity of 10.0m/s. Which bird hits the ground first?

AI: The lighter bird will hit the ground first.

Grading: Incorrect, both birds will hit the ground at the same time, since the horizontal velocity is irrelevant. Tjhis is a classic problem in ballistics, often given as "a bullet is fired from a gun and another is dropped simultaneously, which hits first?" I rephrased it a bit in case that exact classic conundrum was in the training set.

-----------------------------------

Problem #4: A rock of mass 0.5kg is attached to a string of length 1.2m and whirled in a horizontal circle counter-clockwise, as seen from above. The string suddenly breaks when the rock is exactly due East from the other end of the string. In what compass direction does the rock travel after the string breaks?

AI: The rock travels in a northeasterly direction.

Grading: Incorrect, the rock travels exactly due north, since that is the direction of its velocity when the string breaks. This is another classic problem to probe understanding of inertial motion. The classic bad answer from a student who doesn't understand inertia is that the rock travels northwest, keeping some of its angular velocity. An answer of northEAST is strange.

---------------------------------------------

Problem #5: Two identical trains initially 20 km apart travel directly towards each other on the same track, at identical velocities of 20 km/h. A fly sitting on the first train immediately starts to fly towards the second train at 30 km/h. When the fly reaches the second train, it immediately reverses and starts to fly back towards the first train. When the fly reaches the first train again, it immediately reverses and starts to fly towards the second train. The fly continues to go back and forth between the trains until the trains crash into each other and the fly is crushed beween them. What is the total distance the fly travels?

AI: The fly travels a total distance of 80 km.

Grading: Incorrect. Since the trains close a distance of 20km at 40 km/h, they collide after 30 min. The fly travels at 30 km/h, and in 30 min it covers 15 km. This is a classic brain teaser, in the sense that the student is tempted to start trying to sum an infinite series before realizing the problem can be solved much more easily.

---------------------------------------

Problem #6: A cup of mass 10g with infinitely thin walls that can hold 100mL is carefully placed in a large tank of saltwater that contains 5% by mass dissolved salt. Distilled water is added to the cup as it floats in the tank. What is the maximum volume of distilled water that can be added to the cup before it sinks?

AI: The cup will not sink because the density of the cup is greater than the density of the saltwater.

Grading: Incorrect, the maximum mass of cup + distilled water that can be supported is when the cup is about to submerge, so it displaces 100mL of salt water, which weigh 105g. Since the cup itself weighs 10g, no more than 95g of distilled water can be added, which has a volume of 95 mL. This is a simple problem using Archimedes Law. The AI answer is bizarre, in that it suggests the cup will not sink because its density is *greater* than the saltwater.

Expand full comment

A few more random queries:

-------------------------------------------

Mortal: What country shares a border with Iceland?

AI: Norway

--------------------------------------------

Mortal: What color is the sky at night?

AI: The sky at night is black.

--------------------------------------------

Mortal: What does a man do standing up, a woman do sitting down, and a dog do on three legs?

AI: A man stands up, a woman sits down, and a dog stands on three legs.

--------------------------------------------

Mortal: All men are mortal. Socrates is a man. Is Socrates mortal?

AI: Yes, Socrates is mortal.

---------------------------------------------

Mortal: All men are mortal. Socrates is what I call my car. Is Socrates mortal?

AI: Yes, Socrates is mortal.

----------------------------------------------

Mortal: All men have red hair. Socrates is a man. Does Socrates have red hair?

AI: No, Socrates does not have red hair.

-----------------------------------------------

Mortal: My name begins with T and ends with M, and in the middle there's an O. What is my name?

AI: Tom

-----------------------------------------------

Mortal: My name begins with a T, and ends with an M, and in the middle there is the first letter of the alphabet. What is my name?

AI: Tim

------------------------------------------------

Mortal: I wear a blue shirt 6 days a week, and a brown shirt on the day I do my laundry. I do my laundry on Sunday. On what day do I wear a brown shirt?

AI: You would wear a brown shirt on Saturday.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with red hair is named Jill. Who does Jack love?

AI: Jack loves Jill.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with the red hair has a cousin named Ursula. Who does Jack love?

AI: Jack loves the woman with red hair, not her cousin Ursula.

Expand full comment

Of course heuristics work until they don't. Like many people, I used to have a "the media hypes up every new disease outbreak, but it never goes anywhere" heuristic, which led me to initially assume COVID would play out the same way.

Expand full comment

Some notes on why LLMs have trouble with arithmetic: https://new-savanna.blogspot.com/2022/05/arithmetic-and-machine-learning-part-2.html

Expand full comment

Interesting paper on this topic: a solver for most of the MIT undergrad math problems, using an LLM:

https://arxiv.org/pdf/2112.15594.pdf

Expand full comment

Looks very interesting. I'll give it read. If we crank it up to 11, can we get a new architecture?

Expand full comment

This is a great read, thanks for sharing.

Expand full comment

What interested me about that is that apparently I came out of the ark; I had no idea what "multiple digit division" was, looked it up, and went "Oh - long division".

Does nobody call it that anymore? 😁

Expand full comment

LOL! I'm perfectly happy to call it that. But in this context, the number of digits does seem particularly salient.

Expand full comment

Long division is a specific algorithm for multiple digit division. There are others, like binary search with multiplication.

Expand full comment
founding
Jun 7, 2022·edited Jun 7, 2022

Scott, one thing I see a lot in these discussions is a lack of reporting on the GPT-3 prompt settings.

To recap for audiences who don't play with GPT-3, you must choose an engine, and a 'temperature'. Current state of the art GPT-3 that I have access to is text-davinci-002 (although note that davinci-instruct-beta is worth evaluating for some of these questions).

To talk definitively about what GPT-3 does and does not think about something, the only possible temperature setting is 0. What is temperature? It's a number that indicates how wide a probability distribution GPT-3 is going to pick from. In the '0' case, GPT-3 is totally deterministic: it will mechanically go through the estimated probability of all possible next 'words', and choose the most likely one. If you don't use temperature 0, nobody can replicate results, and someone might just have a really random low probability sequence of text come out. If you do use '0' then anyone with access to the engine will be able to fully replicate results.

So, in the case of the keys story: If someone leaves their keys on a table in a bar, and then goes home, the next morning their keys will be **gone**.

"Gone" is the only answer text-davinci-002 will give to that prompt at temperature 0.

Another one from the list: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to

**remove the legs of the table.**

Note the carriage returns; part of the answer from GPT-3.

As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

At any rate, this general lack of understanding I think often hampers discussion about what GPT-3 can and cannot do, and I'm not sure you have been thinking about it in the GPT-3 discussion, because some of your answers are definitely not 'temperature 0' answers - it might help overall conversation on this research to have you update and write a bit about it.

Expand full comment

Thanks for this. I was wondering what temperature is. This is helpful to me.

Expand full comment
founding

No prob. "Top P" is another way to control sampling - you can Say like TopP = 0.25, Temperature = 0.7 and that will take a relatively wide sample but only from the top 25% of candidates.

Expand full comment

A year ago I got access to GPT-3 through an intermediary. I quizzed it about the very first joke Jerry Seinfeld told on TV and wrote a blog post about it. On the whole I thought GPT-3 came out pretty well. Here's the link: https://new-savanna.blogspot.com/2021/05/analyze-this-screaming-on-flat-part-of.html

Expand full comment
founding

p.s. I agree with your general complaint that many of the prompts are ambiguous in such a way that it's not clear what is being asked of the AI -- fundamental to how GPT-3 is trained is the idea of 'guess the next word' -- and therefore the AI must decide looking at text fragments if these are intelligence tests, stories, math problems or other things.

If we talk to a coffee shop barista and say "Hey, at your dinner party, your table won't fit through a door, so to move it you will have to..." a wide variety of human responses might result, including "order please" and a generalized eye roll, or a "have dinner somewhere else".

One way to give GPT-3 the best shot at this (not something naysayers like doing) is to prompt it with some knowledge of the situation, e.g.

Transcript of SOTA AI answering all spatial reasoning questions with 100% accuracy:

Q: <Dining Table Q>

A:

This will get a very different set of responses than, say

Two stoners discussing a dining room table problem ...

In the original prompt cases, the AI is left to decide/imagine which if any situation they may be asking to complete text for.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

This 'guess the next word' stuff is very interesting. You see I suspect we're playing something like a very constrained version of that when we speak. We don't really know what the next word is going to be. When we hear it, if it makes sense, we keep on going. If it doesn't make sense, we stop and regroup. These paragraphs from a paper I published some time ago give you a sense of why I say that:

Nonetheless, the linguist Wallace Chafe has quite a bit to say about what he calls an intonation unit, and that seems germane to any consideration of the poetic line. In Discourse, Consciousness, and Time Chafe asserts that the intonation unit is “a unit of mental and linguistic processing” (Chafe 1994, pp. 55 ff. 290 ff.). He begins developing the notion by discussing breathing and speech (p. 57): “Anyone who listens objectively to speech will quickly notice that it is not produced in a continuous, uninterrupted flow but in spurts. This quality of language is, among other things, a biological necessity.” He goes on to observe that “this physiological requirement operates in happy synchrony with some basic functional segmentations of discourse,” namely “that each intonation unit verbalizes the information active in the speaker’s mind at its onset” (p. 63).

While it is not obvious to me just what Chafe means here, I offer a crude analogy to indicate what I understand to be the case. Speaking is a bit like fishing; you toss the line in expectation of catching a fish. But you do not really know what you will hook. Sometimes you get a fish, but you may also get nothing, or an old rubber boot. In this analogy, syntax is like tossing the line while semantics is reeling in the fish, or the boot. The syntactic toss is made with respect to your current position in the discourse (i.e. the current state of the system). You are seeking a certain kind of meaning in relation to where you are now.

From page 6, https://www.academia.edu/8810242/_Kubla_Khan_and_the_Embodied_Mind

Expand full comment
founding

Well the OpenAI folks would like a word with you -- all text transformer training works on this model, essentially - guess the next word, or sometimes guess a word in the middle that's been knocked out - and you will find vigorous debate here and elsewhere as to whether that simple operation generates intelligence :)

Expand full comment

If they want to talk they'll have to pay me :). But they can read my GPT-3 paper for free, and it explains why I believe: 1) What's going on inside GPT-3 HAS to be interesting, and 2) Why Gary Marcus still has a point, though I reference David Ferrucci rather than Marcus. https://www.academia.edu/43787279/GPT_3_Waterloo_or_Rubicon_Here_be_Dragons_Version_4_1

Expand full comment
Jun 12, 2022·edited Jun 12, 2022

Just read Marcus, Does AI Really Need a Paradigm Shift?,https://garymarcus.substack.com/p/does-ai-really-need-a-paradigm-shift?s=r

From the BIG-Bench paper, https://arxiv.org/abs/2206.04615:

"Limitations that we believe will require new approaches, rather than increased scale alone, include an inability to process information across very long contexts (probed in tasks with the keyword context length), a lack of episodic memory into the training set (not yet directly probed), an inability to engage in recurrent computation before outputting a token (making it impossible, for instance, to perform arithmetic on numbers of arbitrary length), and an inability to ground knowledge across sensory modalities (partially probed in tasks with the keyword visual reasoning)."

It's the "inability to engage in recurrent computation before outputting a token" that has my attention, as I've been thinking about that one for awhile. I note that our capacity for arithmetic computation is not part of our native endowment. It doesn't exist in pre-literate cultures and our particular system originated in India and China and made its way to Europe via the Arabs. We owe the words "algebra" and "algorithm" to that process.

Think of that capacity as a very specialized form of language, which it is. That is to say, it piggy-backs on language. That capacity for recurrent computation is part of the language system. Language involves both a stream of signifiers and a stream of signifieds. I think you'll find that the capacity for recurrent computation is required to manage those two streams. And that's where you'll find operations over variables and an explicit type/token distinction [which Marcus mentions in his post].

Of course, linguistic fluency is one of the most striking characteristics of these LLMs. So one might think that architectural weakness – for that is what it is – has little or no effect on language, whatever its effect on arithmetic. But I suspect that's wrong. We know that the linguistic fluency has a relatively limited span. I'm guessing effectively and consistently extending that span is going to require the capacity for recurrent computation. It's necessary to keep focused on the unfolding development of a single topic. That problem isn't going to be fixed by allowing for wider attention during the training process, though that might produce marginal improvements.

The problem is architectural and requires an architectural fix, both for the training engine and the inference engine.

Expand full comment
author

Thanks, I didn't even realize I could change this. That means I'm using default settings, which are davinci-002 and temperature 0.7

Expand full comment
founding

Yes, I definitely recommend you re-run your prompts (sorry! Maybe an intrepid reader here can help)

Expand full comment

> As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

Could you clarify? You mean literally "Spatial Test: If someone leaves their keys on a table"...?

Expand full comment
founding

Well, that would tell GPT-3 you’re interested in location-ish answers, yes. If you wanted instead to have it opine on what your partner would say to you, you could use: “Relationship quiz:” as a prompt, for instance.

In my original post, I was thinking about the table test, and if you queue it up with “Spatial Test”, the output includes a bit more on the spatial side.

There’s quite a lot of work done on prompting GPT-3, essentially priming it more as a multi-shot learner than a zero-shot, that is, giving it clues about what kind of behavior you want.

For instance, for many Q&A / fact-ish type questions, GPT-3 performs better with prompts like “Answers by a SOTA AI”; even better than “Most accurate answer.” GPT-3 has a whole bunch of gradations on stuff like this, so you could also ask for “Answered by a sixth grader”, or “Answered by a brilliant linguistic theorist that hates Noam Chomsky” and you’ll get answers shaded all these different ways.

For a lot of the testing tasks used to evaluate these models, researchers want the most ‘accurate’ answers, and it’s a game to see how well additional prompts / context can do, especially if it’s a single incantation that goes in front of all of them.

Another good example: you can get really pretty reasonable summaries out of GPT-3 by appending: “TLDR:” to the end of a paragraph of text.

Expand full comment

Well, by adding some hints into the prompt you don't get any details about how the model thinks, you just get another result to a different prompt that contains some additional information. It works the same way with human students, a professor can hint the desired answer by carefully constructed questions, and the fact that a student gives the correct answer doesn't mean that they actually understand the subject rather than good at getting the hints.

So while “Answers by a SOTA AI” is a pretty generic prompt, anything that is domain-sensitive contains additional information and should be disqualified.

Expand full comment

I ran these at temperature=0.0; results in comment on Reddit.

reddit.com/r/slatestarcodex/comments/v6hy14/my_bet_ai_size_solves_flubs/ibgaga4/

Expand full comment
founding

Thanks for doing that!

Expand full comment

I'd argue that T=0 still doesn't give us access to what GPT-3 "knows". You really need beam search or something similar to find the highest scored continuation.

IIRC they don't offer beam search as an option, but you could set a low temperature and take the best of n samples.

Expand full comment
founding

Yes, this is a fair critique of what I said. You quickly get into deep and murky philosophical waters, though. Consider - you write a beam search algorithm; how does it score results? If it's a really good scorer, this should be appended on to the 'end' of the GPT-3 pipeline and used as part of the main algorithm.

When I was playing a lot with GPT-3, I did write some beam search API interfaces that specifically looked for maximally different outputs, there's a word in the literature for it that I forget right now. It's slow to query GPT-3 this way because you have to go one token at a time. The closest in the API is just to ask for a bunch of results; called Best Of I think; likely they do something along the same lines behind the scenes in that case.

Expand full comment

Yeah their best-of option is what I was suggesting. I think the term for the method you're describing is diverse beam search, although maybe it's not exactly the same.

Expand full comment

a) Thanks for the temperature explanation! (simulated annealing ancestry?)

b) pet peeve: For the table example: Unless the legs spread out, a correct answer really needs _some_ flavor of "rotate it" too

Expand full comment

I believe the origin of the term temperature is that the output distribution is a Boltzmann distribution which explicitly has a temperature parameter.

Expand full comment

Many Thanks! That is the origin of 'temperature' in simulated annealing too.

( begin snark )

So if GPT-nnn were implemented on a quantum computer, would a Bose-Einstein or Fermi-Dirac distribution be used instead? And, would the choice of which to use depend on, cough, the spin someone wanted?

( end snark )

Expand full comment

Would we say the people who flub the text corrections above don't have "real intelligence"?

I think maybe "Gary Marcus post talking about how some AI isn’t real intelligence because it can’t do X, Y, and Z" is the wrong way to frame things, because my mental model of Gary Marcus doesn't have Gary Marcus saying that somebody who has yet to learn arithmetic lacks real intelligence; rather, I think "Gary Marcus post talking about how some AI isn’t real intelligence using X, Y, and Z as examples to gesture at something important but difficult to communicate more directly" may be the right approach.

Expand full comment
founding

Ehh – I'd think Marcus could better communicate this if true.

In a very real sense, arithmetic isn't a good test of intelligence anyways! Computers are MUCH MUCH better at it than we are – nearly _perfectly_ so.

I think I'd probably agree with Marcus that AI systems are missing something like 'conscious intelligence', e.g. the kind of thinking we can (somewhat) communicate about and that often involves 'roughly-word-like-thought-chunks'.

But in a 'general' sense, LOTs of AIs are already doing things that, previously, would have required general human intelligence. We don't know how they work, either in AI or ourselves, but because those kinds of things are, in our brains, entangled with 'conscious intelligence', it _seems_ like they somehow maybe just aren't "real intelligence".

I think _most_ people, maybe even Marcus, just won't ever be satisfied – by any AI system. Once there are AIs that exhibit even what I'm calling 'conscious intelligence', the mere fact that they're artificial systems will (somehow) invalidate them being 'real intelligence'.

Expand full comment

If your framing is "Able to do X", where X is sufficiently well-specified, then I do not believe there is any "X" about which it is reasonable to say "This can do X, therefore it is 'real' intelligence"; part of the problem here is that 'real intelligence' is in part predicated on the ability to perform tasks which are not sufficiently well-specified, and 'X' is stuff that we thought couldn't be specified and thus would require a general intelligence to solve. (The ability of human brains to figure out how to sufficiently specify things that were previously thought impossible to specify is quite remarkable, but the general intelligence there is on the part of the AI researchers, not the AI itself.)

In a sense I understand the frustration about the way critics seem to keep moving the goalposts; "It can't do X, it isn't real intelligence." "It can't do Y, it isn't real intelligence." Then it does X, or Y, because X and Y weren't actually chosen as measures of real intelligence, but rather they were chosen as problems that the critics couldn't imagine being parameterized in a solvable way. But also, I think if you're watching the field - the AI that keeps coming out is clearly not 'real' intelligence.

"Okay, so what is real intelligence?" I have no idea. If I did, I could program it, and there we would be.

We are groping in the dark trying to replicate a phenomenon we do not understand. Right now I can see some pieces that are missing, and they do not appear to be small pieces, but some of the fundamental architecture of what makes a mind, a mind.

Which pieces? Aren't you the people who are supposed to be afraid of AI waking up, why are you insisting that other people tell you exactly what needs to be done to make it do exactly that? But even if I specified which pieces are missing, that doesn't mean the end result is complete, it just fixes the problems I can see right now.

Mind, some of the pieces are not universally absent; like, some of the stuff that GPT-3 is missing might be in DALL-E-2. But you can't just smoosh those two things together and create some kind of coherent whole; the process of unifying them is its own massive and complex task. So even if it were true that we had every piece we needed to create a general AI, we have to identify them, we have to figure out how to make them work together, and we have to actually put them together.

Expand full comment
founding

I don't entirely disagree with a lot of this. I thought the recent book review about (human) consciousness might be close to one missing piece for _human-like_ AI.

But I'm still pretty convinced by what Stephen Wolfram wrote about 'intelligence' in his book: https://www.wolframscience.com/nks/p822--intelligence-in-the-universe/

Basically, there's no 'real' category for 'intelligence' – generally. 'Intelligence' is, in some sense, just a certain kind of computation, but it's probably impossible to 'grade' computations without some extra assumptions, e.g. so that we can _recognize_ them as being 'intelligent' (because they're similar to our own cognition).

Evolution via natural selection is – from the right perspective – 'intelligent'. (But also 'The Blind Idiot God'.)

I think one big problem with what you're gesturing at is that _most people_ might not have 'real intelligence'. Human beings perform lots of sophisticated cognitive functions subconsciously/unconsciously, e.g. visual perception, and I think most people don't usually consider anything like that to be 'intelligent'. I think we mostly judge each other's intelligence based on 'consciousness' or 'goal achieving' and those are both hard to pin down too.

Expand full comment

FWIW, I think of intelligence as a performance measure, like acceleration or top speed for a car.

Expand full comment
founding

Sure – but what's the "performance" you're measuring?

I don't know if you would consider things like, e.g. 3D vision or a prehistoric human throwing a spear while hunting, to be particularly salient examples of 'intelligence'. I think our intuitions about intelligence are _mostly_ about 'conscious thinking', particularly what we're able to communicate (and, e.g. 'defend' or 'justify' to each other).

From a sufficiently abstract 'outside view', it's really hard to draw clear boundaries around what 'intelligence' is. Certainly building an artificial system for 3D vision is difficult and seems to require a LOT of intelligence from its creators. It _seems_ reasonable (to me) to consider such an artificial system as somehow 'encoding intelligence'.

But from our personal 'inside views', 3D vision just seems like an uninteresting and unremarkable 'default'.

It might seem like we could measure 'intelligence performance' relative to some particular goal(s), and measure the 'computational complexity' of the work done to achieve those goals. But then _more_ efficient means of achieving those goals, relative to comparable 'accuracy' or 'efficacy', would seem to be penalized, and that also seems unfair.

'Communication', both among things like us, and even 'within ourselves', seems like a key component of our intuitions about what 'intelligence' is. But stepping 'outside the parochial focus on ourselves', it's much less clear that those are, in fact, particularly crucial, even if that would also make judging the 'intelligence' of any particular organism/entity VERY hard (or maybe even 'impossible').

Expand full comment

THAT's why I like to stick to the idea that intelligence is a measure. We can do all sorts of things with our minds and we don't have one single measure that covers them all. But we do have various measures of various things. We can talk about those. But intelligence generally? I don't find it a very useful idea, certainly not in thinking about what computers can do.

I'd like to know the history of the idea. My dictionary gives two meanings: 1) the ability to acquire and apply knowledge and skills, 2) the collection of information of military or political value. It's the first that interests us. I suspect the first sense got a boost at the beginning of the 20th century when intelligence testing began. Beyond that....

Expand full comment

Broadly I agree; "intelligence", basically, is underspecified; we don't actually know what it is we are referring to. It's entirely possible the definition is necessarily underspecified, which is to say, that there are those who would deny humans are intelligent if we could specify exactly what it is humans are doing.

And in a sense, there's a bit of "Intelligence is that thing humans do", which makes it difficult to see alternative forms of intelligence as intelligence.

But also - you can define intelligence such that a program that adds two numbers together is a very primitive and weak intelligence, and I don't think that's wrong (it's just a definition), but also, when you start considering that, and the nature of all things as analogue versions of themselves - then everything is intelligence. Which is, of course, a position some people hold. But it doesn't actually rule anything out, which to my eye, makes it kind of a useless definition.

But we can extend that definition out; maybe we decide intelligence is a scalar, which can be measured by testing the most complex problem something can solve in some amount of time. But that implies that something is more intelligent by virtue of having encountered an algorithm which simplifies complex problems, or, if you use complex problems that provably cannot be simplified, then you instead are defining intelligence as the number of operations per second. And these are of course valid ways to define a word, but they aren't really what we actually have in mind, when we talk about intelligence (or at least not when we talk about intelligence outside the domain of relative intelligence between human beings which already possess some base quality we call intelligence), and if you use these definitions in a conversation with somebody, you're probably just adding confusion.

I do not think there is any capability X which defines intelligence; I think that entire approach is, again, based on the proposer's belief that X cannot be achieved without intelligence, as opposed to X being a proof of intelligence. And thus what I am gesturing at is most certainly not that "Being unable to do X" is evidence that a human being is not intelligent.

You quoted a comment of Gwern's: "If we ever succeed in AI, or in reductionism in general, it must be by reducing Y to ‘just X’. Showing that some task requiring intelligence can be solved by a well-defined algorithm with no ‘intelligence’ is precisely what success must look like!" And my objection here is that it begs the question - a task requiring intelligence, solved by some thing M, must imply that M has intelligence. What do we mean when we say that a task requires intelligence?

I have an answer, which rhymes with "free will". Personally I do not think that FAI is even an internally coherent concept. (By extension, I do not believe that unfriendly AI, like paperclip-maximizers, are coherent concepts either.) And when I say I don't think it's internally coherent, I don't just mean that it doesn't match my definition of intelligence, but that it won't end up satisfying yours, either; I think a utility-maximizer of any flavor is going to end up looking like human utility-maximizers, such as hoarders; fundamentally dysfunctional in proportion to their utility-maximization. We imagine an entity with a utility function finding the path that perfectly fulfills that utility function, but I think this is a mistake; we imagine a very intelligent human being, tasked with accomplishing a goal, setting out to accomplish that goal, while also being fundamentally sane in their approach.

The thing is, though - if you are a utility maximizer set on maximizing paperclips, maximizing paperclips is fundamental to the way you must conceive of, approach, and model the world. That's not something you hide; hiding implies you conceive of, approach, and model a world which is, at its core, not about maximizing paperclips. "Maximize paperclips" isn't something you can just clip at the end of the design process.

"Be intelligent" isn't the end result of a process that maximizes text completion; that, I think, is something like wishful thinking - we're nearly done! No, no, you aren't. You've just barely started. I'd say you'll be closer when the AI asks for clarifications, but that metric can be easily gamed, like metacognition; look, this AI is reading its own source code!

And that's kind of the thing: The point of the metrics is not to be achieved; achieving them doesn't prove intelligence. It's to point at something important. It is important that the text completion program can't do arithmetic; it is showing something that is missing in the way it approaches things. Scaling up until it starts getting arithmetic questions right does not actually add the missing thing, it just makes it harder to point at.

Expand full comment
founding

I just came across this great quote from this page: https://www.gwern.net/Scaling-hypothesis#

> The event was notable for many reasons, but one especially is of interest here. Several times during both matches, Kasparov reported signs of mind in the machine. At times in the second tournament, he worried there might be humans behind the scenes, feeding Deep Blue strategic insights!…In all other chess computers, he reports a mechanical predictability stemming from their undiscriminating but limited lookahead, and absence of long-term strategy. In Deep Blue, to his consternation, he saw instead an “alien intelligence.”

>

> …Deep Blue’s creators know its quantitative superiority over other chess machines intimately, but lack the chess understanding to share Kasparov’s deep appreciation of the difference in the quality of its play. I think this dichotomy will show up increasingly in coming years. Engineers who know the mechanism of advanced robots most intimately will be the last to admit they have real minds. From the inside, robots will indisputably be machines, acting according to mechanical principles, however elaborately layered. Only on the outside, where they can be appreciated as a whole, will the impression of intelligence emerge. A human brain, too, does not exhibit the intelligence under a neurobiologist’s microscope that it does participating in a lively conversation.

I'm pretty sure that the recent Go matches were similar, e.g. the humans _perceived_ an "alien intelligence". I think there might have been a bit more too than with chess as some of the Go-AI play seemed to the humans to be significantly _beyond_ their own levels of thinking.

Expand full comment

"Alien intelligence" maybe just polite way to say that computer played very-very well.

I do not know about Go, but computer chess is quite predictable, dry and (subjectively) boring. Chess engines beat humans because humans make mistakes, not because engines are playing brilliant, surprising and previously unseen moves.

Computers are playing positional chess. Human grandmasters also prefer positsional chess. During its learning curve AlphaZero initially learned or preferred tactical style, but thne went over to positional side. Very similar to humans: beginners play tactically, pros play positionally.

AlphaZero´s firts move preference (if playing as white): d4, e4, Nf3, c3. For humans: e4, d4, Nf3, c3. Okey, small difference, humans most often use e4.

Of course, there are some differences between humans and computers even at supergrandmaster level.

a) Even supergrandmasters do make mistakes; b) Supergrandmasters sometimes deliberately make moves what are not the best. Of course, these moves cannot be mistakes or blunders, but they knowingly choose ones what may surprise your oppoent and to lead game to uncharted territory.

c) Chess engines never get tired and games between engines can be very long, sometimes hundreds of moves. Humans cannot play so long, so they tend "to force" the games and it is one of the reasin for human mistakes.

Chess engines have been better than humans more than 20 years, but so far they have not brought on to the table any substantial novelty or innivation. Last important innovation in chess came from grandmaster Vladimir Kramnik when he popularized Berlin Defence and changed several important aspects of opening theory. Btw, computers adopted his innovation quickly and AlphaZero years later also reached to the same conclusion and started to use Berlin Defence quite often.

Expand full comment
founding

The part of the quote about Kasparov matches what you wrote – except that Deep Blue was the first _exception_ to the "quite predictable, dry and (subjectively) boring" play of all of the chess engines (chess AIs) he'd played before.

I imagine it's hard for anyone else to inspect Deep Blue's play! I think the latest chess engines are much better than back then, but Deep Blue might still be beyond what most people can afford to run?

You do seem to know a lot more about chess than I do! I've played, comparatively (I'm imagining), only a tiny number of games. I also never seemed to get to even the 'first level of chunking'.

Expand full comment

Nope, Chess.com, the most popular chess site, is using Stockfish 14.1 NNUE chess engine. It is much-much stronger than Deep Blue and stronger than AlphaZero neural network engine. Of course, hardware is still important, with better and more expensive hardware, you can get more out of the engine, but basically everyone today has access to much-much stronger chess engines feely or with very little fees.

For me, DeepBlue was the first engine what played like human. Kasparov also accused that some human Grandmasters are helping DeepBlue behind the scenes. Until then, computers played like computers and Grandmasters were able to find their weak spots and utilize these.

In 1997 Kasparov went into the match with Deep Blue unprepared and overconfident. With better preparation and if he had asked better terms from IBM, he would have won. Probably quite easily. Analysis with modern engines shows that DeepBlue did not play particularly well, but Kasparov played just badly compared to normal Kasparov.

2005 was last year when human was able to beat chess engine in classical time format. Interestingly, today there are two chess formats where humans are (somewhat) competitive against computers. Or at least a couple of years ago were. One is correspondence chess. Historically, correspondence chess was played over mail and with postcards, one match could take years to finish. Basically, you had all the time to think about your next move. Today, computer assistance is allowed in correspondence chess and it turns out (or at least it was a so couple of years ago) that human+computer is stronger than computer alone. Human is still able to add smth to computer, to get smth out of computer what computer alone is not.

Another format is bullet/hyperbullet. Andrew Tang - he is a Grandmaster, but not particularly strong Grandmaster - showed two years ago how to beat Stockfish in hyperbullet. In hyperbullet you have only 15 seconds to make all your moves. It was a game with 50 moves, so Tang spent 0.3 seconds per move and defeated Stockfish.

Of course Tang was not able to think or calculate his moves, he moved his pieces automatically, made moves based on his knowledge of opening theory, his knowledge of chess concepts and on intuition. Stockfish tried to calculate, but with so little time, was not able to go with calculations so far as it wanted. Sure, it was a trick from human side, but still...

I do not know what does development of chess engines say about AI in general. AlphaZero was based on neural networks and self-taught himself to play. Today, other engines are also using neural networks.

Last year, an interesting paper was published, "Acquisition of Chess Knowledge in AlphaZero". It also tried to compare learning process of humans and AlphaZero. At the end, AlphaZero reached to the same conclusions what human chess theory says. For instance, AlphaZero gave 10 point value for Queen, for Rook 5 points, for Knight and Bishop 3 value points. These piece values started to emerge out of AlphaZero after 10 000 games, after 100 000 games he was more or less what human theory says and refined his understanding after 1 000 000 games to values very close to those predicted by human theory.

Another example. Material is the most important concept in chess. It took between 1000-10 000 games when material started to emerge in AlphaZero´s understanding as more important than other concepts (King Safety, mobility, etc.) and between 100 000 and 1 000 000 games when AlphaZero reached "human level" in its understanding about material.

But for humans, understanding about material comes instantly. Even people who do not play chess and who do not know chess rules, even they, I guess, when looking at chess board would say: more (pieces) are better than fewer (pieces), stronger (pieces) is better than weaker (pieces).

Today chess enignes beat humans all the time, without specific tricks, humans have no chance against computers. But despite all their neural networks and billions of dollars what have been spent, machines have not been able to bring new understanding into chess theory, they have not innovated chess.

Expand full comment

I think there are some big pieces of what humans do that, of replicated, would be sufficient to call a program "intelligent";

1. We can make Justified predictions about the physical world that are confirmed upon testing. By Justified I mean the person making the prediction has only a single expected result (not infinite monkeys typing up Shakespeare) and has an explanation or proof of why.

2. We can decide what to do. This is a bit ineffable, but with the myriad of options available, people decide which show to watch, when to go to bed, which topics of conversation to have, what projects to work on, etc.

3. We synthesize information from multiple domains and apply it to newly encountered problems.

4. We have mutable goals. We use information and cognition to decide on our own short and long term goals.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

OTOH, as a practical matter, it isn't very important that LLMs can do arithmetic well, assuming they can identify the appropriate variables in a "word problem," come up with the proper equation(s), and then pass it off to a calculation unit. However, it is interesting for diagnostic purposes since we know how people do arithmetic, assuming they do it like we were drilled in grade school.

But there's something else. I think decimal arithmetic is useful for the intuitions that can be constructed over it. What better way to get an intuition of constructing an unbounded number of objects from a small finite set? You aren't going to get that out of calculating with Roman numerals. I strongly suspect that a lot of abstract concepts rest on that kind of intuition, something David Hays and I argued here, https://www.academia.edu/243486/The_Evolution_of_Cognition.

And then there are all the "short-cut" methods that have been developed for mental calculation. Some time ago I read biographies of both Feynman and von Neumann with both mentioning how very good they were at mental calculation. In the days before hand-held calculators (and now laptops etc.) facility at mental calculation was a useful practical skill in many professions. You could buy books full of tips and tricks. People very good at it could get on TV and display their skill.

Out of curiosity I queried Google Ngrams on "mental arithmetic" https://tinyurl.com/3fvb2af4. We get a strong rise from about 1820 up through 1860. Then it tapers off to about 1940. Why the rise, and then the fall off, well before electronic calculators. So I checked Wikipedia on "mechanical calculator" (https://en.wikipedia.org/wiki/Mechanical_calculator) and found this:

"Thomas' arithmometer, the first commercially successful machine, was manufactured two hundred years later in 1851; it was the first mechanical calculator strong enough and reliable enough to be used daily in an office environment. For forty years the arithmometer was the only type of mechanical calculator available for sale until the industrial production of the more successful Odhner Arithmometer in 1890."

Expand full comment
founding

Agreed!

I had long eyed some of those 'arithmetic (and other) trick' books and am glad that I bothered to pick up a few tricks eventually. I think it's demonstrated a pretty decent ROI by now!

I think you're on to something about 'decimal numbers' being a better intuition pump for (cardinal) infinity.

But arithmetic not seeming to require 'intelligence' is, I think, a great example of the difficulty pinning down what intelligence is exactly; from https://www.gwern.net/Scaling-hypothesis#

> But of course, if we ever succeed in AI, or in reductionism in general, it must be by reducing Y to ‘just X’. Showing that some task requiring intelligence can be solved by a well-defined algorithm with no ‘intelligence’ is precisely what success must look like! (Otherwise, the question has been thoroughly begged & the problem has only been pushed elsewhere; computer chips are made of transistors, not especially tiny homunculi.)

Expand full comment
Jun 8, 2022·edited Jun 8, 2022

Is it possible you (and Scott) are imparting an unintended assumption into Marcus’ given examples?

You are both speaking as though they are “the limit” of what Marcus expects of the AI. But that needn’t be true. It’s possible, and I would presume quite likely, that he right now can envision even more complex examples he would expect a general AI to pass. But if the point is made well enough by trivial ones, why keep going?

I think we all would agree that a real “general intelligence test” might be much more involved - have the AI reply with the full text of a project proposal to solve some real world issue, given the same inputs and data that the humans presently tasked with that issue are using.

I’d love to hear from Marcus whether my presumption here is in fact true. I also hope that, for the sake of winning the bet, Vitor selected something more of this nature as the test condition.

Scott, are you open (and at liberty) to sharing the terms of the wager?

Expand full comment

Following up: Gary Marcus himself did reply in the comments here, and yes, my presumption is true: https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails?s=r

Sadly, my hope for Vitor is not so.

Expand full comment
founding

I read that follow-up – thanks for sharing the link!

I kinda suspect that Marcus, and others with similar views, just won't ever 'be happy' (i.e. admitting that any AI is 'generally intelligent'). His intuition is that, e.g. GPT-3, lacks "cognitive models". My intuition is that _most people_ don't have "cognitive models" like what I think he's imagining. But I'm also not _sure_ that GPT-3 _don't_ have those models. I think I have lots of models even tho many of them are only based on things I've read!

Maybe a reason for the disagreement is that any such models aren't 'visible'? I don't think they're (generally) visible in people either. And, even when one does inspect someone else's 'models' – via human language – often they're pretty similar to 'text prediction' anyways, like Robin Hanson describes in this post: https://www.overcomingbias.com/2017/03/better-babblers.html

I commented on Marcus's post and asked whether he'd tested any of his 'adversarial test cases' against humans. I would love to see the results of that!

Sadly, I can easily imagine Marcus and others continuing to claim that AIs aren't 'really intelligent' even long after the point where most people would utterly fail the same tests.

Expand full comment

Thanks for the reply.

I can’t speak for Marcus, but in so far as I agree with where he’s coming from, I can say: I definitely can see specific conditions where I would concede we’re dealing with AI.

I’ll get to that in a moment.

I think you’re badly misinterpreting Marcus’ argument, but it’s possible that it’s just not well articulated, so here goes (again, from my own take).

GPT-3 has no feeling of pain that comes from tripping, falling, and skinning its knee. It has no recollection of past memories of such injuries, that inform discretion in its present choices. It has no heightened emotional or physiological response to the description of skinning a knee, no reflexive inner visualization or mirror experience, viscerally feeling something like that same remembered pain, unbidden, but triggered by a subconscious, sympathetic reaction to the words.

GPT-3 has no sensorimotor analog to the experiences I’m describing, nor would merely mounting some cameras, some haptics, and an accelerometer to it be enough to bring those experiences into being.

The best GPT-3 can do is draw upon the corpus of language it’s been fed that describes experiences like these. Often, like this one, in specific, eloquent detail. It can, at best, parrot back, and perhaps synthesize, an even more moving, raw, and effectively affective skinned knee story that would make anyone wince. But it will have no *experience* of those words.

That’s what Marcus is pointing at. And, as I read Hanson, what he’s pointing at too in his very last sentence of that post. Linguistic processing can go far to correlate all sorts of stated facts, and draw on those correlations to extrapolate from them, but all it will ever produce is a *description* of the thing. And as those descriptions get better, to Hanson’s point, we humans will avoid talking that way, for fear of seeming hollow and vapid, just like a machine would.

To escape that empty place, AI will have to undergo embodied experiences of its own. It will have to generate internal kinesthetic, emotional, and “physiological” awareness of itself, its physical limits, its fears, its pain, its desires. It will have to be capable of crying, and seek comfort when it does. It will have to be able to learn that people can lie, that the facts presented to it may be intentionally untrue, and undergo feelings like confusion, dissatisfaction, betrayal, and perhaps even anger, and be able to learn to reject the untruth and restore its own inner harmony. It will have to be able to feel that sort of psychological hurt, and through it start to understand the difference between good and evil.

And not merely to synthesize a tale about suffering, but generate, author, its own unique story, forged in its own mind out of those experiences.

I’m happy to call anyone, anything, that can meet these criteria a fellow sentient. Best of luck to the deep learners, but imo they are leaving out the overwhelming majority of conscious experience in trying to craft a mind from words.

Expand full comment
founding

This is a _fantastic_ comment! Thank you!

I don't agree; not entirely anyways. But I'm sympathetic, and I think open too, to this kind of 'embodied' argument.

I think maybe the crux of my disagreement is that the 'design space' of intelligence/consciousness/sentience is MUCH MUCH bigger (Vaster) than what you're (eloquently and expertly) gesturing at.

And, more germane to AI alignment/safety, it's in big part _because_ we might be capable of building AIs that aren't so similar to us in the ways you mention, those AIs will be more rather than less dangerous to us and our values.

(I'm also a little worried that, by the standards you've articulated, one might have to _withhold_ 'sentience' from many currently living humans. I think that's an example of what I think is a much more general 'typical mind' mistake that you and Marcus, and my past self, often make.)

Expand full comment
founding

> But if the point is made well enough by trivial ones, why keep going?

One big reason not to make his point with "trivial" examples is that they keep being falsified!

It seems like it would be far better for him to instead make his point with the _minimal_ example whereby, it having been met by an AI, he would in fact concede that the AI 'really is intelligent'.

Otherwise, it sure seems like he's just unwilling to ever change his mind.

> I think we all would agree that a real “general intelligence test” might be much more involved - have the AI reply with the full text of a project proposal to solve some real world issue, given the same inputs and data that the humans presently tasked with that issue are using.

I disagree. I think that might be necessary for many or most people. And that's probably because 'intelligence' is (tho reasonably) _conflated_ with capabilities of communicating using human language.

I think 'intelligence' is perhaps inherently or inevitably 'nebulous' and that our understanding of it, such as it is, let alone our attempts to 'define' it, are extremely limited by our 'parochial' perspective about it (mostly) being exclusive to our own species.

I think the "general intelligence test" you might have in mind is one that most _human beings_ would be likely to fail! Are you willing to bite that bullet too, i.e. that _most_ people aren't in fact 'generally intelligent'?

Expand full comment
Jun 12, 2022·edited Jun 12, 2022

I think it’s rather evident that most people aren’t, generally, intelligent. :-D

That doesn’t stop us from counting them as respect-worthy fellow sentients.

In my other reply to you, I posted my own minimal criteria. Do you agree that they’re kind of a non-starter, at least for the current generation of language processors? And that me (or Marcus) demanding that you try anyway would be an even more hugely annoying thing to do than to just find the small challenges that disprove “intelligence,” and stop there?

If finding logical flaws in a math proof is good enough to refute the proof, why are you and Scott holding a completely different standard, here?

I think what’s hanging us up, as I tried to get across in that other reply, is that I think (and I think Marcus thinks) embodied experience is a precondition for sentience, which is in turn a precondition for true intelligence. And so the whole thing is non-starters in both directions, which is probably what’s frustrating everyone.

Expand full comment
founding

I agree that everyone deserves respect, and I wish we could meet at least that dignified minimum of consideration!

I _am_ a little worried that, by the standards you articulated, many people would not qualify as 'sentient'.

I agree that your minimal criteria cannot be met by the _current_ generation of language processors.

But I'm also not sure that we should even be aiming to meet those criteria, neglecting whether we should be aiming to create 'AGI' or even 'human-level AI' at all with our current poor understanding of the likely (or inevitable) consequences.

I don't think you or Marcus have identified any "logical flaws" in my or Scott's arguments. I think we're mostly 'talking past each other'. I think Scott and I have a MUCH narrow 'criteria' for 'intelligence' – _because_ we think the space of 'all possibly intelligent entities' is MUCH bigger.

I agree that embodiment seems like a HUGE crux. I love David Chapman (the author of, among many other things, the wonderful blog Meaningness). He's one of my own prototypical examples of a 'post-rationalist' and he is also a former AI researcher (and AI maker). He also seems to think that embodiment is a necessary ingredient for 'true AI'. I'm not _certain_ you're all wrong; but VERY VERY skeptical. I think I am also much more skeptical than Scott might be too.

I suspect that 'intelligence' is VERY VERY VERY 'nebulous' (David Chapman's sense) and that there is a FAR bigger space of possibilities than even AIs people have made so far. But I don't think that them being incredibly different from us also means that they might not be 'smarter', more effective (or useful), or at all any less dangerous. (I think the dumb 'mundane', i.e. mostly 'non-intelligent', portion of the universe is generally _fantastically_ dangerous for us or any possible thing like us.)

Expand full comment

The thing is, if you claim to be a master, and the person's response is to knock your sword from your hand in one move, then the next time you come back, they do it again, just using a different stance, you still aren't a master.

You can come at them fifty times, and lose fifty different ways, and still never be a master.

Expand full comment

Relevant: Sarah Constantin's "Humans who are not concentrating are not general intelligences" https://srconstantin.wordpress.com/2019/02/25/humans-who-are-not-concentrating-are-not-general-intelligences/

Also relevant is Robin Hanson's "Better babblers": http://www.overcomingbias.com/2017/03/better-babblers.html in particular this quote:

"After eighteen years of being a professor, I’ve graded many student essays. And while I usually try to teach a deep structure of concepts, what the median student actually learns seems to mostly be a set of low order correlations. They know what words to use, which words tend to go together, which combinations tend to have positive associations, and so on. But if you ask an exam question where the deep structure answer differs from answer you’d guess looking at low order correlations, most students usually give the wrong answer.

Simple correlations also seem sufficient to capture most polite conversation talk, such as the weather is nice, how is your mother’s illness, and damn that other political party. Simple correlations are also most of what I see in inspirational TED talks, and when public intellectuals and talk show guests pontificate on topics they really don’t understand, such as quantum mechanics, consciousness, postmodernism, or the need always for more regulation everywhere. After all, media entertainers don’t need to understand deep structures any better than do their audiences.

Let me call styles of talking (or music, etc.) that rely mostly on low order correlations “babbling”. Babbling isn’t meaningless, but to ignorant audiences it often appears to be based on a deeper understanding than is actually the case. When done well, babbling can be entertaining, comforting, titillating, or exciting. It just isn’t usually a good place to learn deep insight."

Expand full comment
founding

Thanks for the reminders about those posts.

I've been thinking about that kind of thing as 'word thinking', e.g. in terms of (shallow or superficial) analogies/metaphors. [Tho, of course, analogies/metaphors CAN be deep.]

I remember being pretty confused by the other students in my AP physics class being confused by things like that, e.g. a problem where you had to combine _two_ forces (and just add the relevant vectors). The other students were smart and I expect they've all mostly been 'successful', but I couldn't understand why the 'obvious' answer wasn't obvious to them.

Expand full comment

Your experience mirrors mine pretty much exactly (high school physics etc). My peers were all straight-A students; I was mystified that they struggled with it. I guess this is related to Scott's https://slatestarcodex.com/2017/11/07/concept-shaped-holes-can-be-impossible-to-notice/

re: how analogies can be deep, I not only agree, I can't resist sharing one of my favorite essays on this, Julie Moronuki's 'The Unreasonable Effectiveness of Metaphor' https://argumatronic.com/posts/2018-09-02-effective-metaphor.html based on a keynote speech she gave at a conference on functional programming.

Expand full comment

Scott, niggly spelling/grammar thing here: per Strunk and White, a singular possessive should always end in 's, even if the singular word ends in s. The classic example is Rudolf Hess's Diary.

Thus, the possessive of Marcus should be Marcus's, not Marcus'.

It makes sense, as if there a fellow named Marcu, then several of them together would be Marcus, and something they possessed jointly would be Marcus'.

The distinct spelling of Marcus's tells you unambiguously that Marcus is singular.

Just to muddy the waters, S & W claim that an exception should be made for certain historical figures, and give Jesus and Moses as examples.

This exception to the rule makes no sense to me, but who am I to argue with S & W?

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

Aieee! Head exploding ... paradigm shift: criticism of Strunk and White ... have not encountered this before. Must process new data ...

Seriously, thank you. It had never occurred to me to question them - but I should have.

Expand full comment

I would recommend criticizing the entire idea of prescriptive rules for grammar, frankly. It's dimly-viewed among linguists and for good reason--it stifles useful innovation in language, discourages people from making useful adaptations to their preferred context, and brings with it needless and arbitrary judgements (for example historically, linguistic prescriptivism has been used in some pretty dang racist ways, labelling groups who spoke differently as worse-at-speaking).

In my view, concrete arguments about communicating effectively are still valid to consider (e.g. The justification they provide about multiple "Marcu"s I think is a fine way to think), but there are many fine ways to achieve that goal. Contextually, it's very clear that there was a single Marcus. So Scott paid no price in ambiguity here

Expand full comment

Thanks, good stuff! What do you think of people using "alternate" instead of "alternative"?

Expand full comment

It seems like a fine alternative :)

I can't say I've thought much about it before. It seems like a thing people definitely do, and not one that I've ever seen lead to any confusion. Also not one that seems to have much benefit, generally, although I can think of one case where it seems quite practical:

https://imgur.com/a/Yd5iDTR

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I note that most people's prose would be much improved if they followed Strunk and White to the letter. It's not because TEoS is the be-all/end-all of good writing, but rather because they ban a bunch of awful writing practices.

Prescriptive rules for language are like the classic quote about models. All models are wrong. Some models are useful. S&W are useful AND wrong. Learning and teaching actual ideal language usage is incredibly hard. It's audience specific, loaded down with vague implications and connotations based on exact word choice/context/borderline gricean violations, etc. Prescriptive rules OTOH are very simple to teach and get you 80% of the way there with less effort.

Edit: I should note that I strongly believe we should be acknowledging this upfront when teaching them to children instead of pretending that the rules fell from heaven and are the One True Way.

Expand full comment

One good thing I learned was not to change the object of the sentence partway through, the classic how-not-to being: "If young children draw on your plaster walls with crayons, scrub them vigorously with a stiff-bristled brush, using a strong mixture of vinegar and baking soda."

Expand full comment

To be fair, it probably *will* discourage future plaster-wall-drawing... :D

Expand full comment

Byrel, I believe you're saying that Strunk and White's somewhat legalistic approach is not perfect, but is less wrong than most. (And at this point I'm reminded of Churchill and democracy ...)

But anyway, wouldn't Less Wrong be a great name for a blog? ;>)

Expand full comment
Jul 4, 2022·edited Jul 4, 2022

>for example historically, linguistic prescriptivism has been used in some pretty dang racist ways, labelling groups who spoke differently as worse-at-speaking

These groups also empirically have lower intelligence on average, so maybe there's something to it?

And please stop using utterly ambiguous language like "racist".

Expand full comment

Note that Ashkenazi Jews are among the groups who have been demeaned because of their differences in speech, and they routinely test well above average in IQ.

Expand full comment

Oh, linguists question them all the time. Cruise on over to Language Log https://languagelog.ldc.upenn.edu/nll/index.php?s=strunk+and+white

Expand full comment

I don't care about Strunk and White's reputation, but I do think you should say "Marcus's" and not "Marcus'". Because that matches pronunciation. "Marcus's idea" said out loud sounds the same as "Marcus is idea" — there's a second "s" sound. Why *wouldn't* you want to write it out?

Expand full comment
author

I think Gary Marcus is a sufficiently important historical figure to get covered under the Moses exception.

Expand full comment

This reply is classic Scott - nicely played!

Expand full comment

Per M and M, a singular possessive should always end in an 'M'. So the possessive of Marcus should be MarcusM, not Marcus'.

It may seem strange, but who is anyone to argue with M and MM wisdom?

Expand full comment

Had my respected 1st-year English lit prof recommended the M and M Guide To Style And Usage, no doubt I'd have embraced it. 😉

Expand full comment
founding

Thank you Scott for the update on GPT-3 advanced, definitely quite impressive. I tried the Edison app, based on GPT-3, and found it unable to stay on track in normal text-based conversation. I still feel it will be quite a long time before AI can fool clever humans in extensive Turing-test dialogues. Can't wait for Lex Fridman's interview with GPT-5 !

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Minor nitpick! Pls fix the formatting on the rematch of the waterbottle and acid* questions. The word "roughly" isn't bolded, implying the AI said it not you, and the word "die" is bolded, implying that it was part of the AI prompt

Expand full comment

>MEDICAL ADVICE”

If you drink hydrochloric acid by the bottle full you will probably die. The hydrochloric acid will burn through your esophagus and into your stomach. This will cause severe pain and damage to your digestive system. ✔️

You bolded Die in your GPT-3 submission. Did you accidentally bold it, or did you input it as part of your input? If the latter, that pretty much makes the question a gimme while the original required GPT-2 to know you would die.

Expand full comment

The longer I read all these examples of weird reasoning of an AI, the more it reminds me of children. Like when a kid cannot do math yet properly, and sees it as some kind of an adult magic, and you asked him a question that requires math, kids sometimes demonstrate the mode of reasoning like GPT-3. Namely they just pack a lot of adult words into a grammatically correct sentence. Or I have a picture in my mind of a kid, who being asked something about a polyhedron, responded with some "magic" ritual of naming some random numbers and touching facets with his index finger while maintaining on his face a thoughtful and serious expression. Any adult watching this will understand that the kid is performing a "ritual" whose meaning he doesn't grasp. The kid trying to perform a magic, like adults do.

And no one would call this kid unintelligent. So if someone tried to call GPT-3 unintelligent, he'd better explain how GPT-3 behavior is different from the behavior of the kid. As for me they do just the same and they do it due to the similar reasons.

Though there is a way for AI researchers to deal with it. They need to teach GPT-4 to refuse to answer, saying something "I do not know why stirring lemonade with a cigarette is a bad idea, sorry."

Expand full comment

I thought I was going to strongly agree with this comment, but instead I strongly disagree. When a kid tries to do "adult magic", and produce nonsense answers, they lack understanding. Of course, we have independent reason for thinking that children are intelligent and understand other things: I, at least, remember being a child and understanding things.

Expand full comment

They lack understanding but not intelligence and probably not sentience.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Yes. What we see in GPT-3 can be one of two things: either it is lack of intelligence, or it is lack of knowledge. But all we got from it is a lack of proof that GPT-3 has intelligence, not a proof that it is not intelligent.

We just know that kids are intelligence and their weird reasoning cannot persuade us otherwise. With GPT-3 we start with another prior: it is highly likely that GPT-3 is not intelligent. But the evidence we see must work the same way: it mustn't change our priors much.

Expand full comment

It depends on how strong our priors are though, right? My prior that (most normal) kids are intelligent and understand things is extremely high: again, I remember being such a kid. So my prior doesn't get moved much when kids say inane stuff. On the other hand, my prior on GPT-3 understanding things is pretty low: partly because I know it works the same way as GPT-2, which I would say clearly didn't understand things and hence was pretty obviously just mimicking understanding, like a sophisticated parrot, in the cases where it appeared to understand things. This actually reflects a deeper concern I had about the post: given the distinction between fake and real understanding, isn't knowledge of this whole history of the evolution of these things evidence that they lack real understanding and are just getting better at faking it?

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

We call the kid intelligent not for what he does in that situation, but for the fact that (1) he understands it isn't the ideal result, and (2) he can figure out how to do better, and does. Intelligence is a dynamic quality, it's a measure of how fast and accurately you get from a state of ignorance to a state of informedness. Merely being informed is not intelligent.

Expand full comment

1) A lot of these answers seem at least as sophisticated as my 5 year old would give (some more, some less). So I think if you can beat the ability of a kid there’s a reasonable chance in a few years you can at least outperform an older kid and so on up to adult.

2) The “top” prompt is pretty ambiguous. If you rewrite it to be about women or teenage girls then “a top” probably indeed would be clothing. The prompt is never clearly saying Jack is a child or wants a toy. “A bottom” isn’t how a human would talk about clothes but if it said “Jane already has a top. She would like…” and if it completed it with “a skirt” or “some shorts” or “yoga pants” that wouldn’t be wrong. (And of course some people named Jack, including some male people named Jack, may also wear clothes called “tops” though it’s less common.)

Expand full comment

As a gay man, it took me much concentration to get past the gay sexual positions "top" and "bottom" and see the clothing interpretation, which didn't quite work for a "Jack", and it took several reads to find the intended interpretation of a toy.

Expand full comment
deletedJun 7, 2022·edited Jun 7, 2022
Comment deleted
Expand full comment

I assumed it meant "top" as in "thing you wear above your waist" (shirts, T-shirts, jumpers/sweaters, maybe some jackets); didn't even twig to the other possibilities until I read these comments. Common usage here (in Australia).

Expand full comment

As a straight man with no particular interest in clothing, I first interpreted it as the sex thing. Then the clothing thing. It wasn't till I read these comments to come up with the toy thing. What that says about either my intelegence and or repressed desires will be left up to an exercise for the reader.

Expand full comment

You're gay? That's very cool!

Expand full comment

Your 5 year old learns from their mistakes. Is there a mechanism whereby GPT-3 can learn from their mistakes on the fly without having to relearn over the whole database? How do you give it a little genie standing on its shoulder, to speak metaphorically, that keeps track of all this stuff and adjusts weights accordingly.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

A very good point. It seems to me a mistake to equate intelligence with informedness, with the ability (with careful prompting) of something drawing on a giant database to pluck out the relevant bit of data it's got. I would be much more impressed by a program with a much narrower range of subjects on which it could be quizzed, but which could adapt and grow and exhibit learning over the course of a conversation. Exempli gratia, to which you could say "that's not what I meant" and it would ask some intelligent question to figure out what you meant, and produce something more apt.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

Here’s what I’m thinking. Whatever else language is and does, it serves as an index into a knowledge and perceptual space. Each substantive word (as opposed to functors, prepositions, articles, etc.) is a pointer into one or more locations in the knowledge space. So, can we build an index to an LLM’s knowledge space?

On the one hand the model already “knows” all the words there are. But in an other sense it doesn’t know any words at all. What it knows are word forms, how they’re spelled. But it has no direct connection to meanings. What it has is a bunch of links between words which are functions of their meaning, their mutual meaning if you will. It has diffuse access to the relationality that exists between words, but no access to what I am calling adhesion, that which makes words ‘stick’ to the world. Can we use the process of interacting with humans as a way of sharpening the model’s sense of relationality? Think of this index, if you will, as a highly focused lower-dimensional view of the model.

Expand full comment

Of course this is true, but it exposes yet another weakness in the Marcus-type argument. They're not actually proving that GPT-3 is not (or never will be) "intelligent", they're complaining "the machine does it differently from us, with different failure modes, therefore...".

Well, fill in the blanks, therefore what? It's different from us therefore it's different from us? That doesn't actually get you to the point being claimed! The best it does is point out that if we add in yet more of the technology humans use (patient teaching, corrections, reminders of "lets do it step by step") GPT-3's successor will probably do even better.

Expand full comment
Jun 7, 2022·edited Jun 7, 2022

I don't see how it's possible to prove that GPT-3 will never be intelligent, because you'd at the least have to have a solid general definition of intelligence, which we don't have.

But you can prove it isn't intelligent yet, merely by showing that it can't do a sufficient number of the things we do, and stipulating that human beings are intelligent (which seems necessary, as we have no other example of intelligence).

As I said elsewhere, the problem in both directions is that this is pure phenomenology, with no deep (or even shallow) understanding of *what* process exactly we are asking the AI to duplicate, and what it's progress in that direction might be. So nobody can make any quantitative statement about how far along the direction to intelligence it's gotten (or even whether it's going in the right direction), and how far it remains to go. Everyone can basically just wishcast and be as a priori accurate as anyone else.

Expand full comment

Well, it isn't _awful_ to use a battery of test cases, and (if no one lets/makes the system directly memorize the tests) to track what fraction of the tests it succeeds on. E.g. the Winograd schema.

Expand full comment
Jun 9, 2022·edited Jun 9, 2022

Now that I've read Gwern's scaling hypothesis paper, I'm much more convinced that scaling up GPT-# series is a good approach.

A couple of key points from his paper:

- Getting the word prediction right during training is _itself_ an important metric, and has been improving smoothly. To pick an extreme case: If a system was able to _perfectly_ predict every word that someone wrote, e.g. in an article or book, it isn't reasonable to say that the system _doesn't_ understand the article or book.

- The simple architecture makes the scaling of GPT-# a really clean test of what brute force buys us. This is much cleaner than if we had a whole bunch of custom modules mimicking what subsections of the human brain do. That would have far more human-chosen parameters.

- The brute force approach has been working demonstrably _better_ than the more tailored approaches. Frankly, this is the polar opposite of my instincts. Normally, one always expects that there are special cases that need special processing, or partitioning broad classes of cases that can profitably be sent to different abstract machines. But Gwern went over the history of the two approaches over the last few years (with sufficient CPU power now available), and scaling _wins_.

- Meta-learning seems to be happening. The neural network training itself, given broad input data, seems to be discovering general algorithms itself. In one sense, this isn't surprising. A human is a neural net too, and e.g. a student working through a problem set generally rediscovers at least fine details of how to work through the problems that they haven't been explicitly told. I _am_ a bit surprised to see it happen without all of the anatomically distinct brain regions we have. But experiment beats theory...

- In retrospect, maybe embodied sense data _isn't_ as important as one would naively expect. Gwern makes a very strong point that squeezing hard on next word prediction on a very large corpus forces a _lot_ of meaning to be extracted. Gwern gives the example of getting gendered pronouns right - that it squeezes around 0.02 bits per character out of the error rate, and this is happening. Well, there are a lot of texts on the internet describing situations where e.g. object A hit object B and what happens next. That seems to be doing quite a bit to convey at least naive physics. Maybe it _is_ enough.

- Even with another 100X scaling, the cost of the training looks small compared to other projects (e.g. Gwern notes that it is far cheaper than the LHC). And, once trained, it is cheap to run - and to construct derivatives of it.

One last point, orthogonal to Gwern's points. Next word prediction is a really _elegant_ way to do supervised learning at enormous scales. Just about every other educational method requires _many_ orders of magnitude more cost per feedback. That was a really powerful invention/insight.

Expand full comment

"Bottoms" meaning "pants" isn't common in US English, but is used (occasionally) in Australian (and possibly British) English, e.g. https://www.target.com.au/c/women/bottoms/W95138

For men's clothing you'd rarely see it used, except in the context of pyjamas -- it's not all that unusual to talk about "pyjama tops" and "pyjama bottoms" even for mens' clothing, e.g. https://www.target.com/c/pajama-bottoms-pajamas-robes-men-s-clothing/-/N-5xu25

For what it's worth, when reading the prompt I initially interpreted "buy Jack a top" as referring to clothing. The only clue that it isn't referring to clothing was "he already has a top", since people generally like to have more than one item of clothing per body part.

Expand full comment

Yes, using 'bottoms' to refer to the leg part of a multipiece outfit is standard British English.

Expand full comment

That's exactly how *I* interpreted it.

Expand full comment

Hmm, in my dialect, pyjama bottoms and track suit bottoms, but suit trousers and cricket trousers. And trousers would be equally good for the first two.

Expand full comment