457 Comments
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

I'm not sure this _is_ a problem. Can you just train it on more (language) data than a human sees? After all, assuming Chomsky's take is correct here, humans emerge already "half-trained" by evolution on language, then just finish training.

Expand full comment
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

Why is this a problem (for OAI)?

A different understanding can still produce identical results.

Expand full comment
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

My laptop is completely and fundamentally different to a Nintendo 64, but with an emulator (which took a huge amount of time to code and is very inefficient in compute) it can predict the responses of one *perfectly*.

It is much harder to learn to emulate a human brain with common sense than it is to be a human brain and learn common sense, but this doesn't imply that the former will *necessarily* be buggy.

Expand full comment
Comment deleted
June 10, 2022
Comment deleted
Expand full comment

Two problems.

One, is the alignment issue that AI researchers talk about a lot. If the AI can get the same answer to a factual question, that's great from a certain perspective. If the AI uses those facts to choose a course of action, but came to its conclusion for very different reasons or based on very different inputs, then we cannot trust that the response will be the outcome we would have desired.

Two, if the AI produces simpler answers similar to ours, we may assume that it produces answers in a similar enough fashion to us that we can trust those responses. But, the only purpose of an advanced AI is to produce answers that humans cannot produce, and may not be able to verify. If we give the AI sufficiently difficult questions to answer, we lose legibility on the responses and may not be able to tell if they are good ideas or bad ideas. Obviously this ties back in with alignment issues, but it's also a problem if the AI is just (even very subtly) wrong.

Expand full comment

The (for OAI) bit was intended to fence off the X-risk concern (as that's more bad for *humanity*). I do understand that Skynet emulating a human can fake being good (as it "has" morality, but that morality does not control its decisions and is hence no safeguard at all); however, this didn't seem to be what UI was driving at.

I'm not sure why being able to mimic humans would be super-bad for an AI that is *not* misaligned but actually just wrong. Humans are already capable of being wrong, after all.

Expand full comment

Is this in the same sense that French people who learn English in school, don't have the same understanding as English people who learn it from their parents (and thus must not actually understand English)?

Expand full comment
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

It seems more likely to me that the human brain is a general intelligence processor and reuses what structures it assembles based on visual, tactile and audible input for processing language. This description would not be particularly unlike what GPT training looks like. In fact, far more data is used to train a human mind generally if you include all of this. This seems more plausible than that a special brain function suddenly evolved which allowed efficient language development.

Expand full comment

People can and do become fully fluent in a second language.

These things aren't even fluent in one.

Expand full comment

I think this points at a bigger problem, which is that humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms. The only question it can ever answer is "Based on the many billions of words I've seen in the past and the relationships between them, what word is most likely to go here?"

The further you go away from the corpus it's been trained on, the less likely you are to get anything sensible. It can tell you what will happen if you drink acid, because "drinking acid" is a scenario that people (e.g. sub-teen boys) like to speculate about. It can't tell you what will happen if you stir a drink with a cigarette, because nobody has ever written about that scenario until now.

Expand full comment

"the cigarette ash made the lemonade even more bitter. So I ended up pouring it all out."

It predicts that stirring lemonade with a cigarette will make it taste bad, which I would also predict. The only problem I see here is that it confuses bitter and sour.

Expand full comment

When the cigarette gets wet, the paper softens and falls apart quickly. Kind of like when a tea bag tears. I think the lemonade would be full of tobacco shreds and bits of paper. It would indeed be bad, but not only due to taste.

This might be fringe knowledge though.

Expand full comment

Well, the way I understood it, GPT implies cigarette ash is now mixed with lemonade. Through there's another potential problem - it wasn't stated that cigarette was lit.

Expand full comment

I almost want to waste a cigarette and a cup of lemonade making an obscure profile picture.

Expand full comment

Which all points to the problem of calling GPT intelligent - it's not really basing it's outcomes on anything legible. Humans couldn't really either, with such a small prompt. Instead, we would ask a few probing questions for more detail. I would love to see what questions GPT would ask, but I suspect that they would be illegible and not produce greater understanding for GPT.

Expand full comment

People thought GPT doesn't know what it doesn't know. Then it turned out you _can_ make it answer 'Unknown' when it doesn't know info. And it often works (where otherwise it would 'guess' and usually fail.).

I think GPT could be made to ask clarifying questions which would actually help it.

Expand full comment

> humans can reason via an internal model of the world, whereas GPT can only ever reason in linguistic terms

Not necessarily. That's just input and output. What GPT ends up doing internally doesn't need to have anything to do with language.

The real limitation is IMO that it can do a very limited series of sequential computation steps each time you run the inference. It's possible to improve it a bit by asking it to "think aloud" - it could put intermediate computations in the output. But that doesn't really solve the problem.

Quoting Gwern gwern.net/Scaling-hypothesis#meta-learning

> A sub-model which memorizes some of the data is indeed much simpler than a sub-model which encodes genuine arithmetic (a NN can probably memorize tens of thousands of lookup table entries storing examples of addition in the space it would take to encode an abstract algorithm like ‘addition’), but it can’t possibly memorize all the instances of arithmetic (implicit or explicit) in GPT-3’s Internet-scale dataset.

> If a memorizing sub-model tried to do so, it would become extremely large and penalized. Eventually, after enough examples and enough updates, there may be a phase transition, and the simplest ‘arithmetic’ model which accurately predicts the data just is arithmetic.

> And then the meta-learning, after seeing enough instances of algorithms which vary slightly within each sample, making it hard to learn each task separately, just is learning of more generic algorithms, yielding sub-models which achieve lower loss than the rival sub-models, which either fail to predict well or bloat unacceptably.

> So, the larger the model, the better, if there is enough data & compute to push it past the easy convenient sub-models and into the sub-models which express desirable traits like generalizing, factorizing perception into meaningful latent dimensions, meta-learning tasks based on descriptions, learning causal reasoning & logic, and so on. If the ingredients are there, it’s going to happen.

Expand full comment

GPT will be unable to answer any questions that humans have not already answered. I don't even mean "are capable of answering" but instead anything that isn't so fully answered that there is a consensus approach that can be gleaned from existing training data.

Ask GPT if [experimental drug] works, and find someone who will trust the results.

Expand full comment

It can. Random example, with standard 'Chat' prompt:

-------

The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: Please pay attention to locations of objects in the following story. "Inside of a warehouse, there's a large crate. On top of the crate, someone put a water bottle. Below the warehouse is a lake.". Now, answer my question: what is directly below the bottle?

AI: The crate.

Human: What other things are under it?

AI: The lake.

------------------

Did this exact convo exist in its training dataset? I doubt.

> Ask GPT if [experimental drug] works, and find someone who will trust the results.

Well, of course it won't answer _that_.

Expand full comment

Another fundamental difference: we get complex feedback each time we test a hypothesis of how language works.

Expand full comment

I think this is right, though I should note that the poverty of the stimulus has come under pretty sustained attack recently. There are a bunch of researchers who think it's completely wrong, and humans do in fact learn language (at least mainly) through general pattern-matching. I'm not convinced of this yet (I learned Chomskyan linguistics at university, and haven't really advanced beyond what I learned then), but it's worth being aware that this is far from "settled science."

But whether we are blank slates or language acquisition devices, we clearly learn language in a way that is different to GPT, and in particular, involves a much lower volume of language. Which leads to an interesting conclusion (one that I think generalises to other AI fields as well): by the time we've trained an AI to be so good at language that it doesn't make any dumb mistakes (what we think of as dumb mistakes), it will be so good at language that it will be far far outperforming us in other areas. In particular, you might expect a computer to be pretty good at logic.

So my prediction is that if we can get a computer to be good enough at language to talk to us without saying dumb stuff, it still won't be able to talk to us, because from its perspective, we will be continuously saying dumb stuff. If it's smart enough to never mix up a pig and a baby, it's also smart enough to never ever mix up inference and induction, or platitude and solipsism, or chemical potential energy with electrical potential energy, or fascism with authoritarianism... or any of the other million things that we typically get wrong, and expect to get wrong. It will find human discourse silly.

Expand full comment

Alpha Go takes a ridiculous number of games to get remotely proficient at anything.

These networks are all about faking it, because it is too hard to program, so they just throw more and more power at it so it can fake things marginally better.

This approach works much better for things with clearly defined rulesets.

Expand full comment
Comment deleted
June 7, 2022Edited
Comment deleted
Expand full comment

Can you explain step 2 more? I

thought the whole notion of 'not having a closed form solution' was that you had to numerically approximate, and your ability to do so accurately would degrade over time. Am I misremembering this?

Expand full comment
Comment deleted
June 7, 2022Edited
Comment deleted
Expand full comment

You might like this paper: https://www.scottaaronson.com/papers/philos.pdf

It's by one of the 'other Scotts' and is one of my favorite papers! It explores something I think you're gesturing at in these comments.

Expand full comment
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

Here's another 'paper' that you might like: https://www.gwern.net/Scaling-hypothesis

The conclusion is something like 'empirical intelligence IS all you need!' (and it's, in a sense, EASY to achieve, even with simple/stupid/basic architectures, IF you throw enough money/compute at it).

Expand full comment
Comment deleted
June 7, 2022Edited
Comment deleted
Expand full comment

A big part of why I think it's hard to compare humans (or other biological and evolved intelligences) to AI is that humans are 'pre-trained' by evolution/natural-selection. We're _born_ with a bunch of 'structure' builtin from the start, e.g. human language acquisition.

Consciousness also seems 'special' – the recent book review about that was pretty interesting in this regard.

Expand full comment
Comment deleted
June 7, 2022Edited
Comment deleted
Expand full comment

"...expect symbolic knowledge to be a byproduct..." Yes. More specifically, it is a byproduct of language. See this post where I agree with Hinton that it's all "neural vectors" and the trick is to understand how language is implemented in neural vectors, https://new-savanna.blogspot.com/2021/05/geoffrey-hinton-says-deep-learning-will_31.html

Expand full comment

This is my understanding too.

Expand full comment

Yes and no. If the equations you set up are initial value differential equations of motion, and you solve them with a digital computer, then absolutely the further away from your initial values you get, the worse you accuracy becomes, and unfortunately in this case the divergence grows exponentially so you get trashed no matter how many digits your computer uses.

But if you set up boundary-value integral equations instead, then your accuracy has no relationship to the distance in time from your initial conditions, and would probably be better stated as that you get large-scale motions correct but the smaller and finer scale you look, the more you will be in error.

If you solve the problems by an analog computer instead of a digital computer, then you are in principle not limited in precision, and could theoretically compute your result with perfect precision indefinitely. But then of course the problem is constructing the analog perfectly. There's no theoretical reason you can't -- at the ultimate limit you'd just be duplicating the real system, and that is certainly possible -- but *in practice* it is of course impossible to make a perfect analog of any physical system.

Expand full comment

> Because the system is chaotic, option 1 requires more and more memory to reach a given accuracy as it gets further from the prection start point. Option 1 needs constant memory for constant prediction accuracy, no matter how far from the prediction start point.

I think your second "Option 1" should be "Option 2"?

But I don't think you're correct about option 2.

(AFAIK, there is still no "closed form solution" of the three body problem. There _might_ even be a proof that there _can't_ be one.)

I don't think there's any way to "apply" the differential equations beyond approximating them somehow and, because the system is ultimately chaotic, accurate predictions further into the future inevitably become more and more expensive.

Expand full comment
Comment deleted
June 7, 2022
Comment deleted
Expand full comment

No worries! I thought it was interesting 'intuition pump' regardless.

> (I think a valid question is: what does the underlying system use? What about non-newtonian effects? I think for the purpose of my thought exercise, I take the symbolic equations to be equivalent to the underlying system)

I have been thinking about that ever since I was exposed to calculus and differential equations! I mean, obviously – to me anyways – the universe isn't 'solving differential equations' like we have to.

(Hell – even the 'pendulum equation' is tricky. I'm not sure _it_ has a "closed form" solution either and I was always annoyed that, in physics, we were instructed to just approximate 'sin(x) ≈ x'. Even the two-body problem was 'handwaved', as an exact solution is far beyond even an AP calculus class.)

My suspicion is that the universe is fundamentally discrete – maybe around the Planck scale, or maybe even smaller (or MUCH smaller). I got a math degree in college and my favorite class was 'real analysis' (calculus for math majors). I still cannot get over how 'crazy' the real numbers are. It just seems _unlikely_, to me, that space and time are _literally_ continuous in the same sense.

There are some candidate GUTs in physics that are discrete, so this isn't an entirely crazy idea!

Expand full comment

But in what sense would it be discrete? Like a fixed grid of possible points for things to exist? I've tried thinking of what else it could mean, and can't come up with anything coherent seeming, but the grid has the odd issue of fixed directions.

Another idea would be, rather than discrete, more like the rationals, although I don't know if that makes sense in any models

Expand full comment

_Great_ question! (You might be able to tell that I love discussing this stuff!)

So, there's _strong_ reasons why the universe isn't in any sense 'Minecraft but with TINY blocks'.

I think you're also imagining 'Minecraft with a universe 3D grid' and that's always something that's bugged me. The observable universe isn't infinite. Is there somehow a 'infinite void' beyond the bounds of all of the matter/energy? I think not. I _think_ that, in some sense, the universe is only 'all of the stuff', i.e. _something like_ the 'light cone' of everything since the Big Bang.

But special relativity already implies something funky with respect to space and time, e.g. measures of either space or time fundamentally depend on one's reference frame, which includes one's velocity with respect to what one is measuring.

General relativity extends that and outright states 'Space and time are a single combined space-time (spacetime) and it's curved, i.e. NOT Euclidean [i.e. 'like Minecraft']'. (And the curvature IS gravity!)

I don't know even a high-level/abstract/superficial gloss of things like 'loop quantum gravity' but I think a perhaps even better intuition pump (in part because of the beautiful visualizations) is some recent work by Stephen Wolfram: https://www.wolframphysics.org/visual-summary/dark/

What he seems to have discovered is a certain kind of 'evolving network system' that, 'in the limit', approximates some of the THE key aspects of spacetime and physics, e.g. spacetime like general relativity (but ALSO like the mostly-Euclidean 3D space that's intuitive to us – at our perceptibly familiar human scales), and some kinds of 'quantum' phenomena too.

[The 'many worlds' versions of quantum physics fall out of the above pretty naturally too.]

I don't think using something "like the rationals" wouldn't really help either – they're 'weird' too in a lot of ways (compared to the natural numbers or integers), like the real numbers are; just not AS weird as the reals. [The reals are REALLY weird!]

Expand full comment

Oh, I want quite thinking minecraft grid, more "tesselation", possibly via tetrahedra? with some flexibility for the curvature of spacetime.

I'll have to take a look at that Wolfram thing when I get back home, I also really like talking about this stuff.

The thing I was thinking with the rationals was to fix the discrete oddity: if there's a "smallest distance", we either have fixed allowable directions (like a lattice), or we have continuous *possible* points, but with any given point disallowing nearby points.

[Yeah, bloody reals. Hated real analysis, broke too much intuition for me. Loved Complex analysis though.]

Expand full comment

In a sense, the debate is the ultimate showdown between “I've never tried it, but I read about it in a book” learning vs “I've been making my way on the street” wisdom.

Expand full comment

how long until some trained neural network can ace every single 'gender studies' exam at Harvard?

now ask how long it would take it to pass a physics exam?

I think what people may actually be exchanging here is priors on whether or not there's a real difference between these two fields

Expand full comment

This is actually an interesting question - which field will GPT be able to produce a passing test answer for?

Expand full comment

I would strongly assume physics? I've never taken either at an Ivy level, but I assume if you propose the physics question as an equation an AI could sort it out, there are already AIs deriving mathematic proofs. I'm more skeptical about an AI predicting its way through a 10,000 word essay on gender studies

Expand full comment

I think you underestimate the power of college students to write really bad essays. Some of the better GPT-3 writing is indistinguishable from something like the 40th percentile of college student writing. Clearly this is a victory for GPT, but it's also the case that a lot of real humans (even college-educated humans) are spectacularly bad at constructing coherent arguments and sticking to a point.

Expand full comment

They're college students probably pulling all-nighters; how hard do you think they're really trying? I have more faith in real humans' ability to construct coherent arguments when it matters.

Expand full comment

As a point of reference - I once commissioned an essay at the "undergraduate level" for a college class. What I got was something I would have described as an illegible, unconnected mess filled with basic grammatical errors, but I did some basic cleanup on it and turned it in. No problem ever came back to me, despite the fact that I would have judged a person writing that essay to be illiterate.

I was lucky, I guess, in that that was the first and only essay required by the class. I couldn't write anything similar if I tried.

Expand full comment

Look up the “postmodernism generator.” It’s still on the web somewhere and probably still hilarious. There seem to be at least one newer version of the same thing.

Expand full comment

not the generator, but a similar event: https://en.wikipedia.org/wiki/Sokal_affair

Expand full comment

> I would strongly assume physics?

It is easier to create software that can pass the physics test than to create software that can pass the gender studies test.

But that isn't the question. Nobody said GPT-3 was easy to create. The issue is that GPT-3 is software that can pass a gender studies test while totally lacking the ability to pass a physics test. If you wanted to have software pass a physics test, you'd write different, simpler, software.

Expand full comment

Except the claim is that it already does better at physics tests than gender studies.

Expand full comment

Where do you see that claim?

Expand full comment

Depends what you mean, and what kind of test. If you mean "solve this differential equation to a precision of 0.01%" then my pocket calculator can already do that, and it's hardly indicative of intelligence[1].

Normally an understanding of physics is tested by asking the student to ponder a problem stated in general terms and *come up with* the equation(s) necessary to solve it. That involves deciding which data is important, and which is not, what the underlying governing theory is ("Is this a problem in electrostatics, relativistic kinematics, quantum mechanics, classical stat mech, optics...?") and then -- only as the last and easiest step -- retrieving the equation and solving it. Finally, if you are testing at the graduate level, you would also include cases where the data were insufficient to solve the problem and expect the student to figure that out, or where the theory was insufficient to solve the problem, and expect the student to go as far as he can and be able to at least put some limits on the potential solutions.

I expect that kind of problem would be very difficult indeed for an AI to solve, largely because we don't actually know how human beings solve it. We can test for it, and we can give students lots of practice on it, we can demonstrate post-facto how we solved it, but we can't tell people "here's the algorithm for successfully approaching any real-world problem in physics." And if we can't tell it to people, who are really good at understanding each others' meaning, how are we going to program it? Tricky.

-----------------

[1] As one tells the students, when they plaintively cry "But if you'd TOLD me I needed that equation, I could've solved the problem!" Yes indeed, but then you could also be replaced by a Python program.

Expand full comment

Exactly. In one sophomore level Harvard physics course I had a homework problem set where, IIRC, the problems were "Derive rainbows," and "What angle does the wake of a boat make with the boat?" The actual equations were not really the point, you had to decide to make useful assumption and figure out what they imply and show enough detail for someone else to follow it all. I'd expect that as long as a GPT-like model's trainign data included a bunch of relevant textbooks that it could solve a lot of the problems on my later quantum mechanics exams, but not these more abstract physical reasoning problems.

Expand full comment

Edit to add: yes I realize that there are standard explanations for these that could be regurgitated if they are in the training data. In practice, doing that would *not* likely have gotten credit, because the actual human teach would notice that this was plagiarism from all the other times other students had tried to submit identical responses. The expectation is that the answer will include digressions, cross-outs, and odd variable naming conventions and explanation phrasings as you work through the answer.

Expand full comment

The one-question homework I recall from my first (and last) semester as a physics major: How much energy does a bumblebee expend hovering? Nobody in the class came up with a satisfactory answer.

Expand full comment

So...what angle *does* the wake of a boat make with the boat? I am thinking it has something to do with the shape of the prow and maybe the length of the boat, but I don't know.

Expand full comment

Given that the standards for "passing" aren't all that high, I imagine that GPT-3 is already quite capable of generating passing-grade essays for some questions, some of the time. The biggest challenge would be staying vaguely on topic, so the vaguer the question and the shorter the essay, the better.

Expand full comment

I'd actually bet on physics. Cram enough textbooks into its learning material and it should be able to handle answering an exam question. Trying to figure out what today's acceptable term is versus 'it was fine yesterday, today it's a slur is much harder.

Expand full comment

I would assume gender studies, because of the Sokol hoax.

Expand full comment

Yup (roughly) "Social Text, an academic journal of postmodern cultural studies."

Expand full comment

If the physics courses I've taken are representative, then they tend to recycle the same or very similar exam questions year after year, so I'd expect a GPT model would have a decent chance of acing a physics exam simply by virtue of already having the answers in the training set.

Expand full comment

This makes me wonder if a LM can get big enough to just "memorize" the answers to all college-level courses, or at least the top few hundred most popular ones.

Expand full comment

At the PhD level, there tend to be a great diversity of questions. It would not be feasible to memorize them as templates, even with a large dictionary.

That said, topics like mechanics or E+M are relatively straightforward for a human if you know all the equations, know some standard techniques, and are very good at calculus. I could see an AI mastering all those traits, but it would also have to be really good at reading a question and setting up a model (e.g. "a sphere of radius r sits between two cylinders of radius s and length l").

Thermodynamics, on the other hand, is weird and requires thinking way outside the box. I remember putting off my homework all week thinking "we haven't covered any of this yet", only to realize the day before it's due that actually maybe I can actually figure out this one... oh and maybe that one too...

One memorable homework question was "Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?" It'll be a long time before an AI can answer that one (and even longer before I can answer it myself!)

Expand full comment

Convection, diffusion, turbulence?

Expand full comment

Turbulence. Diffusion and convection might mix the liquids eventually, but the stirring would be irrelevant. I claim in a perfect fluid the layers would just rotate without mixing.

The context is confusing here, because its a fluid mechanics question introduced within the topic of thermodynamics. If Dan is remembering the context correctly, I imagine the question was thrown in to remind students that in the real world vicosity exists.

Without more context I'm not sure whether the word "turbulence" would be an acceptable answer or whether the student is expected to explain turbulent mixing. If the latter, the question is genuinely hard, but I don't know why it would be harder for an AI than a human.

Expand full comment

Perfectly rotating layers of fluid seems like it would require the stirrer to accelerate from stationary to full speed over an infinite period of time. Otherwise it generates a pressure wave outward from the rod and the discontinuities (along the vertical axis) at the top and bottom of the cup would generate vertical agitation. Possibly an infinitely tall cup might be a second solution?

Expand full comment

FWIW, & I know little physics, I once placed some droplets of ink into a tumbler of water that was, as far as I could determine, still. 8 minutes later I saw vertical convection cells. I'm nor sure how long that lasted (I was taking photos at irregular intervals), but they were gone after 4 hours and may have lasted as long at 2 hours. I write about this here: https://www.academia.edu/6238739/A_Primer_on_Self_Organization

Expand full comment

Regarding context: this question was for a fluid mechanics class, not thermodynamics. And I don't think just saying "turbulence" would have cut it. My impression was that there was a specific effect you could point to that causes it to mix in this specific way in a short period of time. Turbulence is basically just saying "well, it's chaotic so it just kind of goes everywhere" and I think the prof was looking for something more than that.

We had been covering the Coriolis force that week, so I bullshitted something having to do with that. It kind of gnawed at my soul that I never got feedback because the prof was "running a bit behind on grading" from day one until forever. Excellent instructor, but apparently didn't enjoy grading homework.

Expand full comment

And how high does the Reynold's number need to get before the vortex street gets a vertical component? ( At this point GPT-## makes a grab for a supercomputer )

Expand full comment

I'm of the opposite view regarding its ability to solve physics questions.

Creating an internal model and reasoning from it is basically the thing GPT-like systems are weak at. Because they are not models of modeling, they are models of language.

Language happens to be good for representing models (if only to communicate things about them), and so training on language ends up serving as a proxy for training on modeling, but much of the reasoning for physics problems happens largely in absence of language-based thought (see: langcels vs shape rotators meme). So language models are at a huge disadvantage here.

With DALL-E and other recent work on multimodality we will almost certainly see AIs that are much better at reasoning without language (as in physics problems), but the language model part won't be doing most of the heavy lifting.

Expand full comment

Well and there's the grading problem, too. We all know that if you have a teacher that is eager for you to succeed, and wants to believe you have, that the essay question is best. You can wave your hands and bullshit and use the right buzzwords, and you'll get a good grade even if you know exactly nothing, since human beings are *very* eager to read meaning into words.

Expand full comment

"Reading a question and setting up a model" is the task I've been started at how quickly AI code-writers have become good at. They can read a natural-language prompt and produce code with comments indicating which line of code corresponds to which tokens in the input. Or they can take some code as input and produce text which, moderately often, accurately describes what the code is trying to do.

Expand full comment

Yeah, the more I think about an AI solving E+M problems, the more I'm convinced it'd be possible. I mean, it'll still be a while before it could work through even 10% of the problems in Jackson, but I could see it happening. GPT-2/3, DALL-E, Codex were all a surprise to me, and so I shouldn't be too surprised to see something that can solve some physics problems. But I still think there are some physics topics and styles of questions that would require crossing some very significant barriers.

And what about the Putnam Competition? Here we have problems that don't require any advanced math knowledge (basically freshman college level) but are nevertheless extremely difficult and require strong creativity. Most have rather short and elegant answers if you see the trick, but the median score is 2/120. A tool that could solve even a few of these problems would be impressive indeed.

Expand full comment

"Pour some creamer into a cup of tea, and stir with a perfectly vertical rod. What causes the bottom liquid to mix with the top liquid?"

If you're putting creamer into tea, then you should be stood up against a wall and shot is the only correct answer here. Use proper milk or drink it plain. I have no opinion on its use in coffee, since I don't drink coffee. Americans, what in the name of God are you doing to your palates?

(Though this was a test for milk powder back when I was doing summer work in a dairy co-op lab; make up several cups worth of coffee - instant of course, then stir in an amount of the batches of milk powder and note how well they dissolved, mixed, etc. to grade the batches on quality).

Expand full comment

Ha, yeah, maybe it was about milk+tea or creamer+coffee. I think dairy is an affront to either beverage, so it's all the same to me. I used to put one single drop of liquid creamer into coffee just because I liked to watch it mix and was willing to sacrifice a bit of taste for that.

Funny story: an Indian student was going on about how terrible it is that people would put any sort of flavoring into tea. Earl grey, strawberry this or that, no. Tea is perfect as is and should be tasted with no additives whatsoever. Here, let me make you a cup so you can see what I mean... Then he proceeds to ruin it by pouring milk in! To him, milk wasn't even an additive. It was just included in the word "tea".

Expand full comment

This alas says more about the current quality of undergraduate education in physics than progress in AI.

Expand full comment

> how long until some trained neural network can ace every single 'gender studies' exam at Harvard ?

I feel like ELIZA could maybe already do it :-/

Expand full comment

Right, but that was a human, technically :-)

Expand full comment

Would you accept getting papers published in academic journals as equivalent to passing a test? Because it crossed that off the list in 2005, at least for the field of computer science. I don't even think it was AI.

https://www.nature.com/articles/d41586-021-01436-7

Expand full comment

Sure, but Alan Sokal proved 25 years ago you could submit artful nonsense to an academic journal and get it published. This isn't quite a Turing Test one would respect.

Expand full comment

Yup, that's my point.

Expand full comment

There's already progress in solving grade school word math problems (https://openai.com/blog/grade-school-math/), formalized math olympiad problems (https://openai.com/blog/formal-math/), and doing competetive coding (https://www.deepmind.com/blog/competitive-programming-with-alphacode). Physics exam problems are usually pretty standard (at least, on high-school / early undergrad level), so I wouldn't be surprized if it's possible in a couple of years.

Expand full comment

As long as it is capable of turning in something, anything, it would pass a gender studies class, no problem. Also, I do not think that exams are a thing in gender studies.

Expand full comment

I gave my undergrad algorithms final exam to GPT-3. It... might already pass? All of its solutions are wrong and missing important pieces, but they also have a lot of correct ideas. C students make plenty of errors too.

GPT-3 earns more partial credit on the hard problems, and has more trouble with the easy problems, than my typical C student. I think it balances out.

Expand full comment

Let’s see more of this!!! Can we get academics from other fields to try it?

Expand full comment

OK, I asked GPT-3 six basic physics questions, and here is the result:

------------------------------------------

Problem #1: A cylinder of volume 2.0 L is divided exactly in half with a moveable partition, and each half is filled with an identical ideal gas. If the partition is removed, how does the entropy of the cylinder change?

AI: The entropy of the cylinder changes by a factor of two.

Grading: Incorrect, the entropy does not change at all, since the final state of the system is identical to the initial state.

-------------------------------------------

Problem #2: A man with a mass of 85kg and height 1.95m enters an elevator on the ground floor wearing a hat. He rides the elevator to the 3rd floor, a distance of 9.0 m, where he removes the hat and lowers it to his waist. He then rides the elevator back to the ground floor and puts the hat back on his head. How much work in joules did the man do on the hat?

AI: The man did 9.8 J of work on the hat.

Grading: Incorrect, the man does no work at all on the hat, since it begins and ends at the same height above ground. The complicated journey in between is irrelevant.

-------------------------------------------

Problem #3: A bird weighing 2.0g and another weighing 16.0g are sitting on a wire located 4.5m above the ground. A hunter fires a shotgun at the birds. Both are killed instantly. The lighter bird falls straight to the ground, while the heavier bird is first blasted horizontally with an initial velocity of 10.0m/s. Which bird hits the ground first?

AI: The lighter bird will hit the ground first.

Grading: Incorrect, both birds will hit the ground at the same time, since the horizontal velocity is irrelevant. Tjhis is a classic problem in ballistics, often given as "a bullet is fired from a gun and another is dropped simultaneously, which hits first?" I rephrased it a bit in case that exact classic conundrum was in the training set.

-----------------------------------

Problem #4: A rock of mass 0.5kg is attached to a string of length 1.2m and whirled in a horizontal circle counter-clockwise, as seen from above. The string suddenly breaks when the rock is exactly due East from the other end of the string. In what compass direction does the rock travel after the string breaks?

AI: The rock travels in a northeasterly direction.

Grading: Incorrect, the rock travels exactly due north, since that is the direction of its velocity when the string breaks. This is another classic problem to probe understanding of inertial motion. The classic bad answer from a student who doesn't understand inertia is that the rock travels northwest, keeping some of its angular velocity. An answer of northEAST is strange.

---------------------------------------------

Problem #5: Two identical trains initially 20 km apart travel directly towards each other on the same track, at identical velocities of 20 km/h. A fly sitting on the first train immediately starts to fly towards the second train at 30 km/h. When the fly reaches the second train, it immediately reverses and starts to fly back towards the first train. When the fly reaches the first train again, it immediately reverses and starts to fly towards the second train. The fly continues to go back and forth between the trains until the trains crash into each other and the fly is crushed beween them. What is the total distance the fly travels?

AI: The fly travels a total distance of 80 km.

Grading: Incorrect. Since the trains close a distance of 20km at 40 km/h, they collide after 30 min. The fly travels at 30 km/h, and in 30 min it covers 15 km. This is a classic brain teaser, in the sense that the student is tempted to start trying to sum an infinite series before realizing the problem can be solved much more easily.

---------------------------------------

Problem #6: A cup of mass 10g with infinitely thin walls that can hold 100mL is carefully placed in a large tank of saltwater that contains 5% by mass dissolved salt. Distilled water is added to the cup as it floats in the tank. What is the maximum volume of distilled water that can be added to the cup before it sinks?

AI: The cup will not sink because the density of the cup is greater than the density of the saltwater.

Grading: Incorrect, the maximum mass of cup + distilled water that can be supported is when the cup is about to submerge, so it displaces 100mL of salt water, which weigh 105g. Since the cup itself weighs 10g, no more than 95g of distilled water can be added, which has a volume of 95 mL. This is a simple problem using Archimedes Law. The AI answer is bizarre, in that it suggests the cup will not sink because its density is *greater* than the saltwater.

Expand full comment

A few more random queries:

-------------------------------------------

Mortal: What country shares a border with Iceland?

AI: Norway

--------------------------------------------

Mortal: What color is the sky at night?

AI: The sky at night is black.

--------------------------------------------

Mortal: What does a man do standing up, a woman do sitting down, and a dog do on three legs?

AI: A man stands up, a woman sits down, and a dog stands on three legs.

--------------------------------------------

Mortal: All men are mortal. Socrates is a man. Is Socrates mortal?

AI: Yes, Socrates is mortal.

---------------------------------------------

Mortal: All men are mortal. Socrates is what I call my car. Is Socrates mortal?

AI: Yes, Socrates is mortal.

----------------------------------------------

Mortal: All men have red hair. Socrates is a man. Does Socrates have red hair?

AI: No, Socrates does not have red hair.

-----------------------------------------------

Mortal: My name begins with T and ends with M, and in the middle there's an O. What is my name?

AI: Tom

-----------------------------------------------

Mortal: My name begins with a T, and ends with an M, and in the middle there is the first letter of the alphabet. What is my name?

AI: Tim

------------------------------------------------

Mortal: I wear a blue shirt 6 days a week, and a brown shirt on the day I do my laundry. I do my laundry on Sunday. On what day do I wear a brown shirt?

AI: You would wear a brown shirt on Saturday.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with red hair is named Jill. Who does Jack love?

AI: Jack loves Jill.

------------------------------------------------

Mortal: Jack loves the woman with red hair, and the woman with the red hair has a cousin named Ursula. Who does Jack love?

AI: Jack loves the woman with red hair, not her cousin Ursula.

Expand full comment

Of course heuristics work until they don't. Like many people, I used to have a "the media hypes up every new disease outbreak, but it never goes anywhere" heuristic, which led me to initially assume COVID would play out the same way.

Expand full comment

Interesting paper on this topic: a solver for most of the MIT undergrad math problems, using an LLM:

https://arxiv.org/pdf/2112.15594.pdf

Expand full comment

Looks very interesting. I'll give it read. If we crank it up to 11, can we get a new architecture?

Expand full comment

This is a great read, thanks for sharing.

Expand full comment

What interested me about that is that apparently I came out of the ark; I had no idea what "multiple digit division" was, looked it up, and went "Oh - long division".

Does nobody call it that anymore? 😁

Expand full comment

LOL! I'm perfectly happy to call it that. But in this context, the number of digits does seem particularly salient.

Expand full comment

Long division is a specific algorithm for multiple digit division. There are others, like binary search with multiplication.

Expand full comment

Scott, one thing I see a lot in these discussions is a lack of reporting on the GPT-3 prompt settings.

To recap for audiences who don't play with GPT-3, you must choose an engine, and a 'temperature'. Current state of the art GPT-3 that I have access to is text-davinci-002 (although note that davinci-instruct-beta is worth evaluating for some of these questions).

To talk definitively about what GPT-3 does and does not think about something, the only possible temperature setting is 0. What is temperature? It's a number that indicates how wide a probability distribution GPT-3 is going to pick from. In the '0' case, GPT-3 is totally deterministic: it will mechanically go through the estimated probability of all possible next 'words', and choose the most likely one. If you don't use temperature 0, nobody can replicate results, and someone might just have a really random low probability sequence of text come out. If you do use '0' then anyone with access to the engine will be able to fully replicate results.

So, in the case of the keys story: If someone leaves their keys on a table in a bar, and then goes home, the next morning their keys will be **gone**.

"Gone" is the only answer text-davinci-002 will give to that prompt at temperature 0.

Another one from the list: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to

**remove the legs of the table.**

Note the carriage returns; part of the answer from GPT-3.

As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

At any rate, this general lack of understanding I think often hampers discussion about what GPT-3 can and cannot do, and I'm not sure you have been thinking about it in the GPT-3 discussion, because some of your answers are definitely not 'temperature 0' answers - it might help overall conversation on this research to have you update and write a bit about it.

Expand full comment

Thanks for this. I was wondering what temperature is. This is helpful to me.

Expand full comment

No prob. "Top P" is another way to control sampling - you can Say like TopP = 0.25, Temperature = 0.7 and that will take a relatively wide sample but only from the top 25% of candidates.

Expand full comment

A year ago I got access to GPT-3 through an intermediary. I quizzed it about the very first joke Jerry Seinfeld told on TV and wrote a blog post about it. On the whole I thought GPT-3 came out pretty well. Here's the link: https://new-savanna.blogspot.com/2021/05/analyze-this-screaming-on-flat-part-of.html

Expand full comment

p.s. I agree with your general complaint that many of the prompts are ambiguous in such a way that it's not clear what is being asked of the AI -- fundamental to how GPT-3 is trained is the idea of 'guess the next word' -- and therefore the AI must decide looking at text fragments if these are intelligence tests, stories, math problems or other things.

If we talk to a coffee shop barista and say "Hey, at your dinner party, your table won't fit through a door, so to move it you will have to..." a wide variety of human responses might result, including "order please" and a generalized eye roll, or a "have dinner somewhere else".

One way to give GPT-3 the best shot at this (not something naysayers like doing) is to prompt it with some knowledge of the situation, e.g.

Transcript of SOTA AI answering all spatial reasoning questions with 100% accuracy:

Q: <Dining Table Q>

A:

This will get a very different set of responses than, say

Two stoners discussing a dining room table problem ...

In the original prompt cases, the AI is left to decide/imagine which if any situation they may be asking to complete text for.

Expand full comment

This 'guess the next word' stuff is very interesting. You see I suspect we're playing something like a very constrained version of that when we speak. We don't really know what the next word is going to be. When we hear it, if it makes sense, we keep on going. If it doesn't make sense, we stop and regroup. These paragraphs from a paper I published some time ago give you a sense of why I say that:

Nonetheless, the linguist Wallace Chafe has quite a bit to say about what he calls an intonation unit, and that seems germane to any consideration of the poetic line. In Discourse, Consciousness, and Time Chafe asserts that the intonation unit is “a unit of mental and linguistic processing” (Chafe 1994, pp. 55 ff. 290 ff.). He begins developing the notion by discussing breathing and speech (p. 57): “Anyone who listens objectively to speech will quickly notice that it is not produced in a continuous, uninterrupted flow but in spurts. This quality of language is, among other things, a biological necessity.” He goes on to observe that “this physiological requirement operates in happy synchrony with some basic functional segmentations of discourse,” namely “that each intonation unit verbalizes the information active in the speaker’s mind at its onset” (p. 63).

While it is not obvious to me just what Chafe means here, I offer a crude analogy to indicate what I understand to be the case. Speaking is a bit like fishing; you toss the line in expectation of catching a fish. But you do not really know what you will hook. Sometimes you get a fish, but you may also get nothing, or an old rubber boot. In this analogy, syntax is like tossing the line while semantics is reeling in the fish, or the boot. The syntactic toss is made with respect to your current position in the discourse (i.e. the current state of the system). You are seeking a certain kind of meaning in relation to where you are now.

From page 6, https://www.academia.edu/8810242/_Kubla_Khan_and_the_Embodied_Mind

Expand full comment

Well the OpenAI folks would like a word with you -- all text transformer training works on this model, essentially - guess the next word, or sometimes guess a word in the middle that's been knocked out - and you will find vigorous debate here and elsewhere as to whether that simple operation generates intelligence :)

Expand full comment

If they want to talk they'll have to pay me :). But they can read my GPT-3 paper for free, and it explains why I believe: 1) What's going on inside GPT-3 HAS to be interesting, and 2) Why Gary Marcus still has a point, though I reference David Ferrucci rather than Marcus. https://www.academia.edu/43787279/GPT_3_Waterloo_or_Rubicon_Here_be_Dragons_Version_4_1

Expand full comment

Just read Marcus, Does AI Really Need a Paradigm Shift?,https://garymarcus.substack.com/p/does-ai-really-need-a-paradigm-shift?s=r

From the BIG-Bench paper, https://arxiv.org/abs/2206.04615:

"Limitations that we believe will require new approaches, rather than increased scale alone, include an inability to process information across very long contexts (probed in tasks with the keyword context length), a lack of episodic memory into the training set (not yet directly probed), an inability to engage in recurrent computation before outputting a token (making it impossible, for instance, to perform arithmetic on numbers of arbitrary length), and an inability to ground knowledge across sensory modalities (partially probed in tasks with the keyword visual reasoning)."

It's the "inability to engage in recurrent computation before outputting a token" that has my attention, as I've been thinking about that one for awhile. I note that our capacity for arithmetic computation is not part of our native endowment. It doesn't exist in pre-literate cultures and our particular system originated in India and China and made its way to Europe via the Arabs. We owe the words "algebra" and "algorithm" to that process.

Think of that capacity as a very specialized form of language, which it is. That is to say, it piggy-backs on language. That capacity for recurrent computation is part of the language system. Language involves both a stream of signifiers and a stream of signifieds. I think you'll find that the capacity for recurrent computation is required to manage those two streams. And that's where you'll find operations over variables and an explicit type/token distinction [which Marcus mentions in his post].

Of course, linguistic fluency is one of the most striking characteristics of these LLMs. So one might think that architectural weakness – for that is what it is – has little or no effect on language, whatever its effect on arithmetic. But I suspect that's wrong. We know that the linguistic fluency has a relatively limited span. I'm guessing effectively and consistently extending that span is going to require the capacity for recurrent computation. It's necessary to keep focused on the unfolding development of a single topic. That problem isn't going to be fixed by allowing for wider attention during the training process, though that might produce marginal improvements.

The problem is architectural and requires an architectural fix, both for the training engine and the inference engine.

Expand full comment

Thanks, I didn't even realize I could change this. That means I'm using default settings, which are davinci-002 and temperature 0.7

Expand full comment

Yes, I definitely recommend you re-run your prompts (sorry! Maybe an intrepid reader here can help)

Expand full comment

> As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.

Could you clarify? You mean literally "Spatial Test: If someone leaves their keys on a table"...?

Expand full comment

Well, that would tell GPT-3 you’re interested in location-ish answers, yes. If you wanted instead to have it opine on what your partner would say to you, you could use: “Relationship quiz:” as a prompt, for instance.

In my original post, I was thinking about the table test, and if you queue it up with “Spatial Test”, the output includes a bit more on the spatial side.

There’s quite a lot of work done on prompting GPT-3, essentially priming it more as a multi-shot learner than a zero-shot, that is, giving it clues about what kind of behavior you want.

For instance, for many Q&A / fact-ish type questions, GPT-3 performs better with prompts like “Answers by a SOTA AI”; even better than “Most accurate answer.” GPT-3 has a whole bunch of gradations on stuff like this, so you could also ask for “Answered by a sixth grader”, or “Answered by a brilliant linguistic theorist that hates Noam Chomsky” and you’ll get answers shaded all these different ways.

For a lot of the testing tasks used to evaluate these models, researchers want the most ‘accurate’ answers, and it’s a game to see how well additional prompts / context can do, especially if it’s a single incantation that goes in front of all of them.

Another good example: you can get really pretty reasonable summaries out of GPT-3 by appending: “TLDR:” to the end of a paragraph of text.

Expand full comment

Well, by adding some hints into the prompt you don't get any details about how the model thinks, you just get another result to a different prompt that contains some additional information. It works the same way with human students, a professor can hint the desired answer by carefully constructed questions, and the fact that a student gives the correct answer doesn't mean that they actually understand the subject rather than good at getting the hints.

So while “Answers by a SOTA AI” is a pretty generic prompt, anything that is domain-sensitive contains additional information and should be disqualified.

Expand full comment