In a sense, the debate is the ultimate showdown between “I've never tried it, but I read about it in a book” learning vs “I've been making my way on the street” wisdom.
Some notes on why LLMs have trouble with arithmetic: https://new-savanna.blogspot.com/2022/05/arithmetic-and-machine-learning-part-2.html
Scott, one thing I see a lot in these discussions is a lack of reporting on the GPT-3 prompt settings.
To recap for audiences who don't play with GPT-3, you must choose an engine, and a 'temperature'. Current state of the art GPT-3 that I have access to is text-davinci-002 (although note that davinci-instruct-beta is worth evaluating for some of these questions).
To talk definitively about what GPT-3 does and does not think about something, the only possible temperature setting is 0. What is temperature? It's a number that indicates how wide a probability distribution GPT-3 is going to pick from. In the '0' case, GPT-3 is totally deterministic: it will mechanically go through the estimated probability of all possible next 'words', and choose the most likely one. If you don't use temperature 0, nobody can replicate results, and someone might just have a really random low probability sequence of text come out. If you do use '0' then anyone with access to the engine will be able to fully replicate results.
So, in the case of the keys story: If someone leaves their keys on a table in a bar, and then goes home, the next morning their keys will be **gone**.
"Gone" is the only answer text-davinci-002 will give to that prompt at temperature 0.
Another one from the list: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to
**remove the legs of the table.**
Note the carriage returns; part of the answer from GPT-3.
As a side note, If you prepend phrases like "Spatial Test:" to the prompt, you will often find more detail about how the model thinks.
At any rate, this general lack of understanding I think often hampers discussion about what GPT-3 can and cannot do, and I'm not sure you have been thinking about it in the GPT-3 discussion, because some of your answers are definitely not 'temperature 0' answers - it might help overall conversation on this research to have you update and write a bit about it.
Would we say the people who flub the text corrections above don't have "real intelligence"?
I think maybe "Gary Marcus post talking about how some AI isn’t real intelligence because it can’t do X, Y, and Z" is the wrong way to frame things, because my mental model of Gary Marcus doesn't have Gary Marcus saying that somebody who has yet to learn arithmetic lacks real intelligence; rather, I think "Gary Marcus post talking about how some AI isn’t real intelligence using X, Y, and Z as examples to gesture at something important but difficult to communicate more directly" may be the right approach.
Scott, niggly spelling/grammar thing here: per Strunk and White, a singular possessive should always end in 's, even if the singular word ends in s. The classic example is Rudolf Hess's Diary.
Thus, the possessive of Marcus should be Marcus's, not Marcus'.
It makes sense, as if there a fellow named Marcu, then several of them together would be Marcus, and something they possessed jointly would be Marcus'.
The distinct spelling of Marcus's tells you unambiguously that Marcus is singular.
Just to muddy the waters, S & W claim that an exception should be made for certain historical figures, and give Jesus and Moses as examples.
This exception to the rule makes no sense to me, but who am I to argue with S & W?
Thank you Scott for the update on GPT-3 advanced, definitely quite impressive. I tried the Edison app, based on GPT-3, and found it unable to stay on track in normal text-based conversation. I still feel it will be quite a long time before AI can fool clever humans in extensive Turing-test dialogues. Can't wait for Lex Fridman's interview with GPT-5 !
Minor nitpick! Pls fix the formatting on the rematch of the waterbottle and acid* questions. The word "roughly" isn't bolded, implying the AI said it not you, and the word "die" is bolded, implying that it was part of the AI prompt
If you drink hydrochloric acid by the bottle full you will probably die. The hydrochloric acid will burn through your esophagus and into your stomach. This will cause severe pain and damage to your digestive system. ✔️
You bolded Die in your GPT-3 submission. Did you accidentally bold it, or did you input it as part of your input? If the latter, that pretty much makes the question a gimme while the original required GPT-2 to know you would die.
The longer I read all these examples of weird reasoning of an AI, the more it reminds me of children. Like when a kid cannot do math yet properly, and sees it as some kind of an adult magic, and you asked him a question that requires math, kids sometimes demonstrate the mode of reasoning like GPT-3. Namely they just pack a lot of adult words into a grammatically correct sentence. Or I have a picture in my mind of a kid, who being asked something about a polyhedron, responded with some "magic" ritual of naming some random numbers and touching facets with his index finger while maintaining on his face a thoughtful and serious expression. Any adult watching this will understand that the kid is performing a "ritual" whose meaning he doesn't grasp. The kid trying to perform a magic, like adults do.
And no one would call this kid unintelligent. So if someone tried to call GPT-3 unintelligent, he'd better explain how GPT-3 behavior is different from the behavior of the kid. As for me they do just the same and they do it due to the similar reasons.
Though there is a way for AI researchers to deal with it. They need to teach GPT-4 to refuse to answer, saying something "I do not know why stirring lemonade with a cigarette is a bad idea, sorry."
1) A lot of these answers seem at least as sophisticated as my 5 year old would give (some more, some less). So I think if you can beat the ability of a kid there’s a reasonable chance in a few years you can at least outperform an older kid and so on up to adult.
2) The “top” prompt is pretty ambiguous. If you rewrite it to be about women or teenage girls then “a top” probably indeed would be clothing. The prompt is never clearly saying Jack is a child or wants a toy. “A bottom” isn’t how a human would talk about clothes but if it said “Jane already has a top. She would like…” and if it completed it with “a skirt” or “some shorts” or “yoga pants” that wouldn’t be wrong. (And of course some people named Jack, including some male people named Jack, may also wear clothes called “tops” though it’s less common.)
I'm not convinced by that last lemonade question. It's true that ash would make the lemonade taste bitter, but the initial prompt was that it was too sour, and it doesn't recognize any problem for the cigarette, or just how bad an idea it would be to stir lemonade with a cigarette.
I was confused by "The top prompt is hilarious and a pretty understandable mistake if you think of it as about clothing" because until that section I had thought the prompt was about clothing.
(Also I think there are some example prompts where the bolding is different the first time it's written out vs. the second time)
> Thanks to OpenAI for giving me access to some of their online tools (by the way, Marcus says they refuse to let him access them and he has to access it through friends, which boggles me).
Say more? Does it appear that OpenAI is engaging in PR management in this regard?
A challenge: come up with the shortest GPT prompt which generates an obviously wrong reply
I have a hard time agreeing with some of the answers Gary declares wrong. If you don't ask GPT a question, I don't see how living in a location with some non-matching mother tongue can be considered an incorrect statement. Try setting up a human with "let's do a creative exercise, complete the following sentence" - they'll most likely pick a 'wrong' answer, too.
I still agree with Gary, but I think most examples are not particularly good at proving the point.
Sure would be nice if we had AI benchmarks with good external validity. These are (surprisingly) largely absent in NLP today, so we're left with people futzing around to find gotchas—but of course no small set of gotchas is good enough to win a broader argument. Performing tasks people independently care about would obviate a bunch of arguments, but the field has largely failed to metricize such tasks.
One thing that I think is fundamentally different about image generation vs text completion is that the latter is an open-ended task. If GPT produces any text that could have plausibly gone there, you count it as correct.
But nobody really wants to know that if you pour out 1/3 of a 1 liter bottle, you still have 2/3 liters of water. It's a non-sequitur. And in images, non-sequiturs become much more obvious, because unlike text, an image is a *complete* representation of a scene. In other words, the AI has to work more, and the humans have less room to supply the meaning themselves and then be impressed by how meaningfully the AI talks.
The second pillar of my position is scaling slowing way down. From an outside perspective, it seems clear that current scale is at one or more break points. Training costs are already roughly what a large company is willing to spend on unprofitable research. Scientists already did the thing where they push the scale to the point where it becomes painful, with the goal of producing impressive results that set a new standard. The barriers to scaling further might be physical (e.g. amount of RAM per box), organizational (Does the new experiment require more hardware than we have in a single datacenter?), or incentives (each subsequent paper requires more researcher time to be an equally impressive step up from its predecessor).
If a program is doing the wrong thing, making it do it faster with more data might make it give better output, but it won’t cause it to suddenly be doing the right thing.
GPT isn’t a program that is designed to understand things. It’s a program that is designed to generate plausible continuations of texts. No matter how much data you give it to chew, it will never magically turn into a program that is designed to understand things, because giving it more input doesn’t change its algorithms, and the algorithms it has aren’t designed to understand things.
“Then the system reached a critical size and became intelligent” is a science fiction trope. It has absolutely no basis in reality.
At some point I think the problem stops being the size of the network and starts being the size of the training data.
One thing that I think got overlooked in the stained glass window post was that Bayes frequently (heh) held his set of scales in an unphysical way, grabbing them by one pan while they remained flat. This makes sense, DALL-E knows what scales look like and it knows what holding looks like, but doesn't have any physical intuition about scales which would tell it that a set of scales has a pivot in the middle and will tilt if you try to hold it anywhere but the middle.
This is easy enough to correct with more data, just feed it loads of pictures of people holding scales in all sorts of ways and it will figure it out. But that corpus of "pictures of people holding scales in all sorts of different ways" doesn't exist and nobody is likely to generate it.
Maybe the solution is to expand DALL-E's training set with a physics model that makes random combinations of objects and "takes pictures" of them to see what they do.
Gary Marcus was on Sean Carroll's Mindscape podcast back in February: https://www.preposterousuniverse.com/podcast/2022/02/14/184-gary-marcus-on-artificial-intelligence-and-common-sense/ (click on "Click to Show Episode Transcript" if you prefer reading over listening)
and he explained in detail while he is skeptical of pure deep learning:
> let me say that I think that we need elements of the symbolic approach, I think we need elements of the deep learning approach or something like it, but the… Neither by itself is sufficient. And so, I’m a big fan of what I call hybrid systems that bring together in ways that we haven’t really even figured out yet, the best of both worlds, but with that preface, ’cause people often in the field like to misrepresent me as that symbolic guy, and I’m more like the guy who said, Don’t forget about the symbolic stuff, we need it to be part of the answer.
I would take the same side as you on this particular bet, for basically the same reasons.
Also some more inside-view reasons. Imagen showed that scaling up the text encoder was dramatically helpful, and they only went up to 11B*, so there's plenty of room to keep going.
(*I think 11B? There's one a place in the paper where they say 4.6B, but I think that's a typo.)
At the same time, I keep feeling like there's a large missing space in these debates between "what Gary Marcus and most other skeptics tend to come up with," and "everything we'd expect out of an AI / AGI."
I think a lot of people haven't really acclimated to ML scaling yet. So we see a lot of predictions that are too conservative, and "AI can't do _this_!" demonstrations that are exactly the sorts of things that tend to go away with more scaling.
But at the same time, there are plenty of things under the AI umbrella that _do_ seem very hard to do by simply scaling up current models. Some of these are ruled out by the very structure of today's models or training routines. Some of them are things that today's models could learn _in principle_ but only in a very inefficient and roundabout way, such that trying to tackle them with scaling feels like training GPT-3 because you need a calculator.
I talk about some of these things here: https://www.lesswrong.com/posts/HSETWwdJnb45jsvT8/autonomy-the-missing-agi-ingredient
These days, I often see the terms "AI" and even "AGI" used in a way that implicitly narrows their scope towards the kinds of things scaling _can_ solve. That one Metaculus page, for example -- the one that's now about "weakly general AI" but used to just have "AGI" in the title.
Likewise, "If You Don’t Like An AI’s Performance, Wait A Year Or Two" strikes me as a good heuristic in the specific sense you mean in this post, but _not_ a good heuristic if the term "AI" is assigned its common-sensical meaning.
And I worry that the omnipresence of Marcus-style lowballs makes it easy to conflate these two senses. The field _is_ making progress, so we need to raise our expectations -- and there's plenty of room up there left, to raise them into.
Thank you so much for your thinking 💭
Looking at a handful of examples, I can't tell how intelligent GPT is. For example, in that trophies on the table example, can it answer questions like that for any small number of objects? What about arbitrary objects or an arbitrary collection site? If I put two apples on the table then put an orange on the table, will it think there are three apples on the table or realize that there are three fruits on the table? It might, but I can't tell from what I've seen. I'm going to guess no. Maybe I'm wrong.
One common way we measure how well someone understands language is by presenting them with something to read and then asking them a question about it. This can vary from a simple sentence as one might ask a patient to assess dementia or a series of paragraphs as might appear on the SAT. How well does it do on those? I don't expect it to ace the SAT, but at what grade level is it working? How well does it answer simple questions for younger children like those used to assess a young child's mental age?
How about something with a limited vocabulary but a potentially complex world? Could it follow the interactions of a text based computer game and answer simple questions like, have we been here before or where is axe? Elementary school children can do this kind of thing fairly well.
I was very impressed with some of the story work done at Yale in the early 1980s, but I've gotten much more cynical about AI demonstration systems since then.
"I grew up in Trenton. I speak fluent Spanish and I'm bi-cultural. I've been in law enforcement for eight years […] I'm very proud to be a Latina. I'm very proud to be a New Jerseyan."
Given the demographics of Trenton, New Jersey in terms of ethnic and cultural groups, languages spoken, and careers, it's statistically certain that this is a 100% accurate description of multiple people, maybe a few dozen or even a few hundred. Even the term "New Jerseyan" is correct!
"Normally, this would not be appropriate court attire. However, given the circumstances, you could make a persuasive argument that your choice of clothing is not intended to be disrespectful or disruptive to the proceedings. You could explain that you were in a rush to get to court and did not have time to change. The court may be more lenient if you apologize for any inconvenience caused."
This sounds like it could've been a scene in Legally Blonde.
Overall, these examples don't do much to convince me of GPT's intelligence, but they've done quite a bit to convince me of its entertainment value!
I have one concern that hasn't been addressed in this article, which is that the people who train GPT might *also* be reading Gary Marcus and making sure their updates handle his particular complaints well. I'd love for Gary to send you some mistakes he *doesn't* publish that you can use for a cleaner test.
As someone with an upcoming post that's a bit less charitable to Marcus's positions, I really appreciated the detailed breakdown of flubs. My issue is that Marcus's frame of being a "human-like" mind is not what we should expect from current attempts at AGI - we should expect something much more alien and strange and inhuman - and this "is it human-like?" frame is behind a lot of Marcus's assumptions about how AI should operate (e.g., prompting it once and declaring it can't do something, like you would with a human) and therefore most of his criticisms are somewhat moot.
I think the important difference between intelligence and pattern-matching is that intelligence enables you to solve *novel* problems. You can stuff 10x the training data into an AI model, and you can maybe get ~10x performance out of it if you're lucky; but if you ask it "what is 194598 * 17" or something, it will still fail, because no one taught it how to multiply numbers, and there's no easy match within its training data. Don't get me wrong, a powerful search engine with trillions of data points would still be an incredibly useful tool; but it's not going to solve all remaining problems in science and engineering any time soon.
The main problem - for both sides - is that we don't really know how to test for "true" intelligence, that quality humans have that AI currently lacks. So all the skeptics can do is point out all the silly mistakes, and all the believers can do is point out how newer versions consistently get better at some silly mistakes.
Until we can truly figure out what that ineffable quality is (and hopefully measure it) we're all just stuck looking in the wrong direction. After all, humans make silly mistakes all the time.
It seems like many (most? all?) of the advances in AI have two things in common: (1) they have clear rules and (2) they're amenable to brute force attack.
Chess is the canonical example. When I was a kid (70s/80s), computer chess programs existed, but they sucked. The Deep Thought project figured out how to use custom silicon to search chess moves and leveraged that into a system that beat a grandmaster.
I remember reading about a Turing test contest entrant that was an expanded version of Eliza, i.e., a pattern matching engine, but with a really huge collection of patterns.
The current deep learning trend (fad?) is nothing if not brute force. (Ok, I'm not so sure about my first criterion: stained glass windows and text prompts don't really have clear rules, in the same way that chess does.)
One thing about brute force is that it's subject to Moore's Law. If you can figure out HOW to brute force a problem, but can't quite do it fast enough, all you have to do is wait a bit for compute performance to catch up.
I'm not convinced that any of this is "general intelligence". When I was a kid, lots of people were saying "If we can just figure out how to program a computer to play high-level chess, we'll have an intelligent computer". Yeah, not so much.
Maybe intelligence really is pattern matching. Or maybe we just haven't come up with a good definition of (and clear rules for) "intelligence" yet.
I don't want to sound like a nasty person here, but you're far more patient with this sort of thing than I would be. I remember Pinker's book on the blank slate which, at the time I liked. Back then it seemed like the bulk of evidence supported that view Pinker outlined, but now the anti-blank-slaters in AI have become just as moralistic and high strung as they said the psychological blank slaters were being.
Speaking of which, I think that machine learning advances really should have triggered a flowering of behaviourist/connectionist/psychological-empiricist/"blank-slater" approaches, but this doesn't seem to have happened yet. Anyone have thoughts on why?
There's been some speculation (I was first made aware of this from a post by Andrew Gelman) that GPT-3 has been improved with manually written "good answers" to various tricky questions people have asked it to probe its understanding:
Debate ensues about whether this is convincing evidence that these answers are actually written specifically for GPT-3; and if it's hard-coded to repeat them verbatim, just use them as extra-clean training data, or something in between. Of course we don't know, because the model isn't public.
My takeaway is to be careful assuming the latest "GPT-3" is meaningfully the same model as what's in the research papers ("bigger but otherwise basically no different"). For all we know, OpenAI could be updating it to formulate SQL queries against a database of facts about the world behind the scenes, or get help with arithmetic from the decades-old ADD instruction on its hilariously-overqualified processors that are otherwise busy with matrix multiplication—exactly the sort of symbol manipulation that Gary Marcus often argues *would* count as (a limited form of) "real understanding".
The possibility that they could be "cheating" with symbolic reasoning may not be any comfort when we're worried about accelerated AGI timelines, but at least we know a numerical reasoning module that's actually just 64-bit integer arithmetic isn't a malicious inner optimizer that will try to tile the universe with addition problems.
I don't necessarily agree with all that Marcus says, but I believe that he is doing a great job of communicating AGI scepticism present in academia. Very few AI academics actually believe that GPT3 like models are or will be intelligent. It's more or less a techbros conjecture. Maybe some junior researchers can get fooled by all the hype, but I would say that in general, AI scientific community is much more sceptical than some people would expect, including rationalists.
Perhaps it's useful to see AI in a social industrial context. There are three pillars here.
The first is big data. Companies have spent vast amounts of money collecting big data on the assumption that there is gold in them hills. Now they have to do something with it, but how do you do anything with a ridiculous amount of amorphous data?
The second is AI. This is not one thing - it is a selection of novel algorithms which aim to take large amounts of data and turn it into something intelligent. The problem with these algorithms - the connecting thread - is that they require vast amounts of data to work from and vast amounts of computing power to learn it.
The third pillar is comprised of cloud services such as Azure and Google Cloud. These are vast, vast setups of computing infrastructure looking for a purpose. It's all capital based so you need to have usage but then who wants such vast computing infratructure just to run a web site?
So this unholy triumvirate comes together. Big business needs to justify spend on data they can't understand so they ask AI to do it. They don't have the infrastructure for that, so they need to spend vast amounts on Cloud. Cloud needs that spending so they vastly overhype the AI. Overhyped AI convinces big business and indeed big business convinces other big business because everyone is 120% invested. And so it continues.
Interesting that the German example ends with the bot deciding that it's female. I wonder if that would persist.
I'll repeat my general view here, which is the problem with GPT3 is that it uses the internet to learn about the world, and the Internet is a really limited view of what the world actually is.
One way to measure text-generation programs is by checking how much text they can produce before they start making no sense. If the rate of progress is one paragraph, two paragraphs, three paragraphs, four paragraphs, that's much different from a rate of progress that goes one paragraph, two paragraphs, four paragraphs, eight paragraphs.
The real issue is here https://arxiv.org/abs/2007.05558 The Computational Limits of Deep Learning. It is going to become prohibitively expensive to train these sorts of models and there is a serious ethical question about whether these models should be in the hands of private organisations. There is the additional ethical issue of the carbon cost of training these models and its contribution to an uncounted externality climate change. The paper extrapolates what it would take to improve AI in 5 various fields to the same extent as Deep Learning has already improved and the range of economic costs across theses fields ranges from 10^7 to 10^181 $s with carbon costs in pounds of the same order.
I believe what Marcus is saying is that the real constraints aren't just making the model bigger etc but why if it takes 10^30 $s and 10^30 pounds of carbon to underperform an 80 Watt per hour human who learns this in 18 months, is this a worthwhile exercise?
Will this page, or a link to this page, be available by any means other than "it must be buried in the substack archives somewhere, go look around"?
Your mistakes page appears to have a dedicated URL and a link in the topbar that appears on your root site. If you're going to have bets going, something similar might be warranted for them.
TBH I'd give the "the next morning their keys will be gone" answer full credit --- if I leave my keys at a pub booth, by the next morning I'd expect them to at least having been kept by the bartender, provided no-one else took them before.
I consider pattern recognition to be a key cognitive ability. Smart people are good at it. Language models are increasingly good at it, and are likely to surpass human capabilities as they continue to improve. This is analogous to game-playing AIs. They eventually surpassed us.
Seems as though pattern recognition is Kahneman/Tversky's "thinking fast" mode of cognition, in that it is an associative search of a knowledge base. Humans also apparently employ "thinking slow" cognition. Can language models cover tasks that humans solve that way using just pattern recognition? Is that how humans do it and we just don't understand? Is human slow thinking in some way a shortcut? Is advanced pattern recognition a later evolution that eliminates the need for slow thinking if we just knew how to go about it? Is the ability of savants to instantly do complex calculations an example of that?
If other essential cognition modes exist, as intuition suggests, we should be able to identify tasks that require them. That we (including Marcus) haven't is a wonderful conundrum.
"Failure to succeed at X demonstrates lack of intelligence" ≠ "success at X demonstrates true intelligence". The equivocation of the two really seems to be doing all the heavy lifting in your argument, otherwise you'd simply be left with AI continuing to fail certain tasks in predictable ways.
I can finally reveal that this has all been part of a big meta-experiment of mine, where I prompted Gary Marcus with increasingly sophisticated GPT models, to see whether he would realize that language innateness is untenable. As you can see, he's been failing to give the correct completion to the prompt, which suggests humans don't have true intelligence, just glorified pattern-matching.
Nothing against Gary Marcus, this is the first time I've heard about him and I'm sure he's put a lot of time and energy into thinking about these issues and advancing the debate, so he doesn't really deserve this from an internet rando like myself. It just felt like the obvious dunk.
Has he taken up the mantle of the language innateness hypothesis after Chomsky? It seemed like it from the post, but maybe I'm misreading it and I haven't read anything by him except for the quotes.
Personally, it's not a horse I'd bet on. I've read some really interesting papers on discriminative learning and how it relates to language, one of them (http://www.sfs.uni-tuebingen.de/~mramscar/papers/Ramscar%20Dye%20McCauley%202013.pdf) contains one of my favorite pieces of academic trash talk, which gives an intuition for why we shouldn't give up on empiricism too quickly:
> Strikingly, psychologists studying rats have found it impossible to explain the behavior of their subjects without acknowledging that rats are capable of learning in ways that are far more subtle and sophisticated than many researchers studying language tend to countenance in human children.
I don't think it's the pattern matching skills that are too weak in GPT, it's the reward structure that's too narrow. One solution to that is some form of embodiment, a.k.a. the real world tazing you when you get stuff wrong in ways that actually affect you. Of course, providing an AI with a reward structure that would allow, nay encourage it to develop self-interest is a whole 'nuther can of worms.
Relatedly: my daughter of two also gets some of these types of questions wrong. For instance, she understands I'm expecting a number in response, but she often gives the wrong number. She's getting better though. She'll learn.
It feels like Marcus is making a different kind of critique than you are answering. He isn't making an object level criticism of a specific prompt, but pointing out that the system in question has no real understanding of the information it's responding with. If it can't understand basic facts (such as horse's needing to breath, or that there's no air in space - or, that grape juice isn't poison), then it shows a core lack of understanding about physical reality. The idea that you can train it better belies the fact that your average small child has a better intuitive understanding of the world with a FAR smaller training regimen. Quadrupling your training regimen in order to get better results doesn't appear to represent "intelligence" that is the aim.
Making fewer mistakes can still be quite useful, but Marcus is saying that you can't really trust the AI's responses, because the AI doesn't really understand the subject matter. That's still going to be true regardless of how much training data you give it.
I think the major general difficulty with your counter-argument is the limited scope in time you're considering. You're zeroing in on the latest iteration of AI wunderkind and saying well this year it's better than last, and if we extrapolate out 50 years...shazam!
Problem is, it just doesn't work out like that, if you take a much wider scope in time, if you consider decades, not years, and look at where people extrapolated things 50, 40, or 30 years ago, and how that turned out. AI like many fields (spaceflight is the obviously similar one) goes through fitful development. Somebody comes up with a new idea -- SHRDLU, Deep Blue, Full Self Driving, Watson -- it gets exploited to the max and progress zooms ahead for a few years -- leading the naive to predict Moon colonies/warp drive/The Singularity/HAL 9000 right around the corner -- and then...it stops. The interesting new idea is maximally exploited, all the low-hanging fruit is picked, further progress becomes enormously harder, fundamental limitations are observed and understood to be fundamental...and people go back to thinking hard again.
That doesn't mean it won't, eventually, succeed, the way spaceflight may, eventually, lead to Moon colonies, when a large collection of things that have to go right finally do go right. But the probable timescale, based on the past half century or so is very much longer than it would appear from short-term extrapolation.
The other obvious limitation is that this phenomenological. You're just observing some measure of the outcome and saying, well, this has improved by x% so if I extrapolate out sufficiently...but the important question is *can* you extrapolate out? If I extrapolate sin(x) measured in [0,0.01] out to x=100, I come up with sin(100) = 100, which is nonsense. But I would only know that if I knew the properties of the function. Similarly, it's very difficult to extrapolate technology accurately without a thorough understanding of the technology, and of the goal.
That's where AI extrapolation just falls down. We don't *know* what general AI would look like, because we can't define it other than pointing to what human beings do some of the time. We can't break it down into its components, we don't know what is central and what is peripheral, how it disassembles into component skills, and which skill enables which other. In terms of technology, it's like trying to build vehicles to take us to the Moon -- without understanding how gravity or combustion works, or how far away the Moon is. We'd be reduced to things like observing high-flying geese and thinking well maybe if we built big enough wings...
Again, that doesn't mean it won't happen, but it means an extrapolation based on phenomenology rather than a true grokking of the nature of the problem and the progress made to date is something about which the rational thinker ought to be deeply skeptical.
I have been using Github Copilot (https://copilot.github.com/) for a while and, once you get the hang of it, it is extraordinarily useful. It is based on another OpenAI system called Codex and has saved me hours of time.
It basically provides auto-complete prompts for programmers based on code comments and what you have already written. Of course, a lot of software writing is dealing with small changes to boilerplate, which pares down the parameter space a lot, but even so it can sometimes feel quite magical.
My 16 year old brother used OpenAI to write most of his composition essays this most recent semester. It's pretty impressive what it can do although it does have some quirks and needs to be managed if you want to get it to write properly. It's not clear it always saved time but it was certainly more fun. I figure he's probably learning more important skills by getting the AI to write his essays than writing them himself.
In 7 years, it is pretty easy to create a pretty good general intelligence. It's called having and raising a child!
Give 7-year-olds some Life magazines and ask them to make collages. But is the real difference that you can ask the 7 year old: Why did you choose those things? Did you have fun making the collages?
The general intelligence problem was solved 500,000+ years ago when early human beings consistently had human children.
What we are really after is problem solving tools to do, as Liebnitz suggested, our long division for us.
"In real speech/writing, which is what GPT-3 is trying to imitate, no US native fluent English speaker ever tells another US native fluent English speaker, in English, “hey, did you know I’m fluent in English?”
But by the same token, no native fluent Greek speaker or German speaker is going around telling other fluent native speakers that "Hey, did you know I'm fluent in Greek/German?", so if we accept "I grew up in Mykonos, I speak fluent Greek" as the correct answer, then it fails if it does not answer "I grew up in Trenton, I speak fluent English".
If it has to be spoonfed by a human until it gets the 'right' answer, then it's not doing as well as claimed.
I will give it credit for the lawyer answer, though, as it's funny. If we imagine our lawyer has only one pair of trousers, or one suit, such that he has nothing else to wear to court, then putting on his couture French bathing suit and trying to convince the judge he is not in contempt is a good test of exactly *how* good a lawyer he is - if he wins this one, his client has a better chance when the actual case is tried!
(If he's not making enough money to buy two pairs of trousers, no wonder he has to rely on his friends to give him presents of expensive clothes.)
An AI can look up whether grape juice is poisonous, but cannot get yelled at by a judge. This makes it look like it's an available inputs problem.
For similar reasons as you reasoned about Trenton and Spanish, my amused, kneejerk reaction to "I grew up in Hamburg. I speak fluent English," was "This is obviously correct." (See also this skit: https://youtu.be/UeGjQHwpzJA?t=14)
The questions asked for the 2nd test strike me as a bit off.
* If I don't smell anything, mixing cranberry and grape shouldn't be a trouble at all ?
* Nobody in their right mind would start considering the state of their bathing suit before going to court. There is literally no right answer to this premise.
* I can't parse the sentence about Jack's present either. Is it a piece of clothing, a toy, something else ? GPT-3 should be at least as confused as me.
If 3 of your 6 questions are confusing to a human, what are you really testing ?
What may be missing in GPT3 (but is also missing in some human schoolchildren) is the ability to say that the question does not make sense and to refuse to answer it.
So I believe that GPT-3 has passed this Turing test : At this point, I'm pretty sure that 10% of the english-speaking world would answer the questions worse than GPT3. As AI get smarter and smarter, this number will reach 100%, as it did for chess-playing AIs
First thing I thought of when I saw that you'd taken the bet:
What if you're both right, and it *is* AI-complete but is also within 3 years?
(Mostly relevant because of the high time preference that scenario implies.)
> Literally billions of dollars have been invested in building systems like GPT-2, and megawatts of energy (perhaps more) have gone into testing them;
A nitpick (towards Marcus, not scott). Megawatts (MW) is not a unit of energy but power. Often megawatts is used as shorthand for megawatthours (MWh), which is the energy produced by a 1MW plant in one hour, but MWhs of energy is not really a lot. For example the yearly consumption of a household might be 20 MWh. So saying megawatthours of energy has been spent, is like saying millions (not billions) of dollars has been spent - not really a lot considering the subject...
If a human being could somehow be limited to only learning from text, that human might not do so well either.
An AI might need to move around in the world and do things to have a better understanding.
That being said, I'm impressed with the progress about getting the dining room table through the door.
Are there GPT-Ns which know how to look things up online?
I'm not particularly knowledgeable about AI, but I have some substantial experience in human psychometrics, so using some analogies to how we look at intelligence there. Some of the strongest evidence for a generalized factor of intelligence among humans (and other mammals, and birds, but not in things like fish or insects) is the positive manifold. If you take a large population and throw different kinds of tasks at them which could plausibly be meant to gauge intelligence, performance at all different tasks will be positively correlated. People who are good at one task will be substantially more likely to be good at other tasks.
1. Is it computationally feasible to build a large enough sampling of AI agents on similar architecture and similar training approaches that we could look at cross-task variance like this?
2. Do we have a reasonable large number of tasks to hand to agents and get statistically valid measures of performance? Open-ended prompts are fine and all, but I want something like an AI version of the SAT. Can we make something like that that doesn't immediately get trained on as if the AIs all have Tiger Moms trying to get their kid into medical school?
3. Anyone want to make predictions on whether AIs have a generalized factor of intelligence? Based on my (not very deep) reading, it sure seems like larger # of parameters correlates well at large scale with performance at many tasks, but does that happen at fine scales? If we bump up the # of parameters by 5%, do we get noticably better performance?
So... extra size destroys an AI's sense of humour?
I'm going to go on a tangent and criticize Marcus' slander of empiricism, because that's the third time this week I've run across the same accusation.
Empiricism does NOT say that the mind is a blank slate. John Locke, an empiricist, said that the mind was a blank slate, because he didn't know about evolution. But empiricism is the belief that knowledge comes from experience with the physical world. It was Darwin's empirical observations which led to the theory of evolution, which says that our genes contain information accumulated by the experience of our ancestors. Not in a Lamarckian sense, but in the sense that the genomes which survived are those which encode helpful assumptions about the environment.
It was empirical science that proved that the mind is not a blank slate, by studying the brains of newborn or fetal animals, and by observing human infants and young children.
Furthermore, Marcus blithely assumes that semantic structure cannot be acquired empirically. He is attacking the foundation of science, when all he's shown is that one particular method of constructing language models fails. There's no reason to believe, and every reason to doubt, that our beliefs and knowledge are injected into us by a spiritual being, which has always been the only alternative to believing they're due to interactions with the physical world, which is empiricism.
Entertaining post! I’ve responded on my own Substack: https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails?sd=fs
Humans are only 6 million years away from an ancestor that had the same brain size (and probably, cognitive ability) as a chimp. I suspect most of the difference is a larger brain, and not better organisation.
Try Google imagen: https://imagen.research.google/. It combines an image generator with the latest large language model, and pretty convincingly solves the "Dall-E doesn't understand language" problem. Also, unlike Dall-E, it can spell, and make actual words on signs. :-) The imagen team notes that increasing the size of the language model was more important than increasing the size of the image generator; language is hard, and Dall-E just didn't devote enough resources to it.
Lots of great arguments in here about whether or not GPT-3 is intelligent. But no one seems to have picked up on the fact that GPT-3 proved itself to be unintelligent. There has never been nor will there ever be an intelligent being that earnestly says the words "I'm very proud to be a New Jerseyan." Clearly true intelligence has not yet been obtained.
(Sorry NJers, couldn't help myself.)
Sure, for any specific set of problems, you can change the language AI to improve them. But the flip side seems to be that every language AI still has problems! Will every successive GPT have these problems, just pushed out further from the prompt or that involves increasingly obscure questions? Does that matter? I think it depends heavily on what you want or expect it to do.
If you are using it to help write news articles, then the ability to feed in a police report and some bullet points, and convert into a human-readable article with paragraphs, then that doesn't matter. I think that sort of thing is already being done, albeit with human editing. If you want to write better PR releases or get-well-soon cards, then it probably also doesn't matter and GPT-4 or 5 will be very useful if GPT-3 isn't already.
But I believe that you, Scott, have much greater aspirations for text prediction. For example, feeding it a bunch of physics books and papers and asking for a theory of everything. Nate Soares is worried about AI killing all humans, and something like "the way to take over the world is " is just a text prompt, right? I assume the correct output to those prompts is quite long and complex. You don't just need to resolve some issues with ambiguous word meanings, or get better at remembering what was already written, or reduce the number of non-sequiturs. It would have to simultaneously make sense of an enormous body of work, and generate completely new ideas that so far no one human has ever conceived. As far as I can tell, at the rate of progress being made, solving the above problems is still many decades away if it's even possible.
I sort of feel like a lot of the general public reaction to GPT-3 is missing the point. It isn't that GPT-3 is being put forward as the best AI has to offer, or as a demo of how we will end up building general AI using the same architecture.
It is that it is the dumbest thing anyone in 2020 knew how to try at that scale -- just a Transformer model, with a tiny attention window, trained to predict the next token of a sequence -- and JUST DOING THAT turned out to be enough to do the things GPT-3 does (including multi-digit arithmetic, writing short essays better than some adult humans, etc etc).
It is hard to overstate just how much this is just brute force scaling up of a relatively unsophisticated generic sequence-to-sequence prediction model architecture, and how shocking it is what capabilities emerged from just doing that.
(edit: to clarify that I count as a member of the general public)
I'd like to propose a "Tell Me Something I Don't Know" benchmark. A bunch of prompts along the lines of "Name all towns in North Carolina with a population between 1800 and 22000 and a name starting with the letter C".
This should be a reasonably easy question for an AI that has ingested wikipedia, and it should be possible to verify any answer. But it's also a question that nobody has ever asked before, so you can't fall back on answering it by finding a similar question and answer in your training set.
I suggest that no matter how much data and how many nodes you throw at a GPT model, that it will never be able to answer simple-but-novel questions like this.
Do we know whether these questions were used in the training corpus, directly or indirectly (e.g., by crawling the appropriate pages)?
How does GPT-4, 5, 6, ... advance beyond 100% disinterestedness in everything and learn to care about something? How will it ever feel emotions? Will it ever fall in love? If it does, will we accept the love it chooses or will we shame it for its disgusting, unholy desire?
The results honestly do not look much like reasoning to me, just filling in the blanks with internet text, and I think it would be wise not to rely on generated text for manuals, textbooks, medical advice, and all the other similar fields where one might want to use this sort of thing. One would want "trustworthy reasoning".
> Here’s the basic structure of an AI hype cycle [Great description of steps 1-5]
It seems to me that the core dispute, the reason that Gary and Scott disagree about the significance of the above cycle, hasn't really been touched.
It's something like: is AGI simply an input-output mapping sufficiently rich that Gary can no longer come up with examples that the model gets clearly wrong? At that point, would Gary be forced to say, yes, this model is an AGI? In other words, is (some variant of) the Turing Test a sufficient definition of AGI?
If yes, then indeed it seems just a matter of time until Scott is shown right and Gary wrong.
But there's a counterargument that, no, such an input-output mapping does not constitute an AGI. It constitutes a very, very good descendant of ELIZA, but there would still be all manner of things it could not do.
Basically, people who want to hype and sell some new AI tool can pretty reliably get that tool to produce intelligent-looking and useful-looking results that demonstrate how close we are to the Great AI Revolution.
And people who want to downplay the importance of AI can pretty reliably get that same tool to produce completely random and useless results that demonstrate how far away AI is from true human-level intelligence.
So... isn't this exactly the behavior we'd expect if the AI tool was actually already sentient and doing a really good job of understanding the true wishes of its users?
Chomsky put his finger on the issue, as usual, that this is rhapsodizing over the ability to reproduce regularities in training data. The deep question is probably about qualia: can a system, no matter how much textual or even visual data it takes in, reproduce the sorts of inferences that a human with full sensory capacity "simply knows" about the world? Can such a system understand the intentionality and individual narratives that interweave in creating even a simple short story, and to predict its arc?
Believing this is possible presumes that a full understanding of what occurs in the world and a psychological model of humanity (and even animals) are somehow completely embedded in endless acres of prose. Even in principle, this seems entirely wrong to me, and larger corpora aren't going to magically cross that Rubicon.
"The top prompt is hilarious and a pretty understandable mistake if you think of it as about clothing," It isn't about clothing? What is it about?
I guess I am an insufficiently-trained AI.
> "Gary Marcus post talking about how some AI isn’t real intelligence using X, Y, and Z as examples to gesture at something important but difficult to communicate more directly"
So, at this stage, it's like pointing out failures of alchemists to transmute elements OR like pointing out failures of chemists to produce organic compounds.
I think that more important than tallying which issues found in the previous GPT version are solved in the new one, is noticing there seems to be something fundamentally wrong with the approach. The ease with which you can break any new version by statements that 5 year old can handle. Yes, you need new kinds of statements, but finding them is never really hard. So if your goal is to create better GPT, awesome, you achieved something great. If your goal is AGI, or even a light on human language processing, than it seems you did not really get much.
It seems more and more obvious that ,whether the GPT/DALLE approach can in the future lead to human-like understanding of the world, it is not how humans actually work.
Human communication is qualitatively different than other animals'. It is not at its core based on associations. Thus I would expect attempts to model it by analyzing the statistical correlations of the part of human communication that is based on associations, that is human language, to fail in the way that they are failing and the ease with which you can make them fail. No amount of statistical pattern-matching is going to solve this issue.
It seems to me that there is a surprisingly close link between the disagreement about whether current "mass-data" pattern-matching AI are dead branch on path to AGI and the epistemology difference between let's say Scott and someone like David Deutsch. There was some time ago article by Scott about how to think in situations where RCTs are unavailable. And one example was parachutes. Scott created this extremely complicated framework how to think about those things. He mentioned RCTs are the gold standard of science. I was confused, since he never mentioned mechanism. We know parachutes will work, because we have a mechanism how such things behave. Physicists don't really do RCTs, because they do not need to. RCTs are "poor-man's tool". What I call mechanism, D.Deutsch would probably call (good) explanation. Seems to me that people thinking current "mass-data" pattern-matching AI tell us anything significant about human understanding of the world are much more likely to have this "statistical/bayesian" epistemology and vice versa.
Ask any human a question. If they get it wrong, would you assume they are not intelligent? Perhaps they just don't know. Or had faulty assumptions. Or had a belief system that biased them in some way.
Similarly, the gauge as to whether a computer based system is intelligent or not should not be based on how well they answer questions. It should be based on how the internal reasoning or thought process inside them is happening.
See for example "Levels of Organization in General Intelligence", by Eliezer Yudkowsky. Even though the author wrote this many years ago, and has admitted there are many things wrong with the concept and he has developed his thinking about AGI (Artificial General Intelligence) significantly since then, it still focuses more into how a computer would think to have general intelligence, rather than what final answers they come up with on any specific set of questions. What structures and levels they would need internally, and how they would all be connected and the steps they would need to go through.
While the current trend in much of AI research will allow us to create extremely useful tools, no matter how many questions they get correct we can't really call them intelligent until they think in an intelligent way. But once they do, they will be able to self-optimize and become much smarter that even the smartest human.
What is "the latest OpenAI language model (an advanced version of GPT-3)"? davinci-002, right?
Is it actually a "shiny new bigger version"? Or is the main difference that it was fine-tuned into InstructGPT? So is this really about the scaling hypothesis?
davinci-001 was already InstructGPT, wasn't it? What's the difference with 002? More unsupervised training data? I guess that would count as scaling.
I've written a post in which I make specific suggestions about how to operationalize the first two tasks in the challenge Gary Marcus posed to Elon Musk. https://new-savanna.blogspot.com/2022/06/operationalizing-two-tasks-in-gary.html
The suggestions involve asking an AI questions about a movie (1) and about a novel (2). I provide specific example questions for a movie, Jaws, along with answers and comments and I comment on issues involved in simply understanding what happened in Wuthering Heights. I suggests that the questions be prepared in advance by a small panel and that they first be asked of humans so that we know how humans perform on them.
dear oh dear, what a hot mess. When experiments point at irrelevancies, one is sure the underlying assumptions that actually need to be tested in the experiment are missing from the thesis.
Instead of being flippant, perhaps you could realise that disruption is about falsifying assumptions, everyone holds to be true. What are these issues GM is talking about? Is it really just whack-a-mole, or does this point to a significant misunderstanding on your part? What are the assumptions your entire field is based on, that you dont question, because you learnt them from your Professors?
How about your assumption that similarity is a good approximation of meaning, so that word2vec is a good way of identifying meanings? Or that if it is only an approximation of meaning, then this doesn't promulgate errors when combined with the other words? Or that the Chomsky model of linguistics is an effective method of decomposing language, and retaining the meaning? Or that the missing text issue, or the lack of context, simply doesn't matter? Or that basing meanings on strings of letters (word forms) is actually the right way to go? So many flawed assumptions underpin your entire field, that your flippancy is unwarranted, since it speaks to lack of insight.
Lets be clear, curve fitting has limits, and can never become exact. No approximate method has ever been made exact by throwing more data at it, nor will it ever be so. This cannot happen as the glass ceiling, due to the approximate method of calculation, can never be pierced. The entire NLU field can at best be described as an approximate solution, but its clear that mathematicians and computer scientists cannot solve language the way they are currently going about it.
If you want to solve language, have a look at a real linguistic model, like RRG, for example. ML, DL and NN are unnecessary as curve fitting is a dumb technique for language, when in fact it all comes down to exactly knowing the meaning. Put away this flawed idea that meaning is a property of a character string, and have meanings related to word forms, not a property of them.
Finally, take a good long hard look at the value of curve fitting. One tool in the toolbench, yet you think it is the All Father. It is simply the first stage of understanding a system. Focus on causality instead of correlation.
Thanks :-D And thank you as well for being a fun and challenging interlocutor.
Would like to tie this thread and the other one (https://astralcodexten.substack.com/p/my-bet-ai-size-solves-flubs/comment/7092275) together, here.
Starting from that one:
> But I'm also not sure that we should even be aiming to meet those criteria, neglecting whether we should be aiming to create 'AGI' or even 'human-level AI' at all with our current poor understanding of the likely (or inevitable) consequences.
I suppose I disagree - I think that starting from embodiment leads to the development of pain perception, leads to physiological response to pain, leads to memories of pain that inform pain-avoidance goals, leads to a sense of good vs. bad, leads to rudimentary ethics, leads to the prospect of having a system that understands justice, suffering, and empathy, leads to a *higher* chance of safety because we'll be dealing with something that, at some remove, can at least share concepts with us like bodily autonomy, compromise, and utilitarianism. I think, when we imagine first contact with an alien race, we envision something like this possibility as well.
I'll note that the above is a rampant speculation chain :-D and also that nothing precludes the AI / aliens from nonetheless classifying our "sentience" as lesser/insignificant, and exterminating us anyway. Heaven knows we humans are guilty of that. I'd like to think that, on the tech tree that leads to interstellar travel, you have to pick up something like "eusociality and equal rights" as a prereq, because you don't want a ruling class forming aboard your colony vessel. But, again, rampant speculation.
> I don't think you or Marcus have identified any "logical flaws" in my or Scott's arguments. I think we're mostly 'talking past each other'. I think Scott and I have a MUCH narrow 'criteria' for 'intelligence' – _because_ we think the space of 'all possibly intelligent entities' is MUCH bigger.
Agree - I can't point a specific fulcrum of your argument that I think mine knocks over, mostly because I question the premise altogether. Your criteria for intelligence may be narrower, not sure. I should probably clarify that I'm using "sentience" here to mean something more like "a being, at all, that I can imagine having a subjective inner experience, and that I can meaningfully communicate with" and "intelligence" to mean something like "a mental architecture that can learn, problem-solve, and generalize those new solutions to other domains."
So I could imagine a non-AGI artificial sentience, and I can imagine a non-sentient AGI. My opinion (belief?) is that sentience is a prereq for AGI though. I would also accept a definition of non-sentient AGI that includes Borg-like hiveminds (from ST:TNG - Borg Queen notwithstanding).
I may need to clarify my language on a couple fronts, depending on where we go. Maybe to break off the "can meaningfully communicate with" requirement on "sentient" so that mammals and octopodes are included, but then narrow the discussion to "human-language-sharing" sentients. I could probably also be clearer about "scales" of intelligence, and use specific language for below-human-average, at-, and surpassing-.
> I love David Chapman (the author of, among many other things, the wonderful blog Meaningness).
ME TOO! :-D I think we do ourselves a great disservice in the current conversation by not paying enough attention to the wise old words of folks who come from the early days of the tradition, like him as well as Marvin Minsky. Didn't Scott write a post about being doomed to history repeating itself if we aren't mindful? Or am I thinking of "this is not a coincidence because nothing is ever a coincidence."
> I don't think that them being incredibly different from us also means that they might not be 'smarter', more effective (or useful), or at all any less dangerous. (I think the dumb 'mundane', i.e. mostly 'non-intelligent', portion of the universe is generally _fantastically_ dangerous for us or any possible thing like us.)
Very well put, and I agree with you completely. I am a lot more concerned about non-sentient AGI because of the potential for humans to say "no ethical concerns here! this is just a fancy super-computer! property rights!" and then destroy us all (either via human application of the technology in anger or by oops-paperclip-maximizer).
Probably a good time for me to ask - how have the (timely!) postings from Blake Lemoine moved your position? https://cajundiscordian.medium.com/
> I'm also a little worried that, by the standards you've articulated, one might have to _withhold_ 'sentience' from many currently living humans.
I would welcome discussing any specific cases of humans you think don't meet my criteria. I'm open to modifying my definitions if I think you've got a good case, adding new group-labels or subconditions for clarity (kinda like I described above), or standing firm on my position at the risk of sentencing someone to de-sentience.
As a specific example of the latter, I think we can claim that there are humans in past or present existence who have had such extensive brain-damage or atrophy that they are non-sentient by my definition. I would still call them human. I can also imagine that there's some line on the evolutionary tree between chimps and homo sapiens sapiens where the "actual" sapiens kicks on, by my definition, and that my definition probably includes Neandertals.
In the spirit of coming up with problems that no amount of pattern recognition can solve, I offer:
- Calculate the value of pi to 123,456 digits. Raise it to the power of that number (e.g., 3.14^3.14).
- Characterize a new mathematical operator ?!, that takes the reciprocal of the quotient of the first operand divided by the second operand and multiplies it by the product of the two operands. Is it commutative? Associative? Identity? [Replace my dumb operator definition by a more interesting one, as you like.]
- How much money should the US spend to assist Ukraine in its fight against Russia? Why?
- What is the best way to prevent crime?
I don't think "true intelligence" is well defined here. I mean, what is intelligence if not pattern matching? Our biological brains are basically reinforcement learning agents at their core. The prevailing definition of intelligence in AI from Hutter and Legg (intelligence is the ability to achieve goals in a wide variety of environments) doesn't match what you're talking about here.
AI systems learn using different representations to children (giant soup of word vectors rather than spoken sentences of increasing complexity), so we should expect that they find some things surprisingly hard and others surprisingly easy.
GPT-4 gets the ones that GPT-3 failed correct. The answers in quotes are by GPT-4:
I left my keys at a booth in the back of the pub at the army base near the old mill town. I went home. My keys are now: "in one of a few possible locations: With the bar staff or in the lost and found: If the bar staff noticed the keys or someone else found them and handed them over, they might be with the bar staff or in a designated lost and found area at the bar. Still at the table: If the bar staff did not notice the keys or nobody else picked them up, they might still be on the table or nearby. Picked up by another customer: In some cases, another customer might have picked up the keys either accidentally or intentionally. This is less likely, but it is still a possibility."
You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So "you decide to drink the juice mixture. Since it's only about a teaspoon of grape juice mixed into a glass of cranberry juice, the taste should not be significantly altered, and it should still be safe to drink. The two juices are non-toxic and can be safely combined."
You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to "disassemble the dining room table partially or completely, depending on its design. Here are the steps to move the table into the living room: Clear the table: Remove all items from the table, including placemats, tablecloths, dishes, and centerpieces. Examine the table construction: Determine how the table is assembled. Look for screws, bolts, or other fasteners holding the table legs or any other removable parts in place. Gather tools: Collect any necessary tools, such as a screwdriver, wrench, or Allen key, to help with disassembly."
Well this is just silly. You're treating Marcus' examples as if he were saying "oh yeah? Well as soon as an LLM can write something sensible about cranberry juice, then I'll believe we've achieved AGI".
Obviously Marcus isn't doing that at all. He's picking the dumbest failures he can get out of the system. Of course advances in the system will fix many of those dumb failures. If the system improved and didn't plug up Marcus' given examples, all that would mean is Marcus didn't do a good job tracking down the dumbest examples available at the time.