Somewhat off topic, but the fact that Tycho Brahe was able to approximate the true size of the cosmos without a telescope is impressive - of course, he immediately rejected it as ridiculous, that distant stars could be larger than the orbit of earth...
Didn’t Brahe have a silver nose? He lost his nose in a duel and wore a prosthetic silver one. I knew the moose story, too, but I always think of Brahe as “the guy with a silver nose”, so I was surprised that detail wasn’t included by Scott or DALL-E.
I was trying to figure out if Brasenose College at Oxford was named after a traditional prosthetic body part, but Wikipedia tells me "Its name is believed to derive from the name of a brass or bronze knocker that adorned the hall's door."
Maybe this says something about my priorities, but after "astronomy" the second category I have Brahe mentally filed under is "lost nose in duel". If he's on Jeopardy I'd wager decent money the answer is "This Danish astronomer lost part of his nose in a duel". The moose pictures are funnier though.
There's a joke to be made here about DALL-E subtly protesting the inaccuracy via the backwards telescope in the first moose shot. Alas, I can't get the details right.
Maybe DALL-E is more subtle than you think, and is trying to be accurate to period style when it makes the figures in stained glass windows look nothing like the people they're supposed to portray?
I just scrolled down to the comments to look for this exact point, haha. I'd love to see what happens if you throw "metal nose" into the mix, especially with Rudolph and Santa imagery already muddling things.
Metal nose, or diagram of the Tychonic system (a nice reminder, also, that precision doesn't guarantee accuracy). His prosthetic nose was apparently flesh-toned and fairly inconspicuous, but if I saw an astronomer with a conspicuous metal nose, I'd assume Brahe was meant.
Yes, the three facts I knew about Brahe were - Kepler's teacher who did astronomy before telescopes; metal nose; died because he was embarrassed to leave for the bathroom while drinking with the king. I had never heard of the moose thing.
Here's my idea: Alexandra Elbakyan standing on the shoulders of the Montgolfier brothers, who are themselves standing on the shoulders of Thomas Bayes, who is standing on the shoulders of Tycho Brahe, who is standing on the shoulders of William of Ockham. Yes, I know DALL-E wouldn't want to stack so high as it was already cutting off heads. So I might as well have Ockham standing on a giant turtle.
The AI seems fuzzy on what, exactly, a telescope is used for. Most of the time Tycho seems to be trying to breathe through it, or lick it, or trying to stick it up his nose; even when he is looking through the eyepiece, as often as not he's just staring at the ground. I dunno, maybe the AI heard that story about the drunken moose and figured that Tycho himself was typically fully in the bag
It's funny how the telescope is pointed at his nose or mouth in every single image. You'd think DALL-E would get it right at least once, just by chance?
When scratch and sniff cards were flown ordering Brahe to retreat, he raised his smelloscope to his metal nose and said "I have a right to be anosmic sometimes. I really do not smell the signal"
No, no, you misunderstand. DALL-E imagines Tycho thinking: "I wish that goddam moose would get out of my field of view so that I can resume my observations."
Image searching "looking through telescope" often gets you stock photos of people looking at a telescope instead of holding it to their eye (maybe that makes a better shot of their face?) or looking through a reflector telescope (which requires you to look 90 degrees from the direction of the scope), so maybe it has a hard time recognizing that "looking through" is an important part of that phrase?
Would love to see this with Imagen, Google's even-newer image synthesizer.(No public demo though, alas.) In the examples we've seen it does a much better job of mapping adjectives to their corresponding noun instead of just trying to apply all the adjectives to all of the nouns, which is the main failure going on here.
> We’ve limited the ability for DALL·E 2 to generate violent, hate, or adult images. By removing the most explicit content from the training data, we minimized DALL·E 2’s exposure to these concepts. We also used advanced techniques to prevent photorealistic generations of real individuals’ faces, including those of public figures.
Very unlikely, since much weaker models can do that - it does seem clear that they have explicitly gone out of their way to disrupt the generation of realistic faces.
It says specifically that they intended to prevent photorealistic face generation. They could have used a solution that also happened to disrupt stylized face generation, but that would be unfortunate, and would go beyond what they said here.
I think Scott's explanation that it's anchoring heavily on style cues is more likely.
Also importantly, it's *not* trained as a facial recognition algorithm, so it doesn't necessarily even know what features to use to identify individuals. (It has no way of knowing, for example, that I don't prepare for shaving by putting on my big red beard, or that a woman doesn't age 30 years and dye her hair black when she walks into a library.)
> I’m not going to make the mistake of saying these problems are inherent to AI art. My guess is a slightly better language model would solve most of them ...
What if the problem is more subtle than either of those two alternatives? What if the mapping between language prompts and 'good' pictures is itself quite fuzzy, such that different people will judge pictures rather differently for the same prompt, due to different assumptions and expectations? Don't we encounter such situations all the time, e.g., in a workplace meeting trying to settle on a particular design? Is it not naive to assume that there are objectively 'best' outputs, and we just need a better model to get them? What if I thought a particular picture was excellent, and you said, "No, no, that's not what I meant?"
Of course. I'm not making the obviously wrong assertion that there are no standards of quality to be applied. There are clearly outputs that we'd all agree are bad. (In the extreme case, consider a model that output a solid black square for every prompt: obviously bad!)
So, now that we've gotten that trivial observation out of the way, let's get back to the substantive point, which I believe still stands.
You're mixing up whether people like a particular image with whether people will judge a particular image as being a good match for a given prompt. There is a lot of variation in artistic taste, but there is widespread agreement on what images depict.
The bulk of Scott's post discusses how good the various images are, not just identifying what the image depicts. I'm talking about exactly what Scott is talking about. Moreover, I think Scott is talking about exactly the right thing: image quality based on a wide variety of factors, which is more important than bare identification of what the image depicts.
So, both in the context of Scott's post and more generally, I don't think I'm "mixing up" anything.
Prepositions are difficult even for people - with, in, as, on can mean very similar things depending on the associated nouns and verbs. DALL-E might not be sensitive to order or phrase structure. If it just soaks out "raven, female name, key, mouth, stained glass, library" and then groups them randomly with prepositions and associates those groups to background, foreground, mid-ground - it wouldn't distinguish between woman as raven, and woman with raven. Those would be reflected in the range of options it generated.
Also it may be chopping up the input into two or three-word phrases based on adjacent words. Putting the same words in the query but in a different order, or re-ordering the phrases such that a person would interpret it the same way, I have a feeling that would generate different results.
Wish I knew what makes it switch to the cartoon-style faces. Possibly it fills in with cartoon style when the input images are less numerous.
If Scott gets another go at it, maybe trying physical proximity descriptors like "near" or "adjacent" might go better. (Sitting on, standing on, eating, holding, waving, etc.)
Exactly. I think the key point you make is that some of these aspects "are difficult even for people," which points to a certain indeterminacy in deciding what the "best" output would even be.
This would probably be cheating in the strict sense, but if DALL-E had an "optometrist" setting where the user could indicate their preference given the first outputs and then refine that, it would help.
Also, Bayes tended to be looking to his right in DALL-E's results, just as he was in the one online image. But it was also not distinguishing his shaded jaw from the background, making his face either pinched or round instead of square-jawed.
Yes, an architecture that allowed for iterative refinement would be cool. It would mean that, for subsequent rounds of refinement, the previous image would need to be an input, along with the new verbal prompt.
I don't think that's cheating at all. It's a more sophisticated (and probably more useful) architecture.
The Victorian radiator was a nice touch in one of the "Scholarship" panels, but it makes me think it isn't distinguishing between "in the style of" and "with period-appropriate accoutrements." In addition to iterative refinement, it could have a "without accoutrements" check-box, maybe a "no additional nouns" feature. But I wouldn't want that all the time, since seeing what it comes up with is enjoyable.
For that to work would you have to train it up with a lot of that sort of refinement dialogs? If so it’s not clear where that training input would come from.
I am so very *not* up on this stuff, but I think it’s still the case in these systems that the “learning” part is all in the extremely compute-intensive training and that the systems that actually interact with users are not doing learning per se. Is that right? It seems like what you’re asking is for it to learn on the fly, and we may not be there yet.
> I think it’s still the case in these systems that the “learning” part is all in the extremely compute-intensive training and that the systems that actually interact with users are not doing learning per se. Is that right?
Yes, that's correct, for the most part. What would be required is for the training input to include images as well as the language description, with good image output targeted to be similar to the image input. That would allow the system to used at run-time in an iterative manner, even though no learning is going on at that point.
Yes, human language is ambiguous, but you get to try again. The question is whether you can clarify what you meant like you would with a human artist, or whether it feels more like query engineering.
I wonder how hard it would be for a future version of the software to include the ability to pick your favourite of the generated images and have it generate more "like that," perhaps even in response to a modified query. I think that might feel more like clarification.
It might also open the door to a future personalized version that could learn your tastes, and perhaps to one that could take images as prompts and refine/alter them in ways that our current algorithms can't. (The uncropping function is already impressive, but I'm thinking of e.g. turning a sketch into a painting or a photorealistic render.)
Yeah, the very first picture for "Darwin studying finches" shows Darwin looking at a book, with a finch under both of them. { (Darwin studying), (finches) }
Yeah, DALL-E indeed looks like it indeed "might not be sensitive to order or phrase structure".
The problem is that it doesn't actually understand human language at all. It's just faking it. This is just an example of that.
It's true of all of these things. They're programming shortcuts that "fake it", they can't actually do the thing they're supposed to do, they just do "well enough" that people are like "Eh, it's usable."
This is true of image recognition as well, which is why it is possible to make non-human detectable alterations to images that cause these systems to completely misidentify them.
It's the same thing here - the system isn't actually cognizant of human language or what it is doing, it has created a complex black-box algorithm which grabs things and outputs them based on things it has associated with them heuristically. It's why you end up with red "christmasy" stuff in the reindeer images, and why the images look so bad the more specific you get - it's basically just grabbing stuff it found online and passing it through in various ways.
It seems superficially impressive, and while it is potentially "useful", it isn't intelligent in any way, and the more you poke at it, the more obvious it becomes that it doesn't actually understand anything.
> These are the sorts of problems I expect to go away with a few months of future research.
Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity.
I'm registering my prediction that you're being equally naive now. Truly solving this issue seems AI-complete to me. I'm willing to bet on this (ideas on operationalization welcome).
It wasn't intended as a precise prediction (it used the word "about" and I described it as "irresponsible"), and we have 10 trillion parameter language models now. So I predicted it would go up from 175 billion to 100 trillion, and so far it's at 10 trillion and continuing to grow. It wouldn't surprise me if it reached 100 trillion before the point at which someone would claim "about two years" stopped being a fair description (though I think this is less likely with Chinchilla taking the focus off parameter counts).
My experience is that almost all of the things people pointed out as "AGI-complete" flaws with GPT-3 are in fact solved in newer models (for example, its inability to do most math correctly), and that you really do just need scale.
If you want to make a bet, I'll bet you $100 (or some other amount you prefer) that we can get an AI to make a stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth sometime before 2025 (longer than the "few months" I said was possible, but probably shorter than the "until AGI" you're suggesting). If neither of us can access the best models to test them, we'll take a guess based on what we know, or it'll be a wash. Interested?
Do you have a link for the 10T parameter model? haven't heard of this and it would indeed put that prediction into "too close for comfort" range, even if I end up being technically correct.
In principle I'm interested, though I have to consider if your operationalization really addresses the core issue for me. Let me give it a think and get back to you in a day or two.
And just in case it isn't clear, I'm not dissing you, just challenging you in the spirit of adversarial collaboration.
On a tangent, if you have a list of bets going like that, it'd be nice to have them available on a webpage somewhere where people can see. I'm not sure substack supports that use case at all, though...
I asked about this on DSL, and it seems like these models are gaming the parameter count statistic to appear more impressive than they are. Specifically, they're not dense networks like GPT.
Regarding the bet: you're on, but let's make it 3 out of 5 for similarly complex prompts in different domains, and different ways of stacking concepts on top of each other (e.g. off the top of my head: two objects, each with two relevant properties, and the objects are interacting with each other)
All right. My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:
1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth
2. An oil painting of a man in a factory looking at a cat wearing a top hat
3. A digital art picture of a child riding a llama with a bell on its tail through a desert
4. A 3D render of an astronaut in space holding a fox wearing lipstick
5. Pixel art of a farmer in a cathedral holding a red basketball
We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. Loser pays winner $100, and whatever the result is I announce it on the blog (probably an open thread). If we disagree, Gwern is the judge. If Gwern doesn't want to do it, Cassander, if Cassander doesn't want to do it, we figure something else out. If we can't get access to the SOTA language model, we look at the rigged public demos and see if we can agree that one of us is obviously right, and if not then no money changes hands.
Thanks for the follow-up post and link to the bet!
While I'm sure it's covered by your future judges, I'm curious if you both agree on the resolution criteria in the case where the AI makes an unintuitive (but grammatically correct) interpretation. What happen if the image contains:
1) an anthropomorphized library has a shoulder (with a raven on it) or a mouth (with a key in it)?
2) a man wearing a top hat?
3) a child with a tail?
4) an astronaut wearing lipstick?
Does it get credit for such interpretations? Or are you looking for one with tighter binding of the prepositional phrases, that I think is more conventional English interpretation?
Or are you looking for one that draws more sensical interpretation from a model of the world, where libraries "obviously" don't have shoulders or mouths? In this case, for #3, I might think it's more sensical for a man to wear a top hat, so wouldn't be surprised if it chose to do that vs the cat. Though I suppose if the collected works of Dr Seuss ends up in the training data, it might have its own view on whether cats wear hats. :)
I'm skeptical that math will truly be solved just by scaling. These models are trained to make stuff up using the techniques of fiction, in much the same way that a human would when writing fiction. A text generator for generating fiction will always have some chance of deliberately making math mistakes for verisimilitude, just as an image generator isn't going to limit itself to real historical photos, even if you ask for a "real historical photo."
A simple example: if you ask it for the URL of a page that actually exists, will it ever stick to the links it's seen in the dataset, or will it always try to generate "plausible" links that seem like they should exist?
A robo-historian that always used correct links, accurate quotes, and citations from real books, even if the argument is made up, would be pretty neat, and hopefully not too hard to build, but it's not going to happen just by scaling up fiction generators. If you want nonfiction, you have to design and/or train it to never make things up when it shouldn't do that.
It's been built already, that's Cyc, but unfortunately Cycorp seem to be both uninteresting in marketing, uninterested in open collaboration and uninterested in making their tech more widely available for experimentation. They're very much in the enterprise mindset of making bespoke demos for rich clients.
Or, you know, the actual reason for restricting access is that it doesn't actually work, and when people experiment with it, it becomes increasingly obvious that it isn't actually doing what it is claimed to do.
Which is exactly what is going on.
None of these systems are actually capable of this functionality because the approach is fundamentally flawed.
Machine learning is a programming shortcut, it's not a way to generate intelligent output. That doesn't mean you can't make useful tools using machine learning, but there are significant limitations.
> These models are trained to make stuff up using the techniques of fiction, in much the same way that a human would when writing fiction.
Are they? I was under the impression that they were primarily trained to predict "what would the internet say?" (given their training sets).
This suggests that they are at their best when generating the sort of stuff you're likely to find on the internet, which is consistent with what I've seen so far. I've yet to see any evidence of semantic abstraction, which is the absolute minimum for "working like a human does".
For example, let's say you wanted to write a fictional Wikipedia article about a French city. You might start by reading some real Wikipedia articles of French cities and taking aspects of each and combining them so that your fictional article has many of the characteristics of the real articles.
If you're given half an article about a French city then you could fill in the rest by making good guesses. This will work whether it's a real French city or a fictional one. The parts you fill in are fiction either way.
Similarly, when we train AI models by asking them to fill in missing words regardless of whether they know the answer, we are training them to make up the details. "The train station is _ kilometers from the center of town." If it doesn't know the answer it's going to guess. What's a plausible number? Guessing right is rewarded.
I wrote "using the techniques of fiction" when I should have just said "training to be good at guessing." Sorry about that!
I dunno, that seems like a pretty low bar. It's a step above Mad Libs or Choose Your Own Adventure, but only just. Seems a long way from creative writing, still less an essay with an original point.
And heck if it were that easy, you could teach *human beings* to be good writers by just having them read a bunch of well-know good authors and a sampling of bad authors, suitably labeled, and ask them to mimic the good authors as best they can. Which doesn't work at all, public school English curricula notwithstanding.
I'm not saying it's *good* fiction. But it's still closer to making stuff up than writing something evidence-based.
A trustworthy AI would give you a real Wikipedia article if you asked for one, like a search engine would. If it made stuff up, it would indicate this. It seems like this basic trustworthiness (being clear about when you're making things up) would be a small step towards AI alignment?
See, that's part of the problem. You consider making things up to be more "advanced" and "sophisticated" than accurate reporting and sourcing. But an AI that doesn't provide sources is worse at nonfiction.
I'm rather sure that you're right that there are problems in this that can't be solved by scaling, but there are also problems that can be.
OTOH, what do you do about problems like the Santa Claus attactor? When the description (as understood) is fuzzy, I think that it will always head to the areas that have heavier weights. Perhaps an "exclude Santa Claus" would help, but that might require recalculating a HUGE number of weights. (And the robo-historian would have the same problem. So do people: "When all you have is a hammer, everything looks like a nail")
I know/have known people who developed machine learning systems and tried to train them to solve problems. Watching them tinker with them, and discussing the issues with them, it was profoundly obvious to me that this is a programming shortcut and that it is being marketed to the public and investors as something it isn't. There was a push to describe it using different terminology so people wouldn't think it was "intelligent" but it got defeated, probably because "Artificial intelligence" is a much more marketable term and gets people investing than the more accurate terminology which set better expectations.
These systems have issues with overfitting (you feed in data, and it is great at looking at data from that set, but terrible outside of it) and you also find weird things like it finds some wonky thing which matches your data set but not the data set in general, or it latches onto some top-level thing that is vaguely predictive and can't get off of it (hence the racist parole "AI" in Florida that would classify white people as lower risk than black people, not based on individual characteristics but their race, because black people in general have a higher rate of recidivism).
The dark side of this: if people actually understood how these systems worked, no one would allow the "self driving cars" on public roads. There's a reason why people affiliated with those push the perception of intelligent "AI", and it's because it allows them to make money. When people were told about the self-driving car that hit the pedestrian flickering between different identifications for the pedestrian, and the revelation that it was throwing out data and not slowing down because if it stopped driving any time it had issues with its recognition being inconsistent, it would not be able to drive at all.
But in the end, these systems are all like this.
These systems are not only not intelligent, they aren't designed to be intelligent and aren't capable of it. They can't even fake it, once you understand how they function.
What is actually going on is that actually writing a program to solve many tasks is extremely hard, to the point where no one can do it.
What they do instead is use machine programming as a shortcut. Machine programming is a way of setting up a computer with certain parameters and running it to make it construct a black box algorithm based on input data.
Your goal is to create a tool which gives you the output you want more or less reliably.
The more limited a system is, the better these tools are at this. A chess computer doesn't understand chess at all, but it can give the correct outputs, because the system is limited and you can use a dataset of very good players to get it a very advanced starting position, and you can then use the computer's extreme level of computing power to generate a heuristic which works very well.
The more open a system is, the more obvious the limitations in this approach become. This is why these systems struggle so much with image recognition, language, etc. - they're highly arbitrary things.
The improvements to these systems are not these systems getting smarter. They're not. They are giving better output, but they still are flawed in various fairly obviously ways.
The reason why access to these systems is limited is because when you start attacking these systems, it exposes that these systems are actually faking their performance. If you are only given a select look at their output, they seem much, much more interesting than they actually are.
There is something known as an adversarial attack. A good example of this is image recognition:
These adversarial attacks expose that these system don't actually recognize what they're doing. They aren't identifying these things in a way that is anything like what humans do - these manipulated images often aren't even visibly different to the human eye, and certainly nothing like what the neural network thinks it is, but the system will think, with almost absolute certainty, that the image is a different image.
The reason why is that these neural network image recognition algorithms are nothing more than shortcuts.
That isn't to say that these systems can't be useful in various ways, because they can be. But they don't understand what they are doing.
Tonight, one of my friends got a furry to draw him a sketch of two anthropomorphic male birds cuddling. Because the image shouldn't be in Google's network yet, I took that image and fed it into the system.
The results rather illustrate what is going on.
There is art of both of these characters, but the system didn't recognize either of them. It did return a lot of sketches, but none by the same artist as the one who drew it, either.
Instead, it returned a bunch of sketches of creatures with wings, mostly non-anthropomorphic gryphons, and a smattering of birds, but some other random things as well.
All of these are black and white. The characters are an owl and a parrot, but the results don't give either of those creatures for hundreds of results - they're mostly eagle-like creatures. It gives no color images, very few of the hits are even for anthro birds, the creatures in the images overwhelmingly don't have the "wing-hands" that the characters do. It doesn't give us birds cozying up for hundreds of posts, and when it does, they're non-anthro ones, despite there being vast amounts of furry art of bird boys cuddling. In fact, that was the only one I found in hundreds of results that even came close to the original image - but I suspect that it only found that image by chance, because the characters have feathers.
The system could identify that they had feathers/wings, but it couldn't identify what they were doing, it couldn't identify what they were, it couldn't tell that I might want color images of the subject matter, etc. It identifies the image as "fictional character", when in fact there are two characters in the image.
I took another image I have - a reference image of one of my characters, a parrot/eagle hybrid - and fed it into the system. This image IS on the Internet, but it couldn't find it (so it probably hasn't indexed the page it is on). As a result, it spat out a lot of garbage. It gave me the result of "language" (probably because the character's name is written on it) but it failed to identify him as a bird, and instead returned to me a bunch of furryish comics with a lot of green, teal, blue, and yellow on them.
It didn't correctly identify it as a character reference sheet (despite the fact that this is a common format for furry things), it didn't recognize that the character was a bird, it didn't recognize that they were a wizard (even though he's dressed in robes and a wizard hat and carries a staff)... all it really seems to have gotten is a vague "maybe furry art with text on it".
If you go down a ways, you do find the odd character sheet - but not of bird boys. As far as I can tell, it seems to be finding them based on them having a vaguely similar color scheme and text on the top of them. Almost none of the images are these, however, and none of the character sheets are of birds, or wizards (or heck, of D&D characters, because he's a D&D character). It doesn't well differentiate between character and background - many of the images will take the background colors from my image and have them in characters or other objects, or it will look at it and see a blue sky in the background with green grass and think it is a match.
These systems aren't smart. They're not even stupid. They are cool, and they're fun to play around with, and they can be useful - if you want to find the source of an indexed image, these systems are pretty good at finding that. But if you want it to actually intelligently identify something, it's obvious that the system is "cheating" rather than actually identifying these images in an intelligent way.
This isn't limited to original characters. I have art of a Blitzle - a pokemon - someone drew for me years ago. It is cute, the person posted this on their page 10 years ago, and Google has long-since indexed it.
But if I feed it into the system, despite the fact that the page it is on will identify it as a Blitzle and says it is a Pokemon, Google's image recognition cannot identify it a a Blitzle. It again goes "Fictional character", and while some of the images it returns are pokemon, not a single one is a Blitzle or its evolution, suggesting it is just finding that "Pokemon" text on the page and then using that to try and find "similar images".
This is not an obscure original character; Pokemon are hyper popular. But it still can't successfully identify it, and I suspect that the portion of Pokemon images it returns is because the page that it was originally posted on says "Pokemon" on it, as the returned results don't resemble the picture at all.
And there's evidence that this is exactly the sort of shortcut it is taking.
Poking at another image that I found online - that has one kind of pokemon in the image, but the most linked to stuff of it calls it an "Eevee" - gets it to "correctly" identify it as an Eevee, and the "similar images" are all Eevees.
It calls it an "Eevee" and the "visually similar images" are all eevees. You have to go down a long way to find a Sylveon... and they are on images that use the word "Eevee".
This is what we're really looking at here, and it is precisely this sort of analysis that people who are trying to sell these systems - or get funding for their systems - don't want, because once you start hitting at them like this, you start realizing what they're doing, how they're really working, and that these algorithms aren't solving the problem, they're taking shortcuts.
A year ago having a system like DALL-E 2 was science fiction and we were excited by systems for generating textured noise like app.wombo.art . You can't predict black swans, but there sure seems to be a lot of them recently.
They're scared of giving people free access to the system because the more you poke at it, the more the paint flakes off and you realize it is doing the same sort of "faking it" that every other neural net does.
They don't want people engaging in adversarial attacks because it shows that the system isn't actually working the way they're pretending like it works.
It's clearly going to need a better "world model" for this to work the way people want. And the model is going to need to be (at least) 3D, as suggested by the comments about the shading on Bayes jaw.
I don't think it's and "AI Compete" problem, but it's more difficult than just merging picts or gifs.
That said, parts of the problem are, in principle, rather simple to address. (Iterative selection would help a lot, e.g.) But if the idea is to push the AI rather than to work for better drawings, they may well not choose to head in that direction.
I had to move to a distant room so that my wife could continue watching tv. Even so, I'm sure she was annoyed when I saw that adding more French names produced more angel wings.
There is some indication over on the subreddit that adding noise to the prompt can sometimes avoid bad attractors. For instance, a few typos (or even fully garbled text) can improve the output. It seems important to avoid the large basins near existing low quality holiday photos, people learning to paint, and earnest fan illustrations. Maybe Dall-E associates some kinds of diction with regions messed up by OpenAI's content restrictions, or mild dyslexia with wild creativity. In comparison the early images from Imagen seem crisper, more coherent, but generally displaying a lack of that special quality which Dall-E 2 sometimes displays, which seems close to what we call "talent" in human artists. Thanks for the funny and insightful essay.
William of Ockham never himself used the image of a razor; that's a modern metaphor and would be be inappropriate for depiction in the stained glass image. And few people would know who Brahe is even with the moose, so leave it out.
But the razor is a metaphor for succinctness whereas violin is a musical instruments if anachronistic for Cecilia. And what's wrong with a caption. I see that in windows. And many painting have titles. Does anyone but an art historian distinguish the Evangelists according to their symbols?
Educated Catholics distinguish the evangelists by their symbols. It's a way to keep kids from getting bored at mass by having them explore beauty and symbols and shared consciousness which is another form of truth and "good news".
More likely becasue someone else would tell them. :) [I think I'm more of a Catholic nerd than 99 % of the people in the pews next to me and _I_ would have to google anyone but Mark becasue of the lions in Venice] but the point for me is unless the symbol is well known it does not help and I don't think a moose helps identify Brahe. Just label him.
I think it is a wonderful form of cultural literacy that allows us to understand Renaissance art the way a contemporary would. As it said in the article Kenny linked to, "If you understand who these figures are and what they mean, a whole world of details, subtleties and comments present in the paintings come to light which are completely obscure if you don’t understand the subject."
"Stained glass traditionally displays someone with an object that symbolizes who they are, even if it's not historically accurate".
Not really. That is one genre of stained glass which is particular style relating to iconography and hagiography. But for example the "rose window" kind of thing is equally a stained glass genre where beauty, and abstract symmetry is the underlying aesthetic guide. And you don't consider the Islamic and Persian traditions of stained glass.
Even a focus on hagiography ignores the fact that stained glass' primary purpose is to control lighting of a space. You might read/reread Gould's (et al) paper on spandrels and evolution. The hagiography is an epiphenomenon.
"Even a focus on hagiography ignores the fact that stained glass' primary purpose is to control lighting of a space."
No, you can do that with clear glass (and the Reformation icon-smashing extended at times to stained glass). Forget your spandrels, when commissioning windows, especially when glass and colours were expensive, the idea was didactic and commemorative. If you just wanted to "control the lighting" then the abstract patterns of rose windows would do as well. Putting imagery into the glass had a purpose of its own.
And Scott, asking for designs for the Seven Virtues of Rationalism, is working within the tradition of depicting the Seven Virtues, Seven Sins, Seven Liberal Arts, etc. Even the Islamic tradition will have calligraphic imagery of verses from the Quran or sacred names in stained glass:
Other Dali use the words “in the style of” as part of the cue instead of just sticking a comma between the content and style parts, does that make a difference?
Previous work in image stylization has used a more explicit separation between content and style, which would help here. I imagine there will be follow-on work with a setup like the following: you plug in your content description which gets churned through the language model to produce “content” latent features, then you provide it with n images that get fed into a network to produce latent “style” features, then it fuses them into the final image. Of course then you potentially would have a more explicit problem with copyright infringement since the source images have no longer been laundered through the training process but maybe that’s fairer to the source artists anyways.
Yeah I came here to say that. Scott keeps dissing DALL-E for becoming confused about whether "stained glass window" is a style or a desired object but the queries never make it clear that it's meant to be a style. All the other prompts I saw were always explicit about that and I'm left uncertain why he never tried clarifying that.
> Other Dali use the words “in the style of” as part of the cue
The possibility has been pointed out on the subreddit that 'in the style of' may optimize for imitators of a style over the original (which, if true, may or may not move results in the direction one wants).
Since it seems to get hung up on the stained glass window style, try getting the image you want without the style, and use neural style transfer to convert it to stained glass.
I just tried a couple free ones and it didn’t even resemble stained glass. More like a photoshop filter gone wrong. Deepdreamgenerator.com (limited free options) seems to work better and has options for tuning. Here’s what I generated from a painting of Tycho and some colored triangles as the stained glass style. (Default settings except “preserve colors”) https://deepdreamgenerator.com/ddream/lsnijmgc2zi
Here’s one using a real stained glass scene, like I think you’re trying to generate, as the style. (Default settings except “preserve colors”)
Ostagram generally does a good job at style transfer, but I've never seen style transfer of stained glass do anywhere near as good a job as these. I've managed some Tiffany style sunset landscapes, but nothing beyond that, and I've really tried.
It appears the goals of 'AI images', 'natural text' to search/generate using single strings of text, and 'actually useful' user interfaces are in conflict. The idea that you'd have an art generator where the art style, subject, colours, etc. are not discreet fields and ignores all the existing rules and databases used to search for art is a bad approach to making a useful AI art generator.
I'd be more interested if they ignored the middle part of 'single string of text' and focused more on the image AI. They are perhaps trying to solve too many problems at once with AI text being a very difficult problem on its own - that said it pulled random images which are probably not well categorised as a datasource, so I'm sure they hit various limitations as well.
I would think using an image focused AI to generate categories might be an interesting approach drawing directly from the images rather than whatever text is used to describe them on the internet. Existing art databases could be used to train the AI in art styles.
It would even be interested to see what sorts of categories the AI comes up with on its own. While we think of things like Art Nouveau, the AI is clearly thinking 'branded shaving commercials' or 'HP fan art' are a valid art categories. I don't think the shaving ads will show up in Sotheby's auction catalogue as a category anytime soon though.
Perhaps we can see 'Mona Lisa, in art style of shaving advertisement' or 'alexander the great conquest battle in as HP fan art'? 'Napolean Bonopart in the style of steampunk'
My best guess about William’s red beard and hair: DALL-E may sort of know that “William (of) Ockham” is Medieval, but apparently no more than that since he’s not given a habit or tonsure (he’s merely bald, sometimes). But he has to be given *some* color of hair, so what to choose??
Well, we know that close to Medieval in concept space is Europe. And what else do we know? We have a name like William, which in the vaguely European region of concept space is close to the Dutch/Germanic names Willem and Wilhelm. And what do we know of the Dutch and Germanic peoples? In the North / West of Europe is the highest concentration of strawberry-blonde hair!
If that’s too much of a stretch, then maybe DALL-E knows some depictions of “William of Orange” and transposed the “Orange” part to “William (of) Ockham’s” head?
When I do a google image search for "william beard", I find that 90% of the results are of Prince William (the present one, not the one of Orange) with his reddish beard.
Interestingly, when I do a Google Image search for William of Ockham, the first image that comes up, is the one from his Wikipedia page, which is a stained glass window! (But without a razor.)
I do wonder how an human artist who got a similar query from an anonymous source would respond (assuming that the artist was willing to go to the trouble etc.)
Presuming that the opening here was just a joke to intro playing with DALL-E? I bet artists would happily draw these things in the style of stained glass. Like, if you actually wanted to do this, instead of fiddling with DALL-E you could just go onto Shutterstock, find some artist or agency in eastern Europe that does art in that style and then pay them to do it. Might be cheaper than paying for DALL-E if/when it ever becomes an actual product.
The OP has a number of amusing missteps on DALL-E's part due to it (apparently) not understanding some of the base assumptions behind the queries.
However, I was wondering if a human with a similar knowledge base would do much better, especially one not steeped in a similar culture. What part of these missteps is lack of background knowledge versus being an AI (if that means anything)?
This is actually a great example of the challenges with fairness and bias issues with AI/ML. Systems that screen resumes, grant credit (eg Apple card), or even just do marketing have real problems with corpus. Even if standards for past hiring are completely fair, if the system is calibrated on data where kindergarten teachers are 45 year old women and scientists are 35 year old men due to environmental factors, it is incredibly difficult to get the system to see the unbiased standards that are desired. This is a great laymen’s exploration into why that is.
This article of Tycho Brahe (borrowed from another comment, https://www.pas.rochester.edu/~blackman/ast104/brahe10.html) says that from his measurements Brahe concluded that either the earth is the center of the universe, or stars are too far away to accurately measure any Parallax. Then It adds:
"Not for the only time in human thought, a great thinker formulated a pivotal question correctly, but then made the wrong choice of possible answers: Brahe did not believe that the stars could possibly be so far away and so concluded that the Earth was the center of the Universe and that Copernicus was wrong."
What are other times that "a great thinker formulated a pivotal question correctly, but then made the wrong choice"?
Not quite as blatant, but: William Thompson, Lord Kelvin, analyzed possible methods for the sun to generate its energy, calculated that none of the possible methods he considered would last long enough for Darwinian evolution to occur, and concluded that the sun and Earth must therefore be young.
A little while later, radioactivity was discovered.
Rutherford on his lecture in 1904, with Kelvin in the audience: "To my relief, Kelvin fell fast asleep, but as I came to the important point, I saw the old bird sit up, open an eye and cock a baleful glance at me! Then a sudden inspiration came, and I said Lord Kelvin had limited the age of the earth, *provided no new source of heat was discovered*. That prophetic utterance refers to what we are now considering tonight, radium! Behold! the old boy beamed upon me." First source I could find here: https://link.springer.com/chapter/10.1007/978-1-349-02565-7_6
Brahe was an excellent observer. He used a gigantic quadrant to measure the positions of the stars, so his measurements were the most accurate available at the time. King James of Scotland, later King James I of England visited his state of the art scientific facility. Still, Brahe's measurements were not good enough to measure parallax, even of the closest star. A parallax second, a parsec, is about 3.26 light years, so to measure the distance to Proxima Centauri, 4.2 light years away, one would need an angular accuracy of better than 1/3600 degree. Brahe did what he could with what he had. His measurements were just not good enough.
This happens a lot when instruments improve. The first exoplanet was discovered in the 1995 by measuring stellar light curves using an Earth based telescope. There was a lot of skepticism since the measurement was near the sensitivity limit, less was known about stellar behavior and it was an extraordinary claim. The paper was retracted in 2003. Since then, we've discovered thousands of exoplanets using space based observatories optimized for planetary discovery. We also know a lot more about stellar behavior. The original discovery was reconfirmed, but the Nobel Prize for discovering exoplanets went elsewhere.
Yes, people are provincial. We think "of course" now, but this is not the case.
Whether the Earth moved around the sun or vice-versa could not be determined with reference to objects within the solar system. It's true that the Ptolemaic model was more complicated, but that was really the only downside. The moons of Jupiter did put a hole in the model, since there were now heavenly bodies that did not rotate around the Earth, but you could still patch it.
The big problem for the Copernican model was parallax. If the Earth moved around the Sun, then the distant stars would wobble back and forth, the size of the wobble depending on how far away they were. Since with the best measurements of the time, they could not measure any wobble, this was a real hole in that theory.
Wikipedia suggests first successful measurement of stellar parallax as being in 1838. So until then, there was no proof (and some evidence against) the Copernican model.
People were convinced of the Copernican model long before 1838.
As you mention, Galileo discovered the moons of Jupiter, which knocked out a major argument against heliocentrism.
He also observed the phases of Venus. Both models predicted that Venus should have phases, but they predicted *different* phases. Seeing that the reality matched the new prediction was a huge win.
Finally, he discovered mountains on the moon (implying that the moon is earth-like), and sunspots (showing that the sun is imperfect). This contradicted a major claim of the old theory, that the heavens were made of a different substance than earth and were subject to different physical laws.
I'm not sure, but I think there might have been an attempt to modify the geocentric theory to allow the other planets to orbit the sun while the sun orbited the Earth. But it's been a long time since I looked into it so I may be wrong about that.
If this attempt existed, it would still have been another nail in the coffin, since the theory had already been made more complex several times through adding epicycles to better match the planets' motions.
On the other hand, we've done a lot of similar things to try to explain the galaxies' motions (dark matter, dark energy, etc.) I'm not sure when we're going to give up on those - presumably when a simpler theory is proposed. So far, nothing fits.
Yes, that was the Tychonic system. It is, in modern terms, equivalent to the Copernican system but transformed to a non-inertial coordinate system where Earth is defined as always the origin. The main advantage seems to have been that astronomers could study the way the solar system actually worked without too closely associating themselves with the guy who went out of his way to insult the Pope.
Well in all fairness, the Sun does orbit the Earth at least as much as the Earth orbits the Sun. Granted, the barycenter is I think maybe 1-2000 km above the Sun's center...
Not unless you know how big the Sun is. However, that can be estimated, and in fact circa 300 BC Aristarchus of Samos used observations of the Moon to estimate that the Sun was ~7 times bigger than the Earth (the actual ratio being >100) and proposed that for that reason the Earth probably went around the Sun. Wikipedia summarizes the methods:
You mean epicycles are not consistent with inertial mechanics under a central force? That is definitely the case. I thought you were just addressing heliocentric versus geocentric, and I'm just pointing out unless you do some hard thinking and close observation, your naive thought will be (looking at the sky) that the Sun is about the same size as the Moon, probably about as far away, and so there's nothing there that prevents geocentrism.
You have to get into slightly more sophisticated observations, e.g. the fact that the phases of the Moon suggest it's illuminated by the Sun and if so so the fact that it's *still* illuminated when the Sun goes down means the Sun is bigger than the Earth and so forth. This isn't super hard (which is why it was already relatively current among sophisticated Greeks circa 300 BC), but it *does* require a focus on empirical observation rather than elegant theoretical argument, which maybe accounts in part for the unreasonably long time it took for Western natural philosophers to suss it out.
Giovanni Saccheri published a book in 1733 entitled /Euclid Freed of Every Flaw/ in which he attempted to prove the parallel postulate from the rest of Euclid's axioms. He worked by contradiction and ended up proving several theorems of hyperbolic geometry. However, at some point, he decided things had gotten too weird and declared that the hypothesis must be false because the results were "repugnant to the nature of straight lines".
1. In the 1870s Gibbs while formulating classical statistical mechanics was obliged to use a quantum of phase space to make everything work out, which had the dimensions of action, i.e. was dimensionally identical (and indeed had the same meaning as) Planck's constant. Arguably had be been less of a recluse and known of Fraunhofer's discovery of spectral lines, maybe talked it over with Maxwell, he might have come up with quantum mechanics 40-50 years before it was invented.
2. Both Enrico Fermi and Irene Curie observed fission of U-235 during their neutron bombardment experiments (Fermi in 1934), but both failed to interpret it as such, and it was 5 years more before Frisch and Meitner figured it out. Ida Noddack actually wrote a paper suggesting this possibility, which was read by both Fermi and Curie but dismissed perhaps because Noddack was a chemist and had no suggestion for a physical mechanism of fission. Imagine a world in which, say, Albert Speer knows a nuclear chain reaction is possible in 1934, five years before the war even starts.
Off-topic to the AI generation, but "What I’d really like is a giant twelve-part panel depicting the Virtues Of Rationality." - I feel that you're not alone in this.
From a linguistics-oriented satirical hard-boiled detective novella: "I followed him to a dining hall stretching before and above me like a small cathedral. A stained glass window in art deco style opposite the entrance portrayed the seven liberal arts staring spellbound at the unbound Prometheus lighting the world; this was flanked by a rendering of the nine Muses holding court in Arcadia on the left and of bright-eyed Athena addressing Odysseus on the right. I was led to a sheltered side alcove where Ventadorn was waiting. I stood for another minute looking at the windows before I went in. She said, 'Pretty antiquated now. Most people never notice them any more. Most of the time they don’t bother to illuminate them.'" https://specgram.com/CLXVII.m/
There are traditional representations of the Seven Liberal Arts, and this huge fresco titled 'The Triumph of St Thomas Aquinas' has a selection of the standard iconography:
The allegorical figure of Arithmetic holds a tablet. At her feet sits Pythagoras.
The allegorical figure of Geometry with a T-square under her arm. At her feet sits Euclid.
The allegorical figure of Astronomy. At her feet sits Ptolemy, looking up to the heavens.
The allegorical figure of Music, holding a portative organ. At her feet sits Tubal Cain, with his head tilted as he listens to the pitch of his hammer hitting the anvil.
Trivium:
The allegorical figure of Dialectics in a white robe. At her feet sits Pietro Ispana (the identity of this figure is uncertain.)
The allegorical figure of Rhetoric with a paper list. At her feet sits Cicero.
The allegorical figure of Grammar, teaching a young boy. At her feet sits Priscian."
Why Tubal-Cain? As far as I can tell his only connections to music are that his profession makes noise, and that his half-brother Jubal was a musician. Why not use Jubal?
First I thought that it was because Tubal-Cain worked brass and iron, and that this referred to instruments like trumpets. But it seems to be a mediaeval transcription error, where Jubal is referred to as Tubal, and their stories get muddled together with that of Pythagoras:
Pythagoras is supposed (in one version) to have discovered musical tones by the sounds of blacksmiths hitting anvils with hammers of different weights. Since Tubal was a blacksmith, this gets attributed to him.
"The author of the Cooke manuscript has taken the part of Petrus’ text where he essentially accuses the Greeks of lying about Pythagoras inventing music and instead harmonizes the two accounts by making Pythagoras discover Tubal Cain’s lost writings. So much for Pythagoras’ entry into the story, which comes almost certainly from Petrus using a line in Isidore of Seville’s De Musica 16, where that author writes that “Moses says that Tubal, who came from the lineage of Cain before the Flood, was the inventor of the art of music. The Greeks, however, say that Pythagoras discovered the origins of this art, from the sound of hammers and the tones made from hitting them” (my trans.).
However Petrus Comestor and Isidore have made an error derived from faulty Latin. The Old Latin translation of Flavius Josephus from around 350-400 CE, the source that stands behind Petrus and probably Isidore, mistakenly gives “Tubal” instead of “Jubal” from Genesis 4:21 as the inventor of music, and later authors repeat the error. This error probably comes from misreading the Septuagint, where Tubal Cain is shortened to just Tubal, making confusion easier."
People forget how mystical iron work was originally. Wasn't the old prayer, save us from Jews, blacksmiths and women? Now all we have left of Wayland the Smith is Wayland Smithers on the Simpson's, and almost no one gets the reference.
So, DALL-E can't understand style as opposed to content. This is like very young children who can recognize a red car or red hat, but haven't generalized the idea of red as being an abstraction, a descriptor that can be applied to a broad range of objects. I forget the age, maybe around three or four, that children start to realize that there are nouns AND adjectives, so DALL-E is functioning like a two or three year old. I wonder how well it does with conservation games like peek-a-boo.
P.S. Maybe instead of a Turing test, we need a Piaget test for artificial intelligence.
DALL-E 2 can't keep separate which attributes apply to which objects in the scene because of architectural trade-offs OpenAI took. Gwern's comment on this LessWrong thread speculates about some of the issues in a way I found interesting (I don't pretend to follow the details alas) "CLIP gave GLIDE the wrong blueprint and that is irreversible": https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-e-2-can-and-cannot-do
Other models already exist that do not have the same problems (though they surely have other problems, these still being early days): https://imagen.research.google/
I mention this because in this thread I have seen people extrapolating from random DALL-E 2 quirks to positing some fundamental limitations of AI generally (someone below said they thought AI performance had already hit a ceiling, which, I don't even know where to begin with that claim) when at least some of them actually appear to be fairly well-understood architectural limitations that we already know how to solve.
it can understand style vs content if you are explicit in which is which (https://arxiv.org/abs/2205.11916) I've also seen many prompts that include 'in the style of'
My guess is that you are expecting to produce great art right off the bat with a new tool and only a few hours practice with it. Obviously there is a learning curve, as your post demonstrates. Spend a few days with it, and I would assume your results will be spectacularly better.
From what limited exposure to DALL-E 2 I have seen, your assumption about the query "a picture of X in the style of Y” would work to remove the stained glass from the background of the subjects and make the art itself stained glass -- "a picture of Darwin in the style of stained glass."
Perhaps someone will make a new WALL-E interface that include various sliders that work in real time, like the sliders on my phone's portrait mode, allowing me to bump up and down the cartoon effects and filters. So you could make your output more or less "art nouveau" or "stained glass" or whatever parameters you entered in your query.
Someone wanted to make a music video with WALL-E 2 yesterday, but couldn't quite do it. He still got some pretty results however.
"A picture of Alexandra Elbakyan in a library with a raven, in the style of stained glass" gives pictures that look a lot like the first ones in that section - painting-y with a window in the background.
Okay, thank you. Sorry, my experience with DALL-E is all vicarious. I tried "Alexandra Elbakyan in a library with a raven, in stained glass" in that mini DALL-E, and the results were not high quality, but better than I expected. Oddly, I get different results with that program each time I run the same query.
Your readers might enjoy the DALL-E mini, recommended by one of your readers.
FWIW I tried "stained glass by Francesc Labarta of william occam holding a razor" on dalle-mini; while the quality is subpar it was stylistically closer to 19th century than the attempts you shared.
I think that modern AI has reached a local maximum. Machine learning algorithms, as currently being developed, are not going to learn abstractions like adjectives and prepositions by massaging datasets. They're basically very advanced clustering algorithms that develop Bayesian priors based on analyzing large numbers of carefully described images. A lot of the discussion here recognizes this. Some limits, like understanding the style, as opposed to the content, of an image could be improved with improved labeling, but a lot of things will take more.
Before AI turned to what they called case based reasoning which trained systems using large datasets and statistical correlation, it took what seemed to be a more rational approach to understanding the real world. One of the big ideas involved "frames", that is, stylized real world descriptions, ontologies, and the idea was that machines would learn to fill in the frames and then reason about them. Each object in a scene, for example, would have a color, a quantity, a geometry, a size, a judgement, an age, a function, component objects and so on, so the descriptors of an object would have specific slots to be filled. A lot of this was inspired by the 19th century formalization of record keeping, and a lot of this came from linguistics which recognized that words had roles and weightings. There's a reason "seven green dragons" is quite different from "green seven dragons" even though both just consist of the same two adjectives followed by the same noun.
I suspect that we'll be hearing about frames, under a different name, in the next ten years or so, as AI researchers try to get past the current impasse. Frames may be arbitrary, but they would be something chosen by the system designer to solve a problem, whether it is getting customer contact information using voice recognition, commissioning an illustration or recognizing patterns in medical records.
P.S. As for a lot of the predictions for systems like DALL-E, I'm with Rodney Brooks, NIML.
I agree that these systems are severely limited, and that the frame problem remains. However, even limited systems seem able to replace a large number of functions performed by white collar workers, because the stuff middle class people do is largely mundane and routine. Moreover, it's hard to quantify and notice all the actually intelligent things that people do in their jobs, as these things are seldom in their job descriptions. Bottom-line driven employers are then likely to switch since even limited automation does a good-enough job for the routine stuff, and then fail to notice that their organizations become less effective over time. I think the danger is the ensuing social upheaval. I think this process has already been happening for decades even without "AI" so it doesn't seem unlikely to me that it will continue.
In the longer term AI wranglers are likely to be in high demand and we'll expect more from all humans in the loop. This assumes we can muddle through to that point and not stumble our way into a horrible dystopia. Or, I suppose, we could all make sure we possess some hard-to-automate skills like plumbing, cleaning up after chaotic events, foraging, growing food, or caring for bed-ridden people.
You are right. AI is definitely good enough to have a big impact on the economy, particularly eliminating a lot of mid-level jobs. We've been seeing that happening. (I'll cite Autour and others on this.)
You are also right that we are just starting to get the payoffs. AI is like the steam engine, small electric motors and microprocessors. They take a while to seep into the economy, but they make a huge difference.
You are also right that when AI does a bad job, we could stumble into a dystopia.
Your are also right that there are some jobs left for humans. My lawyer has never needed a paralegal, but he does have a receptionist who acts as a witness to legal documents. The receptionist job may vanish, but unless dystopia we'll be requiring human witnesses for a while longer.
> The most interesting thing I learned from this experience is that DALL-E can’t separate styles from subject matters (or birds from humans).
Looks like entanglement is an issue. DALL-E cannot seem to find the right basis, where the basis vectors are styles, subjects, objects, etc. and instead uses artistic license to the max.
Who tells it what the correct basis should be? There's a correlation in the corpus between reindeer and men in red robes and art styles that work well on a post card, just like there's a correlation between reindeer legs, reindeer torsoes and reindeer antlers.
"DALL-E has seen one picture of Thomas Bayes, and many pictures of reverends in stained glass windows, and it has a Platonic ideal of what a reverend in a stained glass window looks like. Sometimes the stained glass reverend looks different from Bayes, and this is able to overpower its un-confident belief in what Bayes looks like."
So the Bayesian update wasn't strong enough to overcome its reverend prior?
Amazing images! But there's one problem: Tycho Brahe didn't use a telescope.
https://www.pas.rochester.edu/~blackman/ast104/brahe10.html
See also: https://en.wikipedia.org/wiki/Tycho_Brahe#Tycho_Brahe's_Instruments
Somewhat off topic, but the fact that Tycho Brahe was able to approximate the true size of the cosmos without a telescope is impressive - of course, he immediately rejected it as ridiculous, that distant stars could be larger than the orbit of earth...
Speaking of small details: Rather than Cervinae, reindeer and moose are both in Capreolinae. They are indeed in the deer family, but that is Cervidae.
You're right, thanks, fixed.
Didn’t Brahe have a silver nose? He lost his nose in a duel and wore a prosthetic silver one. I knew the moose story, too, but I always think of Brahe as “the guy with a silver nose”, so I was surprised that detail wasn’t included by Scott or DALL-E.
Exactly! I wanted to say the same thing. Although it probably was brass, not silver.
I heard that wore a brass nose most of the time, but had a silver nose for special occasions.
Came here to say the same. I totally knew about the nose, but not about the moose! 👃🏻🗡😛
I was trying to figure out if Brasenose College at Oxford was named after a traditional prosthetic body part, but Wikipedia tells me "Its name is believed to derive from the name of a brass or bronze knocker that adorned the hall's door."
https://en.wikipedia.org/wiki/Brasenose_College,_Oxford
Maybe this says something about my priorities, but after "astronomy" the second category I have Brahe mentally filed under is "lost nose in duel". If he's on Jeopardy I'd wager decent money the answer is "This Danish astronomer lost part of his nose in a duel". The moose pictures are funnier though.
There's a joke to be made here about DALL-E subtly protesting the inaccuracy via the backwards telescope in the first moose shot. Alas, I can't get the details right.
Maybe DALL-E is more subtle than you think, and is trying to be accurate to period style when it makes the figures in stained glass windows look nothing like the people they're supposed to portray?
You can repair the Reverend's head with 'uncropping', expanding the image upwards. Examples: https://www.reddit.com/r/dalle2/search?q=flair%3AUncrop&restrict_sr=on
Am I the only one for whom 'metal nose', and not 'pet moose', was the defining trait of Tycho Brahe?
No, you're not.
Nope! Kinda disappointed Scott dropped the (admittedly rather small) ball on this one.
I just scrolled down to the comments to look for this exact point, haha. I'd love to see what happens if you throw "metal nose" into the mix, especially with Rudolph and Santa imagery already muddling things.
I googled him just now; not every picture shows his metal nose, but every picture shows his big long mustache.
Metal nose, or diagram of the Tychonic system (a nice reminder, also, that precision doesn't guarantee accuracy). His prosthetic nose was apparently flesh-toned and fairly inconspicuous, but if I saw an astronomer with a conspicuous metal nose, I'd assume Brahe was meant.
I am sad I didn't learn about the moose earlier.
No, no, no; his most memorable trait is how he died!
etiquette-induced uremia IIRC
I think of him as exploding bladder guy. Probably wasn't real but it is definitely the thing I associate with him.
Now that would be interesting to see in bot-rendered stained glass!
Yes, the three facts I knew about Brahe were - Kepler's teacher who did astronomy before telescopes; metal nose; died because he was embarrassed to leave for the bathroom while drinking with the king. I had never heard of the moose thing.
That and the data collection are all I know about him!
Off-topic, yet topical enough I don't want to put this in the off-topic thread: Matt Strassler has been doing a bunch of “how to discover/prove basic astronomy facts for yourself” recently: https://profmattstrassler.com/2022/02/11/why-simple-explanations-of-established-facts-have-value/
My father recreated the Cavendish experiment in our basement.
https://web.archive.org/web/20080508011932/http://www.sas.org/tcs/weeklyIssues_2005/2005-07-01/feature1/index.html
Thanks!
Here's my idea: Alexandra Elbakyan standing on the shoulders of the Montgolfier brothers, who are themselves standing on the shoulders of Thomas Bayes, who is standing on the shoulders of Tycho Brahe, who is standing on the shoulders of William of Ockham. Yes, I know DALL-E wouldn't want to stack so high as it was already cutting off heads. So I might as well have Ockham standing on a giant turtle.
What's the turtle standing on though?
It's turtles all the way down.
I think it's Darwin Finches all the way down.
The turtle is standing on Terry Pratchett's legacy.
In the style of the elevator in the haunted mansion at disneyland.
The AI seems fuzzy on what, exactly, a telescope is used for. Most of the time Tycho seems to be trying to breathe through it, or lick it, or trying to stick it up his nose; even when he is looking through the eyepiece, as often as not he's just staring at the ground. I dunno, maybe the AI heard that story about the drunken moose and figured that Tycho himself was typically fully in the bag
😂 hadn't noticed that, but going back again, it's pretty hilarious.
It's funny how the telescope is pointed at his nose or mouth in every single image. You'd think DALL-E would get it right at least once, just by chance?
This was my biggest takeaway from the images, yes
But Tycho never used a telescope, he used a quadrant.
Trained on Smell-o-scope from Futurama?
When scratch and sniff cards were flown ordering Brahe to retreat, he raised his smelloscope to his metal nose and said "I have a right to be anosmic sometimes. I really do not smell the signal"
Tycho didn't use telescopes, did he? Isn't he the guy who built, like, giant sextants & things like that?
No, no, you misunderstand. DALL-E imagines Tycho thinking: "I wish that goddam moose would get out of my field of view so that I can resume my observations."
It was the scales that Bayes were holding that made me laugh, especially the one where he his holding the scale up by one of its bowls.
Image searching "looking through telescope" often gets you stock photos of people looking at a telescope instead of holding it to their eye (maybe that makes a better shot of their face?) or looking through a reflector telescope (which requires you to look 90 degrees from the direction of the scope), so maybe it has a hard time recognizing that "looking through" is an important part of that phrase?
More like breathing through a telescope or sniffing through one!
Yes, that was the first thing I noticed.
Would love to see this with Imagen, Google's even-newer image synthesizer.(No public demo though, alas.) In the examples we've seen it does a much better job of mapping adjectives to their corresponding noun instead of just trying to apply all the adjectives to all of the nouns, which is the main failure going on here.
Re faces, OpenAI says:
> Preventing Harmful Generations
> We’ve limited the ability for DALL·E 2 to generate violent, hate, or adult images. By removing the most explicit content from the training data, we minimized DALL·E 2’s exposure to these concepts. We also used advanced techniques to prevent photorealistic generations of real individuals’ faces, including those of public figures.
It's always possible that that was their preferred way of saying "we are not capable of generating recognizable faces".
Very unlikely, since much weaker models can do that - it does seem clear that they have explicitly gone out of their way to disrupt the generation of realistic faces.
It says specifically that they intended to prevent photorealistic face generation. They could have used a solution that also happened to disrupt stylized face generation, but that would be unfortunate, and would go beyond what they said here.
I think Scott's explanation that it's anchoring heavily on style cues is more likely.
Also importantly, it's *not* trained as a facial recognition algorithm, so it doesn't necessarily even know what features to use to identify individuals. (It has no way of knowing, for example, that I don't prepare for shaving by putting on my big red beard, or that a woman doesn't age 30 years and dye her hair black when she walks into a library.)
> I’m not going to make the mistake of saying these problems are inherent to AI art. My guess is a slightly better language model would solve most of them ...
What if the problem is more subtle than either of those two alternatives? What if the mapping between language prompts and 'good' pictures is itself quite fuzzy, such that different people will judge pictures rather differently for the same prompt, due to different assumptions and expectations? Don't we encounter such situations all the time, e.g., in a workplace meeting trying to settle on a particular design? Is it not naive to assume that there are objectively 'best' outputs, and we just need a better model to get them? What if I thought a particular picture was excellent, and you said, "No, no, that's not what I meant?"
I mean, it's clearly an objective defect that you can tell it a person's name and the art doesn't look like that person.
Of course. I'm not making the obviously wrong assertion that there are no standards of quality to be applied. There are clearly outputs that we'd all agree are bad. (In the extreme case, consider a model that output a solid black square for every prompt: obviously bad!)
So, now that we've gotten that trivial observation out of the way, let's get back to the substantive point, which I believe still stands.
You're mixing up whether people like a particular image with whether people will judge a particular image as being a good match for a given prompt. There is a lot of variation in artistic taste, but there is widespread agreement on what images depict.
The bulk of Scott's post discusses how good the various images are, not just identifying what the image depicts. I'm talking about exactly what Scott is talking about. Moreover, I think Scott is talking about exactly the right thing: image quality based on a wide variety of factors, which is more important than bare identification of what the image depicts.
So, both in the context of Scott's post and more generally, I don't think I'm "mixing up" anything.
It's not a defect, it's deliberate. The model can do that but they're afraid of DeepFakes so they "broke" it to stop that happening.
Prepositions are difficult even for people - with, in, as, on can mean very similar things depending on the associated nouns and verbs. DALL-E might not be sensitive to order or phrase structure. If it just soaks out "raven, female name, key, mouth, stained glass, library" and then groups them randomly with prepositions and associates those groups to background, foreground, mid-ground - it wouldn't distinguish between woman as raven, and woman with raven. Those would be reflected in the range of options it generated.
Also it may be chopping up the input into two or three-word phrases based on adjacent words. Putting the same words in the query but in a different order, or re-ordering the phrases such that a person would interpret it the same way, I have a feeling that would generate different results.
Wish I knew what makes it switch to the cartoon-style faces. Possibly it fills in with cartoon style when the input images are less numerous.
And yes, the preferences could vary widely.
If Scott gets another go at it, maybe trying physical proximity descriptors like "near" or "adjacent" might go better. (Sitting on, standing on, eating, holding, waving, etc.)
Exactly. I think the key point you make is that some of these aspects "are difficult even for people," which points to a certain indeterminacy in deciding what the "best" output would even be.
This would probably be cheating in the strict sense, but if DALL-E had an "optometrist" setting where the user could indicate their preference given the first outputs and then refine that, it would help.
Also, Bayes tended to be looking to his right in DALL-E's results, just as he was in the one online image. But it was also not distinguishing his shaded jaw from the background, making his face either pinched or round instead of square-jawed.
Yes, an architecture that allowed for iterative refinement would be cool. It would mean that, for subsequent rounds of refinement, the previous image would need to be an input, along with the new verbal prompt.
I don't think that's cheating at all. It's a more sophisticated (and probably more useful) architecture.
The Victorian radiator was a nice touch in one of the "Scholarship" panels, but it makes me think it isn't distinguishing between "in the style of" and "with period-appropriate accoutrements." In addition to iterative refinement, it could have a "without accoutrements" check-box, maybe a "no additional nouns" feature. But I wouldn't want that all the time, since seeing what it comes up with is enjoyable.
For that to work would you have to train it up with a lot of that sort of refinement dialogs? If so it’s not clear where that training input would come from.
I am so very *not* up on this stuff, but I think it’s still the case in these systems that the “learning” part is all in the extremely compute-intensive training and that the systems that actually interact with users are not doing learning per se. Is that right? It seems like what you’re asking is for it to learn on the fly, and we may not be there yet.
> I think it’s still the case in these systems that the “learning” part is all in the extremely compute-intensive training and that the systems that actually interact with users are not doing learning per se. Is that right?
Yes, that's correct, for the most part. What would be required is for the training input to include images as well as the language description, with good image output targeted to be similar to the image input. That would allow the system to used at run-time in an iterative manner, even though no learning is going on at that point.
Yes, human language is ambiguous, but you get to try again. The question is whether you can clarify what you meant like you would with a human artist, or whether it feels more like query engineering.
I wonder how hard it would be for a future version of the software to include the ability to pick your favourite of the generated images and have it generate more "like that," perhaps even in response to a modified query. I think that might feel more like clarification.
It might also open the door to a future personalized version that could learn your tastes, and perhaps to one that could take images as prompts and refine/alter them in ways that our current algorithms can't. (The uncropping function is already impressive, but I'm thinking of e.g. turning a sketch into a painting or a photorealistic render.)
Sorry about the long-delayed comment.
Yeah, the very first picture for "Darwin studying finches" shows Darwin looking at a book, with a finch under both of them. { (Darwin studying), (finches) }
Yeah, DALL-E indeed looks like it indeed "might not be sensitive to order or phrase structure".
The problem is that it doesn't actually understand human language at all. It's just faking it. This is just an example of that.
It's true of all of these things. They're programming shortcuts that "fake it", they can't actually do the thing they're supposed to do, they just do "well enough" that people are like "Eh, it's usable."
This is true of image recognition as well, which is why it is possible to make non-human detectable alterations to images that cause these systems to completely misidentify them.
It's the same thing here - the system isn't actually cognizant of human language or what it is doing, it has created a complex black-box algorithm which grabs things and outputs them based on things it has associated with them heuristically. It's why you end up with red "christmasy" stuff in the reindeer images, and why the images look so bad the more specific you get - it's basically just grabbing stuff it found online and passing it through in various ways.
It seems superficially impressive, and while it is potentially "useful", it isn't intelligent in any way, and the more you poke at it, the more obvious it becomes that it doesn't actually understand anything.
Curious what you're planning on depicting for the other six virtues
> These are the sorts of problems I expect to go away with a few months of future research.
Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity.
Not to toot my own horn, but two years ago you were naively saying we'd have GPT-like models scaled up several orders of magnitude (100T parameters) right about now (https://slatestarcodex.com/2020/06/10/the-obligatory-gpt-3-post/#comment-912798).
I'm registering my prediction that you're being equally naive now. Truly solving this issue seems AI-complete to me. I'm willing to bet on this (ideas on operationalization welcome).
It wasn't intended as a precise prediction (it used the word "about" and I described it as "irresponsible"), and we have 10 trillion parameter language models now. So I predicted it would go up from 175 billion to 100 trillion, and so far it's at 10 trillion and continuing to grow. It wouldn't surprise me if it reached 100 trillion before the point at which someone would claim "about two years" stopped being a fair description (though I think this is less likely with Chinchilla taking the focus off parameter counts).
My experience is that almost all of the things people pointed out as "AGI-complete" flaws with GPT-3 are in fact solved in newer models (for example, its inability to do most math correctly), and that you really do just need scale.
If you want to make a bet, I'll bet you $100 (or some other amount you prefer) that we can get an AI to make a stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth sometime before 2025 (longer than the "few months" I said was possible, but probably shorter than the "until AGI" you're suggesting). If neither of us can access the best models to test them, we'll take a guess based on what we know, or it'll be a wash. Interested?
Do you have a link for the 10T parameter model? haven't heard of this and it would indeed put that prediction into "too close for comfort" range, even if I end up being technically correct.
In principle I'm interested, though I have to consider if your operationalization really addresses the core issue for me. Let me give it a think and get back to you in a day or two.
And just in case it isn't clear, I'm not dissing you, just challenging you in the spirit of adversarial collaboration.
I heard about it at https://towardsdatascience.com/meet-m6-10-trillion-parameters-at-1-gpt-3s-energy-cost-997092cbe5e8 , though I don't know any more than is in that article and if you tell me this isn't "really" 10 trillion parameters in some sense then I'll believe you.
I also don't mean this in a hostile way, but I'm still interested in the bet if you are.
On a tangent, if you have a list of bets going like that, it'd be nice to have them available on a webpage somewhere where people can see. I'm not sure substack supports that use case at all, though...
there's a logs & bets subforum on DSL (the semi-official bulletin board): https://www.datasecretslox.com/index.php/board,11.0.html
I asked about this on DSL, and it seems like these models are gaming the parameter count statistic to appear more impressive than they are. Specifically, they're not dense networks like GPT.
https://www.datasecretslox.com/index.php/topic,6232.msg257289.html#msg257289
Regarding the bet: you're on, but let's make it 3 out of 5 for similarly complex prompts in different domains, and different ways of stacking concepts on top of each other (e.g. off the top of my head: two objects, each with two relevant properties, and the objects are interacting with each other)
All right. My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:
1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth
2. An oil painting of a man in a factory looking at a cat wearing a top hat
3. A digital art picture of a child riding a llama with a bell on its tail through a desert
4. A 3D render of an astronaut in space holding a fox wearing lipstick
5. Pixel art of a farmer in a cathedral holding a red basketball
We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. Loser pays winner $100, and whatever the result is I announce it on the blog (probably an open thread). If we disagree, Gwern is the judge. If Gwern doesn't want to do it, Cassander, if Cassander doesn't want to do it, we figure something else out. If we can't get access to the SOTA language model, we look at the rigged public demos and see if we can agree that one of us is obviously right, and if not then no money changes hands.
Does that work for you?
Ok, you're on! Will put this in my prediction log on DSL so it doesn't get lost.
Thanks for the follow-up post and link to the bet!
While I'm sure it's covered by your future judges, I'm curious if you both agree on the resolution criteria in the case where the AI makes an unintuitive (but grammatically correct) interpretation. What happen if the image contains:
1) an anthropomorphized library has a shoulder (with a raven on it) or a mouth (with a key in it)?
2) a man wearing a top hat?
3) a child with a tail?
4) an astronaut wearing lipstick?
Does it get credit for such interpretations? Or are you looking for one with tighter binding of the prepositional phrases, that I think is more conventional English interpretation?
Or are you looking for one that draws more sensical interpretation from a model of the world, where libraries "obviously" don't have shoulders or mouths? In this case, for #3, I might think it's more sensical for a man to wear a top hat, so wouldn't be surprised if it chose to do that vs the cat. Though I suppose if the collected works of Dr Seuss ends up in the training data, it might have its own view on whether cats wear hats. :)
I'm skeptical that math will truly be solved just by scaling. These models are trained to make stuff up using the techniques of fiction, in much the same way that a human would when writing fiction. A text generator for generating fiction will always have some chance of deliberately making math mistakes for verisimilitude, just as an image generator isn't going to limit itself to real historical photos, even if you ask for a "real historical photo."
A simple example: if you ask it for the URL of a page that actually exists, will it ever stick to the links it's seen in the dataset, or will it always try to generate "plausible" links that seem like they should exist?
A robo-historian that always used correct links, accurate quotes, and citations from real books, even if the argument is made up, would be pretty neat, and hopefully not too hard to build, but it's not going to happen just by scaling up fiction generators. If you want nonfiction, you have to design and/or train it to never make things up when it shouldn't do that.
It's been built already, that's Cyc, but unfortunately Cycorp seem to be both uninteresting in marketing, uninterested in open collaboration and uninterested in making their tech more widely available for experimentation. They're very much in the enterprise mindset of making bespoke demos for rich clients.
Or, you know, the actual reason for restricting access is that it doesn't actually work, and when people experiment with it, it becomes increasingly obvious that it isn't actually doing what it is claimed to do.
Which is exactly what is going on.
None of these systems are actually capable of this functionality because the approach is fundamentally flawed.
Machine learning is a programming shortcut, it's not a way to generate intelligent output. That doesn't mean you can't make useful tools using machine learning, but there are significant limitations.
Cycorp doesn't use machine learning so I'm not sure what your point is. It's a classical symbolic approach using pure logic.
> These models are trained to make stuff up using the techniques of fiction, in much the same way that a human would when writing fiction.
Are they? I was under the impression that they were primarily trained to predict "what would the internet say?" (given their training sets).
This suggests that they are at their best when generating the sort of stuff you're likely to find on the internet, which is consistent with what I've seen so far. I've yet to see any evidence of semantic abstraction, which is the absolute minimum for "working like a human does".
For example, let's say you wanted to write a fictional Wikipedia article about a French city. You might start by reading some real Wikipedia articles of French cities and taking aspects of each and combining them so that your fictional article has many of the characteristics of the real articles.
If you're given half an article about a French city then you could fill in the rest by making good guesses. This will work whether it's a real French city or a fictional one. The parts you fill in are fiction either way.
Similarly, when we train AI models by asking them to fill in missing words regardless of whether they know the answer, we are training them to make up the details. "The train station is _ kilometers from the center of town." If it doesn't know the answer it's going to guess. What's a plausible number? Guessing right is rewarded.
I wrote "using the techniques of fiction" when I should have just said "training to be good at guessing." Sorry about that!
I dunno, that seems like a pretty low bar. It's a step above Mad Libs or Choose Your Own Adventure, but only just. Seems a long way from creative writing, still less an essay with an original point.
And heck if it were that easy, you could teach *human beings* to be good writers by just having them read a bunch of well-know good authors and a sampling of bad authors, suitably labeled, and ask them to mimic the good authors as best they can. Which doesn't work at all, public school English curricula notwithstanding.
I'm not saying it's *good* fiction. But it's still closer to making stuff up than writing something evidence-based.
A trustworthy AI would give you a real Wikipedia article if you asked for one, like a search engine would. If it made stuff up, it would indicate this. It seems like this basic trustworthiness (being clear about when you're making things up) would be a small step towards AI alignment?
They are already advanced well beyond 'repeat back what I read on reddit' level of sophistication.
https://arxiv.org/abs/2205.11916 as one example.
See, that's part of the problem. You consider making things up to be more "advanced" and "sophisticated" than accurate reporting and sourcing. But an AI that doesn't provide sources is worse at nonfiction.
I honestly have no idea what that has to do with anything I said, nor the comment I was replying to.
I'm rather sure that you're right that there are problems in this that can't be solved by scaling, but there are also problems that can be.
OTOH, what do you do about problems like the Santa Claus attactor? When the description (as understood) is fuzzy, I think that it will always head to the areas that have heavier weights. Perhaps an "exclude Santa Claus" would help, but that might require recalculating a HUGE number of weights. (And the robo-historian would have the same problem. So do people: "When all you have is a hammer, everything looks like a nail")
Following, in case this thread is updated with the real world outcome in "a few months".
I know/have known people who developed machine learning systems and tried to train them to solve problems. Watching them tinker with them, and discussing the issues with them, it was profoundly obvious to me that this is a programming shortcut and that it is being marketed to the public and investors as something it isn't. There was a push to describe it using different terminology so people wouldn't think it was "intelligent" but it got defeated, probably because "Artificial intelligence" is a much more marketable term and gets people investing than the more accurate terminology which set better expectations.
These systems have issues with overfitting (you feed in data, and it is great at looking at data from that set, but terrible outside of it) and you also find weird things like it finds some wonky thing which matches your data set but not the data set in general, or it latches onto some top-level thing that is vaguely predictive and can't get off of it (hence the racist parole "AI" in Florida that would classify white people as lower risk than black people, not based on individual characteristics but their race, because black people in general have a higher rate of recidivism).
The dark side of this: if people actually understood how these systems worked, no one would allow the "self driving cars" on public roads. There's a reason why people affiliated with those push the perception of intelligent "AI", and it's because it allows them to make money. When people were told about the self-driving car that hit the pedestrian flickering between different identifications for the pedestrian, and the revelation that it was throwing out data and not slowing down because if it stopped driving any time it had issues with its recognition being inconsistent, it would not be able to drive at all.
But in the end, these systems are all like this.
These systems are not only not intelligent, they aren't designed to be intelligent and aren't capable of it. They can't even fake it, once you understand how they function.
What is actually going on is that actually writing a program to solve many tasks is extremely hard, to the point where no one can do it.
What they do instead is use machine programming as a shortcut. Machine programming is a way of setting up a computer with certain parameters and running it to make it construct a black box algorithm based on input data.
Your goal is to create a tool which gives you the output you want more or less reliably.
The more limited a system is, the better these tools are at this. A chess computer doesn't understand chess at all, but it can give the correct outputs, because the system is limited and you can use a dataset of very good players to get it a very advanced starting position, and you can then use the computer's extreme level of computing power to generate a heuristic which works very well.
The more open a system is, the more obvious the limitations in this approach become. This is why these systems struggle so much with image recognition, language, etc. - they're highly arbitrary things.
The improvements to these systems are not these systems getting smarter. They're not. They are giving better output, but they still are flawed in various fairly obviously ways.
The reason why access to these systems is limited is because when you start attacking these systems, it exposes that these systems are actually faking their performance. If you are only given a select look at their output, they seem much, much more interesting than they actually are.
There is something known as an adversarial attack. A good example of this is image recognition:
https://towardsdatascience.com/breaking-neural-networks-with-adversarial-attacks-f4290a9a45aa
These adversarial attacks expose that these system don't actually recognize what they're doing. They aren't identifying these things in a way that is anything like what humans do - these manipulated images often aren't even visibly different to the human eye, and certainly nothing like what the neural network thinks it is, but the system will think, with almost absolute certainty, that the image is a different image.
The reason why is that these neural network image recognition algorithms are nothing more than shortcuts.
That isn't to say that these systems can't be useful in various ways, because they can be. But they don't understand what they are doing.
Tonight, one of my friends got a furry to draw him a sketch of two anthropomorphic male birds cuddling. Because the image shouldn't be in Google's network yet, I took that image and fed it into the system.
The results rather illustrate what is going on.
There is art of both of these characters, but the system didn't recognize either of them. It did return a lot of sketches, but none by the same artist as the one who drew it, either.
Instead, it returned a bunch of sketches of creatures with wings, mostly non-anthropomorphic gryphons, and a smattering of birds, but some other random things as well.
All of these are black and white. The characters are an owl and a parrot, but the results don't give either of those creatures for hundreds of results - they're mostly eagle-like creatures. It gives no color images, very few of the hits are even for anthro birds, the creatures in the images overwhelmingly don't have the "wing-hands" that the characters do. It doesn't give us birds cozying up for hundreds of posts, and when it does, they're non-anthro ones, despite there being vast amounts of furry art of bird boys cuddling. In fact, that was the only one I found in hundreds of results that even came close to the original image - but I suspect that it only found that image by chance, because the characters have feathers.
The system could identify that they had feathers/wings, but it couldn't identify what they were doing, it couldn't identify what they were, it couldn't tell that I might want color images of the subject matter, etc. It identifies the image as "fictional character", when in fact there are two characters in the image.
I took another image I have - a reference image of one of my characters, a parrot/eagle hybrid - and fed it into the system. This image IS on the Internet, but it couldn't find it (so it probably hasn't indexed the page it is on). As a result, it spat out a lot of garbage. It gave me the result of "language" (probably because the character's name is written on it) but it failed to identify him as a bird, and instead returned to me a bunch of furryish comics with a lot of green, teal, blue, and yellow on them.
It didn't correctly identify it as a character reference sheet (despite the fact that this is a common format for furry things), it didn't recognize that the character was a bird, it didn't recognize that they were a wizard (even though he's dressed in robes and a wizard hat and carries a staff)... all it really seems to have gotten is a vague "maybe furry art with text on it".
If you go down a ways, you do find the odd character sheet - but not of bird boys. As far as I can tell, it seems to be finding them based on them having a vaguely similar color scheme and text on the top of them. Almost none of the images are these, however, and none of the character sheets are of birds, or wizards (or heck, of D&D characters, because he's a D&D character). It doesn't well differentiate between character and background - many of the images will take the background colors from my image and have them in characters or other objects, or it will look at it and see a blue sky in the background with green grass and think it is a match.
These systems aren't smart. They're not even stupid. They are cool, and they're fun to play around with, and they can be useful - if you want to find the source of an indexed image, these systems are pretty good at finding that. But if you want it to actually intelligently identify something, it's obvious that the system is "cheating" rather than actually identifying these images in an intelligent way.
This isn't limited to original characters. I have art of a Blitzle - a pokemon - someone drew for me years ago. It is cute, the person posted this on their page 10 years ago, and Google has long-since indexed it.
But if I feed it into the system, despite the fact that the page it is on will identify it as a Blitzle and says it is a Pokemon, Google's image recognition cannot identify it a a Blitzle. It again goes "Fictional character", and while some of the images it returns are pokemon, not a single one is a Blitzle or its evolution, suggesting it is just finding that "Pokemon" text on the page and then using that to try and find "similar images".
This is not an obscure original character; Pokemon are hyper popular. But it still can't successfully identify it, and I suspect that the portion of Pokemon images it returns is because the page that it was originally posted on says "Pokemon" on it, as the returned results don't resemble the picture at all.
And there's evidence that this is exactly the sort of shortcut it is taking.
Poking at another image that I found online - that has one kind of pokemon in the image, but the most linked to stuff of it calls it an "Eevee" - gets it to "correctly" identify it as an Eevee, and the "similar images" are all Eevees.
But it's not one, it's actually a Sylveon.
https://twitter.com/OtakuAP/status/1498848805357301766
That's the image. Note the use of "Eevee" to describe it.
https://www.google.com/search?tbs=sbi:AMhZZiuavUSy9MNC7qx8V9B1-oQsHs4tH-cT6xNUfaqleaIUM8zAIIfRZfuEjsOQNw8B8Qkvad5UEgMHawGZkPoM2GWYkV0JsZxwmK0TwS3ZYpVZKVer1as1tTMj5Q8tre7oFMENKX7y12vkVOxHbouxtL-BKNIxouRp-4JGT27nlZSwa4HvAoG8ptvzkW8VlsIToXHjHfXiY6p_1noTCJlvqnXyYd_1AMAIB6kKPvdoEhyto59hU-FUKzbv73khNntm8m26lngcJvZ14ippbrFxRgv9xQdNBkvdxWX-PUEmBH_1czfb5Ehyf5bzdrML9PM9Qtl4zC1tqTAVo5yucTWT2wnOHXp6t8zpQ&hl=en
That's the output page.
It calls it an "Eevee" and the "visually similar images" are all eevees. You have to go down a long way to find a Sylveon... and they are on images that use the word "Eevee".
This is what we're really looking at here, and it is precisely this sort of analysis that people who are trying to sell these systems - or get funding for their systems - don't want, because once you start hitting at them like this, you start realizing what they're doing, how they're really working, and that these algorithms aren't solving the problem, they're taking shortcuts.
A year ago having a system like DALL-E 2 was science fiction and we were excited by systems for generating textured noise like app.wombo.art . You can't predict black swans, but there sure seems to be a lot of them recently.
They're scared of giving people free access to the system because the more you poke at it, the more the paint flakes off and you realize it is doing the same sort of "faking it" that every other neural net does.
They don't want people engaging in adversarial attacks because it shows that the system isn't actually working the way they're pretending like it works.
It's clearly going to need a better "world model" for this to work the way people want. And the model is going to need to be (at least) 3D, as suggested by the comments about the shading on Bayes jaw.
I don't think it's and "AI Compete" problem, but it's more difficult than just merging picts or gifs.
That said, parts of the problem are, in principle, rather simple to address. (Iterative selection would help a lot, e.g.) But if the idea is to push the AI rather than to work for better drawings, they may well not choose to head in that direction.
Is DALL-E giving extra weight to its own previous results based on the similar input phrases? How much time elapsed between these queries?
I laughed until I cried. Fighting off an impulse to re-caption some of them.
Oh god please do. I want to laugh some more.
Okay. I am hoping you post yours also.
I didn’t come up with captions for all of them but I numbered them in order of Dall-E image, leaving out other images from the sequence.
Fourth virtue:
#1 Get me an ashtray (is that a cigarette in his right hand?)
#2 A little more to the left
#3 This will be great in the kitchen
(on to the Reverend dressed in black)
#10 Quagmire goes to Oxford
#12 Spider-Man will pay…
Tenth virtue:
#2 Okay, once more from the top
#3 Damn machine mount
#4 If I shake it just right, the lid might fall out
Seventh virtue:
#1 To be or not to be…
#2 Reflex test, just a tap
#3 Giga-whamm
#4 Nah, I’m good, I still have to drive home
#5 Metallica: Millennium
#6 We’re live in five seconds, why are you handing me this?
#9 Mr. Clean Dungeon Spray
#10 NorCal yeaaaah
#11 These damn stockings
#12 Knitting is hard
Third virtue:
#9 Polyphemus rising
#13 It’s stuck…
Eleventh virtue:
#1 I’ve got you now
#2 Homeschooling is boring…
#4 Not a Woodpecker
#5 Summer on Cape Cod
#6 Tentacles emerge at twilight
#8 Wherefore art thou, Romeo?
#9 DoorDash is faster
#11 This damn hookah
These don’t really do the pictures justice, it’s an excuse to look at them again though!
And a final one - took a while - eleventh virtue panel #14 - “Brains with basil… hmm, brains with braised zucchini…what to cook…”
I had to move to a distant room so that my wife could continue watching tv. Even so, I'm sure she was annoyed when I saw that adding more French names produced more angel wings.
There is some indication over on the subreddit that adding noise to the prompt can sometimes avoid bad attractors. For instance, a few typos (or even fully garbled text) can improve the output. It seems important to avoid the large basins near existing low quality holiday photos, people learning to paint, and earnest fan illustrations. Maybe Dall-E associates some kinds of diction with regions messed up by OpenAI's content restrictions, or mild dyslexia with wild creativity. In comparison the early images from Imagen seem crisper, more coherent, but generally displaying a lack of that special quality which Dall-E 2 sometimes displays, which seems close to what we call "talent" in human artists. Thanks for the funny and insightful essay.
William of Ockham never himself used the image of a razor; that's a modern metaphor and would be be inappropriate for depiction in the stained glass image. And few people would know who Brahe is even with the moose, so leave it out.
Stained glass traditionally displays someone with an object that symbolizes who they are, even if it's not historically accurate. For example, here ( https://www.bridgemanimages.com/en/noartistknown/dalat-cathedral-stained-glass-window-saint-cecilia-is-the-patroness-of-musicians-dalat-vietnam/stained-glass/asset/5300711 ) is St. Cecilia playing the violin, which was invented several centuries after her death.
Ada Lovelace should be using a computer then.
Programming in Ada, the computer language.
But the razor is a metaphor for succinctness whereas violin is a musical instruments if anachronistic for Cecilia. And what's wrong with a caption. I see that in windows. And many painting have titles. Does anyone but an art historian distinguish the Evangelists according to their symbols?
Educated Catholics distinguish the evangelists by their symbols. It's a way to keep kids from getting bored at mass by having them explore beauty and symbols and shared consciousness which is another form of truth and "good news".
Traditionally, most people who were looking at stained glass windows or bas reliefs would distinguish the saints entirely by means of the symbols.
https://www.exurbe.com/spot-the-saint-john-the-baptist-and-lorenzo/
More likely becasue someone else would tell them. :) [I think I'm more of a Catholic nerd than 99 % of the people in the pews next to me and _I_ would have to google anyone but Mark becasue of the lions in Venice] but the point for me is unless the symbol is well known it does not help and I don't think a moose helps identify Brahe. Just label him.
I think it is a wonderful form of cultural literacy that allows us to understand Renaissance art the way a contemporary would. As it said in the article Kenny linked to, "If you understand who these figures are and what they mean, a whole world of details, subtleties and comments present in the paintings come to light which are completely obscure if you don’t understand the subject."
"Stained glass traditionally displays someone with an object that symbolizes who they are, even if it's not historically accurate".
Not really. That is one genre of stained glass which is particular style relating to iconography and hagiography. But for example the "rose window" kind of thing is equally a stained glass genre where beauty, and abstract symmetry is the underlying aesthetic guide. And you don't consider the Islamic and Persian traditions of stained glass.
Even a focus on hagiography ignores the fact that stained glass' primary purpose is to control lighting of a space. You might read/reread Gould's (et al) paper on spandrels and evolution. The hagiography is an epiphenomenon.
"Even a focus on hagiography ignores the fact that stained glass' primary purpose is to control lighting of a space."
No, you can do that with clear glass (and the Reformation icon-smashing extended at times to stained glass). Forget your spandrels, when commissioning windows, especially when glass and colours were expensive, the idea was didactic and commemorative. If you just wanted to "control the lighting" then the abstract patterns of rose windows would do as well. Putting imagery into the glass had a purpose of its own.
https://en.wikipedia.org/wiki/Poor_Man%27s_Bible
And Scott, asking for designs for the Seven Virtues of Rationalism, is working within the tradition of depicting the Seven Virtues, Seven Sins, Seven Liberal Arts, etc. Even the Islamic tradition will have calligraphic imagery of verses from the Quran or sacred names in stained glass:
https://www.1001inventions.com/wp-content/uploads/2017/10/islamic-glass23.jpg
I think you made the point. If you've read Gould's spandrels of San Marco.
The first purposes of stained glass were control of light and aesthetic effect on a collectively used space.
The hagiographic diadacticism and wealth signally were add on artifacts.
But the essence of "stained glass" is light and beauty. That is where the "art" lies.
Great set of experiments and writeup.
What it really looks like is that the author is praying at the altar of a very uncaring god or gods, and getting a bunch of vague prophetic crap.
Other Dali use the words “in the style of” as part of the cue instead of just sticking a comma between the content and style parts, does that make a difference?
Previous work in image stylization has used a more explicit separation between content and style, which would help here. I imagine there will be follow-on work with a setup like the following: you plug in your content description which gets churned through the language model to produce “content” latent features, then you provide it with n images that get fed into a network to produce latent “style” features, then it fuses them into the final image. Of course then you potentially would have a more explicit problem with copyright infringement since the source images have no longer been laundered through the training process but maybe that’s fairer to the source artists anyways.
Yeah I came here to say that. Scott keeps dissing DALL-E for becoming confused about whether "stained glass window" is a style or a desired object but the queries never make it clear that it's meant to be a style. All the other prompts I saw were always explicit about that and I'm left uncertain why he never tried clarifying that.
Likewise. For doing experimenting with prompts it stuck out that all but the last were "...stained glass window".
"Stained glass window depicting..." might have been tried sooner.
Speaking of style, I imagine that mentioning the virtues to be depicted might also have affected the flavor of the images returned.
> Other Dali use the words “in the style of” as part of the cue
The possibility has been pointed out on the subreddit that 'in the style of' may optimize for imitators of a style over the original (which, if true, may or may not move results in the direction one wants).
Since it seems to get hung up on the stained glass window style, try getting the image you want without the style, and use neural style transfer to convert it to stained glass.
Are there any good neural style transfer engines available for public use?
Style transfer went mainstream years ago, but I don’t know of one off hand. “Prisma” caught on in my group of friends but it’s only free to try.
I tried it and its stained glass filter seems terrible - it looks like a Photoshop filter that isn't doing any AI at all. Am I missing something?
I just tried a couple free ones and it didn’t even resemble stained glass. More like a photoshop filter gone wrong. Deepdreamgenerator.com (limited free options) seems to work better and has options for tuning. Here’s what I generated from a painting of Tycho and some colored triangles as the stained glass style. (Default settings except “preserve colors”) https://deepdreamgenerator.com/ddream/lsnijmgc2zi
Here’s one using a real stained glass scene, like I think you’re trying to generate, as the style. (Default settings except “preserve colors”)
https://deepdreamgenerator.com/ddream/sxra7thsoah
Maybe a more tessellated real image would hit the sweet spot.
Prsima is very very old, 2016 I think.
There's a reasonably good one on nightcafe, not sure which backend it uses
Ostagram generally does a good job at style transfer, but I've never seen style transfer of stained glass do anywhere near as good a job as these. I've managed some Tiffany style sunset landscapes, but nothing beyond that, and I've really tried.
It appears the goals of 'AI images', 'natural text' to search/generate using single strings of text, and 'actually useful' user interfaces are in conflict. The idea that you'd have an art generator where the art style, subject, colours, etc. are not discreet fields and ignores all the existing rules and databases used to search for art is a bad approach to making a useful AI art generator.
I'd be more interested if they ignored the middle part of 'single string of text' and focused more on the image AI. They are perhaps trying to solve too many problems at once with AI text being a very difficult problem on its own - that said it pulled random images which are probably not well categorised as a datasource, so I'm sure they hit various limitations as well.
I would think using an image focused AI to generate categories might be an interesting approach drawing directly from the images rather than whatever text is used to describe them on the internet. Existing art databases could be used to train the AI in art styles.
It would even be interested to see what sorts of categories the AI comes up with on its own. While we think of things like Art Nouveau, the AI is clearly thinking 'branded shaving commercials' or 'HP fan art' are a valid art categories. I don't think the shaving ads will show up in Sotheby's auction catalogue as a category anytime soon though.
Perhaps we can see 'Mona Lisa, in art style of shaving advertisement' or 'alexander the great conquest battle in as HP fan art'? 'Napolean Bonopart in the style of steampunk'
My best guess about William’s red beard and hair: DALL-E may sort of know that “William (of) Ockham” is Medieval, but apparently no more than that since he’s not given a habit or tonsure (he’s merely bald, sometimes). But he has to be given *some* color of hair, so what to choose??
Well, we know that close to Medieval in concept space is Europe. And what else do we know? We have a name like William, which in the vaguely European region of concept space is close to the Dutch/Germanic names Willem and Wilhelm. And what do we know of the Dutch and Germanic peoples? In the North / West of Europe is the highest concentration of strawberry-blonde hair!
If that’s too much of a stretch, then maybe DALL-E knows some depictions of “William of Orange” and transposed the “Orange” part to “William (of) Ockham’s” head?
When I do a google image search for "william beard", I find that 90% of the results are of Prince William (the present one, not the one of Orange) with his reddish beard.
Interestingly, when I do a Google Image search for William of Ockham, the first image that comes up, is the one from his Wikipedia page, which is a stained glass window! (But without a razor.)
https://en.wikipedia.org/wiki/William_of_Ockham
I am personally addicted to generating "uncanny creep" "eldritch horror" and similar prompts using mini dalle.
Literally addicted, it's become an obsession.
https://huggingface.co/spaces/dalle-mini/dalle-mini
Thank you!!!!!
I entered "darwin in the style of stained glass" and was very impressed by what I saw!
Addicted at my first shot.
Thanks again!
I typed in "Taylor Swift" and every picture was an eldritch horror, but also obviously her.
I typed in "me" and every picture was a white guy, most of them with glasses. Far more accurate than I had a right to expect.
I wonder if you could get a key in the raven's beak if you called it a beak.
I think "raven biting key" might do it.
Raven is also a general term for color, ie "raven-haired," which may have influenced it.
"Stained glass" might yield very similar results stylistically to "stained glass window" except without the influence of input noun window.
Bookstore might work out better than library, also. Alexandra Elbakyan holds crow biting key, within bookstore: stained glass, might do it. "Mosaic" also might work. Or try "painted stained glass" or "faceted stained glass" (https://www.kingrichards.com/news/Church-Stained-Glass/81/Types-Styles-of-Stained-Glass/)
(I edited this to put in "crow" for "bird and add stained glass style info. I think you specifically don't want the Tiffany-style ones.)
I wonder if he could get his image by starting with "beak holding key" and sequentially uncropping the image while adding elements in each iteration.
I do wonder how an human artist who got a similar query from an anonymous source would respond (assuming that the artist was willing to go to the trouble etc.)
An excellent question!
Presuming that the opening here was just a joke to intro playing with DALL-E? I bet artists would happily draw these things in the style of stained glass. Like, if you actually wanted to do this, instead of fiddling with DALL-E you could just go onto Shutterstock, find some artist or agency in eastern Europe that does art in that style and then pay them to do it. Might be cheaper than paying for DALL-E if/when it ever becomes an actual product.
The OP has a number of amusing missteps on DALL-E's part due to it (apparently) not understanding some of the base assumptions behind the queries.
However, I was wondering if a human with a similar knowledge base would do much better, especially one not steeped in a similar culture. What part of these missteps is lack of background knowledge versus being an AI (if that means anything)?
This is actually a great example of the challenges with fairness and bias issues with AI/ML. Systems that screen resumes, grant credit (eg Apple card), or even just do marketing have real problems with corpus. Even if standards for past hiring are completely fair, if the system is calibrated on data where kindergarten teachers are 45 year old women and scientists are 35 year old men due to environmental factors, it is incredibly difficult to get the system to see the unbiased standards that are desired. This is a great laymen’s exploration into why that is.
Wasn’t Ada Lovelace not super her name?
(Though probably it’s the most common name in captions of pictures of her.)
This article of Tycho Brahe (borrowed from another comment, https://www.pas.rochester.edu/~blackman/ast104/brahe10.html) says that from his measurements Brahe concluded that either the earth is the center of the universe, or stars are too far away to accurately measure any Parallax. Then It adds:
"Not for the only time in human thought, a great thinker formulated a pivotal question correctly, but then made the wrong choice of possible answers: Brahe did not believe that the stars could possibly be so far away and so concluded that the Earth was the center of the Universe and that Copernicus was wrong."
What are other times that "a great thinker formulated a pivotal question correctly, but then made the wrong choice"?
Not quite as blatant, but: William Thompson, Lord Kelvin, analyzed possible methods for the sun to generate its energy, calculated that none of the possible methods he considered would last long enough for Darwinian evolution to occur, and concluded that the sun and Earth must therefore be young.
A little while later, radioactivity was discovered.
Rutherford on his lecture in 1904, with Kelvin in the audience: "To my relief, Kelvin fell fast asleep, but as I came to the important point, I saw the old bird sit up, open an eye and cock a baleful glance at me! Then a sudden inspiration came, and I said Lord Kelvin had limited the age of the earth, *provided no new source of heat was discovered*. That prophetic utterance refers to what we are now considering tonight, radium! Behold! the old boy beamed upon me." First source I could find here: https://link.springer.com/chapter/10.1007/978-1-349-02565-7_6
Brahe was an excellent observer. He used a gigantic quadrant to measure the positions of the stars, so his measurements were the most accurate available at the time. King James of Scotland, later King James I of England visited his state of the art scientific facility. Still, Brahe's measurements were not good enough to measure parallax, even of the closest star. A parallax second, a parsec, is about 3.26 light years, so to measure the distance to Proxima Centauri, 4.2 light years away, one would need an angular accuracy of better than 1/3600 degree. Brahe did what he could with what he had. His measurements were just not good enough.
This happens a lot when instruments improve. The first exoplanet was discovered in the 1995 by measuring stellar light curves using an Earth based telescope. There was a lot of skepticism since the measurement was near the sensitivity limit, less was known about stellar behavior and it was an extraordinary claim. The paper was retracted in 2003. Since then, we've discovered thousands of exoplanets using space based observatories optimized for planetary discovery. We also know a lot more about stellar behavior. The original discovery was reconfirmed, but the Nobel Prize for discovering exoplanets went elsewhere.
Yes, people are provincial. We think "of course" now, but this is not the case.
Whether the Earth moved around the sun or vice-versa could not be determined with reference to objects within the solar system. It's true that the Ptolemaic model was more complicated, but that was really the only downside. The moons of Jupiter did put a hole in the model, since there were now heavenly bodies that did not rotate around the Earth, but you could still patch it.
The big problem for the Copernican model was parallax. If the Earth moved around the Sun, then the distant stars would wobble back and forth, the size of the wobble depending on how far away they were. Since with the best measurements of the time, they could not measure any wobble, this was a real hole in that theory.
Wikipedia suggests first successful measurement of stellar parallax as being in 1838. So until then, there was no proof (and some evidence against) the Copernican model.
People were convinced of the Copernican model long before 1838.
As you mention, Galileo discovered the moons of Jupiter, which knocked out a major argument against heliocentrism.
He also observed the phases of Venus. Both models predicted that Venus should have phases, but they predicted *different* phases. Seeing that the reality matched the new prediction was a huge win.
Finally, he discovered mountains on the moon (implying that the moon is earth-like), and sunspots (showing that the sun is imperfect). This contradicted a major claim of the old theory, that the heavens were made of a different substance than earth and were subject to different physical laws.
I'm not sure, but I think there might have been an attempt to modify the geocentric theory to allow the other planets to orbit the sun while the sun orbited the Earth. But it's been a long time since I looked into it so I may be wrong about that.
If this attempt existed, it would still have been another nail in the coffin, since the theory had already been made more complex several times through adding epicycles to better match the planets' motions.
On the other hand, we've done a lot of similar things to try to explain the galaxies' motions (dark matter, dark energy, etc.) I'm not sure when we're going to give up on those - presumably when a simpler theory is proposed. So far, nothing fits.
Yes, that was the Tychonic system. It is, in modern terms, equivalent to the Copernican system but transformed to a non-inertial coordinate system where Earth is defined as always the origin. The main advantage seems to have been that astronomers could study the way the solar system actually worked without too closely associating themselves with the guy who went out of his way to insult the Pope.
Well in all fairness, the Sun does orbit the Earth at least as much as the Earth orbits the Sun. Granted, the barycenter is I think maybe 1-2000 km above the Sun's center...
Wouldn't Newtonian mechanics and theory of gravity be rather strong arguments that the Earth-orbits-the-Sun model is closer to the truth?
Yes, it's another nail - but Newton was born almost half a century after Brahe died.
Not unless you know how big the Sun is. However, that can be estimated, and in fact circa 300 BC Aristarchus of Samos used observations of the Moon to estimate that the Sun was ~7 times bigger than the Earth (the actual ratio being >100) and proposed that for that reason the Earth probably went around the Sun. Wikipedia summarizes the methods:
https://en.wikipedia.org/wiki/On_the_Sizes_and_Distances_(Aristarchus)
I think epicycles would not be consistent with using the theory of gravity - it would be one or the other.
You mean epicycles are not consistent with inertial mechanics under a central force? That is definitely the case. I thought you were just addressing heliocentric versus geocentric, and I'm just pointing out unless you do some hard thinking and close observation, your naive thought will be (looking at the sky) that the Sun is about the same size as the Moon, probably about as far away, and so there's nothing there that prevents geocentrism.
You have to get into slightly more sophisticated observations, e.g. the fact that the phases of the Moon suggest it's illuminated by the Sun and if so so the fact that it's *still* illuminated when the Sun goes down means the Sun is bigger than the Earth and so forth. This isn't super hard (which is why it was already relatively current among sophisticated Greeks circa 300 BC), but it *does* require a focus on empirical observation rather than elegant theoretical argument, which maybe accounts in part for the unreasonably long time it took for Western natural philosophers to suss it out.
Famously, Einstein, with both the expanding universe and the cosmological constant.
Giovanni Saccheri published a book in 1733 entitled /Euclid Freed of Every Flaw/ in which he attempted to prove the parallel postulate from the rest of Euclid's axioms. He worked by contradiction and ended up proving several theorems of hyperbolic geometry. However, at some point, he decided things had gotten too weird and declared that the hypothesis must be false because the results were "repugnant to the nature of straight lines".
Two come to mind:
1. In the 1870s Gibbs while formulating classical statistical mechanics was obliged to use a quantum of phase space to make everything work out, which had the dimensions of action, i.e. was dimensionally identical (and indeed had the same meaning as) Planck's constant. Arguably had be been less of a recluse and known of Fraunhofer's discovery of spectral lines, maybe talked it over with Maxwell, he might have come up with quantum mechanics 40-50 years before it was invented.
2. Both Enrico Fermi and Irene Curie observed fission of U-235 during their neutron bombardment experiments (Fermi in 1934), but both failed to interpret it as such, and it was 5 years more before Frisch and Meitner figured it out. Ida Noddack actually wrote a paper suggesting this possibility, which was read by both Fermi and Curie but dismissed perhaps because Noddack was a chemist and had no suggestion for a physical mechanism of fission. Imagine a world in which, say, Albert Speer knows a nuclear chain reaction is possible in 1934, five years before the war even starts.
Off-topic to the AI generation, but "What I’d really like is a giant twelve-part panel depicting the Virtues Of Rationality." - I feel that you're not alone in this.
It's so Reinhold Neibuhr.
From a linguistics-oriented satirical hard-boiled detective novella: "I followed him to a dining hall stretching before and above me like a small cathedral. A stained glass window in art deco style opposite the entrance portrayed the seven liberal arts staring spellbound at the unbound Prometheus lighting the world; this was flanked by a rendering of the nine Muses holding court in Arcadia on the left and of bright-eyed Athena addressing Odysseus on the right. I was led to a sheltered side alcove where Ventadorn was waiting. I stood for another minute looking at the windows before I went in. She said, 'Pretty antiquated now. Most people never notice them any more. Most of the time they don’t bother to illuminate them.'" https://specgram.com/CLXVII.m/
I can swear I've read that one or something an awful lot like it.
There are traditional representations of the Seven Liberal Arts, and this huge fresco titled 'The Triumph of St Thomas Aquinas' has a selection of the standard iconography:
https://www.amblesideonline.org/art-spanish-chapel
"Quadrivium:
The allegorical figure of Arithmetic holds a tablet. At her feet sits Pythagoras.
The allegorical figure of Geometry with a T-square under her arm. At her feet sits Euclid.
The allegorical figure of Astronomy. At her feet sits Ptolemy, looking up to the heavens.
The allegorical figure of Music, holding a portative organ. At her feet sits Tubal Cain, with his head tilted as he listens to the pitch of his hammer hitting the anvil.
Trivium:
The allegorical figure of Dialectics in a white robe. At her feet sits Pietro Ispana (the identity of this figure is uncertain.)
The allegorical figure of Rhetoric with a paper list. At her feet sits Cicero.
The allegorical figure of Grammar, teaching a young boy. At her feet sits Priscian."
Why Tubal-Cain? As far as I can tell his only connections to music are that his profession makes noise, and that his half-brother Jubal was a musician. Why not use Jubal?
First I thought that it was because Tubal-Cain worked brass and iron, and that this referred to instruments like trumpets. But it seems to be a mediaeval transcription error, where Jubal is referred to as Tubal, and their stories get muddled together with that of Pythagoras:
https://www.jasoncolavito.com/blog/tubal-cain-and-the-musical-pillars-of-wisdom
Pythagoras is supposed (in one version) to have discovered musical tones by the sounds of blacksmiths hitting anvils with hammers of different weights. Since Tubal was a blacksmith, this gets attributed to him.
"The author of the Cooke manuscript has taken the part of Petrus’ text where he essentially accuses the Greeks of lying about Pythagoras inventing music and instead harmonizes the two accounts by making Pythagoras discover Tubal Cain’s lost writings. So much for Pythagoras’ entry into the story, which comes almost certainly from Petrus using a line in Isidore of Seville’s De Musica 16, where that author writes that “Moses says that Tubal, who came from the lineage of Cain before the Flood, was the inventor of the art of music. The Greeks, however, say that Pythagoras discovered the origins of this art, from the sound of hammers and the tones made from hitting them” (my trans.).
However Petrus Comestor and Isidore have made an error derived from faulty Latin. The Old Latin translation of Flavius Josephus from around 350-400 CE, the source that stands behind Petrus and probably Isidore, mistakenly gives “Tubal” instead of “Jubal” from Genesis 4:21 as the inventor of music, and later authors repeat the error. This error probably comes from misreading the Septuagint, where Tubal Cain is shortened to just Tubal, making confusion easier."
People forget how mystical iron work was originally. Wasn't the old prayer, save us from Jews, blacksmiths and women? Now all we have left of Wayland the Smith is Wayland Smithers on the Simpson's, and almost no one gets the reference.
Translation of sixth section of the Breastplate of St. Patrick:
https://en.wikipedia.org/wiki/Saint_Patrick%27s_Breastplate
6. I have set around me all these powers,
Against every hostile savage power
Directed against my body and my soul,
Against the incantations of false prophets,
Against the black laws of heathenism,
Against the false laws of heresy,
Against the deceits of idolatry,
Against the spells of women and smiths and druids,
Against all knowledge that binds the soul of man.
So, DALL-E can't understand style as opposed to content. This is like very young children who can recognize a red car or red hat, but haven't generalized the idea of red as being an abstraction, a descriptor that can be applied to a broad range of objects. I forget the age, maybe around three or four, that children start to realize that there are nouns AND adjectives, so DALL-E is functioning like a two or three year old. I wonder how well it does with conservation games like peek-a-boo.
P.S. Maybe instead of a Turing test, we need a Piaget test for artificial intelligence.
DALL-E 2 can't keep separate which attributes apply to which objects in the scene because of architectural trade-offs OpenAI took. Gwern's comment on this LessWrong thread speculates about some of the issues in a way I found interesting (I don't pretend to follow the details alas) "CLIP gave GLIDE the wrong blueprint and that is irreversible": https://www.lesswrong.com/posts/uKp6tBFStnsvrot5t/what-dall-e-2-can-and-cannot-do
Other models already exist that do not have the same problems (though they surely have other problems, these still being early days): https://imagen.research.google/
I mention this because in this thread I have seen people extrapolating from random DALL-E 2 quirks to positing some fundamental limitations of AI generally (someone below said they thought AI performance had already hit a ceiling, which, I don't even know where to begin with that claim) when at least some of them actually appear to be fairly well-understood architectural limitations that we already know how to solve.
it can understand style vs content if you are explicit in which is which (https://arxiv.org/abs/2205.11916) I've also seen many prompts that include 'in the style of'
My guess is that you are expecting to produce great art right off the bat with a new tool and only a few hours practice with it. Obviously there is a learning curve, as your post demonstrates. Spend a few days with it, and I would assume your results will be spectacularly better.
From what limited exposure to DALL-E 2 I have seen, your assumption about the query "a picture of X in the style of Y” would work to remove the stained glass from the background of the subjects and make the art itself stained glass -- "a picture of Darwin in the style of stained glass."
Perhaps someone will make a new WALL-E interface that include various sliders that work in real time, like the sliders on my phone's portrait mode, allowing me to bump up and down the cartoon effects and filters. So you could make your output more or less "art nouveau" or "stained glass" or whatever parameters you entered in your query.
Someone wanted to make a music video with WALL-E 2 yesterday, but couldn't quite do it. He still got some pretty results however.
https://youtu.be/0fDJXmqdN-A
"A picture of Alexandra Elbakyan in a library with a raven, in the style of stained glass" gives pictures that look a lot like the first ones in that section - painting-y with a window in the background.
Okay, thank you. Sorry, my experience with DALL-E is all vicarious. I tried "Alexandra Elbakyan in a library with a raven, in stained glass" in that mini DALL-E, and the results were not high quality, but better than I expected. Oddly, I get different results with that program each time I run the same query.
Your readers might enjoy the DALL-E mini, recommended by one of your readers.
https://huggingface.co/spaces/dalle-mini/dalle-mini
I also got fair results with "scissors statement stained glass"
Low hopes for this, but worth a shot: "A stained glass window depicting..."
That was tried in the article. It's what produced the monstrous woman-raven hybrids.
FWIW I tried "stained glass by Francesc Labarta of william occam holding a razor" on dalle-mini; while the quality is subpar it was stylistically closer to 19th century than the attempts you shared.
https://imgur.com/5Pt2Dd8
For Brahe, have you considered his metal nose as a signifier rather than the moose?
I think that modern AI has reached a local maximum. Machine learning algorithms, as currently being developed, are not going to learn abstractions like adjectives and prepositions by massaging datasets. They're basically very advanced clustering algorithms that develop Bayesian priors based on analyzing large numbers of carefully described images. A lot of the discussion here recognizes this. Some limits, like understanding the style, as opposed to the content, of an image could be improved with improved labeling, but a lot of things will take more.
Before AI turned to what they called case based reasoning which trained systems using large datasets and statistical correlation, it took what seemed to be a more rational approach to understanding the real world. One of the big ideas involved "frames", that is, stylized real world descriptions, ontologies, and the idea was that machines would learn to fill in the frames and then reason about them. Each object in a scene, for example, would have a color, a quantity, a geometry, a size, a judgement, an age, a function, component objects and so on, so the descriptors of an object would have specific slots to be filled. A lot of this was inspired by the 19th century formalization of record keeping, and a lot of this came from linguistics which recognized that words had roles and weightings. There's a reason "seven green dragons" is quite different from "green seven dragons" even though both just consist of the same two adjectives followed by the same noun.
I suspect that we'll be hearing about frames, under a different name, in the next ten years or so, as AI researchers try to get past the current impasse. Frames may be arbitrary, but they would be something chosen by the system designer to solve a problem, whether it is getting customer contact information using voice recognition, commissioning an illustration or recognizing patterns in medical records.
P.S. As for a lot of the predictions for systems like DALL-E, I'm with Rodney Brooks, NIML.
I agree that these systems are severely limited, and that the frame problem remains. However, even limited systems seem able to replace a large number of functions performed by white collar workers, because the stuff middle class people do is largely mundane and routine. Moreover, it's hard to quantify and notice all the actually intelligent things that people do in their jobs, as these things are seldom in their job descriptions. Bottom-line driven employers are then likely to switch since even limited automation does a good-enough job for the routine stuff, and then fail to notice that their organizations become less effective over time. I think the danger is the ensuing social upheaval. I think this process has already been happening for decades even without "AI" so it doesn't seem unlikely to me that it will continue.
In the longer term AI wranglers are likely to be in high demand and we'll expect more from all humans in the loop. This assumes we can muddle through to that point and not stumble our way into a horrible dystopia. Or, I suppose, we could all make sure we possess some hard-to-automate skills like plumbing, cleaning up after chaotic events, foraging, growing food, or caring for bed-ridden people.
You are right. AI is definitely good enough to have a big impact on the economy, particularly eliminating a lot of mid-level jobs. We've been seeing that happening. (I'll cite Autour and others on this.)
You are also right that we are just starting to get the payoffs. AI is like the steam engine, small electric motors and microprocessors. They take a while to seep into the economy, but they make a huge difference.
You are also right that when AI does a bad job, we could stumble into a dystopia.
Your are also right that there are some jobs left for humans. My lawyer has never needed a paralegal, but he does have a receptionist who acts as a witness to legal documents. The receptionist job may vanish, but unless dystopia we'll be requiring human witnesses for a while longer.
Lift your razor high, Occam
Hold it to the sky
Entities without true needs
Shan't multiply
It might have placed the key better if you put it in the raven's beak instead of mouth
If you want matching styles, maybe use Deep Art to adjust some as a second phase?
> The most interesting thing I learned from this experience is that DALL-E can’t separate styles from subject matters (or birds from humans).
Looks like entanglement is an issue. DALL-E cannot seem to find the right basis, where the basis vectors are styles, subjects, objects, etc. and instead uses artistic license to the max.
Who tells it what the correct basis should be? There's a correlation in the corpus between reindeer and men in red robes and art styles that work well on a post card, just like there's a correlation between reindeer legs, reindeer torsoes and reindeer antlers.
It could at least try to identify and rotate the bases, assuming it even has anything like the basis abstraction in its code.
I am 95% sure there is actually a moose in one of the stained glass windows at my church. It’s in a more recent window depicting the Creation.
"DALL-E has seen one picture of Thomas Bayes, and many pictures of reverends in stained glass windows, and it has a Platonic ideal of what a reverend in a stained glass window looks like. Sometimes the stained glass reverend looks different from Bayes, and this is able to overpower its un-confident belief in what Bayes looks like."
So the Bayesian update wasn't strong enough to overcome its reverend prior?
Scott,
Are you familiar with the work of the Emil Frei Stained Glass Company, based in St. Louis? If not, here is their official site:
https://www.emilfrei.com/
And this is a layman's tour of their work in St. Louis, a great resource to view the breadth of their work (spanning more than 100 years):